
USGS R Package Development: 10-year Reflections - posit::conf(2023)
Presented by Laura DeCicco Ten years ago, the first set of git commits was submitted to a new R software package repository "dataRetrieval" with the goal to provide an easy way for R users to retrieve U.S Geological Survey (USGS) water data. At that time, the perception within the USGS was the use of R was exclusive to an elite group of "very serious scientists." Fast forward, we now find many newer USGS hires having a solid grasp of the language from the start along with the use of R in a wide variety of applications. In this talk, I'll discuss my experiences maintaining the dataRetrieval package, how it's shaped my career, impacted USGS R usage, and why data providers should consider sponsoring their own R packages wrapping their data API services. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Lightning talks. Session Code: TALK-1171
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
My name is Laura DiCicco and I work at the United States Geological Survey. Today I'll be talking about a package I developed and maintained for the USGS called Data Retrieval. What is USGS? It's a United States federal government agency. Its mission is to monitor, analyze, and predict current and evolving Earth system interactions and deliver actionable information to decision makers.
The USGS has scientists that study volcanoes, earthquakes, ecosystems, and has a long history of making cool maps. But what I'm here to tell you about today is the story of USGS water data and how you can get that into R.
USGS water data
There's a huge variety of data that the USGS collects concerning water. Surface water data includes things like gauge height and discharge. Scientists use this data to study how water levels are changing over time, monitor current river conditions, and predict flooding. The USGS also studies groundwater data. This includes information about springs and wells that indicate aquifer conditions. We can use this data to understand how human activities and climate change affects the supply of groundwater.
There are estimates for how much water is used in the United States and generally what that water is used for. Water use data might include things like the volume of water used for irrigation or the amount of water in a public water supply system. Water quality data includes physical data like temperature, chemical data, and biological data. For example, with water quality data we can study the effects of road salt on streams or lakes. Or maybe we're looking at chemicals of emerging concern such as pesticides, microplastics, or PFAS.
Origins of the dataRetrieval package
I started at the USGS about 13 years ago and not too long after that I was asked to learn R. It was around then that the USGS was deciding to move our support officially from SAS to R. There were some scientists already using R but it wasn't very widespread. For the most part Excel was king. Keep in mind RStudio at that time hadn't been around much and alternative IDEs were not very user friendly.
One thing that became obvious right away that it was very difficult to get our USGS water data into R. This data bottleneck seemed like the first big hurdle that would need to be overcome to convince people to invest in the time to learn R.
I'd been hired into a smaller group within the USGS that had a great mix of scientists and software developers. This meant that from the very beginning of my R development I was being taught about version control, unit testing, project management, and other software management development best practices. And so on one fateful day in November 2012 I pushed up a set of code to GitHub. There were some hurdles we had to jump through to work in the open but I've been lucky to have had extremely supportive supervisors who advocate for open source development.
Benefits and community growth
It did not take long to start seeing some immediate benefits of creating a package to help users get their data. Standardizing the data retrieval phase of code made collaborating easier and jumping into analyzing the data quicker. More emphasis was being placed on reproducible science and data retrieval made that a little easier. Community engagement flourished. We introduced still active GitHub and GitLab organizations, social media, blogging, and various training sessions.
Data retrieval has been downloaded over 186,000 times. For a pretty niche package that's good. Data retrieval users are fantastic at finding problems with our data and services and honestly they're also really great at helping fix those problems.
Data retrieval users are fantastic at finding problems with our data and services and honestly they're also really great at helping fix those problems.
There have been many USGS R packages developed over the years. We recently got a Posit Connect license and it's been a joy to watch it flourish with Shiny apps, Quarto reports, and Plumber APIs. As a main point of contact for data retrieval I've been contacted by users in many different fields and it's a real highlight of my career.
I wanted to share some examples of all the cool projects that use data retrieval but this talk's way too short and there are way too many to pick from. Here's a small list of examples of topics that I've known that use data retrieval extensively. There's also been many cool data visualizations that use data retrieval in their creation process. One recently came out from the New York Times about groundwater. Within the USGS we have a group that creates stunning water-related visualizations. I'd encourage you all to check out the USGS VisLab group.
Expanding beyond R
As our studio pivoted to Posit to connect with more diverse users so did the USGS realize we really needed a similar product for Python. So this last year we officially added support for both a Python and Julia data retrieval package.
I hope what you got from this talk was that making a relatively small investment in an R package is a great way to foster a thriving R community. For the USGS that's meant people have been able to get their data analysis faster with reproducible transparent data pipelines. In our current era of rapid hydrologic change this is critical for our USGS mission.
If you're interested in checking out the data retrieval package here's some links that should help you get started. Thank you.
