Resources

Danielle Dempsey | Save an ocean of time: streamline data wrangling with R | RStudio (2022)

My organization currently has over 250 oceanographic sensors deployed around the coast of Nova Scotia, Canada. Together, these generate around 4 million rows of data every year. I was shocked when I discovered my colleagues manually compiled, formatted, and analyzed these data using hundreds of Excel spreadsheets. This was highly time consuming, error prone, and lacked traceability. To improve this workflow, I developed an R package that reduced processing time by 95%. The package has since become integral to our data pipeline, including quality control, analysis, visualization, and report generation in RMarkdown. The resulting datasets have already proven invaluable to industry leaders looking to invest in Nova Scotia’s coastal resources. Talk materials are available at https://github.com/dempsey-CMAR/2022_rstudio_conf(opens in a new tab). Session: Cat herding: solving big problems by bringing people together

Oct 24, 2022
14 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Writing code can improve your workflow. We, the people in this room, and our friends joining online, we know this. But writing code can also be tricky, frustrating, and time-consuming. We also know this. And in my experience, it can be difficult to convince yourself, or maybe a manager, that it is worth spending time up front to write code for tasks like data wrangling, especially if there's an existing process that's already good enough.

So my goal today is to demonstrate that it is worth spending this time up front to write code, because in the long run, it will save time, reduce errors, and improve the workflow. I'm going to use some examples from my work, and maybe these will motivate you to go off and write some code, or maybe you can use these examples to help convince a sceptical manager to let you write that code.

About CMAR and the data challenge

So hi, everyone. My name is Danielle Dempsey. I'm a research scientist at a small company called the Centre for Marine Applied Research, or CMAR for short, in Nova Scotia, Canada. CMAR coordinates this really great coastal monitoring program to measure ocean variables like temperature, dissolved oxygen, and salinity from all around the coast of the province. This data can be used by scientists to study coastal ecosystems, industry to inform their site selection, government to inform their policy and management decisions, and the list could go on.

So this is really valuable data, but for several years, it was rarely used because it was just stored on somebody's desktop in a myriad of CSV and Excel files that were really difficult and time-consuming to compile into a useful format.

So why was this so difficult? It has to do with how the data are collected using what we call sensor strings. So a sensor string consists of a rope that's anchored to the ocean floor and suspended just below the surface by a buoy. There are sensors attached to the rope at different depths through the water column, and these sensors are recording data every one minute to one hour. Each sensor string is deployed in one location for a couple of months, up to a year. So they're really collecting a lot of data.

And we use three different types of sensors that each report the data back in a slightly different format, like you can see on the screen. And you can start to appreciate how working with this data can be challenging, especially because there can be seven or more sensors on each sensor string, and each of these data files can be tens of thousands of rows long.

The old copy-and-paste process

So part of the reason that I was hired about two years ago was to help compile all of this data into a nice, tidy format that we could post online for any interested stakeholder to download and use in their own analyses. So about three weeks after I started this new position, I had a video call with my manager so she could show me the existing process for how this data is compiled.

And I can remember very vividly sitting in my home office on a beautiful Thursday afternoon while she showed me this process. And let me tell you, this involved copying. It involved pasting. It involved sorting the data by a specific column. And it even involved writing down the Excel row numbers to use later on in the process. It was truly horrible. And I thought to myself, I will never do that.

I offered to write some code to help with this, but my manager was a little bit skeptical at first, particularly because at this point I wasn't very confident in my data wrangling or package development skills. But eventually she gave me the thumbs up and I wrote a package called strings to help automate a lot of this process. To be clear, the package is named after our sensor strings and has nothing to do with character strings. So this was maybe a bit of a naive name choice, but it's only caused one small moment of confusion so far, and that's another story.

Writing code saves time

Writing code will save you time. Yes, you'll have to spend some time up front to develop the code, and you may have some learning curves to surmount, particularly if you're new to R, new to coding, but it will be worth it.

So I spent a few weeks developing the backbone of the strings package, and as I said, at this time I didn't have a lot of experience with data wrangling, package development, working with dates and times, or importing and exporting large data files. So I spent a pretty substantial amount of time looking into different packages to help with those challenges. And these are just some of the packages that I ended up working with. You'll recognize most of them from the tidyverse, although there are a couple from outside of that collection.

And the point I want to make here is that even though I hadn't used most of these packages before working on strings, I now use almost all of them in almost all of my coding projects, and that's because of the time that I spent learning about them to work on the strings package. So this is just one small example of how the skills that you learn working on one coding project are often really transferable to other projects.

And in my case, the time that I spent working on the strings package also saved us time for processing that coastal monitoring program data. We estimate that it would have taken two years of work hours to compile all of that data with the horrendous old copy and paste method. Instead, within a couple of months, I had developed the package, myself and our student at the time had compiled and formatted all of the data, posted it online, and generated summary reports through rmarkdown.

It's also now really easy to train new people to compile this data using templates that I wrote to go with the package. Our current intern has compiled 71 deployments so far on top of all of his other work in just the 11 weeks that he's been with us so far. Something that would have taken over nine months of dedicated work time with the old method.

Writing code reduces errors

Writing code to help with your data wrangling can also help find and reduce errors in your data set. This will give you and your clients more confidence in the data, and also make whatever analysis you plan on doing with that data more reliable. And in my experience, errors can kind of creep into the data at different points along the data pipeline, especially wherever humans are involved. And so I've learned to write code to anticipate and sometimes even fix those errors.

For the first one, after we had finished compiling all that sensor string data, I decided to plot the latitude and longitude of each sensor string to get an idea of the spatial coverage of the data. So you can imagine my surprise when my map looks something like this. So most of the sensor strings are showing up clustered around Nova Scotia, which is what we would expect. But there's also some sensor strings showing up way over here in Kazakhstan, which is not close to Nova Scotia and not even anywhere near the ocean.

After a little bit of digging, I discovered that our contractors who actually deploy these sensor strings have to manually record the latitude and longitude of each location. And because Nova Scotia is in the western hemisphere, longitude is typically recorded as a negative value. But every so often, it accidentally gets entered as a positive value, which is what's causing some of these sensor strings to show up way over there in Central Asia. So now, when the package first reads in this coordinate data, I'll get a warning if any of those longitude values are positive. So we can go in and fix that mistake before compiling all the data.

A kind of more insidious error that I've seen is data from the wrong sensor in a file of data that's ready to be compiled. And this just happens when someone has what I call a human moment and accidentally puts the wrong file in the folder. In this example, data from one of the sensors on the blue sensor string has been put in the folder with all of the data from sensors on the green sensor string. And this issue is really twofold. One, the data set for the blue sensor string will be incomplete, and two, we will get false information about the area where the green sensor string was deployed. So this is really a critical error to be able to find and fix.

Luckily, the data files are automatically named with the serial number of the sensor. And we keep track of which sensors are associated with each sensor string in deployment logs. And so I've written code in the package that will give an error if there's any missing extra or different files in that folder based on what we would expect from the log.

Writing code improves traceability

Writing code to help automate your data wrangling can also make your workflow more traceable and reproducible, especially when compared to something like copying and pasting. And you can explain this to your non-coding managers with a metaphor.

So copying and pasting in Excel is kind of like making a salad from just whatever is left in your fridge. Most of the time, it will probably taste okay. But eventually you will get something that doesn't taste quite right. Because you have no record of what went into that salad, it's really hard to tell what exactly is making it taste funny. And it's also hard to avoid making the same mistake in the future. As my colleague said when I was trying this metaphor out on her, you can't take the dressing back out of the salad.

As my colleague said when I was trying this metaphor out on her, you can't take the dressing back out of the salad.

In contrast, writing code is more like writing and following a recipe. You can follow it step by step to get a delicious salad every time. And as an added bonus, you can even give this recipe to a friend so they can make their own salad or help you make dinner.

As a more concrete example, for another part of our coastal monitoring program, we had these 65 current datasets that had to be compiled from a kind of messy text format into a tidier CSV format that was easier to work with. So I offered to write some code to help with this. But my managers again decided to have a couple of people instead copy, paste, and format in Excel. This took hours and hours of several people's time. But eventually they finished.

And then we discovered that there had been a small mistake in how these original text files were exported from the sensor software. So all of the data needed to be compiled again. So this time I was asked to write some code. And within a few weeks, I had written a package that automated most of this workflow. So now if we want to make any other changes to this process, we've actually made a couple, just need to go in, tweak the code, press a button, and all of this data will be compiled within just a few minutes.

And so if you have a repetitive task that you've been thinking about automating, I encourage you to go home after the conference and start writing some code. You don't have to write a whole package in a day. You can start with a function or two and just see where it goes from there. And if you do get some pushback from a skeptical manager, remind them that writing code will save time, reduce errors, and improve the workflow. And if you do use those points and it works, please let me know, because it would make my day.

And the last thing I'll say is that my previously skeptical manager is now totally on board and really supportive of all my coding. She even showed me this super R, because she says that I've convinced her that R can do anything that we put our minds to. So thank you, and happy coding.