Resources

Hezi Buba | Do It For Yourself: Creating a data input platform using R | RStudio (2022)

The Belmaker lab fieldwork involves underwater surveys where we document the observed marine species. Historically, Microsoft’s Access was used for data input. But as the number of surveyors increased, a new user-friendly platform was needed to avoid bottlenecks. In this talk, I share my experience of developing such a platform for the lab using tidyverse packages. This experience demonstrates the concept of Minimum Viable Products as the use of Google Sheets evolved from manually creating the sheets to a single line of code. We encourage other researchers to expand their use of R to create a tailor made data processing workflow - from the data input stage. Session: Take a sad process and make it better: project and process makeovers

Oct 24, 2022
16 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, I want to share with you a story of evolution, and as you can see, this evolution unsurprisingly starts at sea. See, this is me, and I just started out my master's degree in marine ecology, and I went out to the Red Sea to do some fish sampling, and when I was there, I was completely overwhelmed by the beauty of the coral reef, with all these beautiful shapes, colors, and really nice-looking fish.

But when I was done sampling and looking at all these crazy fish, and it was time for me to go in and fill in the data, I was welcomed by this, ugly, unintuitive, hard-to-navigate Microsoft Access forms, and this is the starting point for our journey.

This journey is a walk-through of how I moved from using Microsoft Access to using Google Sheets and using R to do our own lab custom-made database.

And I hope that whether you're using Microsoft Access now or Google Sheets or any of these different alternatives for making data forms, you will find some lessons that I've learned, something that you could use in your workflow.

Understanding the lab's needs

So the first step in this walk-through was to understand our needs. And to understand our needs, I just have to say a few words about how exactly we operate. So we are, as I said, a marine ecology lab, and we are going, using scuba diving to do some underwater surveys. So we go out there, as you can see here, with water-resistant piece of paper and a pencil, and we just do some records of the fish species, and they're abundant when we are underwater.

So why do we need Microsoft, why did we need Microsoft Access to do that? So first, we know we needed a big data set. So as you can see from this photo, these are taken several years apart. We are doing this for a long time. And we have no intention of stopping any time soon.

Another thing, we need a strict, for our database, we need a strict form of data validation. Because this fellow, their name is Pomacantus Imperator, and if you don't write that exactly as the bottom row, you're going to have a lot of problems down the road. And that's true for any of the cells that we have in our data form.

Most importantly, we need relational data. So our data is structured in a relational format, where we have the trip, the topmost level is mainly the location where we are diving in the world, and the date. Then we have the site, which is, like, more close-up of who is diving where exactly, like, in the actual dive site. Then we have the transect, the underwater environmental data. So that's the visibility, the depth, the substrate. Lastly, we have the observation itself. The species of the fish and its abundance.

Problems with Microsoft Access

When we used Microsoft Access, unknowingly, we made some sacrifices. So first, we had really bad user interface. So for us, it was really hard to work with Microsoft Access. I remember clicking the wrong button a few times, getting this really loud error message. It was a hell to work. At least for me, it was a hell to work.

The last thing that was really detrimental to our workflow is that we couldn't figure out how to do simultaneous data entry for our needs. Sorry. So what happened is we had to actually e-mail the database file from one person to another, and so we took turns filling in the data. So sometimes it could take more than a week until it was your turn to do so, and it was really detrimental to the quality of the data.

Lastly, since we didn't build the database ourselves, it was really hard for us to make any customization needed for our needs. So for example, we are working on both the Mediterranean Sea and the Red Sea. There are completely different seas with completely different animals. We need different protocols, and for this, we need a different database. So that was also a problem.

The Google Sheets prototype

Taking into account all of these things, we set out to create our own database. So what we did was we create. So that was our prototype now, okay? We have this giant Google spreadsheet for an entire expedition, and every day, we had someone in charge of manually creating different spreadsheets within the spreadsheet, different worksheets for each of the different surveyors of that day.

So when I created this Google Sheet prototype, I needed to take some drastic measures so I won't have any user error, and I didn't know how to do that. So my first idea was to have this really horrendous color coding, and I'm going to show you. We had this bright red trip data. So these are different forms. All the forms are in different colors. So we have this trip data in red, then we had this yellow, is it yellow here? This yellow site data, the green transect data, and lastly, that blue observer data.

So as you can see, really nice colors, and to connect this, to connect all of these, the observers themselves had to copy and paste IDs so we could create this relational data.

So what happens is that for every expedition, we have this, a lot of different worksheets within a spreadsheet. As you can see, some red, some blue, it's a whole mess, but this is where we used Google Sheets. It was Google Sheets at the time, now it's Google Sheets 4. We could read by that color coding, and then using that ID that the users had to copy and paste, we could use dplyr to connect it all to a flat data set that we can work with.

Now I know it's a mess, and if you have this expression right now, yeah, this is the expression that all my colleagues had when I presented that idea, and I told them, let's stick with it, it can be worse than what we currently have. It wasn't worse. But the thing is, it doesn't have to be perfect, all right?

So this is us. This is Crete 2019, and I know it doesn't look like that, but we are currently hard at work filling in the data. So as you can see, we're doing that together. So that's an important thing. We managed to solve one big problem, which was the simultaneous data entry. So now that we can work online, as you can see, it's the blue one.

Automating with googlesheets4 and googledrive

So once we managed to solve this problem, and I know that the prototype works, it's time for us to adjust, to make the necessary adjustments to make this data form, database, much better to work with. So how did we do that? Mainly by increasing our dependency of Google Drive and Google Sheets 4 package. So we have a great tool for actually working with Google Drive and Google Sheets, and we managed to automate the entire process.

So what we did at first, we had an offline file that replaced the old yellow and red worksheets that we had. Now they're offline. And from this, we can set out to automate the whole thing. And we used Google Drive for the housekeeping and navigation, and for the manipulation, we used Google Sheets 4.

More specifically, we had Drive MacDir to create individual folders for every sampling day. So now instead of having one big Google spreadsheets that's just somewhere on Drive, we have this hierarchy when we have an expedition directory, and within it for each day, we have a different folder. So it's much more organized.

For Sheets 4 Create, the function we used to upload that offline file from which we create the metadata, we used Drive Find and Drive CP in order to copy and paste preexisting templates that we had to today's folder. So then we used Sheet Copy and Range Write in order to manipulate those preexisting templates, and to manipulate these preexisting templates into whatever necessary for today. So change in site numbers, hour, et cetera.

Lastly, what I really like is that we have Drive Share, and we use this function in order to send out specific e-mail for specific users so they can only change their own data. So they can only fill out their own form. And it's really good because they don't have to mess around with anything in the database. And they can't accidentally erase or change someone else's data. And for them, it's really easy because all they have to do now is get an e-mail, and they sit out and they fill out the forms.

Lastly, what I really like is that we have Drive Share, and we use this function in order to send out specific e-mail for specific users so they can only change their own data. So they can only fill out their own form. And it's really good because they don't have to mess around with anything in the database. And they can't accidentally erase or change someone else's data.

So this is what we get. This is a folder for a day, and we have the metadata up there, and as you can see, we have individual spreadsheets for each of the different observers.

User-oriented data forms

In order to do that, I had to do some, to be creative in a way, and let's zoom in into one of the spreadsheets. So first thing, as you can see, the colors are much better. They're not really bright anymore.

And I wanted to have a strong connection between the environmental data and the fish data. So we're doing ecology, which is basically the relationship between a species and its environment. So we wanted to have this demonstrated in our database, in our forms. To do that, you can see columns A and B don't follow the same logic as columns C and H. Some people would say it's unorthodox, incorrect.

For those of you who can't see that, A and B, each row does not represent an observation. It's a long format. But once you realize that you can go from this long format into a wide format fairly easily using tidyverse, then it opens up a whole world of possibilities for you.

For example, on another project where we do quadrat sampling, I managed to take the exact piece of paper that surveyors go to, to the field, and I managed to copy it almost verbatim into a digital form, making the transition from the actual survey data, from the actual real-world data to the digital data much more intuitive for my users.

And I encourage you all to think about how user-oriented your data forms are at the moment. Because I know I've been taught as well that you should make your data as computer-readable as possible for analysis, visualization, et cetera. But at the same time, all of us were taught how easy it is to manipulate the data shape. So we could all stand to benefit from having more user-oriented data so we could then get so we could minimize the errors from those users, because it would be much more intuitive for them to fill in the data forms and have we make our data much better at the end.

And I encourage you all to think about how user-oriented your data forms are at the moment. Because I know I've been taught as well that you should make your data as computer-readable as possible for analysis, visualization, et cetera. But at the same time, all of us were taught how easy it is to manipulate the data shape.

Further improvements and lessons learned

So this was our project. This is how it went for quite a while. This worked quite well. It was really automated, and we worked for two years or so on this type of forms. But as we continued to use this, we realized that we needed to make additional changes, because as we kept using this database, so did our needs change.

So we realized that having an offline file was creating a bottleneck, because we needed to have someone who knows exactly how to type in the specific observer's name and all kinds of stuff. Then we decided to put that on the cloud as well. And then using the Google Sheets GS4 browse function, we can open up the browser directly from RStudio, and then it opened up something like that, sorry. And then you just fill in with data validation, you fill in who are the observers for today.

So as you can see, I took this opportunity to thank all my colleagues here, my collaborators, whose feedback was really what drove this thing forward. So while I did the coding and I presented the idea and everything, their feedback is what drove this thing forward. If it wasn't for them, we would still be stuck in that ugly call off phase, honestly. So thanks for that.

Another thing, since we built this thing from the ground up, now it's really easy for us to make adjustments whenever we need. So for example, let's say we have another project now. We just have a template, change a few lines of code, not even change, we just add a few lines of code, and we now have a whole new database for another project, really easy to work with.

So basically we can do whatever we want using Google Sheets 4 and RStudio. And now it's even more automated in a sense that you just run one line of code. For example, you select what kind of projects are you working on, fill in the data, like the actual observer data, and just here, as you can see here also, you just confirm that it's correct, and then R does anything for you, everything for you. You can take a few minutes, make some coffee, and then you spend the afternoon filling data with your friends together and colleagues.

I hope this convinced you on the power of using R right from the start of the data manipulation stage. I mean, right from actually getting the data.

If you want to explore this option of having a more custom-made database, some things to consider. So the first thing, consider what exactly do you need out of your database. Think about whether you are currently meeting your needs or are you simply overkilling your needs with what you are doing right now. And in order to meet those needs, don't be afraid to be creative. If you need a triangular-shaped pyramid data, you can do it, okay? Don't be afraid to pursue the type of data that you want.

Lastly, pay attention to whatever your user's feedback is and make sure to keep moving forward with whatever project you are currently working on. And if you want to see how we keep moving this project forward or you're interested in the code, try to replicate this for your own needs, then you can go to our GitHub page and I'll make sure to upload this slide as well to GitHub. And if you have any questions, I'm free, I'm willing to help. And thank you for listening. Thank you.