Resources

Weihuang Wong & Kiegan Rice | How to be a pollinatoR | RStudio (2022)

R users are part of data ecosystems comprising both statistical and non- statistical applications. We may work with SAS or Stata datafiles; non-R users may help run R scripts; or we may need to generate outputs in Word or Excel. Just as pollinators support biodiversity, we believe R users can be constructive members of diverse data ecosystems. Our talk will: (1) outline what it means to be constructive, (2) highlight tools that can help R users contribute to their ecosystems, and (3) describe practices that can improve workflows involving diverse groups of staff and software. We hope our talk will inspire R users to think creatively and empathetically about how R can be a force for good in diverse data ecosystems. Talk materials are available at https://rsconnect.norc.org/rstudioconf-pollinator Session: Working with people is hard

Oct 24, 2022
17 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Wei, and I'm here together with Kiegan, who is down there getting mic'd up. And today we're going to talk to you about bees.

So bees. We think our users have a lot in common with bees. For one thing, we share a love of hexagons. But more importantly, bees are pollinators. And what we know is that the pollinator population is really important to the health of an ecosystem. And I just noticed we have this beautiful backdrop behind us with these really nice butterflies. So we didn't plan this, but it's kind of appropriate.

And what we hope to convince you of by the end of our talk is that, like bees, our users can enrich their own data ecosystem with effort and empathy.

What is a data ecosystem?

So what do we mean by data ecosystem? And to explain, I think it's going to be helpful to tell you a bit about ourselves. So Kiegan and I work at NOC. We are a nonpartisan social research organization. And at NOC, our coworkers, our partners, our clients, we work on, we use different types of software, so SAS, Excel, Stata. And we produce lots of documentation. So policy briefs, technical reports, data code books, and we do that in a variety of formats like Word and PDF.

We work with expert coders, non-coders, and everyone in between. As our users and developers, we have to leave the hive and we have to interact with all these people in our ecosystem. And to do so, I think we're very lucky to have all these great tools that folks in our community have built to help to facilitate these interactions.

So that's what we mean by an ecosystem. It's a diverse environment where we have folks with a range of skills using different types of software and creating different kinds of products.

Why enrich the ecosystem?

Why do we feel so strongly about enriching this data ecosystem? Well, three reasons. So in our work, we either succeed or we fail together. So if our work is late or we bust our budget, the client is not going to care that our team did this or the SAS team didn't do that or manual review took two weeks longer than expected. So it's really important for us to make sure that everybody succeeds, whether we're a SAS user or a Stata user or a non-coder.

The second thing is, well, as our users, we do better work when others do better work. So it really helps us when others thrive.

And the third thing is that the more we can lower the barrier to R for non-R users and beginner R users, and the more we can show that R not only works with other software, but it also works for other software, then it's easier for us to convince our bosses to invest in the infrastructure and the support to support R and other open source software and also to pay for us to come to this conference next year.

Empathy and the concept of umwelt

Before we dive into how to be a pollinator, so just one word about what we mean by empathy. When we work in an ecosystem, we quickly realize that different people have different ways of thinking about the world. And biologists have a word for it. It's called unveiled, and an unveiled is an animal's perceptual world. So every species senses the world in a different way. So two species sharing the same physical environment have completely different experiences of the same environment. So bats and whales, for example, they hear the world in a way that human beings do not.

Here is where empathy comes in. So among all the species, humans are unique in that we have a capacity to appreciate the unveiling of other species. So in our context, empathy means being able to perceive code and perceive data in a way that other people do.

So in our context, empathy means being able to perceive code and perceive data in a way that other people do.

Lesson 1: Do no harm

Now, Kiegan and I are going to share with you three lessons we've learned on how to be a pollinator. And the first lesson is do no harm. As R users and developers, it's really important for us to be really careful and intentional when we interact with others in our ecosystem and when we deal with foreign data formats.

So here is one example of reading data from Stata into R. And they say that the lessons that stick with you most are the ones you learn from your mistakes. And so this is one that's very dear to my heart. Anyways, this example comes from the General Social Survey, or the GSS. And the GSS is one of the most widely cited studies of American society. It's a study that's been run by NOC for the past 50 years.

And here we have an entry from the GSS codebook. This variable describes the number of educations received by the respondent's mother. And the codebook tells us what the unique values of this variable is and number of cases with each value. And then at the bottom, we have this section called reserved codes. And these are missing values. So there are three different types of missing values. And we have 316 cases that are missing value because they said they don't know. And then we have 100 cases that are missing values because they said that they did not grow up with a female caregiver growing up. So they were not asked this question.

Now when we read this Stata data file into R, we see that the three different missing value types have been collapsed into a single NA. And this looks bad, right? Because it looks like we've just lost information. So why does this happen?

If folks are familiar with SAS or Stata, it's that whereas these two software have extended missing types so they can support multiple missing value types, R only has a single system missing. So what do we do about this? Well, it turns out that the R data object that we have, whether we've read in the data using Haven or we used the foreign package, the R data object still contains these missing value types. And it takes a little bit of effort and a few more lines of code to recover the different types. Folks who are interested can see Haven's convergent semantics for more information.

But our point here is that when you're working with a new or foreign data format, it really is important to take a moment to learn about the idiosyncrasies of that specific format. And it's part of doing no harm.

Here is another example. Here we have a screenshot of an Excel spreadsheet. The contents don't really matter, but you can see that a few rows have been highlighted. So there is data in the formatting. And these could be cases that have been flagged by a team for review. So some people have feelings about formatting as data. But our view is that this data can be important, and it's important that we preserve this data. And the great thing about R is that there are these packages, openxlsx.xl, that really helps us to retain this information in a very easy way. And we should take the effort, a few lines of code to preserve this information.

So the lesson here is it's going to take a bit of effort and a bit of empathy, but it's really important as pollinators that we take care with other people's data and we do no harm.

Lesson 2: Make things that just work

So the second principle of how to be a pollinator is to make things that just work. So what we really mean by this is we want to remove barriers for either people interacting with R or people who are using things that we're creating in R to make it so that things just work for them.

So a good example of making things that just work is consider how your data are being outputted. So you might take the usual approach of I'm going to share this data that I've cleaned and wrangled in R. It's really awesome looking. It's tidy. I'm going to save it out as a CSV. So that's one approach. If you think about who the next user is going to be, you could also take a different approach. And again, with one line of code here, using the foreign package, we can write foreign and include the command package equals SAS. And that gives us a SAS script. So now not only do we have the CSV, but the next person who uses this data, if they're going to use it in SAS, they have a script that reads in the data and adds the correct formatting and labels for them automatically.

So this is a really great way to make sure that the next person has an easier time. They go ahead and run this script, and the data just works. They don't have to go digging for the formatting and the labels and reencode everything in a way that makes sense to SAS. It just works. It's already there.

Another example is something you've maybe run into, a message like this. Looks like there's a hiccup. The server-hosted dashboard is not filtering based on the uploaded file, but it's working on my local computer. So maybe you've gotten this message, this panicked email. Maybe you've sent this panicked email at some point. It's working on one computer, but it's not working on another. It's working on the server, but it's not working locally. So for somebody who's not super confident in figuring out when errors are going wrong, this can be really scary.

So we can remove this possible barrier by, from the start, using something like RENV. And there are a lot of different options, but essentially, if you don't know what RENV is, go look it up. But it allows us to keep track of the package versions we're using when we're writing and implementing code and share those with any other people or environments that we're going to use this code on. So the next time we go to run this code, it works on the server and it works locally. It works on your machine and it works on my machine. And when we come back to this code in six months, it runs exactly like it's supposed to. It just works.

So again, with a little bit of effort and a bit of empathy for how other people are going to be interacting with the things you are creating in R, you can build a positive interaction, right? A positive interaction with R or something you've created in R and kind of smooth out those friction points that people run into and make things that just work for people.

Lesson 3: Promote growth

So the third principle is to promote growth. And we really want to bring as many people into the R fold and into the R community at our organization as possible. And so we do this in a couple of ways. The first that we do pretty often is have newer R users interact with existing scripts and processes. So maybe a more advanced R user is going to write a script or some sort of process to complete a task. Maybe it's a data analysis, maybe it creates a table, maybe it creates a chart. And we can have newer users take that already written code and run it with maybe changes to certain parameters or flags within that code.

And this may lead to the following scenario. You're trying to figure out what exactly happened to the script that you wrote. So you ran lines 1 through 57, then you changed overwrite to true, then 61 through 66, then back to 37. And this can be a really kind of frustrating conversation to have for both people if you're trying to figure out exactly what happened and exactly what went wrong. And now the person interacting with this code who's maybe a newer R user is going to feel frustrated and a little scared.

So there's a couple of ways that you can just make this a little bit easier and provide a more low-risk way for them to interact with this code. So one is you can use a dialog box. So something like the SVDialogs package. You can write, again, one line of code, and you can have a user go run an R script, and when it gets to a certain point where you've inserted this dialog box code, a dialog box will pop up and give them the option to select yes or no. So maybe it's whether you're going to include a certain set of NAs or not. Maybe it's whether you're going to run this for males or females. Whatever it is, it's a much easier way for them to interact with the code, and it's much less likely that you're going to run into this tracking down what lines of code got run at what time and what order.

So with a little more effort, you can also give them a more user-friendly interface to interact with. So here's an example of a process where we have analysts run dozens of reports over a couple of months each year, and we built a shiny dashboard for them. So they can see what reports need to be run. There's some of this metadata here redacted. But how many reports are left to run, how many have already been run, and they can select a specific report in the table and go to the details of that report, see all the metadata that they need, and see all these action buttons that let them take certain steps in the process.

So the users are interacting with buttons in this without actually having to run the code themselves, and they're able to see exactly the products of this, these beautiful created R Markdown documents, without having to run the code themselves or dig into it.

So a good example of how this can promote growth is we had an analyst start and say, hey, I'm honestly not sure what happened. I caused this error. I thought I did everything right. So I dug into it, figured out the error, fixed it, and got back to them. And they said, I'd really be interested in what went wrong. I really want to get better at using R. So I explained the process and what went wrong and why it was fixed now.

Cut to a few weeks later, and I get the following message after they hit an error. I'm looking at the log right now. So far, I only see warnings, but I'm going to go look for the error. So now within the span of about a month, this person went from not having any idea how to figure out what caused the error to digging into the logs themselves and trying to figure out what the error was and really kind of learning on their own and trying to get better.

So now within the span of about a month, this person went from not having any idea how to figure out what caused the error to digging into the logs themselves and trying to figure out what the error was and really kind of learning on their own and trying to get better.

So with a little bit of empathy and a little bit more effort maybe building out a whole user interface, we've provided people a learning opportunity and a low-risk environment to see how powerful R code can be and see how it runs without having to open the door to maybe scary mistakes that they're not sure how to fix. So we can promote growth in users who may be interested in learning more.

Closing thoughts

So what we want to leave you with today is to think about the question, how can you pollinate your own data workflows using R? So can you do no harm? Can you make things that just work? And can you promote growth? And I would posit that you can, but thank you, thank you. But think about how you can be this bee, right? Think about how you can travel between different environments and work with different people and think about what you can learn from having empathy for the perspective of others and think about what positive impact you can leave behind with you.