Resources

Irene Steves | The dynamic duo: SQL & R | RStudio

There's a point in every data wranglers' career in which their full dataset can no longer fit into just CSV files, and the journey to database-world begins. I reached this point about two years ago, when I transitioned from ecological research to the world of eCommerce fraud prevention. My calls to read_csv became scarcer as I came to rely more and more on databases. In this talk, I'll demonstrate how I use R and SQL to access database tables, and how I incorporate both into my daily workflow, aided by features in RStudio IDE. I'll also discuss our company’s "riskiconn" package for handling database connections and queries, which includes customizations to simplify day-to-day data querying. About Irene: Irene holds an M.Sc. in Ecology and a B.A. in Integrative Biology, through which she first discovered R and data science. Her interest in data led her to the Arctic Data Center at the University of California Santa Barbara, a summer internship at RStudio, and ultimately to the Research & Data Science department at Riskified, where she now explores the complex patterns of fraud in eCommerce. In her free time, she studies Hebrew through podcasts and dubbed kids' movies

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

About two years ago, I started at Riskified, an e-commerce fraud prevention company based in Tel Aviv, Israel. Previously, I'd worked with ecological and environmental sciences data. I'd worked with a lot of different kinds of files, shapefiles, XML, CSVs of course, sometimes thousands at a time, but I hadn't really gotten the chance to work so much with databases. So I was really excited to enter a world where there were millions of orders coming in each day. There were teams dedicated to setting up ETLs to process that incoming data and get it neat and tidy into databases for me. And all I had to do is query that data into R and start playing.

And I was excited because, you know, I had R under my belt and R was a super multi-tool. It seemed to be able to do almost anything. I used it to create slides and websites. I used it for querying APIs and even building APIs. You know, it did a lot for what was, you know, supposedly just this statistical tool. SQL, on the other hand, was a little bit more like a screwdriver. It's a perfectly fine tool, but it seems to be only really good at one thing, which is to query databases. And so going to Riskified, in my mind was a little bit of this feeling of, why bother to learn SQL if I can just use R for everything?

Discovering dbplyr

One reason I was fairly confident is because I knew about this package called dbplyr, which is an extension of dplyr that runs on databases. So instead of running something on a data frame in your R session, you're actually running it on a table in a database, but it feels like it's just a data frame. And what it's doing is it's taking your tidy RISC code, translating it into SQL, sending it off to the database, bringing it back, and then giving you this, you know, table output like you're used to, as if everything all happened locally.

And so if we look at how this works, let's just take a simple R example where I have now an empty cars table in its database. I take some variable miles per gallon, round it, create this new variable, and then I select two other variables, this rounded miles per gallon, and then horsepower. If I look at the translation, it also looks fairly good. It looks somewhat similar to the R code. I'm rounding it, I get the horsepower from empty cars, the table. If I were to have written this out by hand, I probably wouldn't write in the backticks, but otherwise it looks almost the same.

And sometimes using dbplyr really saves you a lot of kind of tedious work. For example, if I want all columns except for horsepower, I can use this simple expression in dplyr, and then it generates the SQL code where I literally have to write out every single column except for horsepower. If I want to now take the maximum across all of the different columns, I can also do it very easily in dbplyr. It's just one line of code, and then it's able to generate this really long and very repetitive query for me. It's not a hard query to write, but you know, it does take some effort, and it is very repetitive.

Limitations of dbplyr

But the thing is, over time you also realize that, you know, what you're used to doing in just regular dplyr isn't always going to work when you're connected to a database. For example, if I now want to look at the number of NAs or nulls across every single column, I can, you know, use this one liner in R. If I try to do the same thing on a table, it will say no such column dot x.

It's angry at me because I'm trying to use an anonymous function, and it doesn't know how to translate the anonymous function. So what I have to do is I have to fix this set of commands. I have to break it up into the individual steps, which is isna first, and then the sum, and now it'll work.

Well, now it'll work for some databases, and this is where it kind of becomes complicated a little bit. On the one hand, dbplyr saves you kind of the mental load of learning SQL, this whole new language, but on the other hand, you still have to realize that you're working with databases. It does a lot of the translations to dialects for you, but you have to know what is specific to your database. In some databases, for example, isna becomes a boolean. In other databases, isna becomes an integer, and depending on what you're using, you might have to convert it into an integer in order to sum it up because it can't sum up booleans.

And so you have to make sure to still juggle that information in your mind as you're using dbplyr. So you are using something you're used to, but you can't use it exactly in the way that you normally would. And so over time, I did really start to appreciate SQL, the screwdriver, because even they have their moments, and you know, sometimes it is easier to just switch over to a different language rather than to try to juggle how regular dplyr is different from like dbplyr code on this specific database.

And so for example, if I take what I had from before, I switch out the ending, and now I have a group by and summarize instead. If I were to write this out by hand, it's also a fairly simple SQL query. The ordering is a little bit different, but otherwise it has all the same components. If I now take the machine translation of this, it looks a little bit scary at first. It's going to run just fine. It's just as efficient. It's not going to have any problems, but you can see that it's not as readable. It's harder to distinguish what here is really important, what is just kind of extra code that is used to generate this, to automatically generate this query.

And so at some point, once you have to, you know, hand SQL code off to someone else, or if you need to really optimize for speed, and you need to kind of input some of these little database tricks that you know about, you have to dig into the SQL. There's just no way around it. And for me, it's kind of like using Google Translate. I live in Israel. I did not grow up speaking Hebrew. You know, I learned the alphabet just a few years ago. And I can get around with just Google Translate. I can survive. But you also really quickly realize that it doesn't, you can't express yourself exactly in the way that you want to. You lose out on a lot of the nuances of the language. Sometimes you even have to alter the way you input your English sentence in order for the translation to come out okay on the other side.

And for me, it's kind of like using Google Translate. I can survive. But you also really quickly realize that it doesn't, you can't express yourself exactly in the way that you want to. You lose out on a lot of the nuances of the language.

SQL as a lingua franca

Speaking of language, that's one other really amazing thing about SQL. It's kind of this lingua franca among at least like the technical folks in a company. Of course, the database administrators are familiar with it, but also business analysts, data scientists, software developers, everyone has at least that core level of SQL. So if you need to communicate some actions you want to do to data, this is a really good way to communicate that through code.

And so with time, I realized really that R and SQL belong in the same toolkit. You can survive on just one or the other. But over time, it's really nice to be able to use both very well, especially if you're in this intersection of using R and databases.

Becoming bilingual in R and SQL

The nice thing is if you know one, it's fairly easy to become bilingual. If you look at just the main verbs, for example, in each of the two languages, you see there's a lot of overlap. Select is select, group by is group by. There are some things you have to learn. You have to realize filters, now where or having in SQL. Arrange is order by, and so on and so forth. So there is a little bit of translation you have to do, but it's not too big of a leap.

And the other thing is if you're used to R, then you know that you kind of write things out in the order that they happen. So let's say I have a table, I filter by some condition. Okay, now I group by some column and get a summarized aggregate column, and then I filter by that column, for example, and then I can arrange by something.

If I were to write this out in SQL, I would have to, I would start with, yeah, the table, the condition, group by. Now the select or the summarize jumps all the way up to the top. Having is at the bottom again. Order by is also at the bottom. And so you have to kind of remember that the ordering is a little bit different. You don't necessarily write things in the order that you would do them. But other than that, it's something that you get used to fairly quickly.

SQL features in RStudio

And if you're like me and you spend a lot of time in RStudio, the nice thing is there are a lot of SQL things you can do from within that familiar environment of RStudio, especially with the newer versions of the IDE. For example, if you open up a SQL file, you have your nice, you know, colors, syntax, highlighting, so you can see what the verbs are, what the, you know, the table names are. What I didn't realize for the longest time is there's this line at the top that says, you know, exclamation mark preview with some stuff after that. And this is where you can actually set a connection. So if you set a connection to the database, you can get the SQL preview pane where you can preview that SQL query within RStudio. This took me a while to figure out, but it's a really nice feature.

R Markdown is this kind of magical document where you can mix all kinds of different languages, and R and SQL is no exception. You'll start probably with some kind of YAML header like usual. You might have an R setup chunk where you have packages, the database connection, maybe you'll set up a table like I did here, and then you can have your SQL query. And if you have this inline mode enabled, you can actually see those previews within R Markdown and work with both R and SQL interactively within this one document. And finally, you can navigate through your database from within R. You can see what databases you have, schemas, tables, columns within tables, all within that RStudio interface.

Real-world workflow at Riskified

And so I've talked a lot about all these cool things about R and SQL together, but what does that look like in real life? I wish I knew more about what people are doing in the wider world, out in the wild, so to speak, but I can only really speak to my experience and the experience of my colleagues at Riskified. On a regular day, I'm often switching between a SQL IDE and RStudio.

I did just talk about all these SQL things you can do with RStudio, but some are still not quite that mature, and so I really like using the SQL IDE for all of the extra auto-formatting, highlighting, code and field suggestions, so I don't have to remember those long table names just from memory. One other thing is that it's really easy to stop bad queries in a dedicated IDE. Sometimes an RStudio gets stuck, and I press on that red button for ages, and in the end, I have to just force quit my RStudio and just start over.

On the RStudio side, there are a lot of nice R features, obviously visualizations, reporting, being able to mix languages in the same document, as we mentioned earlier, but I think maybe most importantly, you have this ability to organize your analysis project, and on top of that, have version control. That means you have your, you know, you have your SQL files, your data files, your R file, maybe R Markdown files, all together in one space. It's a project that you can now easily give to someone else and have them work on it and have everything work, and on top of that, everything is version controlled, so you can connect directly to GitHub and have it all sit there as well.

And so, speaking of project structure, if you were to, you know, take a walk through the GitHub repositories at Riskified, you would often encounter some kind of folder structure like this, where you have SQL files, where you're reading in relevant data, doing some initial data processing, and getting that initial sample population ready for analysis, and then you have your R files, where you do that, all the complex steps, the stats, visualizations, etc. Sometimes you're bringing extra information from the database in between with smaller queries, and then finally, you have this R Markdown report, where you share your results and conclusions with stakeholders, and this is where both of those pieces really come together into this one final product.

The riskiconn package

And because we're using R and SQL so much within Riskified, we actually have a dedicated package to handle those database connections and also just add little helpers that help us with our day-to-day flow. For example, we keep all our configurations within this package. You know, anything that's not too sensitive, like a host name, port, etc., we just stick into the package. If we need to update it, we just update it in that one place, and as long as people reinstall the package, it gets updated across all the computers. The other thing is that we basically just use one database, at least among the analytical teams, and so unless you explicitly provide a database connection, we assume we know what you want, and so that's built into our very own getQuery function.

There was one point recently where we actually had to migrate from one database to another. It was a bit of a slog, but one thing that did help smooth it over a little bit was this option we added to switch between the old database or the new database. You just had to do it once at the beginning of your R session and it would last for all of your getQuery, so you didn't have to go in and now start changing your code to specify the database connection for every single query in your code.

One small thing that we do a lot also is caching query results. The first time you ask the database directly, the second time it will have stored the result in some temporary file. It's cached, ready for reading, and unless you force it to connect to the database again, you have a much quicker response the second time around, and that saves a lot of time as you're iterating through your analysis.

And finally, we also have other little helpers like access to pipelines for bringing large amounts of data from R to the database or the other way around. Sometimes insert statements just don't cut it, and you need to, for us, what is easier is to dump the data into some cloud storage and then bring it directly into the database or going the other way around.

Conclusion

And so to conclude, R and SQL, they both have their strengths. R is great for summarization of cost columns, especially, you know, selecting specific columns, translations to multiple dialects, if you're working with several databases. SQL, on the other hand, is great for, like, is really needed when you need to optimize for speed or readability because you're having it off to another team to review that code.

And, you know, there's still a lot to learn, I'm sure. I'm still on this journey of learning how to use R and SQL well together, but I would love to also hear from you. Is your story similar, different from mine? My contact details are here, and thank you so much for listening today.