Irene Steves | The dynamic duo: SQL & R | RStudio

Transcript#

This transcript was generated automatically and may contain errors.

About two years ago, I started at Riskified, an e-commerce fraud prevention company based in Tel Aviv, Israel. Previously, I'd worked with ecological and environmental sciences data. I'd worked with a lot of different kinds of files, shapefiles, XML, CSVs of course, sometimes thousands at a time, but I hadn't really gotten the chance to work so much with databases. So I was really excited to enter a world where there were millions of orders coming in each day. There were teams dedicated to setting up ETLs to process that incoming data and get it neat and tidy into databases for me. And all I had to do is query that data into R and start playing.

And I was excited because, you know, I had R under my belt and R was a super multi-tool. It seemed to be able to do almost anything. I used it to create slides and websites. I used it for querying APIs and even building APIs. You know, it did a lot for what was, you know, supposedly just this statistical tool. SQL, on the other hand, was a little bit more like a screwdriver. It's a perfectly fine tool, but it seems to be only really good at one thing, which is to query databases. And so going to Riskified, in my mind was a little bit of this feeling of, why bother to learn SQL if I can just use R for everything?

Discovering dbplyr

One reason I was fairly confident is because I knew about this package called dbplyr, which is an extension of dplyr that runs on databases. So instead of running something on a data frame in your R session, you're actually running it on a table in a database, but it feels like it's just a data frame. And what it's doing is it's taking your tidy RISC code, translating it into SQL, sending it off to the database, bringing it back, and then giving you this, you know, table output like you're used to, as if everything all happened locally.

And so if we look at how this works, let's just take a simple R example where I have now an empty cars table in its database. I take some variable miles per gallon, round it, create this new variable, and then I select two other variables, this rounded miles per gallon, and then horsepower. If I look at the translation, it also looks fairly good. It looks somewhat similar to the R code. I'm rounding it, I get the horsepower from empty cars, the table. If I were to have written this out by hand, I probably wouldn't write in the backticks, but otherwise it looks almost the same.

And sometimes using dbplyr really saves you a lot of kind of tedious work. For example, if I want all columns except for horsepower, I can use this simple expression in dplyr, and then it generates the SQL code where I literally have to write out every single column except for horsepower. If I want to now take the maximum across all of the different columns, I can also do it very easily in dbplyr. It's just one line of code, and then it's able to generate this really long and very repetitive query for me. It's not a hard query to write, but you know, it does take some effort, and it is very repetitive.

Limitations of dbplyr

But the thing is, over time you also realize that, you know, what you're used to doing in just regular dplyr isn't always going to work when you're connected to a database. For example, if I now want to look at the number of NAs or nulls across every single column, I can, you know, use this one liner in R. If I try to do the same thing on a table, it will say no such column dot x.

It's angry at me because I'm trying to use an anonymous function, and it doesn't know how to translate the anonymous function. So what I have to do is I have to fix this set of commands. I have to break it up into the individual steps, which is isna first, and then the sum, and now it'll work.

Well, now it'll work for some databases, and this is where it kind of becomes complicated a little bit. On the one hand, dbplyr saves you kind of the mental load of learning SQL, this whole new language, but on the other hand, you still have to realize that you're working with databases. It does a lot of the translations to dialects for you, but you have to know what is specific to your database. In some databases, for example, isna becomes a boolean. In other databases, isna becomes an integer, and depending on what you're using, you might have to convert it into an integer in order to sum it up because it can't sum up booleans.

And so you have to make sure to still juggle that information in your mind as you're using dbplyr. So you are using something you're used to, but you can't use it exactly in the way that you normally would. And so over time, I did really start to appreciate SQL, the screwdriver, because even they have their moments, and you know, sometimes it is easier to just switch over to a different language rather than to try to juggle how regular dplyr is different from like dbplyr code on this specific database.

And so for example, if I take what I had from before, I switch out the ending, and now I have a group by and summarize instead. If I were to write this out by hand, it's also a fairly simple SQL query. The ordering is a little bit different, but otherwise it has all the same components. If I now take the machine translation of this, it looks a little bit scary at first. It's going to run just fine. It's just as efficient. It's not going to have any problems, but you can see that it's not as readable. It's harder to distinguish what here is really important, what is just kind of extra code that is used to generate this, to automatically generate this query.

And so at some point, once you have to, you know, hand SQL code off to someone else, or if you need to really optimize for speed, and you need to kind of input some of these little database tricks that you know about, you have to dig into the SQL. There's just no way around it. And for me, it's kind of like using Google Translate. I live in Israel. I did not grow up speaking Hebrew. You know, I learned the alphabet just a few years ago. And I can get around with just Google Translate. I can survive. But you also really quickly realize that it doesn't, you can't express yourself exactly in the way that you want to. You lose out on a lot of the nuances of the language. Sometimes you even have to alter the way you input your English sentence in order for the translation to come out okay on the other side.

And for me, it's kind of like using Google Translate. I can survive. But you also really quickly realize that it doesn't, you can't express yourself exactly in the way that you want to. You lose out on a lot of the nuances of the language.

Irene Steves | The dynamic duo: SQL & R | RStudio

Transcript#

Discovering dbplyr

Limitations of dbplyr

SQL as a lingua franca

Becoming bilingual in R and SQL

SQL features in RStudio

Real-world workflow at Riskified

The riskiconn package

Conclusion

Featured software#

rstudio