Resources

Ian Cook | Bridging the Gap between SQL and R | RStudio (2020)

Ian Cook | January 31, 2020 Like it or not, SQL is the closest thing we have to a universal language for working with structured data. Celebrating its 50th birthday in 2020, SQL today integrates with thousands of applications and has millions of users worldwide. Data analysts using SQL represent a large audience of potential R users motivated to expand their data science skills. But learning R can be frustrating for SQL users. One major frustration is the inability to directly query R data frames with SQL SELECT statements. Eager to use R for tasks that are not possible with SQL (like data visualization and machine learning), these users are dismayed to find that they must first learn an unfamiliar syntax for data manipulation. The popularity of the sqldf package (which automatically exports an R data frame into an embedded database, then runs a SQL query on it) demonstrates this frustration. But now there is a way to directly query an R data frame without moving the data out of R. In this talk, I introduce tidyquery, a new R package that runs SQL queries directly on R data frames. tidyquery is powered by dplyr and by queryparser, a new pure-R, no-dependency SQL query parser

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, good afternoon.

So 2020, this year, marks the 50th birthday of SQL. SQL, or S-Q-L, if you prefer, is, of course, the database query language that's ubiquitous today. Its origins date back to 1970, when Edgar F. Codd at IBM published this seminal paper that established the theoretical basis for SQL. It took a few more years after this, actually, before the language as we know it today came into being. R, by comparison, is younger. The first version dates back to the early 90s, although its precursor, S, also dates back to the 70s.

So let's do a show of hands. Raise your hand if you know at least some SQL. Most people in this room, even here at an R conference, it seems like most people learned SQL first, or at least a good share. In the broader population, vastly more people use SQL than R. And most people who use R learn SQL first.

So my story with R and SQL is I first learned R in graduate school for statistics. But then when I finished my degree and started working in industry, I started having to use SQL a lot. And I've used both R and SQL a great deal ever since. But over the last year or so, I really leveled up my SQL when I created a SQL-focused Coursera course.

SQL and dplyr: two sides of the same coin

So if you know some SQL and you know R and the tidyverse, one of the things you'll notice is that SQL and dplyr are really like two implementations of the same abstract concept. And they have a whole lot in common.

First of all, a table in SQL is essentially equivalent to the concept of a tidy data frame in R. So each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. That's what we call tidy data in R. But in the world of SQL, that is really the essence of Edgar F. Codd's third normal form, which is the most widely used way of structuring the tables in a database.

And once you have your data in this kind of structure, whether you call it tidy or third normal form, the types of operations you can apply to the data are basically the same in dplyr as in SQL.

So dplyr, as I'm sure most of us know, provides a set of functions, so-called verbs. And the idea is that you can use just a few of these to do most of the kinds of data manipulation that you'd ever need to do. So when Hadley created dplyr, essentially what he did was he took the semantics of the SQL SELECT statement and he brought it into R. Of course, the syntax is different, and Hadley did depart from SQL in a few notable ways. But the similarities are really striking.

So if you're familiar with SQL, then you'll recall the clauses of the SQL SELECT statement. SELECT FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT are the main clauses. And each of these maps onto a dplyr verb. But it's not exactly a clean one-to-one mapping.

The FROM clause in SQL, I'll start there, just specifies what table to use. So in dplyr, you do that by specifying just the name of a data frame. The SELECT statement in SQL has several different roles. So it corresponds to different dplyr verbs, depending on what you're trying to do. If you're just grabbing existing columns, then that's a SELECT in dplyr. If you're using expressions to define new columns, that's MUTATE or TRANSMUTE. And if you're using aggregate functions, in that case, SELECT corresponds to SUMMARIZE in dplyr.

The WHERE clause in a SQL SELECT statement corresponds to FILTER. GROUP BY corresponds very nicely to GROUP BY. HAVING in SQL, that's the clause that you use to filter by groups after aggregating. And in dplyr, you would use FILTER for that, but you'd use it after GROUP BY. And then ORDER BY, that corresponds to ARRANGE in dplyr. And finally, LIMIT, that's just HEAD or SLICE, if you prefer.

And incidentally, you know, it's not just dplyr and SQL that have this same underlying abstraction layer. This has proven to be such a successful one that it's been widely adopted. Pandas and Python, Spark in the big data world, many, many more tools for working with data all follow the same semantics to varying degrees.

dbplyr: translating dplyr to SQL

Anyway, one of the great things about this common abstraction underlying both SQL and dplyr is that it's enabled dplyr code to be translated into SQL, allowing you to use dplyr to query databases. And this is enabled by the package dbplyr.

With dbplyr, you can connect to a database, refer to a specific table in the database, and then chain on dplyr verbs, just as if you were working with an in-memory R dataframe. And what dbplyr does is it translates all those dplyr operations into a SQL SELECT statement, sometimes with the help of a custom backend package, like, specific to that, you know, SQL engine. And then finally, it uses the DBI package to run that SQL SELECT statement on the database and return the result back to R.

So you first define a variable. It's effectively just a pointer to a database table. So in this example, this is just creating a small in-memory SQLite database, and then it's copying the iris dataframe to it. Then you can use dplyr on that table in the SQL engine, just as if it were an in-memory dataframe. And if you're ever curious about what SQL code your dplyr code is being translated into, dbplyr has a handy function, show query, that will show you the SQL that it's generating.

The idea: translating SQL to dplyr

So I've been using dbplyr for several years. I've made a few small contributions to it, and I developed a custom backend package for it. But I've always wondered, like, what about going in the other direction? What about translating SQL to dplyr code? I always thought it would be really cool to have this. It could enable you to use SQL statements to query dataframes in R. And I searched all around, and there was no existing package that did quite this, SQL to dplyr translation.

Well, there's a clear answer to this question, evidenced by the popularity of another R package, sqldf. So this is a well-established package. It's very popular, created by this guy, Gabor Grotendijk. And it lets you run SQL queries on R dataframes. But to do this, it does not parse your SQL statement. Instead, it uses a clever trick. The package includes an embedded database. And it loads your R dataframe into the database, and then runs SQL queries on the database, and then returns the result as an R dataframe.

So here's an example of sqldf at work. The main function has the same name as the package, sqldf. And you pass it just one argument, which is a SQL statement enclosed in quotes. And the one part of this SQL statement that it does look at is the from clause to identify which dataframes, you know, you're referring to, and it loads those into the database. But it doesn't do any further parsing of the SQL statement. It just passes it over to the database to run.

I think sqldf is really cool. It eliminates an obstacle for users who are proficient in SQL and not so proficient in dplyr, and who just need to get some data manipulation done in R. So I work as a data science and machine learning educator at Cloudera. And I encounter a lot of learners who just love this, they eat it up. They typically have some experience with SQL, and they're really excited that they can reuse those skills inside R to do the kind of boring data manipulation piece, and they can move on to the more exciting stuff, like machine learning and ggplot2 and shiny apps, which were, you know, the reasons they were excited to learn R in the first place.

But there are a couple of fundamental limitations to sqldf, just inherent to the way it works, that it's important to recognize. Like I said, it works by copying data out of R into a database. That's okay for smaller, maybe medium-sized data. It does introduce a bit of latency on the first query for a particular data frame, because it has to copy it. However, with bigger data, it could be really impractical to make a copy like this. Just like architecturally, copying data to a whole different system like this seems like a bit of an anti-pattern.

Second thing, what sqldf provides feels less like a bridge from SQL to R and more like a dead end. I mean, you could just, you know, refuse to learn dplyr and just do all your data manipulation in R using SQL. I mean, I think it's important to learn dplyr at some point.

Second thing, what sqldf provides feels less like a bridge from SQL to R and more like a dead end.

Building queryparser

So a few months ago, I set out to create something that doesn't have these shortcomings, and it was immediately apparent to me that this problem breaks down into two separate pieces. One, parsing a SQL query into some kind of abstract representation of the data manipulation specified in the query. And then two, actually applying those data manipulation steps to the data in R data frames.

So the first challenge then was to create a SQL query parser for R. I mean, how hard could that be? It turns out it's pretty hard. So to achieve it, I ended up building kind of a Rube Goldberg machine.

I created this package called QueryParser, and it's on CRAN. The main function in QueryParser is named parseQuery. It takes a quoted SQL SELECT statement as input, and what it returns is a list with sublists that are named according to the structure of the SELECT statement. So you can see in this example, using the iris table, the SQL statement has a SELECT clause, a FROM clause, and a GROUP BY clause, and the list returned by parseQuery has sublists that are named the same.

And then the items within the sublists, what they are are unevaluated R expressions. In this example, see how there are two expressions in the SELECT list in the SQL query? One returns the species column, one computes the average of the sepal length column for each species. And these have been translated from SQL to R, and they're each returned within the SELECT sublist as an unevaluated R expression.

So I'm really proud of a few things about how QueryParser works. One, it is implemented entirely as R code, with no imports, no depends, no compiled code. It's just R code calling base R functions. And this makes it easy to maintain. No dependencies are gonna break over time.

However, there is an option in this function to use functions from several tidyverse packages in the translations that it outputs. This option defaults to false, but you can set it to true. So in this example, for instance, COUNTSTAR translates to the dplyr function n to count the number of rows, which is handy in many cases.

QueryParser performs semantic translations, not literal translations. And a good example of this is how it translates the SQL aggregate expression SUM of 1. This is commonly used by folks in the SQL community just to count rows. It does the same thing as COUNTSTAR. But the SUM function in R doesn't work like that. So QueryParser translates this to the dplyr n function.

Also, the parseQuery function checks all the expressions in all the clauses of the query that you pass to it against a list of allowed functions and operators. And it throws an error like this if there's a function or operator that's not on that list. And the reason I did this is because I'm imagining that someone will eventually build this into a shiny app where there's a box that allows users to input a SQL query. And in this case, QueryParser would allow any malicious inputs from getting through.

So the details about how all this works are enough to fill a whole different talk. But in the simplest sense, it kind of goes like this. First, it splits the query into clauses. Then it splits the clauses into strings containing different SQL expressions. And then it manipulates the strings to change them from SQL expressions to R-ish expressions. Things that R will recognize as R code.

This is the most laborious and unsatisfying part of the whole process. The code in QueryParser needs to step through every single expression, character by character, keeping track of when it's inside quotes, when it's inside parentheses, looking for different kinds of SQL syntax and replacing them with the equivalent R syntax. It uses this thing called raw connections inside base R to do that. It's not particularly elegant, but it works.

The next step is it takes those R-like expressions and it parses them using str2lang, which creates an unevaluated R expression from a string containing R code. And now this is the step that is satisfying. After that, what happens is this magical combination of quoting, substitution, unquoting, deparsing, string manipulation, reparsing. Some of this is what you might call quasi-quotation in an Rlang context. But here it's all done with base R. And this is the part that I'd love to do a whole talk about, because it's sort of fascinating how it all works.

After that, I have to do a few more manipulations of the abstract syntax tree to make everything work exactly right. Then I perform some basic validity checks and then return that list that you saw in the previous slide. So all this takes about only 100 milliseconds for a simple query. Maybe like 200 or 300 milliseconds for a more complex query. It could be much faster, for sure. But to the end user, just parsing one query, it feels instantaneous.

Introducing tidyquery

So moving on to the second piece, tidyquery. This is also on CRAN. And this is the package that most end users will actually use. The main function in tidyquery is named query. It takes a quoted select statement as input, and it uses the from clause to identify which data frame you want to query. And then it uses the rest of the select statement to establish how to manipulate that data frame, performs those manipulations, and returns a new data frame as the result. So basically, if you're familiar with SQLDF, then you use tidyquery just the same way. But of course, it works differently under the hood, as I'll describe.

But first, there's another option. You can also use this query function like a dplyr verb in a chain with other dplyr verbs above it and below it. The thing is, you just have to leave out the from clause. Because then the data that's passed in, piped in above, becomes the from clause.

So despite the syntax of this being similar to SQLDF, under the hood, this query function in tidyquery works totally differently. The first thing it does is it calls the parse query function in query parser. Then getting back that list, it determines which sequence of dplyr verbs it needs to call to execute this query. And then it uses dplyr with the help of rlang to evaluate the expressions in the context of a data frame.

So if you remember the earlier slide about dbplyr, essentially, tidyquery is like the inverse of dbplyr, which Hadley was quick to note.

Currently query parser and tidyquery work for the most common types of SQL queries. They have some limitations. I'll not go through these point by point in the interest of time. But I do have plans to implement all of this. But I don't have any specific time frames yet. If you feel strongly that you'd like to see one of these features implemented sooner rather than later, let me know. And I've talked to several people at the conference who voted for subqueries.

So one of the things I'm really excited about is that tidyquery doesn't work only on R dataframes. It can work on any data object that dbplyr can work on. So for example, data.table. With Hadley's recent overhaul of the dtplyr package, you can use dplyr code with data.table and it uses lazy evaluation to run efficiently on data.table. So tidyquery preserves that lazy evaluation.

So basically any time dplyr works with some data source, tidyquery can work with it. Maybe I might have to make a couple little changes to recognize the class name. But it'll all work. And it won't move the data. It'll query it in place. I'm excited to try this with the new version of the Apache arrow R package that Neil Richardson spoke about yesterday, which now supports dplyr verbs. I haven't tried it, but I'll do that soon.

Showing dplyr translations

There's one more thing about tidyquery that I'm really excited about. So besides running your SQL query, tidyquery can also show you the dplyr code that it translated your query into. So to do this, you use the function show underscore query. And here's a simple example. This is exactly the same dplyr code that would run if you were to call the query function.

And internally, there's actually just one main function that runs query and show dplyr. And that function has a great deal of conditional logic to try to generate the simplest possible dplyr code that it can, given your query. So the dplyr code ends up being basically what you would get if you wrote it, you know, as a human.

And here's a more complex example. Lest you think I'm cherry-picking simple examples that will work, no, it can generate some really complicated queries with joins and so forth. And I think for a lot of people, being able to see instant translations from a language they do understand to one they don't is a really effective means of learning a new language. So I hope that this provides a valuable learning and teaching tool and helps build the bridge from SQL to R. Thank you.

And I think for a lot of people, being able to see instant translations from a language they do understand to one they don't is a really effective means of learning a new language.