Lionel Henry | Interactivity and Programming in the Tidyverse

Transcript#

This transcript was generated automatically and may contain errors.

I work at RStudio in the Tidyverse team, and today I'll be talking about interactivity and programming in the Tidyverse.

So this is about the notion of data masking in R, which is this idea that you can blend the data with the workspace, so that you can work with the columns in your data frames as if they were objects in your workspace.

And I think that's a really great and unique feature of R, that it helps to turn ideas into software because you're directly working with your data, rather than worrying about the data structure that it is in.

On the other hand, it does make some things more difficult, and in particular, it makes it more difficult to reuse code, to write functions around data masking functions.

So we have made progress in the tooling to solve that issue, and also in the teaching, in our approach to teaching. So if you have learned a little bit about TidyEval before, we'll see some new concepts here. The concepts that you have learned before are still relevant. We still use the bang-bang operator all the time when we create our tools, but this should be considered lower-level tools, and now we are going to see a higher-level approach.

History of data masking in R

So I would like to talk a little bit about the history of data masking in R. So in 1998, the blue book was published that defined the S language, which is the ancestor of R, and a lot of the things that we still use were in that book.

And one of the first manifestations of data masking was in the attach function, which allows you to take a dataset and attach it to the search path the same way that you do with a package. So that's a little bit different from data masking, but it's still the idea that the data frame is the important scope, and that if you are working interactively with R, you want to be able to work with the data directly.

So attach is not the recommended way of working with data now, but a few years later, the white book was published, and it was about all of the statistical modeling functions that we still use in R now, so like the LM function. And the way that it works is that they took a data frame and a formula, and in the formula, you have data masking. You can refer to your columns directly.

And then R was being developed in the 90s, and Peter Dalgaard, who is a member of R-Core, published the frame tools package, which was really foundational for R and for the tidyverse and for dplyr .

And so as you can see, you had the subset frame function, the select frame function and modify frame. And these are the same kind of function that we have in dplyr, like filter, select, and mutate, and you could use data masking to filter rows. You could select columns and modify columns inside the data frame.

And that was the first operation of selections, which is a little bit different from data masking. So in particular, you can use the column operator to select a range of variables that are consiguous in your data frame, and other features that make it really easy to select columns.

So these were integrated in R. Subset and select were merged into one single function. Modify became transform.

And then there were relatively few developments on data masking in base R. The with function was included, and the within function a few years later. But by and large, most of the developments happened in package space.

So in 2006, the data table package was released for the first time. And one of the things that it did besides performance is that it allowed you to use data masking in I to subset rows, and you could use data masking in J, but also selections to select columns.

Then 2014, dplyr was released. So very similar to the ideas that were in the frame tools package, but with the objective to really push data masking in a much more comprehensive API.

The ambiguity problem

So what was the reason that development of the data masking slowed down in base R? I think we can find the answer in the documentation topics for subset and transform. And if you read those, you will see this warning that they are really convenience functions intended for use interactively, that nonstandard evaluation can have unanticipated consequences.

So I think what they mean by that is that it's about the ambiguity between data variables, so variables that are in your data frame, and environment variables, those that are in your workspace and that you have assigned with the R operator. And this ambiguity causes two different problems that are distinct but related. First, you can get unexpected masking from data variables. And secondly, data variables cannot get through function arguments, and that makes it difficult to create functions around data masking functions. And so we are going to see solutions that we offer in the Tidyverse to solve these issues.

So I think what they mean by that is that it's about the ambiguity between data variables, so variables that are in your data frame, and environment variables, those that are in your workspace and that you have assigned with the R operator. And this ambiguity causes two different problems that are distinct but related.

Lionel Henry | Interactivity and Programming in the Tidyverse | RStudio (2020)

Transcript#

History of data masking in R

The ambiguity problem

Solving unexpected masking

Tunneling data variables through function arguments

Data masking propagation and string-based alternatives

Selections and the all of helper

Q&A

Featured software#

dplyr

rstudio

tidyverse