Resources

Jared Lander | R: Then and Now | RStudio (2020)

R has changed a lot since the meetup was founded 10 years ago. Back then we were using base graphics (or lattice) and the apply family of functions and we didn't have pipes. At the time there was an impressive 1800 packages on CRAN, now there are over 15,000 extending R's reach far beyond its traditional domain of statistics and machine learning into publishing, website building and video generation. The community has grown and changed dramatically during that time, with the New York meetup alone going from 25 to over 10,000 members. During this talk we go through a then-and-now of R code and community to palpably see how everything has changed

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everyone. So we're going to take a look back at the past 10 years of R. I started getting into R about maybe 13 years ago. And when I finished grad school, I discovered there was an R meetup in New York City. And that's sort of my beginning of my journey. And it's been about 10 years since then, and a lot of things have changed. So we're going to look at R retrospectively in a few different ways.

Community growth

We're going to start with the community and see how that's changed a lot. So this particular meetup, it was back then called the New York R Meetup, started on April 2, 2009. 21 people RSVP'd, and far fewer attended. It was founded by Josh Reich, who went on to found Bank Simple. Ten years later, for our 10th anniversary, we had a meetup in April. Within three hours, 40 people had signed up. It was sold out in less than a week. And our only constraint now is space. We are physically out of room to hold more people who want to come.

So for the 10th anniversary, we put up this graph here showing our membership growth. Zoomed in, it looks like this. I created this graph originally for our 10th anniversary, and we went from 21 people to 10,697. Since then, we're now at 11,205 as of this morning. So about 1,000 people a year joined the meetup.

This is a picture from the first meetup. In fact, there might be someone in this room in there. And it was a small crowd, and we had a ton of space to work with here, because not that many people showed up. About 10 years later, it turned into this. We have a lot more people coming now, and we also take really epic selfies.

So beyond the New York R meetup, which I'm obviously partial to, which is, by the way, no longer called the New York R meetup. It's the New York Open Statistical Programming Meetup, which is a lot of words, because it's R and friends. It is a friendly community of many different open source languages. And around the world, this is a map. There's a great, shiny app out there that's showing a map of all the different R meetups around the world. And as of this morning, there were 913 meetups with over 700,000 members. This did not exist 10 years ago.

Something else that didn't exist 10 years ago was R-Ladies. It's remarkable the growth that R-Ladies has shown and the turnout they do for all of our events. And this map shows 186 R-Lady groups around the world with over 62,000 members. It's amazing the growth of this community. It was not this large. So large, in fact, this morning, there's over 2,000 people in one room for this conference. Think about that. 2,000 people come to a conference about R.

CRAN and tooling

So what about what goes with R, all the stuff that goes around it? The growth of CRAN has been amazing. I know it looks like a similar graph I showed before, but it's different data now. About 10 years ago, there were roughly 2,000 packages available on CRAN. There are now over 16,000 packages. There are 16,000 add-on packages to make your life better, so you could use R better than you did ever before. So we really have to thank CRAN for this. CRAN has done such an amazing job of curating, not curating our packages, but making them accessible to us and exposing them in a way that we could all get easy access to and know that they're good packages and they're secure packages.

How we use R itself has changed drastically, too. This is the old interface. Who remembers this interface? Right? It was awful. You entered code in one line at a time, and if you made a mistake, you had to start all over again. If you got sophisticated, maybe you wrote it in a text editor like Notepad++ or TextMate, then you copy and paste it into R by flipping back and forth. It was awful. We're all here at this conference. We all know. We have RStudio now. It's so much better. It has a text editor built in. It shows you your environment, your files, your plots, and obviously a much-needed dark background. Otherwise, how do you have street credit at a coffee shop when you're programming?

Then and now: code

So let's take a look at the code, the old code versus the new code, then and now. Because life was much harder back then. I had to walk uphill in the snow both ways without save points.

So as a convoluted example, we're going to look at how you can call multiple functions repeatedly in a row. We're going to take from a data frame the fourth row of the data frame using just the head and tail functions. I don't think you should ever do this outside of Bash, but it's just to illustrate a point. So 10 years ago, I would have written tail of head of iris, comma, n equals 4, close my parentheses, comma, n equals 1. And remember, n equals 1 goes with tail, whereas n equals 4 goes with head. You read it inside and out, inside-out, like a nested Excel formula. And to be honest, like almost every other programming language is read. Thankfully, we have the pipe now. Iris piped into head with n equals 4, piped into tail with n equals 1. We can speak the code. We read it from left to right. It's very natural. But I don't think I need to convince anyone here of the merits of the pipe.

So let's talk about working directories and paths. One of the toughest subjects I have when teaching, it is very hard to explain to people who've lived in a GUI world. So I'm going to list all the files that were in a directory called data that's two directories back. So I would use the list.files function, go dot, dot, slash, dot, dot, slash, data to go backwards two steps and move forward. And this works fine. But if I'm doing this in the console, it works one way. But if I'm using our markdown that's in a different folder, it has a different working directory, it works a different way. And I used to jump through so many hoops to make this work out. And plus, list.files by default only gives the file names. It doesn't give the path to the file name. So if you then try to read in these files, it's where are these files? I don't see the path.

So two big improvements today. We have the dirls function from the fs package, which gives you the full path to the files. But the here package with the here function, this allows you to combine the merits of both relative paths and absolute paths. It dynamically creates absolute paths on the fly. So your code works, whether it's in the console or whether it's in a markdown document. And it's really nice.

Let's read in a CSV next. Read.csv has been around for a long time. I give it my old-fashioned path. I read it in. And I get back a data frame. Of course, it's the data frame, so it tries to print out all the rows, all the columns. I don't know what the types of the columns are. And in particular, the text data. I don't know what that's going to be. If we look at the class of the text data, it's a factor. By default, read.csv converts character data into factor data. Unless you know to do strings as factors equals false, and you do it without typos, like you don't remember that it's strings plural and it's camel case. And I learned that argument way too late in my career. They did not teach that in grad school.

So nowadays, we have read underscore CSV. You pass it in a file name. In this case, I use the here function to make it a nice path. And it prints out right away. It tells you the different column types you're getting. And it does not do strings as factors. It's by default turned off. We go to print it out. And we see it's a tibble, which is New Zealand for data frame. And it smartly just shows the first few rows, the first few columns. It shows the data types and has some other formatting niceties. And importantly, to stress it, it does not convert your characters to factors. It leaves them alone.

Selecting columns and rows

So get in columns. Let's say you want to get a subset of the columns. Well, 10 years ago, I used the aptly named subset function. Give it a data frame, then you give it a vector of column names. Notice there are no quotes. It was an early days nonstandard evaluation. And this worked on both rows and columns. But there's a problem. If you ever read the help menu for subset, there's a warning. It can have unanticipated consequences. How can they have that as a function and not expect it to work a certain way?

So you say, well, you know what? I'll use square brackets. So you write out your data frame, square brackets, leave the first entry blank because you're taking all the rows. Then you pass in a vector of column names as characters. And you get your two columns. Awesome, right?

And this syntax works just as well in a TIBL. If you want to use square brackets, you can. And it will smartly just print out the first few rows because it is a TIBL. But something I didn't learn in grad school was that if you give it one column, it returns a vector instead of a data frame. I never got taught that. So I had all this code like, oh, sometimes there's two columns. Sometimes there's one. It's a data frame. It's a vector. So I wrote all sorts of code to be like, take the transpose of the transpose of my subsetting. Yeah. Then it was a matrix. And it was disgusting. So not consistent. It cost me hours and hours.

So if someone had told me about drop equals false, I would have been fine. Why is that not the default? I know there's obviously good reason 30 years ago, but not anymore. So we look at our modern TIBL. And we select a single column. It gives us back a one column data frame. But if you want to set drop equals true for some reason, you can, which is not the default. And that's very important. Sensible defaults go a long way.

Sensible defaults go a long way.

Well, let's forget about square brackets. Let's use dplyr. We pipe our data frame into select, pass it the column names without quotes. And it's nice and easy to use. It's easier to write. If you select one column, you get back a one column data frame as you would hope. If you want a vector, you could use pull to get that column as a vector, just like putting a single name in square brackets does, which is confounding.

And if you want to combine pipes and square brackets, you're more than welcome to. Square brackets is just a function that you can pipe into. You surround it with back ticks, and it works like a function. Don't know why you'd want to do it, but you can.

So let's now select a bunch of rows based on a logical condition. We take our data frame, square brackets, and the first argument we put, we put a vector with a logical condition. In this case, data frame dollar column. So we have to refer to the data frame pizza base multiple times. Of course, if you know me, there had to be a pizza example in here.

So using modern technology, we pipe into the filter function. We only need to say the data frame once, and we just use the column name bare. This becomes more important when you have multiple conditions. So we want to have a condition on two columns. We have to say pizza base three times. Pizza base square brackets, pizza base dollar price greater than or equal to three, and pizza base dollar sign city equals New York. That's a lot of repeating yourself. With filter, we have pizza modern, pipe it into filter, and you just use the column names. You don't need to repeat the data frame. It's clearer. You get to focus on the logic. That's really important.

So if you want to select both rows and columns simultaneously, with square brackets, you do your data frame, square brackets, your logical condition, which involves the data frame name again, comma, the columns you want, and you hope it's more than one column. Remember, drop equals false. With dplyr, we get to do each step on its own line, its own function. We take the data frame. We filter it. Then we select it.

Plotting then and now

So let's see how plotting has evolved. There's been a lot of changes over the past year in plotting. We're going to start with the example of the diamonds dataset because it's canonical at this point. This is a base graphic scatter plot. You can argue about its aesthetic looks, if it looks attractive or not. But I first had to define the colors I'm going to use. I had to know how many colors I wanted. I used the formula interface for plot, passed it a data frame. The formula interface is actually kind of nice. But then I need to provide the colors. So I take my colors vector, and I subset it based on a vector from the data frame using square brackets. Then I manually add the legend, and hopefully everything lines up. I spent about an hour and a half building this plot. I forgot how to do it. And back then, ggplot existed, but it was far from dominant. In fact, Lattice was giving it a run for its money. And I'm not going to show you a Lattice example because I've never successfully made one of my own.

So we have ggplot. Everyone here knows this syntax, probably loves it. You have ggplot, pass it a data frame, pass it aesthetics, give it a geom. It can do just about anything. We have fine-grained control. We have great defaults. Even though I'm not using the default theme, it's just easier.

Then we come to boxplots. Now, I'm not a huge fan of boxplots because the box represents 50% of the data, which means the thin little line is half of your data, so you're throwing out half of your data. But it's easy. Boxplot, give it a formula, give it a data frame, you're done. But we could do better with violin plots. I take a violin plot, I jitter the points underneath them, I add some color to the points, I throw in my quartile lines, and I get to see both the shape and density of the data. It took a lot more code, but I'm able to see what's happening in the data. I get a real sense, I can really tell a good story now.

So we come to our favorite one-dimensional data plot, histograms. The hist function, not histogram, but hist because you want to save characters. I pass it a vector. If it's a data frame, I do data frame dollar column. It's basic, it's unattractive, it's not great. If I do ggplot, I can get a lot more control, I can make the aesthetics look a lot nicer, and it's just a better plot.

And if I want to, there is no shortage of interactive graphics. The high charter package lets me call hchist. I still have to pass it in a data frame dollar sign column, but look how great this is. And if you open this presentation in your browser, you can hover and use this interactively. All the simple line of code.

And if you have time series, you can call the plot function. The plot function is overwritten. It does so many different types of plots. Depending what object you pass to it, it gives you different plots. You need to know that. You pass it a multiple time series, and it gives you a multiple time series plot, which is kind of nice.

I'm not going to say it's attractive, but if you have the forecast package or the fable and feast packages, you can say autoplot, and you get what I consider a much, much more attractive plot than you do with base graphics, all with still one line of code.

And if you want to go interactive, high charter does it. Or digraphs is another great package for interactive time series plots.

Aggregation and modeling

So let's aggregate some data. Let's do a group by summary. If you use the aggregate function, you pass it a formula with the column you're going to compute on, the column you're going to group by on. You pass it a data frame. You pass it a function. It works, but it's kind of slow. With dplyr, you have group by. You have summarize. It's much more readable. It's faster. It's more SQL-like, so it makes a lot of sense to a lot of people. And it works on database and Spark backends, so it's really versatile.

If you want to aggregate on two columns, if aggregate, you need to use cbind on the left-hand side of the formula. I don't know who thought of this. I did not know how to do this for years. It was so confusing to me. Why am I cbinding two columns that are in a data frame? Now summarize, you just pass it two calculations, and you're good.

Modeling has changed a lot, too. It changed so much, it became machine learning. And then it became AI.

So we start with R's formula, and this is powerful. Traditional model formulation, this is a powerful tool. In fact, R won a statistical computing award for the formula. It lets you express your outcome and your inputs in a very nice fashion. You can call operations on individual columns. It automatically creates dummy variables for you. It's amazing. But it's been overmatched by Y data, and it doesn't really help you when you have new data you're trying to predict on.

So now we have the recipes package. It's a modern design matrix calculations. It's highly programmatic. You list out all the steps you're going to take. You can list individual columns, or you could say all numerics, or all outcomes, or all nominals. You can create dummy variables. You use the prep function to get all the calculations ready, and you use the juice and bake functions to carry them out. I think Max went a little overboard of his food analogies here, but it works.

So let's go fit our linear model, LM. It's easy. It's tested. It's fast. It's really fast. It's written in Fortran. You're not going to get much faster. Call LM, pass it a formula, pass it a data frame. You're done.

Well, using parsnip, it looks like it takes more code, and it does. We declare that you're doing a linear regression. You set your engine to be LM, and then you fit the model to fit that function. The key part here is LM. We're telling it to use LM. If you don't want LM, if you want to go all Bayesian, it's the same exact code, except you change set engine to stan, and now you're doing MCMC for free.

If your data's already in a cluster and you want to use Spark, just set it to Spark. It's that simple. If you want to do penalized regression, my favorite algorithm, you say set glimnet. If you want to do a neural network to do a linear model, you can, because a single layer neural network is a linear model. You just say set engine Keras. It's all the same.

If you don't want LM, if you want to go all Bayesian, it's the same exact code, except you change set engine to stan, and now you're doing MCMC for free.

Speaking of neural nets, everyone's favorite algorithm. It's been in R for decades. The end net function was written by Brian Ripley and Bill Venables. It takes a formula like LM along with a data frame. It only has one hidden layer, but you tell it the size of that layer, and you have a neural network. Today's neural networks take a lot more code. Now, it's amazing. We could use TensorFlow, but look at all this code it takes to do the same function as earlier. And you have to compile in place. You modify in place. It's a little weird for R users, but you can get it slower than the neural network in R, the base R at the end net package, but you can get much more complex relationships.

Getting help

Getting help has changed a lot, too. Sometimes you need some help. So back in the day, you might turn to the R help list, and you might get a gem like this. Once you appreciate that you have seriously misread the page, things will become a lot clearer. I'll let you guess who did that. Or you might be welcome to, or you might respond to that with, hopefully this one isn't in the manual, or I'm about to get shot. Scared face. It wasn't friendly.

Nowadays, we have nice people like Mara on Twitter giving free advice, helping you out in the friendliest way possible. It's really amazing how much better it is today to get help. You don't have to be worried about someone flaming you in a mailing list.

So a lot has happened in ten years. Quite a lot. And I've been thinking about my future in R and how I've welcomed the newest R programmer to my family. And that should be something like this. This shirt is a baby-themed hex sticker. We take R seriously in this family. We have the tinyverse, Messy R, Burp R, Gigi Poop. Yeah. And, you know, he has lots of R T-shirts, and we match a lot. So thank you very much.