Jeffrey Arnold | Solving R for data science | RStudio (2019)

While teaching a course using "R for Data Science", I wrote a complete set of solutions to its exercises and posted them on GitHub. Then other people started finding them. And now I'm here. In this talk, I'll discuss why I did it, and what I learned from the process, both what I learned about the Tidyverse itself, and what I learned from teaching it. About Jeffrey: I am formerly an Assistant Professor of Political Science of Political Science University of Washington and Core Faculty Member in the Center for Statistics and the Social Sciences, an Instructor of Political Science and QuanTM Pre-Doctoral Fellow at Emory University. I received Ph.D. in political science at the University of Rochester. Prior to graduate studies, I was a Research Associate/Economist in the Money and Payments Studies research group at the Federal Reserve Bank of New York

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, hello, I'm Jeff Arnold, formerly of Insight Data Science as a data science fellow, and will be working at Instacart beginning of January as a data scientist, so I should get used to saying things like, my opinions do not necessarily represent the opinions of anyone who has or ever will employ me.

Probably don't represent the opinions of myself in the past or the future.

So this is a bit of a click-baity title, I realize, after the fact. If you're parsing it as Solving R for data science, I'm not going to do that. Most of the people here are doing that far better than I, so I will give you time to leave if you're expecting me to solve R for all of data science.

Instead, I'm solving R for data science, as in R for data science, which needs no introduction, by Wickham and Grohlman, and I made exercise solutions to it, and that's what I'm going to be discussing.

So hopefully that's not too much of a letdown.

Background and motivation

So as one would expect, these are a set of exercise solutions to all exercises in R for data science. It is up on GitHub, as one is wont to do, and it is all done with Bookdown and compiled to a nice, pretty website.

Why did I write these solutions? What was I thinking at the time? And I had to go back to reflect upon it. But the main thing was, this was from my previous life as an academic. So I was using R for data science in a course. I was responsible for teaching the first-year sequence of quantitative methods to social science PhD students, at least with the students I had.

I really couldn't assume any sort of prior statistical experience, nor any prior programming experience, and they were interested in getting to do applied research, not in the theory behind it, necessarily, except to the extent that it moves that forward, which is actually kind of perfect, because I really like this book in the sense that it's not a book about R as a programming language per se, nor is it a book about statistics that happens to use R on the side. It really starts with getting people into using R to get their job done.

And at least when I took quantitative methods courses, it was a lot of this. I know a lot of people here are improving that curricula, and so I don't expect this to be the case much longer. But in practice, most applied research is a lot more like this.

There's the importing, the tidying, the transforming, the visualizing, the modeling, and of that modeling, the proofs behind why all that statistical inference works is a very small sliver of it, a very important sliver, one that we cannot ignore, but you have to go through a lot of stuff just to get there. And that's why I adopted this textbook, and it was great.

Now I could spend more time working with the students and less time writing more and more of the same material. But it was a new book, so I was like, well, I should probably see what's in it if I'm going to assign it. And so I just kind of sat down and worked through it. I knew tidyverse , so it was fun.

And then, you know, put it up on GitHub, and I didn't want to make it too obvious that it was R for data science, and so, you know, I took the ROT13 of it and called it E4QF. I'm clearly not too good with the cryptography, so there's these things called search, and apparently people found it.

And so, yeah, so that's really how I got there. And so all of this happened about, like, the month before, actually, I think, R for data science was even published. I think I assigned it prior to the official publication date. Did that, put it up there, you know, told my students they could look at it. It was great.

They clearly didn't catch enough typos, as I found out later. And then it just kind of sat around there until eventually people started finding it.

What I learned from the process

So what did I learn from this experience? Well, I mean, I would like to be able to say that, like, you know, the experience of being a beginner and working through and learning it, I don't want to say I suffer from expert bias in this, because I'm not really an expert, but I didn't get that part of it, you know.

So if one is wondering, this is the thread where the Clojure name came from. So I had a bit of experience with tidyverse, but I learned a lot through those books. But I found most of the stuff in practice I ended up learning from all of it is, one, if you put stuff on GitHub and make it public, some people might find it useful. And that's really what happened here.

Second, I make lots of typos. So I spent a lot of time, kind of technical fixes to that, because I kind of didn't want to go back and proofread everything. And I've been using this kind of as an opportunity to learn new things. So when, you know, new markdown stuff comes up, I try it out. If I want to learn some more JavaScript or some other sort of new technology, I test it out here.

So first thing I do, like, I end up building up a lot of automation in terms of deployment of this. You know, it uses bookdown to build it. Travis commits it and deploys it to GitHub Pages. But then I go through a lot of different things of check the spelling using our open size spelling package, linter, styler, and then a lot of extra stuff, like I lint the markdown, I check the HTML, I check for all the broken links.

This can be kind of quite useful. Here's one that came up yesterday or so. The HTML linter found that the ID of something generated by HTML widgets was a duplicate. So there would have been a double image in there that I would have never found. And I picked it up and, you know, removed the caching and fixed it after a bit of annoyance. But this is something that picked up.

So been building up that over time, same sort of thing. After it's built, check all the links, all 16,000 of them, to make sure that they're still working. This picks up things like, you know, spelling exercises wrong or whatnot.

Reducing friction for user feedback

A second thing I like to do is trying to, like, reduce friction in getting user feedback. So, you know, it's been helpful, as I put it up, people found it, and I just kind of responded as people brought up things. So this is one of my favorite ones.

The question is, brainstorm as many ways as possible to select depth time, depth delay, R time, and R array from flight. So a bunch of different ways to do the select command. I was all clever. I mentioned the numbers. And then I say things like, this is bad, it's obfuscated, what do these columns mean? I don't know what column six is. I just wrote the code and I've already forgotten.

Sure enough, I get an issue raised that I actually wrote in the wrong column after writing that. And it was not on purpose. I'm not that clever. But people catch that sort of stuff. And I think that's great. And had some people contribute in various ways. I would love more contributions.

But one thing with GitHub issues is just kind of how time-consuming they are. So these are a bunch from Greg Wilson, which are literally things like a, the, and a whole bunch of just minor typos. As I said, I make lots of typos. Opening up a GitHub issue for all of those is kind of annoying.

So I also tried putting on the service hypothesis, which allows annotation of websites. So that's pretty nice in that it's a lot easier just to highlight something and leave a comment and say this is misspelled, this is a typo, rather than opening up a GitHub issue. Some places you do want that extra friction for feedback. I think with these sort of things, you want less friction.

But then I started thinking another way to go forward with it is simply to write something where you can just highlight the text, and then it will open up a new GitHub issue that has that highlighted text as the body of the issue, and, you know, links to the page where that's occurring. So in something like a document, that makes things a lot easier to submit, these sort of kind of minor, trivial, there's a typo here type issues.

Learning from user behavior

Another thing I started looking at here was what could I learn from user behavior? So there's Google Analytics on it as of November when I started playing around with it. And so kind of thought I'd throw that information in and see what are the page views of people on these various chapters of R for Data Science solutions.

So the top are the highest chapters going from like the first couple intros, and you see data visualization has lots of page views, then transformations have a little less, and it kind of decreases rapidly as one would expect in that sort of behavior.

But can we maybe say something about which ones seem to be harder or which ones people are turning to? And one problem with that is the number of exercises in R for Data Science vary dramatically with each chapter. So data visualization, transformation also have a lot of exercises. Strings has a ton of exercises.

But let's go ahead and put together a simple statistical model, a model unique users viewing the page as a function of chapter and exercises, and then look for pages that are not explained well by it as being potentially very hard or very easy, or ones that people want to view otherwise.

And we see that like exploratory data analysis, the third major chapter seems to be receiving fewer page users than expected. People seem to tail off at the end. But iteration and the beginning of the modeling chapters seem to get lots of users, even given how far into the book they are and the number of exercises they have.

What's next

So with that, I kind of want to know, you know, what's next with this? And basically, I really only had one goal here, just saying that this solution exists. They're up on GitHub. They're there for people to contribute. I'm not trying to keep this my own. I'd love everyone to, if you see something wrong, if you see ways it can be improved, contribute at it. I'd love to open this up as much as possible. Those are the links to it. And that's my information. Thank you.

Q&A

We have a few minutes for questions. We have a few mics. We have the throwable mics like we had in the opening keynote.

I really appreciated the infrastructure you put together around testing the HTML and stuff like that. Like, I'd really like to implement that with some things that I've done. Have you written anywhere about a way to do that or a guide or something?

I have not written a guide to do that, and it's kind of a lot of this has just evolved as I've tried things out. You can check out the source code of it. One thing I've tended to use, at least for the HTML and markdown, has been actually to turn to Node and JavaScript packages, just because there's so much web infrastructure there. It seemed to be easier to implement. But that's about all I have there. But I'll think about writing up more on that.

So which book are you doing next?

I don't know. The one thing I didn't get to, which I would love to ask Hadley at some point, was what is the deal with the nursery rhymes? I know it started with Little Bunny Foo Foo, but then there's multiple exercises in R for Data Science about implementing nursery rhymes in other songs. In my memory, it was pages of these. In reality, there's only three questions.

But since I had time, I should have put it in there. But for what it's worth, the Baby Shark song seems to be one that would fall under that rubric very well. If they threw this back up, I have a very nice version that uses map in there. You can turn nested for loops into multiple intervals. So that might be the next song to become an R programming teaching tool.

Featured software#