Irene Steves | Teaching data science with puzzles | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, I'm Irene. I was an intern with RStudio last summer working with the tidyverse team, specifically with Jenny Bryan , who many of you know. And now I'm based in Tel Aviv. And today I'll be talking about the data science puzzles I was building last summer, and some of the extra concepts that we were trying to also convey through these puzzles.

And so this story really starts over a year ago in December. I was supposed to be writing my thesis or something. Instead I was drawn into these coding challenges online known as the advent of code. And as you might guess from the name, these are a series of puzzles, daily puzzles, from December 1st to the 25th. And they cover all kinds of different computer science concepts, but in really fun ways.

And just to give you a sense of what that looks like, this is just a screenshot from that page. You have some puzzle text telling you what the problem is. You know, there's a robot running up to you and there's some tragedy that you have to solve. At the very bottom there's a link to your unique puzzle input. So maybe there's just a lot of garbled text that you have to somehow parse. And then at the very bottom is a little text box for you to enter your answer and, you know, get your little star. And so these puzzles can be solved in any number of different programming languages. As an R user, I used R. And I found that I was using R in very unfamiliar and unnatural ways. But I was still able to solve a lot of the puzzles.

A little did I know at the time, but Jenny Bryan was also solving these puzzles around the same time. And she took that feeling one step further. She thought, well, these puzzles are a lot of fun, but why don't we make a set of puzzles that highlight what R and the tidyverse are actually good at? In other words, how about we make a set of puzzles that focus on the idea of messy data and tidying it up?

And so that's how this project was born. It's currently known, oh, as the tidies of March. It looked better on my computer. So they've become a series of bite-sized puzzles that focus on core data science skills as championed by the tidyverse set of packages.

I think I forgot to mention at the beginning, there is a GitHub repo with nicer looking slides that will be linked to at the end as well.

The name and goals

And so just before I move on, I wanted to quickly tell you about the name. I think the phrase beware the Ides of March is maybe familiar to some of you. It comes from the tragedy of Julius Caesar by Shakespeare. And so you have this fortune teller coming up to Caesar saying beware the Ides of March, which means beware the 15th of March. Caesar is like, okay, whatever, dude. And of course, come 15th of March, Caesar is assassinated, more tragedy ensues, and I think the moral of the story is that you and your data do not have to fall into the same fate. You can avert tragedy by using the tidyverse and tidy tools.

You can avert tragedy by using the tidyverse and tidy tools.

And so for this project, we had a number of different goals. We had some more general goals. We wanted to build up a community around these puzzles to have a space where people can share their solutions and see other people's solutions, whether it's in tidyverse or Python. And then we wanted to be able to create a bank of solutions to these discrete data wrangling problems. But beyond that, we also wanted to impart on people some very specific skills.

First off, we wanted to help people exercise their wrangling skills for beginners, introducing them to a lot of these different packages for more advanced users, showing them some of these lesser known functions within these packages. And beyond that, we also wanted to promote the concept of workflow and project management.

The R package

And in order to do that, what we did was rather than just have this web interface where you can just go off and use whatever IDE or whatever tool you wanted, we were going to build an R package that guided you through that process and guided you through and modeled some of these best practices so that you can start to internalize some of those things.

And so this is more or less what it looks like. Like with any other package, you start with a library call, you start with an initialized puzzles function, and you pass it a directory name. And what it does is it creates a new project for you and then sets you up with a number of different files. We'll just take a quick look at that here.

So first, before anything else, if you use this R mediated route or this R mediated experience, we force a few things on you. We force you to use an R project. We think it's a very good idea to use these projects and keep all the relevant files in one place, keep it portable so that you can access those same files from a different computer and it will just work.

And use this logo in the bottom corner here because it was a big inspiration for this package. It focuses a lot on workflows and on making some of those kind of more meticulous tasks a little bit more easy to do. You see here there are a number of other files.

The dot puzzle file stores some of your user information. You have 01 underscore pets. That's your first puzzle. And as the days go on, you can download additional puzzles as additional folders. And then you have a read me. So if you now put this project on a GitHub repository, you have that already set up for you and automatically generated.

And so now if we click into the puzzle folder, we see a number of things. Maybe the first thing you notice is the naming convention. You see that everything starts with a 01 underscore. We use an underscore rather than a space. We use that leading zero so that once we have more than ten puzzles, it will still order nicely. So basically we're following and modeling the principles of having names that are machine readable, human readable, and that play well with default ordering. And these files specifically, I think you can see here, you have one file that's the data file. That's the file that you need to wrangle. In this case, it's an Excel file. We have a solution script that's been started for you, and I'll show that in a second. And then you have the puzzle text. And so that's that problem that you're trying to solve. Like that's, you know, Johnny has some problem with his microwave data, and you have to solve it.

And so this is what the solution script looks like. There are a number of things that we've kind of put into here. You don't necessarily have to understand it all, but I'll point out a few here. First, we have this prepopulated path. Because of this very rigid structure, we know where everything is. We can just populate that data path, and you can choose the import function you want to use.

There are a few other things. We use here here so that we can use the same relative path both in your R scripts and in your R Markdown files. R Markdown and R scripts, they understand the home directory in a slightly different way, and this makes it just work across different file types. This weird notation here with this hashtag apostrophe and hashtag plus, that allows you to have an R script that functions in the same way as an R Markdown file, where you can have text mixed in with your code. And then the last bit that I point out here is this options tidyverse.quiet equals true. That just turns off all the tidyverse attach messages so that when we have this rendered file, we just have that library call without all those extra messages.

And so jumping back to the README, this is what it looks like. Now that you have a sense of what that repository looks like, here it doesn't look very impressive. It automatically generates a table of contents, which with a single puzzle is just, like, okay. But you can imagine that with 20 puzzles or even with five puzzles, having something that just automatically generates is a saves you a lot of work. And so the idea behind this is really that we set up a lot of things for you in advance, and when your project is small, it feels like just extra work. But we're setting you up to really easily extend and enlarge that project. So when you have a lot going on, when you want to publish this to GitHub, when you want to create a website out of it, a lot of the architecture is already in place for you to use.

A puzzle example

And so, of course, I can't do this talk without showing you an example of what a puzzle is. And so we'll run through just this very simple example. So let's say we have a sandwich shop. They make really good sandwiches from BLTs to Fluffernutters, which I did not know was actually very popular until this week. It's a marshmallow peanut butter sandwich. But since many of their specialty ingredients keep going bad, they've decided to focus only on their most popular sandwich.

And so to help with the decision, they've collected some data. Here it is. Or here's a sample of it. And you have in the left-hand side the names of their customers. On the right-hand side, you have the sandwiches, their favorite sandwiches, semicolons separated in no particular order. And then it says at the bottom, in this sample, the Dagwood sandwich is the most popular. In the full dataset, what is the most popular sandwich among the customers? So that question in bold at the bottom, that's the question that you're trying to answer in the end.

And before we wrangle the data, I want to point out that within that puzzle question, we have a test case. We have a sample of the data, the answer to that sample, and that way you can start out by writing a script that works for the sample. And in the case of small datasets, maybe it doesn't make a big difference, but you can imagine that if you're working with hundreds of thousands of rows, really you want to test it on something small, make sure it works, have a script that works for most, if not all of your use cases, and then move on to that full dataset. And the test that and test RMD packages are good places to go learn more about testing.

So this is just the table output of that same table that you saw earlier. I've called it SW in this case, sandwiches. And the first thing we want to do here is take those sandwiches and have just one sandwich per row instead of this kind of list within the column here.

And so the tidyr package has a very useful function called separate rows. You just take that column, sandwiches, tell it the separator is semicolon space, and you have everything worked out for you. Here you might also want to pay attention a little bit. Are there some spelling mistakes? Are there some inconsistencies in capitalization? In this case, we'll just assume everything is perfect, because it normally is. And now we just count it.

And so we count sandwiches, sort equals true bumps the most popular sandwich to the top, so we have Dagwood at the top. That's exactly what we expected. Great. We are ready to submit our solution. And so we'll go ahead and do that.

So now we go back to our interface, type in submit puzzle, Dagwood, the puzzle number was 11. It says give it another try. Oh, right, that was the sample. That wasn't the full dataset. Let's try again. Bacon, egg, and cheese. I happen to know that's the answer. And it says correct, you are dandy. That's just praise from the praise package, which is very encouraging. And in this case, it's rendered it as a reprex on the side. Later we actually change it to R markdown, to rendering it through the R markdown package.

But basically now on the side, you have a way of previewing that solution script that you had written. You can make sure that the final output is what you expect. And both the R markdown package and the reprex package, they force you to do certain things, which are actually really good practice. They force you to make sure that you have all those library calls in place. They force you to make sure that you actually have everything in order. Definitely for myself, when I was starting with R, an R script was this interactive thing where I would run this part and then this line and then this line. And that doesn't really work when you're trying to reproduce it and use that script six months later. And so both of these packages really force you to do certain things right.

Definitely for myself, when I was starting with R, an R script was this interactive thing where I would run this part and then this line and then this line. And that doesn't really work when you're trying to reproduce it and use that script six months later.