Resources

Irene Steves | Teaching data science with puzzles | RStudio (2019)

Of the many coding puzzles on the web, few focus on the programming skills needed for handling untidy data. During my summer internship at RStudio, I worked with Jenny Bryan to develop a series of data science puzzles known as the "Tidies of March." These puzzles isolate data wrangling tasks into bite-sized pieces to nurture core data science skills such as importing, reshaping, and summarizing data. We also provide access to puzzles and puzzle data directly in R through an accompanying Tidies of March package. I will show how this package models best practices for both data wrangling and project management. VIEW MATERIALS https://github.com/isteves/ds-puzzles About the Author Irene Steves This summer I was an intern at RStudio, where I worked with Jenny Bryan to develop a series of coding challenges to cultivate and reward the mastery of R and the tidyverse. I was previously a Data Science Fellow at the National Center for Ecological Analysis and Synthesis (NCEAS), where I reviewed data submissions to a national repository for completion, clarity, and data management best practices. As a fellow, I also collaborated on a number of open science projects to improve access to Ecological Metadata Language (EML) and datasets in the DataONE network (see metajam, dataspice)

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, I'm Irene. I was an intern with RStudio last summer working with the tidyverse team, specifically with Jenny Bryan, who many of you know. And now I'm based in Tel Aviv. And today I'll be talking about the data science puzzles I was building last summer, and some of the extra concepts that we were trying to also convey through these puzzles.

And so this story really starts over a year ago in December. I was supposed to be writing my thesis or something. Instead I was drawn into these coding challenges online known as the advent of code. And as you might guess from the name, these are a series of puzzles, daily puzzles, from December 1st to the 25th. And they cover all kinds of different computer science concepts, but in really fun ways.

And just to give you a sense of what that looks like, this is just a screenshot from that page. You have some puzzle text telling you what the problem is. You know, there's a robot running up to you and there's some tragedy that you have to solve. At the very bottom there's a link to your unique puzzle input. So maybe there's just a lot of garbled text that you have to somehow parse. And then at the very bottom is a little text box for you to enter your answer and, you know, get your little star. And so these puzzles can be solved in any number of different programming languages. As an R user, I used R. And I found that I was using R in very unfamiliar and unnatural ways. But I was still able to solve a lot of the puzzles.

A little did I know at the time, but Jenny Bryan was also solving these puzzles around the same time. And she took that feeling one step further. She thought, well, these puzzles are a lot of fun, but why don't we make a set of puzzles that highlight what R and the tidyverse are actually good at? In other words, how about we make a set of puzzles that focus on the idea of messy data and tidying it up?

And so that's how this project was born. It's currently known, oh, as the tidies of March. It looked better on my computer. So they've become a series of bite-sized puzzles that focus on core data science skills as championed by the tidyverse set of packages.

I think I forgot to mention at the beginning, there is a GitHub repo with nicer looking slides that will be linked to at the end as well.

The name and goals

And so just before I move on, I wanted to quickly tell you about the name. I think the phrase beware the Ides of March is maybe familiar to some of you. It comes from the tragedy of Julius Caesar by Shakespeare. And so you have this fortune teller coming up to Caesar saying beware the Ides of March, which means beware the 15th of March. Caesar is like, okay, whatever, dude. And of course, come 15th of March, Caesar is assassinated, more tragedy ensues, and I think the moral of the story is that you and your data do not have to fall into the same fate. You can avert tragedy by using the tidyverse and tidy tools.

You can avert tragedy by using the tidyverse and tidy tools.

And so for this project, we had a number of different goals. We had some more general goals. We wanted to build up a community around these puzzles to have a space where people can share their solutions and see other people's solutions, whether it's in tidyverse or Python. And then we wanted to be able to create a bank of solutions to these discrete data wrangling problems. But beyond that, we also wanted to impart on people some very specific skills.

First off, we wanted to help people exercise their wrangling skills for beginners, introducing them to a lot of these different packages for more advanced users, showing them some of these lesser known functions within these packages. And beyond that, we also wanted to promote the concept of workflow and project management.

The R package

And in order to do that, what we did was rather than just have this web interface where you can just go off and use whatever IDE or whatever tool you wanted, we were going to build an R package that guided you through that process and guided you through and modeled some of these best practices so that you can start to internalize some of those things.

And so this is more or less what it looks like. Like with any other package, you start with a library call, you start with an initialized puzzles function, and you pass it a directory name. And what it does is it creates a new project for you and then sets you up with a number of different files. We'll just take a quick look at that here.

So first, before anything else, if you use this R mediated route or this R mediated experience, we force a few things on you. We force you to use an R project. We think it's a very good idea to use these projects and keep all the relevant files in one place, keep it portable so that you can access those same files from a different computer and it will just work.

And use this logo in the bottom corner here because it was a big inspiration for this package. It focuses a lot on workflows and on making some of those kind of more meticulous tasks a little bit more easy to do. You see here there are a number of other files.

The dot puzzle file stores some of your user information. You have 01 underscore pets. That's your first puzzle. And as the days go on, you can download additional puzzles as additional folders. And then you have a read me. So if you now put this project on a GitHub repository, you have that already set up for you and automatically generated.

And so now if we click into the puzzle folder, we see a number of things. Maybe the first thing you notice is the naming convention. You see that everything starts with a 01 underscore. We use an underscore rather than a space. We use that leading zero so that once we have more than ten puzzles, it will still order nicely. So basically we're following and modeling the principles of having names that are machine readable, human readable, and that play well with default ordering. And these files specifically, I think you can see here, you have one file that's the data file. That's the file that you need to wrangle. In this case, it's an Excel file. We have a solution script that's been started for you, and I'll show that in a second. And then you have the puzzle text. And so that's that problem that you're trying to solve. Like that's, you know, Johnny has some problem with his microwave data, and you have to solve it.

And so this is what the solution script looks like. There are a number of things that we've kind of put into here. You don't necessarily have to understand it all, but I'll point out a few here. First, we have this prepopulated path. Because of this very rigid structure, we know where everything is. We can just populate that data path, and you can choose the import function you want to use.

There are a few other things. We use here here so that we can use the same relative path both in your R scripts and in your R Markdown files. R Markdown and R scripts, they understand the home directory in a slightly different way, and this makes it just work across different file types. This weird notation here with this hashtag apostrophe and hashtag plus, that allows you to have an R script that functions in the same way as an R Markdown file, where you can have text mixed in with your code. And then the last bit that I point out here is this options tidyverse.quiet equals true. That just turns off all the tidyverse attach messages so that when we have this rendered file, we just have that library call without all those extra messages.

And so jumping back to the README, this is what it looks like. Now that you have a sense of what that repository looks like, here it doesn't look very impressive. It automatically generates a table of contents, which with a single puzzle is just, like, okay. But you can imagine that with 20 puzzles or even with five puzzles, having something that just automatically generates is a saves you a lot of work. And so the idea behind this is really that we set up a lot of things for you in advance, and when your project is small, it feels like just extra work. But we're setting you up to really easily extend and enlarge that project. So when you have a lot going on, when you want to publish this to GitHub, when you want to create a website out of it, a lot of the architecture is already in place for you to use.

A puzzle example

And so, of course, I can't do this talk without showing you an example of what a puzzle is. And so we'll run through just this very simple example. So let's say we have a sandwich shop. They make really good sandwiches from BLTs to Fluffernutters, which I did not know was actually very popular until this week. It's a marshmallow peanut butter sandwich. But since many of their specialty ingredients keep going bad, they've decided to focus only on their most popular sandwich.

And so to help with the decision, they've collected some data. Here it is. Or here's a sample of it. And you have in the left-hand side the names of their customers. On the right-hand side, you have the sandwiches, their favorite sandwiches, semicolons separated in no particular order. And then it says at the bottom, in this sample, the Dagwood sandwich is the most popular. In the full dataset, what is the most popular sandwich among the customers? So that question in bold at the bottom, that's the question that you're trying to answer in the end.

And before we wrangle the data, I want to point out that within that puzzle question, we have a test case. We have a sample of the data, the answer to that sample, and that way you can start out by writing a script that works for the sample. And in the case of small datasets, maybe it doesn't make a big difference, but you can imagine that if you're working with hundreds of thousands of rows, really you want to test it on something small, make sure it works, have a script that works for most, if not all of your use cases, and then move on to that full dataset. And the test that and test RMD packages are good places to go learn more about testing.

So this is just the table output of that same table that you saw earlier. I've called it SW in this case, sandwiches. And the first thing we want to do here is take those sandwiches and have just one sandwich per row instead of this kind of list within the column here.

And so the tidyr package has a very useful function called separate rows. You just take that column, sandwiches, tell it the separator is semicolon space, and you have everything worked out for you. Here you might also want to pay attention a little bit. Are there some spelling mistakes? Are there some inconsistencies in capitalization? In this case, we'll just assume everything is perfect, because it normally is. And now we just count it.

And so we count sandwiches, sort equals true bumps the most popular sandwich to the top, so we have Dagwood at the top. That's exactly what we expected. Great. We are ready to submit our solution. And so we'll go ahead and do that.

So now we go back to our interface, type in submit puzzle, Dagwood, the puzzle number was 11. It says give it another try. Oh, right, that was the sample. That wasn't the full dataset. Let's try again. Bacon, egg, and cheese. I happen to know that's the answer. And it says correct, you are dandy. That's just praise from the praise package, which is very encouraging. And in this case, it's rendered it as a reprex on the side. Later we actually change it to R markdown, to rendering it through the R markdown package.

But basically now on the side, you have a way of previewing that solution script that you had written. You can make sure that the final output is what you expect. And both the R markdown package and the reprex package, they force you to do certain things, which are actually really good practice. They force you to make sure that you have all those library calls in place. They force you to make sure that you actually have everything in order. Definitely for myself, when I was starting with R, an R script was this interactive thing where I would run this part and then this line and then this line. And that doesn't really work when you're trying to reproduce it and use that script six months later. And so both of these packages really force you to do certain things right.

Definitely for myself, when I was starting with R, an R script was this interactive thing where I would run this part and then this line and then this line. And that doesn't really work when you're trying to reproduce it and use that script six months later.

Conclusion

And so to conclude, with this project, we wanted to show off a lot of things in the tidyverse. But really there's a lot more to data science and to data wrangling than just the functions in the tidyverse. There's also the idea of using projects, using version control, using a solid file and folder structure with good names that are parsable and predictable, using test cases, and using code that is reproducible and rerunable.

And with that, thank you. My contact information is there. GitHub repo with the slides and other resources. And thank you.

Q&A

Thanks for the awesome presentation. I was just wondering if you got to apply the same thinking to the learner package. Like does it fit somewhere into the experiment you were trying to do? I did look at the learner package at some point. I don't remember exactly what it does anymore. But that is something that Jenny has in mind. And I don't know if it will be incorporated or not. But the learner package is for sure a good way of teaching as well.

Sorry. What's your goal for the difficulty level of these? Like for beginners, intermediates? I think we were thinking of like not absolutely fresh beginner. But basically anywhere from like beginner intermediate to advanced also. I mean, if you're advanced, you can do it as kind of like a speed challenge, if you'd like. And I think I found that for advanced users, these puzzles take between like 10 to 30 minutes. And sometimes it's just the luck of thinking of the right solution. And so really we try to cover a really broad range of levels. And we do start out with easier levels and then kind of level up to, you know, JSON files and things like that.

Thanks very much. This looks great. Do you give the suggested solutions at the end somewhere? The reason I'm asking this separate rows thing. How many times have I done separate and then gathered together instead of just using that one liner? I think, well, it hasn't been fully developed yet. I think the idea is that so in the puzzles themselves, I didn't show the web interface that we developed. But there are little places for hints. And so things like that would be in the hints where it's like maybe you want to try this tidy R function. It's handy.

Sorry. We nearly lost a laptop there. I was just wondering what sort of feedback you've got so far from people. So far I've mostly been talking to R ladies. I was in Paris and I did a meetup there. And the group of people I was working with there were mostly beginnerish. And for them I think it was an interesting exercise. For sure it took longer than I expected. And I know there's some like wording issues sometimes that comes up. But in general I think the feedback has been positive. I did some virtual sessions with some R ladies online as well. And so far we have had positive feedback. But we haven't really put it to the test yet.

Hi. I teach at a college and I'm just wondering like my students don't use R projects. Don't yell at me. So I'm just wondering if there would be a way to use the puzzles without sort of all the lovely tools you've built to go along with it. So there is a web interface. It's the same way the advent of code does, which is totally language agnostic. It gives you the puzzle text. The problem you're trying to solve, it gives you a link to the file that you have to download. And then other than that, you just there's a text box for you to submit your final answer. And everything in between, that's up to you. You can use whatever workflow suits you.

This is fascinating. And it makes me want to see the puzzles. Is there a website where we can actually go and try your tidies of code puzzles? You said there is a web interface. Yes. It is a prototype. It's currently a shiny R markdown that is hosted on RStudio Connect. I don't know what the timeline is, unfortunately, because I am not in charge of the project anymore. You can ask Jenny Bryan. Maybe she would know. But I think sometime in the next year, they will be released. And you can try them.

It is in a private repository on GitHub. But once everything is released, we would like to make it public. Right now, all the puzzle solutions are also contained in the repo. The way we were able to separate the solutions and the user interface was through an API. It gets a little bit more technical. But if you're interested, feel free to talk to me or read my blog post about that.

Are the puzzles grouped by any sort of common category? Are they simply like an increasing difficulty? Would they fit into a curriculum somehow? Okay. That is something we were considering. I tried that in the beginning. We were thinking of structuring it week by week, having some kind of theme. Originally, we were thinking, okay, week one, we'll work with data import and all the annoying parts of data import. Week two, we'll work on something else. Week four will be tidy eval. We realized that timeline specifically was maybe a little bit unrealistic. I think I've dropped that thematic part of it. But I think it would be cool once we have these set of puzzles to have other themed puzzles. Maybe just a whole month of tidy eval for the folks who really want to do that. Or a whole month on purr. Some more specific aspect. This is kind of just intro to the tidyverse style.