Ryan Timpe | Learning R with humorous side projects | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

I'm not nervous. I'm just excited. All right

So my name is Ryan Timpe. I'm a senior data scientist at the Lego group on the marketing effectiveness team. Basically build marketing mixed models in R to help the Lego group optimize all their different forms of media spending. You can find me on Twitter at at Ryan Timpe where I like to tweet about my dog, dinosaurs and all the fun things I do in R. You can also check out my Twitter for exclusive bonus content to this talk that did not make the cut.

So I'm here today to talk to you about learning R by creating side projects. Side projects have been a critical part of my learning experience in R and I want to share some of those experiences with you today. Whenever I'm interested in learning a new package or a new technical concept, I have to come up with a project design for me to learn it and then once I'm comfortable I can begin using it in my work.

So for example two years ago when tidy text came out I used data science to answer this really important question. Which season of the Golden Girls should I watch when playing a drinking game?

Yeah, this is silly and stupid, but that's my point. So R is always changing and we're always learning and so sometimes it feels like a struggle to keep up with all the new technology and packages and over my eight or so years of using and learning R I found that side projects really work well for me to learn these new tools on my own terms. And then once I'm comfortable I can use them in my work.

So I like to come up with this really ridiculous question like the Golden Girls drinking game. And this works for me because I get to learn the new tool using data and a topic that I'm already familiar with so the only unknown for me is the new tool itself.

Learning tidy text with the Golden Girls drinking game

So this is what I did with tidy text if you're not familiar. Tidy text is an package by Julia Silke and David Robinson that can take text and turn it into a tidy data table. It's a really powerful tool that can make possible data from anything, from literature to financial earnings reports or in my case TV show scripts. And I just love the idea of these two professional programmers out there making this really powerful tool for the R community and I'm just using it to get drunk.

And I just love the idea of these two professional programmers out there making this really powerful tool for the R community and I'm just using it to get drunk.

But you know what that worked so I can make this look a lot more like data science with some charts.

So again if you and your friends are hanging out one night and playing the Golden Girls drinking game, which season should you watch to maximize your drink consumption? So if you're responsible and not familiar with drinking games you basically watch a television show and each time one of the characters performs a specific action you take a sip of your drink. So for the Golden Girls when Rose talks about her hometown of St. Olaf, Dorothy talks about her ex-husband Stan, or when any of the women eat cheesecake you're gonna drink.

In looking at this lovely beach colored bar chart on the left I proved with data science, I remind you, that if you're going to play the Golden Girls drinking game and you want to drink the most watch season 5. And that's because in the later episodes of the season Rose talks about her hometown a lot and you're gonna drink about 10 more drinks that season than season 6.

That said, I'm not sure how much of a good idea it is to watch an entire season of a TV show for one drinking game. Definitely not healthy so instead look at the other chart on the right. This shows the cumulative drinks per minute for each of the seasons and here you can see that maybe seasons 4 or 6 are gonna be a better idea for you because they ramp up the drinks quickly in the first 100 minutes. If you watch season 5 you need to watch 400 minutes of the show or 16 episodes to exceed your drink consumption from season 4 or 6.

So this is how I learned tidy text. I hope you're proud of me.

And but I get it, tidy text I use it all the time now in all my projects both at work and for side projects. And I get it you'd like Golden Girls is a very old show and you might not be interested in that. So we can use tidy text for a lot of other things, like the Good Place drinking game. Because once you learn how to do it once it's really easy to repeat.

So here you drink every time Eleanor says fork or Janet reminds someone that she's not a girl, or Janet reminds someone that she's not a girl and here you're gonna watch season 1, or the Jurassic Park drinking game. Literally any drinking game you give me the TV show or a movie and I'll solve it for you using tidy text. And so here we now know that if you watch the two-hour movie you're gonna consume 80 drinks during the course of that movie.

Learning gganimiate with Jurassic Park data

I already had the data because of a different mini side project I did when a different new package came out. So I want to learn how to make animated ggplots with the gganimate package and again using data that means something for me. It's just way more fun. So I spent three days watching Jurassic Park, I paused the movie every few seconds to figure out which characters were in the scene to count all the dinosaurs on the screen and to just jot down all the locations. And I did this all for data science.

So here we have animated character paths of the main characters in Jurassic Park and where they move throughout the movie. We have three maps, we have the globe if they move from the Badlands where they're digging up the bones to the island itself, we have a map of the island in the middle where they moved between the different dinosaur exhibits, and then we have a map of the visitor center with the interior scenes. And in the middle every time a dinosaur eats one of the characters a little skull emoji pops up.

So this is a small silly project, but I learned gganimate this way and now like I learned all the features and transition elements and how to use them and when to use them and that set me up for being able to use it in some of my more serious work.

Building datasaurus: learning new tools to solve a fun problem

Other times I approach the learning experience from the other side around. I dream up a really fun project I want to complete with R, but I don't have the tools to do so yet.

So take a look at this chart, this is a rolling average of some mortality data from the United States. It's generally decreasing. So that's a really good thing. Take another look though and look closely. Do you think maybe, does this chart look like a dinosaur to you? And spoiler alert. The answer is yes. Um, it looks just like a dinosaur.

So doodling on charts is a lot of fun I do it a lot especially with tablets, yeah. But for this project I wanted a lot of doodles like thousands of them. So in this case getting computer to doodle for me was gonna be way more fun. So here we have another dinosaur doodle drawn from this data, but this doodle was made with a lot of data science and data science that can create any dinosaur from any data.

Yeah problems you never knew you had.

So a few years ago it seemed like everyone out there was building Twitter bots. And a Twitter bot is a win-win, a computer does all the work and a human gets all the credit, all the Twitter likes and all the retweets and I wanted in on this. So I built datasaurus which takes a time series of data and it finds a dinosaur outline that's closely correlated with it. It then redraws a dinosaur using that time series as the outline, colors it in and displays it on this fun poster. R sends out a tweet, rinse and repeat every few hours forever and you have a Twitter bot.

The thing is when I set out to do this, I did not know how to do this. My knowledge of R at the time was very limited to data manipulation and regressions and I didn't really have the tools to accomplish my goals here. So doing this I learned a lot of new packages.

So I started with what I could do, and then I would plan out the next step whenever there's a roadblock. I would do some research to see what tools and packages were available for me to accomplish this and then yeah, I would get it to work for datasaurus, I learned the package along the way and I move on to the next one. So some examples of this are I use the Flickr API to actually get all these dinosaur images onto my computer.

Geom raster actually lets me draw the dinosaur on a ggplot, grid extra to arrange a lot of ggplots on the same chart, rvest because I wanted trivia facts to make this more scientific. So on the bottom I had to scrape Wikipedia to display some facts. You see the fun color patterns on that. I had to relearn some basic trigonometry from high school because that's all sines and cosines. rtweet and the Twitter API to put that into a tweet and then batch processing so I did not have to hit the enter button every time I wanted to make one of these.

So the output of this is really silly, but I learned a ton of new tools that I use every day at my work as a data scientist. Not the dinosaur drawing part, but everything else. And so solving this silly problem just made me a much better data scientist.

And so solving this silly problem just made me a much better data scientist.

Ryan Timpe | Learning R with humorous side projects | RStudio (2020)

Transcript#

Learning tidy text with the Golden Girls drinking game

Learning gganimiate with Jurassic Park data

Building datasaurus: learning new tools to solve a fun problem

Naming dinosaurs with deep learning

Building the Bricker package and landing a job

Closing thoughts

Q&A

Featured software#

rstudio