Resources

Alex Cookson | The Power of Great Datasets | RStudio

There are a few classic datasets, like mtcars, nycflights, or Titanic passengers. They're okay, but they leave something to be desired for folks learning R: they're kind of boring. There's a big difference between "Okay Datasets" and "Great Datasets". Great Datasets prompt you to exclaim, "That's so cool!" They get your blood pumping and mind racing with questions you want answered. They give tremendous motivation to answer those questions. And in answering those questions, you'll probably learn some R. I want you to curate Great Datasets. You'll contribute to the richness of our community, you'll learn some R yourself, and you'll feel fantastic when someone finds your Great Dataset and exclaims, "That's so cool!" About Alex: Alex Cookson helps the Customer Intelligence team at the Royal Canadian Mint make the most of their data. When he's not working on A/B testing, recommendation engines, or exploratory data analysis at the Mint, he can be found participating in Tidy Tuesday or thinking up cool datasets to explore. And when he's not doing that, he's probably cycling around Toronto or doting on his two cats, Tom Tom and Ruby

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Alex Cookson, and I'd like to talk about the power of great datasets.

I'd like to start off this talk by posing a question that I think all of us have asked ourselves at one point, and that is, how do I learn to do this thing in R, whether that's sentiment analysis or data visualization or web scraping, how do we learn something? And Alison Horst, who does a lot of wonderful illustrations about R and statistics, has one that answers this question, and it shows how teachers, mentors, and the wonderful RStats community all contribute to taking someone from being a beginner to some level of proficiency with something.

But I always look at this illustration and ask myself, what about great datasets?

What makes a great dataset

Now when I say great datasets, I'm talking about something pretty specific. I'm talking about datasets that are both cool and interesting. And when I say cool, I mean something that when you come across it, you say to yourself, hey, that's cool, or that's really fun. It's something that speaks to you on a deeper level, and it tends to be pretty personal. So what's cool to you might not be cool to me.

And when I say interesting, I mean that it rouses your curiosity. It makes you start asking a whole bunch of questions that you really want the answer to. If you've ever played a video game where you say to yourself just five more minutes or one more turn, and you're still playing it an hour later, this is the dataset version of that. Just one more graph.

Examples of great datasets

So some examples that I enjoy are the Duke Lemur Center dataset, which looks at ages and weights of almost 30 different species of lemur. The Broadway Grosses dataset, which looks at weekly box office grosses, tickets sold, average top ticket prices of Broadway shows going back to 1985. And fictional character personalities, which I consider to be one of the all-time greatest of great datasets.

And this one looks at personality traits of 800 fictional characters across over 250 different spectrums, such as the playful to serious spectrum, where we might find Michael Scott from The Office on the playful side and Worf from Star Trek Next Generation way over on the serious side.

And I think this is such a great dataset because I've been watching so much TV and so many movies lately that when I came across it, I thought, oh, this is so fun. It's something that speaks to something I've been spending a lot of my time doing. And I immediately had tons and tons of questions that I wanted to ask it and uncover from this dataset.

Learning R through great datasets

One of those questions was, what are some of my favorite characters' personalities? And I put in a whole bunch of work and ended up with some graphs that looked like this, which takes the six characters from Pride and Prejudice and shows their strongest traits. So we see that Elizabeth Bennett at the top left is a treasure as opposed to trash. She's important. She has a high IQ. She's independent and beautiful and a feminist.

But to put this together, I had to learn to do a whole bunch of things I didn't know how to do before, like use custom fonts so I could get this Georgian era kind of handwriting vibe, to tweak labels using the glue and ggtext packages. I had to choose an appropriate color palette for the subject matter and adjust a ton of theme elements using ggplot2.

A second question I had was, which characters are most similar to or different than one another? And again, I put in a bunch of work and ended up with something like this, which looks at reasonable versus deranged characters plotted against rugged versus refined characters. So here we see that characters like Sandra Clegane and Jane Cobb are really similar to one another. They're deranged and rugged. And their opposites are Charlotte York from Sex and the City and Annie Edison, who are reasonable and refined.

And to do this, I had to conduct principal component analysis with tidy models because these are actually principal components. You don't find them in the data. I had to learn to add text and arrow annotations to get this graph here. And I wrote a blog post where this is an interactive chart. So I had to learn how to make an interactive chart using Plotly.

And if I can be honest with you for a moment, I never would have put this much effort, this many hours into analyzing Titanic passengers. It was because it was such a great, engaging data set where I had tons of questions that I wanted the answers to that I had the drive and the curiosity to actually get those answers and learn a whole bunch of stuff in the process.

I never would have put this much effort, this many hours into analyzing Titanic passengers. It was because it was such a great, engaging data set where I had tons of questions that I wanted the answers to that I had the drive and the curiosity to actually get those answers and learn a whole bunch of stuff in the process.

So I encourage you to get your own great data sets. And to Allison, I very respectfully that we maybe add one thing to the balloon. Thank you.