Same Data, Different Tools: Visualizing with R and Python (Olivia Hebner, Summit)

Transcript#

This transcript was generated automatically and may contain errors.

Thanks for hanging until the end of the session, I really appreciate it. And thank you to Posit for giving me a microphone, that's not something that most people who know me would do, I'm naturally loud and a yapper, but too late now, no take-backs.

So I am here today to talk about recreating a historical visualization in both R and Python. And that might sound a little similar to the first session of today, so I just want to give a quick thank you to our first presenter for giving all of the background information, I'm going to speed through that a bit more, but I echo all of her sentiments there.

Okay, so I'm starting the presentation a bit unconditionally, and I'm going to throw this slide at you with pretty much no context. One of these visuals was produced in R, and the other was produced in Python, so some audience participation, get your blood flowing, can you raise your hand if you think the visual on the left was produced in R?

All right, okay, now raise your hand if you think the visual on the left was actually produced in Python? Okay, I see a lot of hesitancy, that's good, that proves my point later. Cool, I'm actually not going to tell you until the end.

So going into our agenda, back to our regularly scheduled programming here, I'll do a very quick introduction of the challenge, I'll walk through common elements that both teams use to create these visuals, we'll do a quick overview of the Python code walkthrough, a quick R code walkthrough, and key takeaways.

The Du Bois challenge

Okay, so each year, the Data Visualization Society, and if you haven't heard of them, I highly recommend you check them out, they publish a challenge to celebrate Black History Month, and the goal of this challenge is to recreate visuals from Du Bois' 1900 Paris exposition using modern tools.

So we at Summit, that's where I work, we approach the challenge a bit differently, because like, what are rules, like, you should break them. First, we only selected one visual to recreate, as you saw there were, I think it's eight every challenge. We all have full-time jobs, and like, weirdly, our clients expect us to deliver things on time, even if we're busy with fun passion projects, I don't know.

The second thing we did was we separated into teams of three, instead of taking this on individually, the main reason behind this was I was leading a team, my colleague was leading a team, and then it was comprised of junior data scientists, and we wanted to give them a chance to learn how to collaborate together, how to share ideas together, how to share code on GitHub together and avoid merge conflicts, that did not happen, so also how to resolve merge conflicts. Great learning opportunity.

And then the third thing that we did a bit differently was we limited each team to one technology, so we said, hey, your team R, your team Python, figure out how to do it in there.

Okay, so I have to give a quick disclaimer here. I was the lead of team Python. I have, like, no bias about it. It's really chill. You probably won't notice anything, but the lead of team R is in the audience. I have encouraged her to throw things at me if I start to show my bias too much. I think she grabbed some tomatoes from the salad bar, so, like, look out for that, but you probably won't notice. I mean, I'm very, very unbiased here.

I spoiled the surprise with my first slide, but this was the visual that we decided to recreate, and I do want to quickly highlight that this was created using watercolors and therefore gives the maps in those counties that almost shading-like effect, and also the lettering and numbering here is extremely detailed, so Dubois and his team accounted for height and density of each letter and number that you see here.

Purpose of the talk

We're, like, six slides in, like, four minutes into the presentation at this point, and you may be thinking to yourself, this is a really, really cool visualization, but why are you on stage talking about it?

There's two purposes to my talk today, so the first is I want to compare end-to-end R and Python workflows for a historical visualization. As data scientists, we're often tasked with creating, let's say, more traditional. I've heard people at this conference call them boring. I am not saying that. More traditional visuals such as bar graphs, line graphs, pie charts, but this challenge allowed us and actually forced us to practice techniques that we often don't use in our day-to-day work.

The second purpose is to show that strong data and design principles produce the same high-quality results, whether you're in R or Python, leading to the conclusion in the yellow box that tool choice can be secondary, and I probably should have added an asterisk here and that said for visualizations. I don't know anything about machine learning. Maybe Python's better. Please don't come at me. I'm sure that there's a case for each, but for visualizations, I stand by this.

The second purpose is to show that strong data and design principles produce the same high-quality results, whether you're in R or Python, leading to the conclusion in the yellow box that tool choice can be secondary.

And finally, collaboration strengthens outcomes. I was not responsible for the shading technique or the use of that capital I instead of the one in the subplot title. Our team lead, or our team R lead was also not responsible for those ideas.

So that's it. Here's my contact information. We have our blog post, our public repo of all of the data and the code if you want to recreate it for yourself. And happy to talk about other ways we could have done this. Thanks.

Let's take a few questions from Slido. Go easy on me.

Okay, this is from anonymous, and I get why. Weird flex preferring matplotlib to ggplot2, but why matplotlib and not seaborn or seaborn.object? I don't really have a reason for that. It was, you know, we gave ourselves a week to do this challenge, and we did what was familiar to us. I'm sure it's possible in seaborn. And, you know, again, I'm not here to say one package is better or one tool is better than the other. If you love ggplot, that makes you just as great of a data scientist as if you're one using matplotlib or seaborn.