
Same Data, Different Tools: Visualizing with R and Python (Olivia Hebner, Summit)
Same Data, Different Tools: Visualizing with R and Python Speaker(s): Olivia Hebner Abstract: In 2024, our team participated in a data challenge to recreate a visualization from W.E.B. Du Bois’s 1900 Paris Exposition using modern tools. We split into two groups—one using R and the other Python—to compare their strengths and limitations. Both teams used census and geographic data to map county-level populations for 1870 and 1880. Team R used ggplot2 and grid for precise layout control, while Team Python used matplotlib’s subplot system for structuring. This challenge pushed us beyond more traditional data science visualizations, requiring creative approaches to mimic Du Bois’s design. Attendees will gain insights into data wrangling, visualization techniques, and layout design to guide their own projects. Materials - https://github.com/summitllc/Du-Bois-Challenge-2024 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thanks for hanging until the end of the session, I really appreciate it. And thank you to Posit for giving me a microphone, that's not something that most people who know me would do, I'm naturally loud and a yapper, but too late now, no take-backs.
So I am here today to talk about recreating a historical visualization in both R and Python. And that might sound a little similar to the first session of today, so I just want to give a quick thank you to our first presenter for giving all of the background information, I'm going to speed through that a bit more, but I echo all of her sentiments there.
Okay, so I'm starting the presentation a bit unconditionally, and I'm going to throw this slide at you with pretty much no context. One of these visuals was produced in R, and the other was produced in Python, so some audience participation, get your blood flowing, can you raise your hand if you think the visual on the left was produced in R?
All right, okay, now raise your hand if you think the visual on the left was actually produced in Python? Okay, I see a lot of hesitancy, that's good, that proves my point later. Cool, I'm actually not going to tell you until the end.
So going into our agenda, back to our regularly scheduled programming here, I'll do a very quick introduction of the challenge, I'll walk through common elements that both teams use to create these visuals, we'll do a quick overview of the Python code walkthrough, a quick R code walkthrough, and key takeaways.
The Du Bois challenge
Okay, so each year, the Data Visualization Society, and if you haven't heard of them, I highly recommend you check them out, they publish a challenge to celebrate Black History Month, and the goal of this challenge is to recreate visuals from Du Bois' 1900 Paris exposition using modern tools.
So we at Summit, that's where I work, we approach the challenge a bit differently, because like, what are rules, like, you should break them. First, we only selected one visual to recreate, as you saw there were, I think it's eight every challenge. We all have full-time jobs, and like, weirdly, our clients expect us to deliver things on time, even if we're busy with fun passion projects, I don't know.
The second thing we did was we separated into teams of three, instead of taking this on individually, the main reason behind this was I was leading a team, my colleague was leading a team, and then it was comprised of junior data scientists, and we wanted to give them a chance to learn how to collaborate together, how to share ideas together, how to share code on GitHub together and avoid merge conflicts, that did not happen, so also how to resolve merge conflicts. Great learning opportunity.
And then the third thing that we did a bit differently was we limited each team to one technology, so we said, hey, your team R, your team Python, figure out how to do it in there.
Okay, so I have to give a quick disclaimer here. I was the lead of team Python. I have, like, no bias about it. It's really chill. You probably won't notice anything, but the lead of team R is in the audience. I have encouraged her to throw things at me if I start to show my bias too much. I think she grabbed some tomatoes from the salad bar, so, like, look out for that, but you probably won't notice. I mean, I'm very, very unbiased here.
I spoiled the surprise with my first slide, but this was the visual that we decided to recreate, and I do want to quickly highlight that this was created using watercolors and therefore gives the maps in those counties that almost shading-like effect, and also the lettering and numbering here is extremely detailed, so Dubois and his team accounted for height and density of each letter and number that you see here.
Purpose of the talk
We're, like, six slides in, like, four minutes into the presentation at this point, and you may be thinking to yourself, this is a really, really cool visualization, but why are you on stage talking about it?
There's two purposes to my talk today, so the first is I want to compare end-to-end R and Python workflows for a historical visualization. As data scientists, we're often tasked with creating, let's say, more traditional. I've heard people at this conference call them boring. I am not saying that. More traditional visuals such as bar graphs, line graphs, pie charts, but this challenge allowed us and actually forced us to practice techniques that we often don't use in our day-to-day work.
The second purpose is to show that strong data and design principles produce the same high-quality results, whether you're in R or Python, leading to the conclusion in the yellow box that tool choice can be secondary, and I probably should have added an asterisk here and that said for visualizations. I don't know anything about machine learning. Maybe Python's better. Please don't come at me. I'm sure that there's a case for each, but for visualizations, I stand by this.
The second purpose is to show that strong data and design principles produce the same high-quality results, whether you're in R or Python, leading to the conclusion in the yellow box that tool choice can be secondary.
Common elements across both teams
Common elements that both the Python team and our team used. For colors, we both used a free online hex color code identifier. For fonts, we took a very manual approach here to identifying a font that closely matched the visual. That was actually pretty difficult and involved one of our junior data scientists going a little crazy, but it was very appreciated, and literally putting up the visual on the left and scrolling through fonts on the right and trying to identify what most looked like handwriting.
For the background, we just used a free photo editor to retrieve a similar paper-like effect. So shading. Both teams used shading in our maps to try and mimic that natural effect produced by the watercolors, so one of our junior data scientists thought to overlay census tracks on top of the original map, so the original map is on the left, and I mean, it's fine. You can definitely tell it was created online, and it was not hand-drawn, but using Tigris or Pygris, both teams imported Georgia census track data, and we randomly assigned each tract a transparency level or an alpha level, and we put that map on top of our original map so that you could mimic that shading effect from Du Bois.
So again, the one on the left is without shading. The one on the right is with the shading. I thought this was a really cool idea.
And then finally, tilting. We were all really excited to get started. The Data Viz Society provides all of the data you need. We opened up the shapefile. We plotted it. It's tilted. So we just transformed the coordinate reference system here to ensure that the maps appeared straight.
Python code walkthrough
All right. Python. It's weird that it's first. That just happened to be first. I don't know who made these slides, but let's dive into it. So Python, of course, we use Pandas and NumPy for the data cleaning. For the visual, we use MapPlotLib and MapPlotLib's PyPlot lines and font manager modules. And again, I'm going to do a high-level walkthrough here. There's a lot of details that went into this. All of our code is available in a GitHub, which is linked at the end.
Okay, we're going to get into some fun stuff now, finally. Here we're plotting our graphs, so we're putting these in axis one and axis four. We're taking our data, which we very originally named shapefile, and we're filling in each county with the hex codes we defined earlier in the file, and we're also adding a very thin black border around each county.
Now we're adding the shading that I talked about. So again, we're going to plot the Georgia census tracks over the current maps in those axes, using those randomized alpha levels to assign transparency, and ensuring that the edge color is set to none, so there are no visible borders around the census tracks. There are only borders around the counties.
Okay, stay with me here. So now we're going to build our custom legends, and those are going to go in our remaining two subplots, axis two and axis three. I left out a couple of code snippets from this slide. The first one is me getting really annoyed by the tick marks of the subplots, and just turning those off, using set axis off. And I'm also not showing how we defined our custom legend, but we did create a list that we call handles, and that list defines the symbol, in this case circles, the color of the symbols, the labels that correspond to them, the width of the border around the circle, and things like that, so it was very meticulous.
Okay, so what is on the screen, what are we looking at? Here we are defining axis two, by pulling the first three legend handles, which correspond to between 20,000 and 30,000, 15,000 to 20,000, and 10,000 to 15,000. We're setting custom coordinates for the labels positioning in the subplot, that's that lock argument. We are ensuring there's no box frame around the legend. We're setting the vertical spacing between each handle with that label spacing equals 3.2. Yes, we tried many variations of that. And we are assigning the font that we manually selected earlier, and setting the text color to gray.
All right, finishing touches, not so bad. We're gonna add the overall figure title in that first line of code, again, you're gonna hear me say this all the time, we set the position of the title manually, that y equals 96% means put the title almost to the very top of the page. We're also setting font specifications here. Similarly for the subplot titles, we're again setting those titles manually, using those x and y coordinates, and setting font specifications.
So a neat trick that a team member implemented here was to use capital I instead of the number one to better match the original visual. And then at the bottom, you'll just see that we further specify the subplot dimensions. Okay, that's it for Python, it's a beautiful, gorgeous, how could any other team compete with this visualization?
R code walkthrough
So, our walkthrough. Here we use tidyverse, you probably haven't heard of it, ggplot2, grid, and extra font.
So in Python, we started by building that overall figure. Here we started by creating our two Georgia graphs, using everyone's favorite ggplot2. We removed missing data, we plot the counties with a dark gray outline, we overlay those transparent census tracks, set manual themes, and use similar code to create the 1880 graph.
So here's where the fun starts. The R team utilized viewports, which allow a user to arrange multiple plots on one page, giving them the ability to precisely position and size them. So the code on the left starts by setting the background image, and then we create our very first viewport, which is that black box you're seeing on the screen. It's set to 70% of the width of the figure and 100% of the height. Grid text is then defining the title and font specifications and placing it within that viewport.
Okay, so let's put those maps on here. So starting one map at a time, we're gonna define a new viewport, that's our VP1 up there. Manually set the positions as well as the size of the viewport itself, and we're gonna use push viewport to activate the viewport, and that tells R, hey, any code that follows this should go into viewport one. We place our graph in there, and then we use the code up viewport to take a step back to the main page.
Similarly, we're adding custom legends, putting those in, again, new viewports, so viewport two. Grid circle places that circle at the specified coordinates and styles it using our custom colors. We add a label that goes with it for grid text, and then we repeat this for each label. Yeah, we could have done a function. I totally agree with you. We didn't.
For your own sanity, you're welcome. We separated out the maps for this talk so that you could see it, but realistically, this is what you would see with all of the boxes surrounding the viewports. You should be able to see five of them. I don't know where the title went, if I'm being honest, but it does come back. So after the final green label, we use up viewport one last time, and we get our R visual.
Comparing the results
So let's go back to the first slide. Anybody wanna change their answer from before? I showed you guys the final products. I feel like you should know the answer now, right?
Well, R is on the left, Python is on the right, and there are slight differences. Team R 100% had the better font. Team R's circles in the legend are also slightly bigger and have a black outline instead of a dark gray. The spacing between the legends and the maps is a little bit different. Overall, these figures are the same, and the differences that I just named, we could actually control for to make them the exact same. It would just cost me all of my sanity.
Key takeaways
All right, key takeaways here. So again, the output is essentially the same. What mattered in this challenge was bringing strong data and design principles to apply to the visualization, leading to, again, tool choice can be secondary.
This also means, though, that you can and should explore using a new tool to expand your skillset. Your output isn't going to suffer. Your sanity might.
And finally, collaboration strengthens outcomes. I was not responsible for the shading technique or the use of that capital I instead of the one in the subplot title. Our team lead, or our team R lead was also not responsible for those ideas. We maybe would have come up with them. Who knows?
And finally, collaboration strengthens outcomes. I was not responsible for the shading technique or the use of that capital I instead of the one in the subplot title. Our team lead, or our team R lead was also not responsible for those ideas.
So that's it. Here's my contact information. We have our blog post, our public repo of all of the data and the code if you want to recreate it for yourself. And happy to talk about other ways we could have done this. Thanks.
Let's take a few questions from Slido. Go easy on me.
Okay, this is from anonymous, and I get why. Weird flex preferring matplotlib to ggplot2, but why matplotlib and not seaborn or seaborn.object? I don't really have a reason for that. It was, you know, we gave ourselves a week to do this challenge, and we did what was familiar to us. I'm sure it's possible in seaborn. And, you know, again, I'm not here to say one package is better or one tool is better than the other. If you love ggplot, that makes you just as great of a data scientist as if you're one using matplotlib or seaborn.
