Resources

TidyTuesday + Posit | PydyTuesday | Weekly Community Python Data Project

Posit software engineer Isabel Zimmerman discusses the TidyTuesday project and the Posit PydyTuesday Initiative. Learn how to participate in weekly TidyTuesday projects, watch Isabel explore Central Park squirrel data, and discover how to deploy your work to Posit Connect Cloud. Find her code here: https://github.com/isabelizimm/pydy-tuesday Check out these repositories to join the TidyTuesday and the Posit PydyTuesday Initiative: TidyTuesday repo with datasets: https://github.com/rfordatascience/tidytuesday Posit PydyTuesday repo: https://github.com/posit-dev/python-tidytuesday Learn more about Quarto and Connect Cloud: Quarto website: https://quarto.org/ Posit Connect Cloud: https://connect.posit.cloud/ Other videos in this Posit PydyTuesday playlist: https://www.youtube.com/playlist?list=PL9HYL-VRX0oSDQjicFMLIIdcLv5NuvDp9 #pythoncontent

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi there, I'm Isabel. I'm a software engineer at Posit, where I build open-source data science tools. But today I'm going to be here to talk to you all about the TidyTuesday project, which is a weekly social data project that's hosted by the data science learning community.

This is a fantastic project. I have loved it for many years. It's actually how I got started in R and Python, being able to explore these different data sets. Every week, the data science learning community posts a new data set. This could be anything from squirrel sightings in New York City to chocolate bar ratings or even Taylor Swift lyrics. It's really a wide variety of data sets, and it's a great opportunity to see types of data that maybe you wouldn't in your daily life.

It's also a great way to get involved with real-world data and a real-world data community. And this comes in because once you've analyzed your data, you've made some sort of app or dashboard or even just a really cool visualization you want to share with others, you can post it on social media with the hashtag TidyTuesday. If you go on your favorite social media and look this up, you will see probably many, many, many examples of this vibrant community of people sharing out just cool data things. It's really fun to see what other people have done with the same data that you saw.

of people sharing out just cool data things. It's really fun to see what other people have done with the same data that you saw.

At Posit, we have loved the TidyTuesday project for a very long time, and it's such a great community. We wanted to find a way to get Python users involved to really learn these critical data science skills and explore data in kind of a fun community-centered way. So we've decided to host a challenge alongside the data set that gets released for TidyTuesday. You can solve the TidyTuesday, of course, in R or Python, but every week we'll be sharing some sort of Python solution to the challenge we pose to people.

Exploring Central Park squirrel data

I'll show you all an example, so let's get started. So I'm in the Positron IDE. We're going to be doing a little bit of data analysis for Central Park squirrel sightings. This is a TidyTuesday data set that was released in 2023 on May 23rd. I think it's a lot of fun, and I think it'll be a good way to show everyone just kind of the beauty of the TidyTuesday world.

So we're going to start by loading our data in using Pandas. Later on, I'll also be using Matplotlib and Seaborn, so I've imported those as well. I find the easiest way to load data is by using just the raw URL given by GitHub. So I'll get the raw URL, paste it in here, and do pandas read csv to get a data frame.

We can run this cell, and I'm going to look over at this variables pane, maybe just peek at the data here. We can collapse these little spark lines and see what sort of data is available. This is also on the TidyTuesday website. Each data set comes with a data dictionary, so you know what each column is going to be all about, but we can look for ourselves. So we have x, y, I think that's the latitude and longitude of each squirrel, unique squirrel ID, things like primary fur color, highlight fur color, some notes, see if they've been running, chasing, climbing, eating, foraging, and maybe what sounds they're making if their tail is twitching.

All right, so we know a little bit about our data, and I think there's a lot of missing values, so we're just going to fill those in with unknown to give us a little bit prettier plot. We're also going to be dropping some columns that have a lot of missing values that we don't want for our analysis, so I can run this cell as well.

So the first question I think I'm interested in answering is maybe what color of squirrels are sighted the most. So our first plot, we'll maybe explore this. I'll look at it using matplotlib and seaborn. We'll start by making a matplotlib figure. Then we'll be using a seaborn countplot and wanting to think about the primary fur color. If we look at maybe all of the different options for what we have, I would think the primary fur color is more important than the highlights of the squirrel. We also want to add in the order so we can see these fur colors in order from maybe greatest to least. So we can see here we have a pretty good idea of mostly gray squirrels next to cinnamon, black, or unknown.

And just for later us, let's add in what question are we trying to answer with this. Maybe another question is what are the squirrels doing when they're being sighted? So let's make a second chart to see maybe what are the different activities that are occurring. We can take a peek at how many activities are being done. Looks like foraging is happening the most, but let's throw in a plot as well so we can share this more easily with others.

All right, so we also have a little bar chart of the different activities that are happening. I'm going to add a little bit more pizzazz onto all of these plots and we'll come back to see what the end result looks like. This looks fantastic. Let's go figure out how to put it on Posit Connect Cloud.

Publishing to Posit Connect Cloud

So now that I have my Quarto markdown file created, I pushed it up to GitHub so everyone far and wide can see it. My source code is available for others. And this looks great. You know, I have all of my Python code, we can see all of my plots, but the one thing is that it's not rendered here. And if I want someone to not see the source code, but maybe the beautiful plots that I've generated, what I really want is this HTML file or this HTML rendered output somewhere available that doesn't quite look like this. I want to see that beautiful webpage that Quarto creates. So we're going to use Posit's Connect Cloud to render this file and be able to share it far and wide with the TidyTuesday community.

One thing that I will need before we publish is a requirements.txt file. This is going to tell Connect Cloud what needs to be installed to be able to run my code. So we know that we've used Matplotlib, Pandas, Seaborn, and then in order to run Quarto itself, we'll need Jupyter as well.

All right. So let's go to Connect Cloud. So I was able to just sign in with my GitHub account. It was super easy setup, and let's start publishing things. So I know I made a Quarto file, so we'll select our framework to be Quarto. It's this TidyTuesday repository. I'm on the main branch, and the primary file that I would like to be published is this squirrels.qmd. There's some advanced settings if you want to change your Python version. 3.11 should be fine, and I'll toggle this setting to publish on push as well. So let's make this public, and let's publish it.

So one thing that's really nice about Connect Cloud is number one, it's super fast. You can see we're installing dependencies. Already done. We're rendering the document. It shouldn't take more than a second, and we are publishing for everybody to see, and this is all live, and there you go. That was exactly as long as it took to publish something from a GitHub repository to Posit's Connect Cloud.

and we are publishing for everybody to see, and this is all live, and there you go. That was exactly as long as it took to publish something from a GitHub repository to Posit's Connect Cloud.

I'm able to share this with others. I can copy my link here, post it on LinkedIn or BlueSky or wherever you're active, and really connect with other people in the TidyTuesday community, maybe show your friends and family all you've learned about Central Park squirrel sightings or whatever other datasets are available through TidyTuesday. That has been a very quick example of what a TidyTuesday dataset looks like and how you might publish an example of your work on Posit Connect Cloud. We hope you enjoyed this little walkthrough, and hope to see you online on the TidyTuesday hashtag. This has been a great community, and it is only because of people like you who really engage and share the joy of data with others. So thank you so much for watching, and we'll see you online.