Untangling Nested JSON With Wes McKinney | PydyTuesday #3

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to PydyTuesday Uncut, where we show you how datasets are truly analyzed, warts, dead ends, and messy reality included. Today, I'm joined by a very special guest, Wes McKinney, creator of pandas, Apache Arrow, Ibis, among other things. And yeah, he's been watching data scientists struggle with messy datasets for, well, close to two decades now.

I am super excited that today he is jumping into the trenches with us. Wes, welcome to the show.

Thanks for having me on. It's going to be fun.

Yeah, so Wes, very quickly, I mean, we can talk about those projects for the next hour, but we actually have some work to do. We have to analyze some data, but very quickly, you're my colleague at Posit. So you're a principal architect. And you know, what we'll do is we'll include some links to some great interviews that you've done recently, including the one with Michael Chow on the test set. And I believe that tomorrow, the second part is going to come out. And yeah, so we'll make sure to include links to those interviews.

Before we start, it's actually a good idea to mention that we are both going to be at Posit Conf happening this September, where we're both going to be giving a talk, but we're also going to be doing a book signing session. You, of course, for the third edition of Python for Data Analysis. And I'll be signing Python Polars, the Definitive Guide. So lots of interesting things happening there. And yeah, of course, looking forward to seeing you there in person.

For now, it's remote. Yeah, it's in Atlanta. So, you know, it'd definitely be great to see more folks there and look forward to do the book signing. I'm giving a talk about some of the work that I've been involved with in Positron . And maybe we'll have a chance to take a look at that today while we're analyzing datasets also.

Yeah, that's a good idea. Talk about Positron. And I'll be doing a talk or a workshop about Polars. Maybe we get to combine the two today. And yeah, just like before, I'm the driver. And this time you're the navigator, which means that I do all the typing so that you can focus on the hard part, which is all the thinking. And in other words, you're in the lead, Wes.

Okay. Sounds good. I mean, this is fun because, you know, often people describe me as a data scientist. And it's true that I started out doing a lot of data analysis. But then over the last 10 or 15 years, I've mostly been doing engineering on data tools. And so it's actually surprisingly infrequent that I actually sit down and work with a dataset. So I enjoy when I actually get to sit down and do data analysis. Since more often than not, I'm doing engineering on the tools and less of like the day-to-day, like in the driver's seat as a data scientist.

About the dataset

But yeah, so we've got some data to look at that I put together last night. I actually haven't looked at these datasets, but I thought that it would be cool to take a look at the 50 most popular Python projects on GitHub, at least ranked by GitHub stars. And so what I did was I asked Claude Code to build some data scraping scripts, which I can publish on GitHub so other people can get them and use them to scrape the data themselves. You need to put in a GitHub API key to get around rate limits and so forth. But basically, these are – this is raw data straight out of the GitHub API. It includes commit data and issue and pull request data.

And so I thought that by analyzing this data, we could get some sense of like what's going on in these projects, maybe like what people are talking about. I don't know. If you've got an Anthropic API key, maybe we could install chatlists and get some like LLM summaries of activity in these projects. So who knows what we'll get into. But we've got some raw data and so I think I sent you the tarball of all these datasets. So we can start by untarring the datasets. And there's some other datasets that I prepared that we're not going to look at, but I think we'll just start to look at the GitHub data. And since you did write, you know, Python Polars, the definitive guide, we'll use Polars today. And we'll do the work inside Positron. So maybe we'll get to use the Data Explorer a little bit that I've worked on.

And we'll see what we find. Excellent. So I have the file that you sent me, this tarball. Yeah. I'm currently in a terminal, but we can immediately do this in Positron too, if you like. Yeah, why don't you just flip to Positron and you can pull up the terminal there.

So just tar.xfs that. There we go.

All right. So we have a data directory here. I didn't sanitize the GitHub repos for bad language. I guess there's a popular Python project that has a bad word in it. So I apologize. I can always delete that one, I guess. But yeah, so the directory that we're looking at is the GitHub repos. Let's maybe, I don't know, let's open up like the Ansible file and just see what's going on in there.

All right. Okay. So we've got, looks like a big nested JSON file. Lots of data. Yeah. With issues and all kinds of stuff. So we've got issues and recent commits. Could you open up the issues and see what the, let's see what the structure of the issues looks like.

Okay. So we've got issues, issue, issue labels. Are there, yeah, there's probably comments in there. So yeah, there's an indicator if it's a pull request or not. If the issue is closed, there's the issue body. Right. So a pull request is also internally stored as an issue.

Yeah. So in GitHub, like issues and pull requests are the same like entity type, like in their storage models. So there's just a flag, whether something's a pull request or not. Scroll down a little bit. I was curious to see like what the comments look like. If there's reactions, comments, kind of getting a sense of like maybe, oh, I guess people's reactions.

So we got, we got all the comments. These are all like comments that people have left. It's quite a lot of data. Interesting.

Loading and exploring the data

Maybe what we'll do is like, let's start and we can like, we can load one of these files with Polars and then we can decide like what kind of an initial set of fields that we want to extract from the file for analysis. And then I'm trying to think like what, you know, what type of analysis we would like to do with this.

Yeah, so we've got, you know, we've got issue, you know, we've got all the issues and pull requests. We've got issue comments. I think, you know, to get the, like, you know, issue labels, the comment texts, the comment date, like the issue creation date, that will enable us to start getting like a sense of, like, if we wanted to start computing, let's say like the pulse for each project, like what kinds of issues are getting a lot of activity. So like issue date, the issue labels. So that will give you a sense of like what kind of issue or pull requests it is. Like we want to know whether it's a pull request or an issue.

So we can look at the pull requests field to determine like whether it's, I think for an issue, the pull request field is just blank, but we can, you know, determine that by looking at the data. And so maybe like if we start working with just a single file and get it working for one file, then we can work on doing it for, you know, doing it for all of the JSON files. So we can start to maybe, you know, get a sense across all of the repos of like, you know, like where is the, like what should, which are the most active repositories, like in terms of like, you know, issues, pull requests, comments, to get a sense of, because what I would, what, the reason I was interested in looking at this was to get a sense of like the community pulse across all these projects.

Because there's so many fast-growing projects and they get a lot of stars on GitHub, but sometimes the projects that get a lot of stars on GitHub don't actually have very active communities. And so it might be interesting to see like, you know, there's especially now there's a lot of AI projects that are, because everything AI, almost everything in AI is written in Python. So there's plenty of AI projects on GitHub that have tons of stars, but some of those projects don't have very active communities. And so I thought it might be a little interesting to like get a, to compute some kind of like a project pulse. And then maybe we could visualize that versus like the GitHub, like number of stars in the repository and things like that, to get a sense of like where each project falls on the like hype, basically the hype scale, like community activity relative to hype, which is measured by GitHub stars.

to get a sense of like where each project falls on the like hype, basically the hype scale, like community activity relative to hype, which is measured by GitHub stars.

I don't know how that seems to you as like maybe a starting point of one analysis we could do. Sounds like I have my work cut out for me. This is, but we'll, like you say, we'll see how far we can get. Yeah. Exactly.

So I just created a new Python file and we can just keep it like this for now. First I need to set up a virtual environment.

Yeah. And Positron has, we've been doing a lot of work on UV support in Positron. So there's still rough edges. I mean, Positron is still a relatively new project, but if you're a UV user, we are, you know, we want UV to just work. And since it, you know, everyone's, I'm using UV in a lot of personal projects. And so even though historically I've been more of a Conda user, but it's a great tool. It is just so fast.

It seems like Positron has picked up on this virtual environment. There's only one way to find out. Yeah, maybe you can go over to the, yeah, maybe you can flick over to the console here to see the, there we go. Of course. Yeah. Okay. That works.

Let me see. That looks like you've got multiple sessions. So that little window to the right of the console is the session selector. So like Positron, you can have a, see, there's the UV from your virtual environment. I guess you had another one from before. I think you could probably just kill that other session. But one cool thing about Positron is you can have like several Python sessions going at once. And you could have like one session that's attached that you're doing one set of work in, and then another one, or even you could be testing like multiple versions of Python at the same time and switch between them in the session selector.

But you can also use R and Python in the same Positron session and switch between like an R session and a Python session, which is pretty cool. Anyway. That is really cool. My talk at Conf is actually going to be about polyglot data science or how to use multiple languages at the same time. So this is very exciting.

Awesome. Okay. And then I guess let's start by reading some JSON. And see GitHub repos.

This is going to be interesting because I think it's like one big JSON object. And so I'm curious like what Polars does with that. So it's one, it's a single row. Yeah, exactly. It's a single object in JSON.

Yeah. So it's one, it's one massive struct. So how are your skills at unpacking? Well, I guess we can do struct field selection with Polars. So I guess if we look at the, yeah, let's look at the struct type.

There are two tools that you can use. You know what, and I'll just admit that right before this session, I quickly had to look it up. So I have here my book, which of course I didn't write by myself. I co-authored with Thijs Nieuwdorp. But sometimes you just have to reread what you've written. So there are two. So in this particular example, also a single object in the JSON, Pokedex.json file. And there are two tools that we can use, two methods. One is explode, and one is unnest. So explode, which is used to turn every item in a list into a new row, and unnest.

Yeah. So it sounds like probably our first step is that we want to unnest, since this is like a big JSON blob with one row. So we want to unnest it, right?

Exactly. So we can unnest, and that takes in the column name, which is called metadata.

Let me see here. The output is now shown in the console. There's a lot going on. We go all the way up. Or maybe if I save it to a variable, then we at least get it into, we can get it into the Data Explorer, which might be easier to look at.

Oh, so it's still one row. It's still one object. But now we have 80 more columns. These were all part of the metadata key inside this JSON. Yeah, because it's still just all one, all one repo.

Okay. So, all right. So we got all these URLs. We got the project description. And so whenever we have a list.

So this is the project. So this is the project metadata. Let me see. I thought that we also had the number of, oh, we have a forks count. Is that something that you could use as part of the pulse calculation? For stargazers, this is the number of stars. Stargazers count. Yeah. So that's something that we want to select.

Yeah, I'd like that. I think if we go back to the JSON file, so if we go up a little bit, so maybe collapse the metadata field if you can. Yeah. Although I guess full name would also be interesting, right? Yeah. The metadata full name.

Yeah. So then it looks like the issues are separate from the metadata field. Exactly. So if we collapse this, issues is another top level key over here. Yep. All right. That's an array, right? And so if we, yeah, because we want to flatten that into. Exactly.

Well, the issues, I think the issues are actually, they're a sibling of metadata. Are they? The metadata is currently collapsed. Yeah. So issues is a sibling of metadata. And meaning it's still, well, it's still, I called the data frame, it's in DF, yeah. It's also still in DF metadata. So it is. Yeah. So my name here is not really correct.

Oh, I see. Okay. So when you do the unnest operation in Polars, it like, it adds it to basically, like, if you scroll all the way down to the bottom is issues down in there. Oh, there it is. Okay. Yeah.

Cool. Oh, that's nice. I just double clicked on issues and then it, you actually worked on the data explorer. Is that right? That's right. Yeah. Yeah. So like, yeah. So this side panel view is the, like, all the columns in the data set, there's summary histograms, like, so when you get like a more traditional, a very not nested data set, you would get histograms and value count histograms in the summary pane.

But it is nice that when you have like a ton of columns that you can just double click on a column and it will jump to that column in the data set. Yeah. Very nice.

Since this is a list, I think we now want to use that other method to turn them into individual rows. That would be explode issues.

I'm not really good at naming unless you, feel free to suggest. No, that's totally fine. That's totally fine.

I trust you on this. This is also, this is also fun for, you know, good, good use case for Polars because, you know, pandas doesn't have, you know, as much built-in capabilities for, for working with highly nested data. So like whenever, and Polars is, Polars uses Arrow natively. And so when we were, you know, designing Arrow 10, you know, 10 years ago, like this was one of the use cases where I'm having a, you know, next generation data frame library with really good nested data handling was something that we really wanted. So I'm happy that we have it now. So it was always the goal.

Okay. This is currently running. Now we have issues, 50 rows. Okay. And anything that's, that's not part of the thing that's being exploded is being repeated.

Okay. And so it looks like my data parsing script, my data scraper only downloaded the data for the 50 most recent, like most recently updated issues. So if I wanted to get more comprehensive dumps of like every issue in the Ansible repository, and all of these repositories, I would have to have the data scraper paginate. And with, you know, if you have API rate limits, it would probably take, my guess would be several days to scrape all of the issue and pull request data for the 50 most popular repositories. But at least with this small snapshot of data, we should be able to get at least a flavor of a pulse of what's going on in this repository.

Exactly. Exactly. So we'll, we'll figure something out.

Yeah. So maybe, so maybe the next step here, so we've got the issues. So, okay. So, so what's going on in this, so now you're going to un-nest the issues field.

Okay. It's got lots of columns now. Look back there. Okay. So looking more interesting. Very handy that un-nest feature. Lots of URLs. And the cool thing about, so one of the cool things about Arrow and Polars is that these un-nest operations, they do very little actual data copying. They reuse the arrays that represent the data. They just move them around. So these operations in theory should be fairly efficient.

Oh, that's fantastic. Yeah. Okay. So now we've got a table of issues, all the issue columns. And so we've got people who left the comment, the issues. I assume we have a field later on in here with the issue comments. So maybe, I think for the comments, like probably we just want to count the comments and see like how many comments are on the issues to get some sense of like a good measure of issue activity. Oh, wow. We have the comment count. Okay. That's great.

It looks like these issue numbers are all out of order. I wonder how the data scraper, if the issues are in like random order, or I wonder what it selected. I'll have to go back and improve the data scraping script.

Okay. Oh, and then, oh, and there's the pull request field at the end. So like, what does that look like? So pull requests. Okay. So now we have some that are null. And so those are just issues versus PRs.

Yeah. Okay. Let me just make a note out of that. So it's called pull request. Yep. And if it's a null, that means it's an issue. Just an issue. Yep.

Selecting and wrangling columns

Do we want to make a selection of the columns inside of an issue? Oh, so like just to subset some of them or have fewer columns? Yeah, there's a lot.

We could, so we have, sorry, it's, I don't have too much real estate at the moment just to make things legible. Yeah. I think just the, maybe the, is there an updated at on the issue? Or I guess we could do like just take. Yeah, there's an updated, currently is. Okay. Okay. So updated at or closed, which.

Same. I guess we could just do maybe, maybe create, you know, we could take created at and updated at and the number of comments and the description, I guess.

Okay. We'll do another select. Yeah. I think actually, if you, so in the variables pane, if you expand the DF underscore issues on the lower right, you'll be able to, we'll be able to take a look at the column names here to kind of go through them a little bit. Yeah. Oh yeah. But it's convenient. Yeah. So we want the, I guess the repository. And we need some of the repo metadata as well, like the full name, stargazers counts.

I guess I could move this down to here. Sure. Yeah. Yeah. Now I'm using two different ways of selecting columns. I guess if we're not doing anything special and we don't have to turn them into expressions, we can just use the column. Yeah. That's a good use.

Okay. Yeah. Created at, so created at updated at, and then I guess the full requests, I guess we could do like, just do the, let's do the, let's do the full request. The labels earlier you mentioned, is that something? Yeah. Yeah. Yeah. The labels. Yeah. Let's get the labels and we can wrangle those a bit after that. It's like an array of labels, I guess.

Yeah. So that in labels, we're going to want to extract the, um, the name, like the label names. So like convert the labels field to, I think there's a way in Polars to basically apply a map to select a field out of a, an array, a list of structs.

Yes, there is. Yeah. But, um, trying to think of how I would do this. I would, I think I would, uh, stick to, let me first try, because the idea is we want to convert that labels field to like a list of labels. So you just have to, cause we're not going to flatten that. We're just going to leave that as a list, a list type in Polars, but we want like a list of string for the labels. So we want to like, yeah, just project out the, um, yeah, let me just grab the labels column and see if this code works and then continue with that.

Yeah. Let's see what else. So we want the labels. What else do we want to get here? Issues is invalid because, ah, of course, because when you un-nest something, it's no longer a column. And I was, I was, uh, I copy pasted some of that, that wasn't that. Yep. Okay.

Exploding column issues is invalid. Oh, wait. Of course. Yeah. You got to get rid of that select. Yeah. Yeah. Yeah. Nothing was, nothing was selected. Yeah. Nope. And now we're still not there. Oh yeah. This is a common issue as well, is that you can have, um, if you un-nest things, you can have columns. They clobber each other. Yeah.

Yeah. So there's, there's a collision here. And now I think if there's a URL column on. Yeah. So it looks like we do need to, so we do need to move up like the selection, we need to have two selects then.

Yeah. Let me quickly check if there is, because I'm, uh, it wasn't metadata there. And I guess, yeah, there's URL over here. So what I could do at this point is say, um, I don't want select, but there is, would it be remove? Uh, yeah. Like remove URL. So that's, um, uh, I want to, I want to unselect. Maybe drop. Oh, of course there's drop. Yeah. Thank you.

Yeah. Yeah. Drop URL. So I think in this case. That's inherited from pandas maybe. Still not working, but yeah, I just want to say that yes, pandas definitely has a lot of, uh, good influences there. Now what's going on here? Couldn't, oh, wait. Labels URL also duplicate.

How, let me just first double check. So maybe it is. Safer to do a select first and then repurpose ourselves. Yeah. So select, select full name, uh, stargazers count and, and issues. Yeah. Not, I guess we need like, uh, issues. You're right. I think, I think this is it. Right. One more time. Okay. That's looking better.

Yes. All right. Okay. Much better. And so, oh, that's cool. So you can see, uh, like the little histogram next to the, um, number of comments, uh, there. Yeah. So that's nice. Yeah. If you expand the, um, you expand the carrot, just like people are familiar. You see, get like a bigger histogram and give you, get some summary statistics there. So, okay. Yeah, that's great. All right. Progress.

Yeah. And so I guess maybe the only other thing we might want to do is to, um, maybe, um, to flatten the, uh, flatten the labels, like the labels field. So just pull out the label names.

Yeah. Let me see. There is, there is a way to do this. It's, it's relatively new. Um, yeah. See if Claude knows. Yeah. You know what? I mean, in the meantime, I'm just going to go to, uh.

Okay. Yeah. So you do, uh, so you do list dot eval and then CL element dot struct dot field, uh, name. All right. All right. All right. Yeah. Lists. Eval. EL. Sorry, that was element. Or what did you say? I'm going to go back to the, go back to the zoom window here. Um, so list dot eval. And then inside there you do PL dot element function call. And then you do dot struct. Dot field. Um, and then sub, sub, uh, name. Like the string name. Yeah. Sorry.

There we go. And then maybe you have to. Does that like over? Okay. I, I, I, when you do, when you do with, when you do with columns here, what does that, what does that end up? Does, what does that, the result of that do here? I guess we'll, we'll see what it does, but.

Yeah. So it creates or, or, um, it allows you to create an additional column, but if you don't give it, um, is it going to overwrite the labels column? Exactly. Yeah. Okay, cool. All right. Let's take a look. Let's take a look then. Perfect. It's exactly what I wanted. Thanks a lot.

All right. So that's cool. Okay. Um, and then, uh, do you know on Polars, like if it's possible to, um, uh, if it's possible to, when you have, um, do you know how to aggregate on, on the values inside like this, uh, like labels column? So for example, if you wanted to get like a count of issues by label name, I assume that that's possible maybe.

Um, and you want it per issue or per column? Per issue or, because eventually we're going to apply this to all of our JSON files, right? So that we have. That's, that's right. That's right. Yeah. Yeah. And every project has different labels. And so like, you know, the data will be kind of messy. I guess we could apply some heuristics to try to group together similar sounding issues or use an LLM to like, you know, group the issues together.

So, um, one thing that comes to mind now is to first, um, uh, explode this so that each label becomes its own row and then do a group by, um, but there might also be another way of, you know, uh, doing this. There might be a shortcut.

Yeah. Okay. Well, I guess we can maybe we'll see if we have time for that later, but, um, yeah, maybe, uh, since we've got this like template for processing a single file, maybe now we can, um, we can do this for all of our files and make one big, um, make one big Polars data frame. Yeah. Append them all together. Yeah. So we'll leave the aggregation for now.

Processing all files

Yeah, let's, let's maybe just focus on, uh, importing all the, all the data files. And, um, I guess maybe after, after this, uh, after this video, I will, uh, I can, you know, take, take him to take a half hour, an hour with cloud code and see and work on, uh, you know, getting all of the data for all of these repositories. So we could, um, I can publish a more comprehensive data scraping scripts in case people want to download, uh, you know, all the project like Ansible has had, uh, tens of thousands of pull requests. So you can imagine like the number of, um, the size of, uh, you know, that type of data set would be pretty like many gigabytes worth of data for all the issues and all the pull requests and like every comment and pull request comments and code review comment that's ever been left to be pretty, pretty insane.

Um, I guess maybe the only other thing we might want to add is like a flag, um, whether the, uh, pull request field is null. Yes. Or not. Okay. So you can add that to like, if you add that to the select like pull requests and then maybe, uh, you can add that to the width columns expression.

Uh, yeah. So that would be, I think it's just called pull underscore requests. Yeah. So is pull request, is that what you want to use as a column name or what can you, we can start with, we can start with the column pull request, right? Yep. Is. Null. Yeah. And I guess. Yep. It's an issue. Yep.

Okay. Just cleaned up the code a little bit as well. Now we just have one data frame. Um, do you have your, do you have, do you have your Positron? Uh, what's that? Yeah. So I now have two data frames, two variables that I, that we no longer need, right? DF issues and DF metadata. Um, can I get rid of those using this, uh, this pane or not the variables pane, but if you go to the console and you del them, yeah, just run that and it should, uh, yeah, it disappeared. Yeah. Yeah. Yeah. Yep.

Okay. Um, and I think that this is. Okay. Fantastic. Fantastic. Yeah. So 10% is just an issue. Is it? Yeah. Yeah. From this, uh, sample of 50, 50 issues or pull requests from, from Ansible.

Yeah. Okay. Um, do we want to collapse this back to a single row? Um, or do you first want to try getting in all the data? Getting in all the data. Uh, I think if it's a single row, it'll be a little bit like more unwieldy, like in terms of doing group bys and stuff. Cause I, cause ultimately we're going to want to group by group by full name, I guess. Yep. Okay. Yeah.

You know what? I think, uh, the most obvious thing tries to, uh, just to use globbing and, uh, put an asterisk here. I'm not sure if there's. Wouldn't that be nice if it just worked? Let's find out. Exactly. And there's only one way to find out and that's, uh, to run this and we're getting an error. And, uh, oh, so that's interesting. Read JSON doesn't support, um, uh, globbing.

Now. Yeah. I'm sure if you, I'm sure if you mention it to Richie, he'll, uh, I mean, uh, the Polars project might, might, uh, um, they might do it. So. Yeah. It's interesting. And, uh, and, uh, I do, I have reported things in the past and then it usually, uh, takes a few hours before it's fixed. Um, yeah, the development team is, uh, uh, yeah, they're working very hard.

Yeah, so I guess we may, so we may have to glob the, glob the directory ourselves and, um, and concatenate all of the expressions. Yeah. Yeah. No, no, I mean, in Python, if we do like, uh, if we do like, um, you know, we use the glob module, so glob.glob with that pattern that will give us the list of file paths. And then we can, uh, just run that expression in a loop and append them all together.

Yes. Okay. Yeah. That works. That works too. The other, uh, thing I was thinking of just to, uh, mention this is to concatenate all the JSON files using the cat command line tool and then read in, use, use a read ND JSON. So new line delimited JSON. But since we're working in Python, might as well stick in Python. This might be a pattern that others come across.

Right. So, uh, okay. So I believe that would be something like this. Yep. I've got to run the import. Oh, of course. That's yeah. Okay. So we got to figure out the fully, um, and we could turn this into a function. Yep. It's like wrangle JSON. Yes. Oh, let's do that. Yep.

Okay. Read, get, get up, JSON. That's fine. Yeah. Yeah.

Fantastic. Oh, all right. That's file name. Yeah.

Oh, no, wait. I should run this first before it autocompletes. Um, let me try out the function first with, uh, with one file name data, get up, repos, principal, principal JSON.

Why is it? It's because of the function is below the it's, uh, it's doing. Oh yeah. Yeah. Cause that, that, that function isn't defined yet. If you were to run the script. Yeah. There you go. Yeah, that's, that's absolutely right. Okay. So this works.

But it would be cool. Read, get up, JSON, uh, uh, F. Yep. Uh, four F in file names and, uh, yes, we can. Yeah. Cool.

Search field not found. Oh, yeah. Oh, you know what? It's probably because, uh, some, um, my guessing is that some issues don't have no labels. And so that that's going to error that, uh, with columns expression. Yeah. Okay. Then let's leave this out for now. Rerun this. Yeah.

Another interesting one. That's quirky. Ah, I remember there is the F concat. Was something with, um, I think the error is actually happening in the, in the data processing, like the read the read, uh, JSON function. Could be, or it's, it's getting, because it was, what was it erroring on? It's because yeah, it's not finding the pull request field. So maybe like, yeah, these are things that we can do later. Uh, but yeah, because there's evidently some data quality issues.

So we've got 50 repos times 50, uh, samples for each one. So that, that looks about right. Yeah. These are still strings, by the way. Oh yeah. So we'll probably want to convert those to proper dates. Although the, the, you know, since the, since it's a random selection of issues, it seems that the dates maybe are not that, not that meaningful. Um, since it's not, not as much of a measure of, uh, oh yeah. See, there's some bunch of things with no, uh, uh, labels with labels issues with no labels.

Um, so this is crazy. Like this auto GPT project I've never even heard of, but it's got 177,000 GitHub stars. So yeah. Yeah. There's a lot. It's, it's, uh, it goes to, it goes to show what I know. Hard to keep up. Yep.

Um, I'm actually curious. Can you pop open that, like the histogram of the stargazers counts just to like, see the range on these projects. So like, uh, max is what? 363 K. Can you go up to the stargazers count column? Just double click on that. And then, um, yeah. And then, and then click on the dropdown and sort descending.

Untangling Nested JSON With Wes McKinney | PydyTuesday #3

Transcript#

About the dataset

Loading and exploring the data

Selecting and wrangling columns

Processing all files

Featured software#

positron