Resources

Exploring Web APIs | PydyTuesday Uncut #1

Join Michael Chow (open source developer at Posit) and Jeroen Janssens (developer relations engineer at Posit) as they dive into this week's #PydyTuesday dataset about Web APIs. Tools include uv, Positron, Polars, Plotnine, Great Tables, and the Unix command line. True to the "PydyTuesday Uncut" title, this video is completely unedited. Every typo, mistake, web search, and "aha!" moment is left in so you can see exactly how others approach a new dataset from scratch. Things mentioned during the session and related resources: - Code produced during the session: https://github.com/jeroenjanssens/pydytuesday-uncut/blob/main/2025-06-17/01-start.py - PydyTuesday https://github.com/posit-dev/pydytuesday - TidyTuesday https://github.com/rfordatascience/tidytuesday - Getting Data from the TidyTuesday Repo with Python https://www.youtube.com/watch?v=ol2FrSL5gVU - Positron IDE https://positron.posit.co - Data Science at the Command Line https://jeroenjanssens.com/dsatcl/ - Python Polars: The Definitive Guide https://polarsguide.com - Polars https://pola.rs - Plotnine https://plotnine.org - Great Tables https://posit-dev.github.io/great-tables/ - The Big Year https://www.imdb.com/title/tt1053810/ 00:00 Introduction 02:46 Getting the data with uv 13:18 Positron IDE 17:42 Importing Polars 23:17 Plotting a bar chart with Plotnine 33:55 Inspecting duplicates 46:30 Handling missing values 58:56 Crafting a great table 1:38:48 Reflection

Jun 23, 2025
1h 41min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey, Michael. Jeroen, always a delight. How's it going? So good.

I think it's relevant to say we're like six hours apart, so I'm having a beautiful morning and you're... I'm about to have a beautiful weekend. It's Friday afternoon here.

And we're about to analyze some data, is that... That's the goal. That's the plan.

I just want to introduce you very quickly so that people know who you are for the very few who don't yet know who you are. You're Michael Chow. You're an open source developer at Posit. Known, for example, for the Great Tables package, among other things.

To also send it back, you're Jeroen. Author of Data Science on the Command Line. Which I find to be one of the most intriguing data science books about how you can use the command line to slice and dice data. And more recently, Python Polars. I always call it just the Python Polars book. The official title is Python Polars, the Definitive Guide.

Yeah, I'm such a big fan of your Polars work. I know we've done like a plotting workshop together with plotnine. So I'm excited to be able to do a little bit like a low-key data analysis with you.

Yeah, I'm excited too. I'm also a little bit terrified because people will find out that it's one thing to write a book about Polars and another thing to be actually working with it and to put it to practice.

I'm also a little bit terrified because people will find out that it's one thing to write a book about Polars and another thing to be actually working with it and to put it to practice.

Introducing the dataset

Yeah, can you tell a little bit what kind of data are we using? Yeah, so I think for background, we're getting into the PyD Tuesday data. This is a data set on API specs.

And maybe for some background, PyD Tuesday is a thing where every week they upload data from a different data set for people to analyze. And it's like I guess it's an opportunity for people in the Python community to sort of see what it looks like to analyze real data and to share out basically what they did. So you can see what people did with real data and see how people are analyzing data or try it yourself and share out results.

It's been going on for a while under the name Tidy Tuesday for the R community, right? And tidy, I guess, well, on the one hand, because well, we'll see how tidy the data actually is or whether we actually need to tidy it up. But what I think is that it comes from the tidyverse. And I guess it sounds like kind of cool when you combine it with Tuesday.

I think this will be great, though, because I think what I love about it is like I feel like the Python Polars book is such a great kind of like lesson on Polars. But I feel like now it's like it's as if we read like a book on birds, you know, like how to identify birds. And now we're going on like a bird walk with you. The correct term for that is birding. We're birding.

So, no, I'm currently on the GitHub repo for tidy Tuesday. And it has some instructions on how to get the data. And you already mentioned it's a, well, we know this week it's a data set about APIs, right? Very meta. But we haven't looked at the data yet whatsoever. We're both going in this, yeah, fresh, right?

Getting the data with UV

So, I'll do UV pip install Polars all. Want to do plotnine as well? Might as well start with that.

So, it's just an empty directory right here. I didn't clone this repo or anything. We can just start using the tool from the command line. Now, for this, we can use UV. There are other ways to do this, to invoke this tool. But this way, you don't even have to install it. So, the tool is called tidy Tuesday. It has a subcommand, tt-download, and it takes one argument, which is the date. And I know that the last Tuesday was June 17th, right?

All I have is I have installed UV, and that's it. It installed some packages. UV is blazingly fast, just like yours.

Okay, we've got a bunch of CSV files, a ping file, and a YAML file. Now, let's put those in a subdirectory, right? That's perhaps nicer. So, I'll create a subdirectory called data, and then I'll move everything into the data directory.

Exploring the data dictionary

And now, the question is, how do we proceed? We might open, like, an IDE, and then maybe peek at the PyD Tuesday readme on GitHub, which usually has some nice descriptions of the data just to get a feel for kind of what's there.

So, this week we're exploring Web APIs. The lead volunteer for PyD Tuesday, John Harmon, is writing a book. About working with Web APIs with R, as well as a series of R packages to make it easier to create API-wrapping R packages, right?

So, here are some questions that we may ask the data, right? What API specs are provided? How many different APIs do providers provide? What licenses? Are there any APIs listed more than once in a data set?

So, those are some good questions to start off with, and I'm sure that we'll be able to come up with others.

Yeah, I do feel like something like the first question, just like what API specs are provided, is kind of a nice starting point. It looks like there's tables describing each CSV, so it could be nice just to try to answer that question to help us kind of route through the data a little bit.

Five CSV files, and I believe what they all have in common is the name of how apis.guru, right? Because this comes from apis.guru. It's kind of a meta API.

Maybe we could try cracking it open and counting the format in API origins? I think that's a good start, just to get a little bit warmed up.

Opening Positron and loading data

It just opened up Positron, right? A new IDE created by Posit, and it supports both R and Python. So, we're obviously going to use Python for this.

I cheated a little bit before we started recording, and I just wanted to make sure that everything was working. It's a hidden directory, .vamp, where your virtual environment is defined.

So let's do, let's just start. Let's start by importing Polars. You see, I just pressed the command enter on my Mac, and then it sends that line down here.

Now, you may see these squigglies every now and then and suggestions, and that's just the IDE trying to be helpful.

API origins and try counting the format column. It's in data. API. Tab origins. Let's save this to a variable right there. Okay. No errors. That's good. That's already half the battle.

So this data is already quite clean, I can imagine. Now, okay, so here in Positron, here on the right, we see that the origins data frame is now, we can immediately see this being listed here. And if we click on this icon, we can even immediately explore this data frame.

Okay, so four columns. None of them have missing values. Yeah, and it looks like it kind of counted the format column for us with that little bar chart. Oh, wow. We don't have to do anything.

Discussing AI assistants

I mean, of course, the next step would be to use an assistant, right, an AI assistant, but let's stay away from those for now, maybe in a later session.

I do find them really helpful for, like, where we looked up, like, what are the columns in API origins? I do feel that step assistants are really good at, like, the scanning, the data descriptions. But I definitely think it's, like, a future thing for sure. It's kind of nice to just get in there manually.

And I have to say, I've been playing around a little bit with these AI assistants when it comes to doing data analysis, right? And sometimes I'm just so really impressed with what they can do and how they can correct themselves and come up with interesting solutions. And because they produce code, right, you can always see what they're doing, what they're – you can always adjust that. It's not like – so that part is not hidden from you. So I like that. But let's do it manually. You're my AI assistant at this moment.

Plotting the format column with plotnine

Yeah, I think – okay, so what I'd note is, like, this bar chart's really great. Like, we see right out of the gate that Swagger, there's, like, a thousand counts. We can also do some checks to figure out what that means, like a thousand. But the one thing I'd note is, like, this bar chart's nice for, like, a quick glance. I would call this, like – this bar chart's, like, in the spirit of, like, a spark line. So, like, notice that you have to basically mouse over to get the labels. It's basically, like, this bar chart is a compromise between a graph and, like, a table entry. So it's, like, really stripped down. You don't see the labels, but you can kind of mouse over. Like, if anything interests you, you can kind of mouse over to get more details.

I would say one thing that could be nice is using plotnine to do the count just to see the chart, like, kind of a full chart with the labels.

What I always do when I work with plotnine is I do from plotnine import star, right? And I know that's being frowned upon in the Python community, not so in the R community. The reason I do that is because then you can use all those functions immediately. The only thing is is that when you do that, it will clutter up your global namespace. So you will see all those functions that plotnine provides.

So this is, of course, complaining here because, like I said, this is frowned upon. But we got ourselves a data frame, right? We want to recreate this bar chart. So ggplot origins. You know, first argument is the data frame itself. Then we'll do an aesthetics mapping. And that's the format column, right? Geom bar.

Yeah, this is nice. So now things are just labeled a little bit, so this plot can breathe a little more than in the table. Yeah, we see the Swagger with, like, 1,000 and OpenAPI with 800. I do feel like the thing I'd be curious about is what is this account of?

So, like, it seems like maybe these are the API endpoints or – Or, yeah, we could have a look at one.

No, I think this really goes to the origin, description of the actual API. What are the parameters that you can give? Yeah, so we should look at – we should try to find one that we like. Is the – for example, the Star Wars API, is that in here?

I saw Transavia, a Dutch light operator. XKCD has an API.

So this is a YAML file, and I guess the format is OpenAPI. And I've never looked at any of this. I mean, this is brand new to me. Yeah, no, I mean, I guess this API is probably not that big, right? Because this is all about the XKCD comics. But we have to give it a comic ID, and then it returns things, I guess. Yeah, interesting. So it could be – yeah, I guess the big question here is kind of, like,

what might someone be interested in about API data? Is there something we could learn about APIs in general from this data? Or – that is a good point.

Like, we could also pull out – so in this case, we're pulling out, like, some APIs to get kind of, like, a closer glimpse at – What this data is made up of. I'm just trying out things. I mean, it's good to do a little bit of a deep dive in order to get some sense of the data. I mean, APIs are not, you know, too complicated, but still.

We can't do this for every API, but might as well. So this is one endpoint right here. And this is what it returns, some JSON, which, again, links to a ping file.

Okay, so coming back to this, this is a specification of one API, the XKCD one. And this particular YAML format appears to be the OpenAPI format. And that's what it says here as well, OpenAPI. OpenAPI seems to be quite popular. It's the second one after Swagger. We could also have a quick look at what a Swagger specification looks like.

Let's do Zoom. Yeah, this is nice that you can kind of, like, pull up APIs and, like, peek at some of this different stuff.

Yeah, I think, I mean, if the text is too small, just let me know, right? Yeah, it appears that Zoom has a couple of API endpoints. We won't be able to do anything. This most definitely needs some authentication, but still. So this is not YAML. This, again, is JSON.

I guess there is some way to dive even deeper into these specific endpoints, right? See, this is a somewhat more involved API endpoint as part of the Zoom API.

Exploring the data structure

I wonder, I'm curious, like, what's the relationship between API info and API origins? Like, what all we have in this data?

It feels like maybe, well, I think here's the number of rows almost, just to start. So they have pretty similar numbers of rows, which makes me wonder if it's like there could be duplicates in here, in origins.

That was also one of the questions we saw in the README. It might be that a single API has specifications in multiple formats.

Yeah, let's take a crack at that. So that would be, that's origins. I'll group by name. Concatenate the formats into a list and then check which lists have more than one item.

Group by name. Aggregate the format. Let's check here. So here in the bottom, we can see the first five and the last five rows. Now I'll filter this. Filter pnl.callformat.list.len greater than one.

There are, indeed, and this, so there are seven rows more in origins than in info. And this is the reason. There are seven APIs that have multiple formats.

Is there, I can't remember, does Polars have like a, does it have some kind of method to mark duplicates? Or like, it has like a duplicated method. Oh. Like on the expression or the, you know what? I can't quite remember. Let's look it up. Oh, it's a method of the data frame, of course. Is duplicated.

So that works too. Let's try that. So origins is duplicated. This is, oh, but this gives you back a single series. You might need to do it on the two, just the two rows basically does like, we want to know if name and format are duplicated.

I guess just name. So can we do this? So let's add a new column. PL.call. So name is duplicated. Yeah, gotcha.

Oh, wait. So now name, of course, this should be is duplicated. So now we have a new column that indicates whether it's duplicated. Now we could say, okay, I only want the is duplicated ones. And this gives us back 14 rows, right, because now we haven't concatenated the format.

Yeah, and this is kind of nice because then you can kind of see some of the other columns, like version or URL. So like bbc.com, they have like this NA format and open API, and it looks like they go to kind of different URLs. It could just be it was like, I don't know if this is like a quirk, like it's just kind of misentered because. NA goes to what seems like a BBC. Yeah, maybe they have their own format or maybe it's just some missing data.

So those were two different ways of figuring out which rows are duplicated. So we don't really want to use this data frame method. It's much better to use the method on the column, right, on the series.

And it seems like my guess is because this table, this duplicate table is 14 rows and there's a seven row difference between info and origins. That is like if there are two of each kind in this duplicate table, maybe it's like there are seven sites that were duplicated. Yeah, that's absolutely the case. Right here you can see everything. All of them have two formats. Which is a coincidence, I guess.

Maybe it's like they migrated formats or something like that. I don't know. An API wizard might know, but.

Loading the remaining CSV files

Hey, so, okay, so what's next, right? We've had a brief look at two of the five CSV files. Why don't I load in the other three and you can have a quick look.

So, we have here, I don't think that logos is all that interesting. Categories is interesting. Licenses? Seems pretty juicy. Licenses. And which file is that? I think it's an API info. Okay. Yeah, we have that.

So, the logos, I, I guess we can also just double click on this file right here and then inspect it. We don't have to load it in as a, as a Polish data frame at first logos. Now, now I'm actually curious. What does the XKCD logo looks like? Let's take a peek. Static, terrible, small logo.

I mean, I guess 1 very data way could be filtering XKCD out. Printing it in the console and copying it there. Maybe that's a nice.

So, which is part of logos, which I don't have yet. Logos. API logos. So, logos. Filter. PL. Call. Name. XKCD, right? So, the contains method is part of the SDR namespace, as they say.

And I will say, I love contains. I actually feel like a lot of times I see early data scientists doing a lot of equals equals. Like, is exactly this name. But I do feel like a data analysis contains is nice. You just shoot it and you see if what you were looking for comes back. I mean, I wasn't expecting any other API to have the characters XKCD in them. Or at least for exploring, it's nice to just see if it gives you what you were looking for.

Whoa, whoa, whoa. What's going on here? Do you want positron to open up the external website? Let's do it. Yeah, nice. That's that's a terribly small logo. Okay, great. Now we know that the logos data frame is there.

Planning a Great Tables visualization

I feel like there's a real sweet Great Tables opportunity here. Oh, yes. Oh, you want to you want to you want to explain what Great Tables. So Great Tables is a tool to. Well, if you do pullers style, it'll show to style your data frame a little bit as a table for presentation. But what could be neat is if we look at the licenses a bit, Great Tables can display logos in the table.

That could be fun to maybe pull out a few examples of APIs with different licenses. And then show there like logos or even just a few APIs we like. Maybe we can just try.

So let's let's first have a look at the data. Before we before we just show Great Tables straight off the bat. We'll have to build it up, build up the suspense a little bit. Where did you say license was? I think it's in API info. License names and license URL license. Let's have a look here at our this. Nice license. There are no missing, but. I think it's all string in a is the problem.

I feel like there's a real sweet Great Tables opportunity here.

We could. I think pullers has the null value argument. That's if you want to. So Apache to license. Oh, you receive some other ones. Nice to be as the some creative comments. There is the Auckland Museum license. But there are also a lot of missing ones.

So, as then, I guess it's actually, you know what, I want to use the command line for this real quick. Do I have XSV installed? No, I don't. Let's just go. Go here. API. Info. Okay, this looks horrible. This is not a good way. What are you trying to do? Well, I just want to verify, like, am I now looking at a value? That's actually an a.

Well, yeah, it should be because it's not it's not being reported as missing. We have to pass no value to pullers. I think the read CSV. Yes.

So positron is, uh, it's pretty helpful here again. But because otherwise, if you didn't have this, it's like, okay, am I now you would have to know that polar uses the word null and you LL. And even then it could be that actual string would be called null and that it's still not interpreted as a missing value. But I guess you're right. We should have is it null values, which is.

And I guess I guess we need the same for those here as well. And this also this is an interesting polar sink to where no values. Without no values, pullers tends to read a lot more things as strings, because imagine you have a CSV of numbers like one, two, three and a. Pullers might interpret that as a. Their care, like a string column. Because it's like one, two, three and a looks like a column of strings, basically.

So I know what that's actually what's happening right now. I'm doing origins here. Let's have a look at origins. Origins probably has a lot of N.A. values. So here is also a version. So version is currently being loaded as a string. But as I am, as I want to. When I said N.A. should be interpreted as a null value. Here are some null values over here.

So there's probably some value in here, which cannot be interpreted as, um, as a float. CV one appears more towards the bottom. So. To fix this, well, we can do several things and pullers is quite helpful here, um, suggesting the options that we have. Um, the easiest thing we can do is to infer to set this value and for schema length, which basically. Look at some more data before you try to infer infer the right, um, data type. I think that's what we're going to do here.

I often just set it to none for data like this. Is it none or should it be minus 1? None also works. Yeah, I haven't. It's a good question though. I've never from the folders documentation and figure it out, but. I've kind of assumed that none. Does a full send.

Let's get the full. Read CSV. Uh, infer schema length. It can be none. Yeah, the docs aren't totally clear because it says they might. The full data may be scanned. So slow. Yeah, but we're not dealing with that much data.

Oh, and first scheme of false, I guess, is the. Then you're not doing any inference. Then it's all text. None is not as good. None is what we want. Nice because it's not that much data, right? It's about 2500. Uh, lines.

Oh, okay. Nice data is up to date. Can we still create a bar chart? Yes, we can. And now it actually prints Nan over here, which is under the hood. There's, there's, there's some pandas being used by a plotnine. This works. And now we also have a nice looking null over here. Before it said, and a.

Okay. But now we're back. And I believe that we were trying to have a look at the. Uh, info. The license. The licenses.

Maybe we just pull some. With license info. It seemed like there were a lot of. Maybe even just pulling some. Glancing at the ones with licenses might. Even just knowing how many have licenses might be.

Yeah, I guess we can sort this. It's the classic dilemma. It shows. It's like a nulls first.

So info dot. Drop nulls. Boom. Info. How does that work for a full data frame? Does it have to be a full no row? No. It will drop a row as soon as any value has a null. So, which is perhaps a little bit too. Eager. Drop nulls. We can also specify the columns here. I bet. License name or subset. Yeah, it was. Was it? License. Underscore name.

Okay. Info, not null. So now we have another data frame. 886 rows. Let's explore those.

I guess what would be nice here is also to. And maybe it is possible. And maybe we're missing something. Oh. Oh, wow. Okay. This is advanced. For now. We have here our info data frame. Where all the rows do have a license. Well, all the APIs.

So there are also some variations over here. Right. API.

So I'm almost curious to scan the list of names. Like the. Like the first name column, just to see. Who are some of the people that like show up. So lots of AWS. And for some reason, a person named Mike is responsible for all the AWS. That's a lot of responsibility. Load bearing.

Who is Mike? He works for postman. Oh, wow. Whereas with Google, it's just Google. Google is responsible. I love AWS. So they're like, you got to get in touch with Mike. You know, wait, I got it. You got to pull up the Mike Rouse and get hub. Not to just for 30 seconds.

I don't know how the contacts happen, but I love this is. They're like, don't get in touch with Mike. Reach out. You know, it seems like it is. Is ducks in order in a row when it comes to APIs? Because it seems pretty relaxed.

Designing the Great Tables output

All right. Back to business. Back to business. All right. So, so many ideas we have. We can still create a Great Tables table if we wanted to. You wanted to have a look at some licenses as well.

So maybe pull out, like, could we see who, which of these have a logo? Maybe is that. I don't think that the logo of the license is in here, but maybe join, like interjoin logos with this. Logos is related to the API. Not this is not the logo of the license. This is the logo. These are the logos of the APIs themselves.

But I wonder if we join the name. Interjoin on name. If that will give us the. Things with logos and license. Specified just for. Oh, you want in your integrate tables that you have in mind. And the great table that you have in mind, you wanted to use the logo of the API. Yeah.

Maybe a logo and a license column. So we can kind of show. A few like APIs with. Okay. Yeah, we can do that. Let me just let before we dive into that, into that. Let's first do a little bit of an outline. Okay. What is it that we want to accomplish? I think that creating a Great Tables table. It would be a wonderful outcome of this, this analysis.

So we're going to create a Great Tables table. Exclamation mark. Love it. We need the. We need a name. Do we want to use the name. Of the actual API or of the organization, because it seems that. The name column. Can sometimes be. Split. On the column.

So I would almost need to know, like. Like the relationship between logo in. In photo, I think, say, like, I'd have to see. What they what their relationship is. I'm guessing name is probably shared across them is my. Yeah. Here, if you look at this long list of. AWS APIs. They all have the same logo.

So do we want to work on the level of the organization? That's that's probably what I'm asking. We could even. Is there an organization. Column. No, but it's, it's, it's. Embedded in the name column. So we would have to split the name column into 2.

Yeah. Right. Yeah. I'm into that. Okay. So that's, that's, that's our 1st step. So. To do. Split name. Column on. Column. Right. To do join. With. Well, in, in info. There's. Info has. The info data frame has both the name and the. URL, but you also want it to have the logo. So we do, we need to split this name. For both data frames.

Well, can you, can you join on name? So we can just see, because if we can get the. If we can see the shared logo and license. Then we can always split the name. Okay. But we might just want to check to see if we have even any juice. To work with, like, yeah. Because if the, if there aren't a lot of like logos and licenses, we'll kind of. Have to switch gears. I think.

Yeah. So. 1 thing we need to think about is. When we're creating a great table. There's only so much we can show. Yeah, which is very different from when you're creating a visualization. We could easily add in 2500 points, but with a great table with a table. Uh, right. We need to come up with some sort of a top.

Have a look at how the number of APIs. Each organization has. So AWS has a lot. Azure. Get up. Google, obviously. And those also tend to tend to have. Licenses, you know, they tend to have that figured out and also logos.

And then. This is this is this is going to be your. Your job is to create the actual Great Tables. Okay. Does this sound like a good plan? Yeah. Yeah, it seems great. All right.

Joining and exploring the data

We cannot really assume that they're in the same order. Otherwise we could do a concatenation, but I think it's safe to do a join on the name. Let's do info. Join. We're going to join on logos. You see, I, you know, I thought that we weren't going to use logos and now it. It turns out we are going to use logos. That's that's life, you know, that is life. I'm going to join on name. And, uh, I guess it doesn't really matter how so how is, uh, by default, inner and inner join. Which sounds good to me.

So this should work. How about we're going to call this for a beautiful. For old time's sake. Um. Maybe we want to narrow down. The, the, the columns that we're actually going to use. Just to make things easier.

Yeah, so, like, this is 252929 rows. And info was 2529. So we know that. And logos as well. So that we got kind of the whole. But that's surprising. Like, I'd imagine some of the logo rows are. So maybe we. We still need to filter out. We, I bet the logos has some that don't. Have logos.

If we could filter out, that'll get us to like. A URL has no missing values. And I bet you there are 519 organizations in here. That's what I'm thinking. There are 519 unique URLs here. Yeah. Okay. That's super helpful for me. Because in my head, I was thinking like. Some of these wouldn't have logos.

Your point is so helpful to hear that actually. A lot of these have logos. But it's a smaller set of organizations. So we can cut it down. Yeah, thanks. That's so helpful for kind of switching my. Like, I had a misconception about. Yeah, data. That's I think. Yeah, no, it's, it's, it's good to have all these to do these sanity checks.

Selecting columns and preparing the table

Yeah, let's 1st, let's 1st, you know, trim down. Some of the columns here, because there's a bit much to look at. So we obviously want name. URL refers to the actual logo. Ooh. Some different formats. So that's. We want to do a filter on that. I don't know whether Great Tables handles. We could, I mean, we could try it, like, on the 1st, 10 rows and just see.

License name. Right. Because you wanted to show the license. Yeah. Background color might be nice too. For for styling purposes. Yeah, I'm into it. Okay. So we're going to select name. Name. License name. URL for the logo. Background color. Anything else? That's great.

Boom. Now this gets updated. I like that. Oh, that's wild. That's a cool feature. Just fits. Okay. So not all of them have a background color. We can set that to white. If we wanted to, but now I guess. One of the things that we need to do is split that name column. And yeah, I also noticed with license name is that sometimes it says. It has the word license.

And some of them have nothing. Obviously we need to filter for that. But so here. Apache 2.0 license and others are missing that last word. So maybe we need to strip. The word license. Maybe, maybe, but maybe, maybe, maybe it doesn't even come up. If we look for the, say, 10 largest. Organizations with the most APIs.

They might also have, I guess, a trick too, is they might have different licenses for different. That is theoretically possible, although that in practice, I would think this is unlikely. Yeah, but there is a way. Yeah, we could also hand pick a few just to kind of. Test the. Process and try to get the kind of end thing.

Splitting the name column and grouping by org

Shall I, shall I take an attempt, make an attempt at splitting that name column. Call name right in order to start an expression. It's a string method, so let's do. Let's do split. Nice. On here. What does this do? Gives us a list. And I guess we want the 1st. Element of this. List. 1st. Oh, nice. I didn't realize that was a. All right. Is this. It was too easy. So this is the name. Well, I guess, I guess we want to call this. Org. Yeah, cool.

Now. I guess we want to do some grouping. Yeah. Some aggregation. Sounds right. Yes. Group by. Org. We want to aggregate this. If we're going to be. Diligent about this, then we're going to check for for duplicates.

Oh, like all rows. Do they all have the same value? Yeah. Right, because. Oh, and I guess what we also want to do is get a count. Equals PL thought. I do think this is kind of a duplicated. Is duplicated should. Because that's like. Since it's grouped by org.

Well, I guess what we could do is get a unique. And if we don't still end up with. So name obviously is not unique. Right because these are all the different APIs within a single organization. Yeah. Okay. Just to double check your. Right now with this group by. You're trying to see are there any like. Road duplicates is that right? Like rows that are duplicated.

Well, there are definitely duplicated rose. When an organization has multiple APIs. Yeah. And that's. Right. So that's what we're hoping is that. They could have like a different URL, I guess. A different URL, but. What you also mentioned a moment ago is that. They can have potentially different licenses. So Google may have a license with. Or an API with 1 license and then another API with another license. Yeah, potentially.

Can you do you mind saying what you're trying to do with the group? I just recapping. Well, it depends on. What we want to put in our great table. What do we want to be? Can you just help me understand the calculation? Like the group? What's that? Trying to do. Okay. So let me 1st, get rid of this line. So that I can show you what we have here.

This is our, our joint data frame with. You know, fewer columns. But what we, what we did is we've created a new column called org. Which contains only the part before. The column. Yeah, we've, we've, we've called that. Or the organization. Yeah. So there are a couple of organizations that have. More than 1 API. Okay. Amazon AWS.

Am I, am I right? Understanding what you're trying to do with it is you're trying to say for each org. You're trying to look at each unique. Kind of entry for each column. Is that right? So you want to see like. Yeah. Say for. 1 password. Dot com. You're trying to list out the. Each URL. Yeah. It's unique. So let's, let's verify. Let's have a look. It always helps.

Because this is still. It's not. A lot of data, but it's still, it doesn't fit in your screen. So. Uh, let's again, do that. Where we're just going to have a look at a single. Just a couple of instances. Org. Contains contains AWS. Yeah. That's the string. That's a string method. Yes. So these are. All AWS. APIs 272. Right.

Well, let's then be a little bit more. Amazon. AWS.com. Yes. Let me save this to a. A separate. Variable so that we can inspect this here. Easily. Yeah, so these are all AWS. APIs. And they all have the same org. Now. We could check. How many different licenses they have and they all have the same license. In this particular case. Yeah. It is still theoretically possible that some other organization. Uses multiple licenses.

But this group by just to go back. Yeah, so this is kind of where you're trying to go. And the group is just listing out. The unique entries. So you can kind of see, like, oh, there's only 1 thing listed. So it must be. The same. All. Exactly. Exactly. Yeah. Nice. And your PL. All is kind of you're doing it for each column. So you can kind of check. If you wanted to check something else. Like. Background color. Yeah.

Well, now I'm. Yeah, and I'm doing unique. So I'm calling. So these, these are all the columns. Except for the column on which we are grouping. By which is org. So all the other columns. For those, we're going to take. All their unique. We're going to take all our values. Keep the unique ones. So that we don't have duplicate the duplicated values in there. And then they're going to end up in a list. Yeah. And we're adding another column called count. Yep. Which is what, what are the number of rows in this group? Right. What are the number of APIs that this organization has? Yeah.

So if you fired this, I'd be curious to take a look at this data frame. With the, even with the filter or. Yeah. Even with the filter. All right. No, that's always useful. So this should be only. One row. Nice. And just to recap. So. Pl. All that unique. Is a column wise. It does for each column. Get the unique for that column. So it's not doing unique across the rows. It's just, it lets you glance at each row. Exactly. This is, this is done on a. Per column basis. Which is not always what you want. But in this case, yes. Yeah, that's a cool move.

Planning the Great Tables output

So, yeah. Again, just so to make sure that we're on the same page here. Right. Thinking about the great table that we want to produce. I was thinking. We would have the top. 10. Or so organizations. No, we could list out. Okay. How many APIs do they have? What license or licensees. Do they use. Their logo. Right. I mean, it's not. It's not rocket science, but we're going to make it look pretty. Right. So, yeah. No, we got this. This is. I feel like this is really shaping up. We're really. We're nearly there.

So we have here this, this aggregated. So let's, let's call this orgs. With a G. We don't need this filter anymore. I'm going to, I'm going to move this over here with a comment for. So now. What we have here. It's called orgs. Let's have a quick look here. These are all lists. And. If we're going to be. So. So there are some that don't have a license name. I guess we would need to filter on those. First. Shall we do that?

There we just want. Only the rows that have. Data doesn't. So. So that just drops like. All no rows, right? Like. No, whenever there is. Well, we can check. Any column with the, because license name here is no still. Is that or. Is it. No, now we have. We now we only have rows that have no missing values. At all. Oh, cool. Now, and we're down to 43. So what we. Yeah, we're cooking.

What's this. List got string for series name, background color. Why is this. String list. Transparent. I think we might just want to drop the. Like where licenses know or something, because this will also drop. If the background color is. Is that right? Right. And that's not there. That's that's your. Right. This is a little bit too aggressive. Yeah. It's good to know about dropping those. I actually had no idea though.

I wanted to check how many values. Let me see. Oh, there are some. Yeah. These are the bigger ones. Who's the winner who's got 281. Google. Yeah. Okay. That makes sense. Right. And just by glancing this. Whoa. No, wait. So there are different. So this 1. EPA. And also the government of British Columbia and Canada. They had, they use different licenses. Yeah.

Yeah. It's a fun. Honestly, I'd never do this. Move the unique. Like grab all columns. Unique. It's kind of a cool move. It's like. It's it produces a lot of like ragged. Basically, but it lets you kind of just. Swoop across, which is. Yeah. Yeah. I don't know. I'm just making this up on the fly. Yeah. I feel like I'll definitely. Use it again.

Building the Great Tables output

So I guess for the actual do we, I guess we're just interested in the logo. Right. In the, no, and also in the license name, do you want to. Yeah. Let's try it. Maybe we just try the top. 5 and we try just the name and the, the like org and the logo. Just to see if we've got. If Great Tables can. Do it.

Yeah. Then, then, you know what, let's. Now, I don't understand why this doesn't work. I'm taking the unique one. Yeah. Wait. I don't know. Okay. That's interesting. Could it be because org itself is a. Dream or does. Doesn't I guess. Because it's group by or it'll ignore. Yeah, but this returns a list. So I would, I would expect. You see, there's still, there's still many things I don't understand about. I would expect. That because this returns a list. I would need to use the list. Name.

The really intriguing thing is that it. It says I got a list. I got a string for series with name. Empty name. Okay, so I'm kind of cheating here. But maybe for the interest of time, that's okay. Yeah, what I'm doing is I'm just. If there are multiple values, I'm just grabbing the 1st 1. Yeah, you would say that for this 1st iteration. That's okay.

Yeah, and then, I mean, I think. Try like the 1st 5 orgs. And a logo, and then just kind of iterate. Yeah. So now you want it to get the top 1. So let's sort. On count. Nice. Descending. Equals true. Cool.

So. We got our license name. This is it. Yeah, we, we shouldn't be. Oh, background color is all is not. Saying a lot here. So. We might as well make that white for all. Logos, but I guess now. What's this. What's yeah, we might need to do a couple, but let's just. Maybe we just try to fire the great table and see what. Yeah, comes out. Because that'll. We shouldn't see the problems when we try to create the table. So we should be able to just. Yeah. Okay.

So now it's your turn. So we have the orgs data frame. How do we turn this into a. Yeah, table. So if we do orgs dot style, we might need to install Great Tables. But let's just see, just run it and see what. I, I installed all the optional dependencies of Polars as well. Yeah. Oh, nice. Okay. So this is a great table. Yes, which is a formatted. And then if we do. Dot format. FMT. Underscore image. And then the name of the column we want to try. So URL.

What's going to just running this. Boom. I can't even believe it. I can't believe I didn't have the faith in Great Tables. This is more, I think this is like a feather in the cap of rich. Oh, yes. Great Tables who. Every day. Astounds me that this just ended up working. So it just works. No credit to me.

I can't believe I didn't have the faith in Great Tables. This is more, I think this is like a feather in the cap of rich. Astounds me that this just ended up working. So it just works.

You definitely deserve credit here. But the data, on the other hand. Is missing some. There are two URLs that are not working here. That's that's. So we. Could. We could filter those out. We could say. Yeah. Yeah, I do think maybe in the interest of time, we just punt them. Yeah.

So. I would say. I would say. Before we do it. Sort over here. We're going to filter. PL dot call. URL. It does. Contains. Any. Oh, nice. Wait. Filter rows, retaining those that match the given predicate expressions. Does that mean this can just take. A list. Oh, I should make it turn it into a list. I guess. I'm not sure. Yeah, I don't know. We'll just figure it out.

Did it. Oh, yeah. Oh, maybe. Oh, and get up. Wait. So I should do this. Before I do the group by. Although then. I would just put it even at the. You can even do it when you're generating the great table, which is kind of to say, like. We need to maybe go back and clean up, but we're kind of just dropping for now. Yeah, but I need to see this works because I'm doing this before the grouping. Because group by only take. Or I would need to apply this filter on the list level inside the aggregation here. That's that's the neatest.

Yeah. This seems fine. This looks this looks nice. But I guess we could apply a little bit more styling to this great table. Let it shine. Um, so. Some nice things we could do. Do we want to take name out of the table? Yeah. Nice. Tell me. So I think we'll need to do it before style. So it's kind of like we'll put all the Polars stuff. Before. Let's take. Or we could drop calls. Yeah, you could do. Calls. Hi. Sorry. Yep. You're schooling me on great tables. Hey, man, I'm just making this up.

What I like that there's this separation of actual data and how it's being represented. What you sometimes see people do is that they change the actual data in order to make it look right.

Um, add the percentage sign. To a number, but then all of a sudden it's then it's a string and you cannot do any other calculations with it anymore. So, yeah, it's a good point too, because I forgot that Great Tables. 1 of the big advantages of this is you can use a column for. Like, calculation or yeah, your point, like, formatting or styling. Like, you could set a background color and then you could hide the background color column. Yes, so that it's used to style, but then the. Data doesn't it's used to style like another column and then you hide it basically.

Refining the Great Tables output

Oh, wait, this should then I think it's a list. Okay, I'm going to file an issue for that. I mean, that's legit. Yeah. All right.

I guess do we want to have the logo in front of the organization, but then not have the. Uh, had her name URL. Uh, how would you do that? So I would move the. Calls move URL. I think that that actually will just move it to the front. Because it's default 0 or after 0, maybe. Missing after. And after should be. I think there's a calls move to start, maybe the method. You're right. It says so in the help. Sorry, move to start. Yeah, nice. Now we're cooking.

And then we could try applying the background color. I don't know to the. Uh, you could do like tab style if you want to try applying the background color. Um, well, you had a background color was either. White or missing. Yeah, so yeah. Okay. I see. So less critical.

Could we add a bar chart? Could we could we do a bar chart? Instead of the yeah. Yeah, uh. Format nano plot. I think, you know, all the methods. I got to close the door. We have a cat that blasted a door open. I'll be so fast. That's life.

Here I am. Um. Yeah, I think format nano plot should. I'm not I have to admit I'm not super familiar with. Yes, you are. It's incredible. And I think you could make it. I think you can make it a bar if you want.

I, I, I remember there's also the possibility to. Still show the numbers, but then have the have a gradient. Background. Like, the higher the number, the darker the background color. Data color data underscore color. There you go. So, this is. This is so helpful for me because I actually. Well, having called the things is. Is it is it FMP? What's it? I think it's data underscore color. It's got a special name. Data color. Count him. So I, and I think it'll try to do it automatically for you.

So, I'm not so happy with this, this, this. That we might have to pull up the data color docs. I'm a little rusty on my. I just feel alive. All right. Let's see. Next time you you're driving. Oh, it's got all these palettes. Wow. So, what's that? Let's a sequential palette. I'm in the data color docs.

Just works great, but. Try greens. What happens if you put greens in. I'm guessing everything. Greens. Greens. Yeah, I think so. Doesn't exist. Oh, maybe uppercase greens like the G is sorry. The G is silent. Yeah, that's right. The reams. Oh, yeah, that's nice. That's that's subtle. I like that to have a single color.

Wow. What if we, what if we get rid of those? Actually, I'm. Are you, what do you think? Yeah, I'm not too. It's kind of nice. You can see extreme. They are so extreme. So, it has to be. Less than 30. Oh, what does it mean? It should be called. Sorry. Oh, wrong method. That's nicer. Oh, yeah. It's great.

Wait. Okay. So 1 final thing that I think we can improve on this table is the column names. Now, they're, they're the actual data. The column names in the data frame. I think we can. Yeah, make it a little bit nice. So we'll do dot dot calls label. Let's do that up here after calls label. Is this a dictionary? Or it can just be keyword arguments. I think. Oh, all right. Yeah, I guess then the. So, it becomes. Org is organization. For example.

Okay. Nice. Yeah. And then license. Name is license. And count is. I don't know. Number of yeah, can we hide URL? Can we can we say URL equals none? Maybe blank empty stream. That works. Nice. Or can you set the URL column to this? What do you call it? Index. The stub stub is that tab. If you do tabs stub, I think.

Oh, I'm honestly so scared every time you do something. I just. I worry Great Tables will explode. So this is also. It is. Yeah. But it's okay. This is a safe space. And so far. Both you and Great Tables. Have have demonstrated that you're up for it. To it, I should say. You're up for it and you're up to it.

Wrapping up

Hey, man. I feel that we have also looking at the time. I mean, we could go on forever. You and I, but I'm pretty happy. Yeah, thanks for really pulling us through. I feel like. It's crazy. We got the table out in the last. I feel like 10. 20 minutes. We really wanted it out. You know, everything up to that. This is great.

Yeah, I let me just, you know, what we'll do is. We'll put this, uh, we'll put this online. We'll put the link. To this code in the description. And, um, I don't know. I, I, I really enjoyed this. I, um, yes. Maybe we'll do this more often. We'll see. Now that we've got a great table cooking, I'm like, I'll do anything, you know, so.

We did this. Yeah, we'll clean this up a little bit. Not too much, but, um. And there's still a lot of things to do. I haven't even looked at what others have done with this data. Um, I didn't want to be. What other greater tables they might. Oh, yeah. Or or visualizations or, uh. I don't know. Yeah, some, uh, some machine learning classifiers.

There's a lot you can do with, uh, with all these data sets that. And I do feel like we went through a lot of. Like classic data exploration. Topics like even like little. Bumps you hit, like the no values. Yep. First schema. These are such classic things, so I think it was helpful to have to kind of. Yeah, and just looking up documentation. Yeah, that's all part of it. And, uh.

We were also surprised. I was a couple of times by what. Positron, uh. Yeah, offers. I mean, we, we were looking at tables like in the table viewer. Nonstop I feel like actually table viewer was like. A load bearing. So that was to see. Everything kind of there. But, um, yeah. Yeah. Hey, this was a lot of fun. I hope others will find this helpful. Uh, as well, but, uh, in any case, uh. It was a good chatting with you. Good doing data analysis with you and, uh. I'll see you next time. Yeah, thanks. This is a real treat. So thanks for having. Next next Friday, Tuesday. All right. Yeah. Take care of my. See you. See you.