Exploring Web APIs | PydyTuesday Uncut #1

Transcript#

This transcript was generated automatically and may contain errors.

Hey, Michael. Jeroen, always a delight. How's it going? So good.

I think it's relevant to say we're like six hours apart, so I'm having a beautiful morning and you're... I'm about to have a beautiful weekend. It's Friday afternoon here.

And we're about to analyze some data, is that... That's the goal. That's the plan.

I just want to introduce you very quickly so that people know who you are for the very few who don't yet know who you are. You're Michael Chow . You're an open source developer at Posit. Known, for example, for the Great Tables package, among other things.

To also send it back, you're Jeroen. Author of Data Science on the Command Line. Which I find to be one of the most intriguing data science books about how you can use the command line to slice and dice data. And more recently, Python Polars. I always call it just the Python Polars book. The official title is Python Polars, the Definitive Guide.

Yeah, I'm such a big fan of your Polars work. I know we've done like a plotting workshop together with plotnine . So I'm excited to be able to do a little bit like a low-key data analysis with you.

Yeah, I'm excited too. I'm also a little bit terrified because people will find out that it's one thing to write a book about Polars and another thing to be actually working with it and to put it to practice.

I'm also a little bit terrified because people will find out that it's one thing to write a book about Polars and another thing to be actually working with it and to put it to practice.

I feel like there's a real sweet Great Tables opportunity here.

We could. I think pullers has the null value argument. That's if you want to. So Apache to license. Oh, you receive some other ones. Nice to be as the some creative comments. There is the Auckland Museum license. But there are also a lot of missing ones.

So, as then, I guess it's actually, you know what, I want to use the command line for this real quick. Do I have XSV installed? No, I don't. Let's just go. Go here. API. Info. Okay, this looks horrible. This is not a good way. What are you trying to do? Well, I just want to verify, like, am I now looking at a value? That's actually an a.

Well, yeah, it should be because it's not it's not being reported as missing. We have to pass no value to pullers. I think the read CSV. Yes.

So positron is, uh, it's pretty helpful here again. But because otherwise, if you didn't have this, it's like, okay, am I now you would have to know that polar uses the word null and you LL. And even then it could be that actual string would be called null and that it's still not interpreted as a missing value. But I guess you're right. We should have is it null values, which is.

And I guess I guess we need the same for those here as well. And this also this is an interesting polar sink to where no values. Without no values, pullers tends to read a lot more things as strings, because imagine you have a CSV of numbers like one, two, three and a. Pullers might interpret that as a. Their care, like a string column. Because it's like one, two, three and a looks like a column of strings, basically.

So I know what that's actually what's happening right now. I'm doing origins here. Let's have a look at origins. Origins probably has a lot of N.A. values. So here is also a version. So version is currently being loaded as a string. But as I am, as I want to. When I said N.A. should be interpreted as a null value. Here are some null values over here.

So there's probably some value in here, which cannot be interpreted as, um, as a float. CV one appears more towards the bottom. So. To fix this, well, we can do several things and pullers is quite helpful here, um, suggesting the options that we have. Um, the easiest thing we can do is to infer to set this value and for schema length, which basically. Look at some more data before you try to infer infer the right, um, data type. I think that's what we're going to do here.

I often just set it to none for data like this. Is it none or should it be minus 1? None also works. Yeah, I haven't. It's a good question though. I've never from the folders documentation and figure it out, but. I've kind of assumed that none. Does a full send.

Let's get the full. Read CSV. Uh, infer schema length. It can be none. Yeah, the docs aren't totally clear because it says they might. The full data may be scanned. So slow. Yeah, but we're not dealing with that much data.

Oh, and first scheme of false, I guess, is the. Then you're not doing any inference. Then it's all text. None is not as good. None is what we want. Nice because it's not that much data, right? It's about 2500. Uh, lines.

Oh, okay. Nice data is up to date. Can we still create a bar chart? Yes, we can. And now it actually prints Nan over here, which is under the hood. There's, there's, there's some pandas being used by a plotnine. This works. And now we also have a nice looking null over here. Before it said, and a.

Okay. But now we're back. And I believe that we were trying to have a look at the. Uh, info. The license. The licenses.

Maybe we just pull some. With license info. It seemed like there were a lot of. Maybe even just pulling some. Glancing at the ones with licenses might. Even just knowing how many have licenses might be.

Yeah, I guess we can sort this. It's the classic dilemma. It shows. It's like a nulls first.

So info dot. Drop nulls. Boom. Info. How does that work for a full data frame? Does it have to be a full no row? No. It will drop a row as soon as any value has a null. So, which is perhaps a little bit too. Eager. Drop nulls. We can also specify the columns here. I bet. License name or subset. Yeah, it was. Was it? License. Underscore name.

Okay. Info, not null. So now we have another data frame. 886 rows. Let's explore those.

I guess what would be nice here is also to. And maybe it is possible. And maybe we're missing something. Oh. Oh, wow. Okay. This is advanced. For now. We have here our info data frame. Where all the rows do have a license. Well, all the APIs.

So there are also some variations over here. Right. API.

So I'm almost curious to scan the list of names. Like the. Like the first name column, just to see. Who are some of the people that like show up. So lots of AWS. And for some reason, a person named Mike is responsible for all the AWS. That's a lot of responsibility. Load bearing.

Who is Mike? He works for postman. Oh, wow. Whereas with Google, it's just Google. Google is responsible. I love AWS. So they're like, you got to get in touch with Mike. You know, wait, I got it. You got to pull up the Mike Rouse and get hub. Not to just for 30 seconds.

I don't know how the contacts happen, but I love this is. They're like, don't get in touch with Mike. Reach out. You know, it seems like it is. Is ducks in order in a row when it comes to APIs? Because it seems pretty relaxed.

Designing the Great Tables output

All right. Back to business. Back to business. All right. So, so many ideas we have. We can still create a Great Tables table if we wanted to. You wanted to have a look at some licenses as well.

So maybe pull out, like, could we see who, which of these have a logo? Maybe is that. I don't think that the logo of the license is in here, but maybe join, like interjoin logos with this. Logos is related to the API. Not this is not the logo of the license. This is the logo. These are the logos of the APIs themselves.

But I wonder if we join the name. Interjoin on name. If that will give us the. Things with logos and license. Specified just for. Oh, you want in your integrate tables that you have in mind. And the great table that you have in mind, you wanted to use the logo of the API. Yeah.

Maybe a logo and a license column. So we can kind of show. A few like APIs with. Okay. Yeah, we can do that. Let me just let before we dive into that, into that. Let's first do a little bit of an outline. Okay. What is it that we want to accomplish? I think that creating a Great Tables table. It would be a wonderful outcome of this, this analysis.

So we're going to create a Great Tables table. Exclamation mark. Love it. We need the. We need a name. Do we want to use the name. Of the actual API or of the organization, because it seems that. The name column. Can sometimes be. Split. On the column.

So I would almost need to know, like. Like the relationship between logo in. In photo, I think, say, like, I'd have to see. What they what their relationship is. I'm guessing name is probably shared across them is my. Yeah. Here, if you look at this long list of. AWS APIs. They all have the same logo.

So do we want to work on the level of the organization? That's that's probably what I'm asking. We could even. Is there an organization. Column. No, but it's, it's, it's. Embedded in the name column. So we would have to split the name column into 2.

Yeah. Right. Yeah. I'm into that. Okay. So that's, that's, that's our 1st step. So. To do. Split name. Column on. Column. Right. To do join. With. Well, in, in info. There's. Info has. The info data frame has both the name and the. URL, but you also want it to have the logo. So we do, we need to split this name. For both data frames.

Well, can you, can you join on name? So we can just see, because if we can get the. If we can see the shared logo and license. Then we can always split the name. Okay. But we might just want to check to see if we have even any juice. To work with, like, yeah. Because if the, if there aren't a lot of like logos and licenses, we'll kind of. Have to switch gears. I think.

Yeah. So. 1 thing we need to think about is. When we're creating a great table. There's only so much we can show. Yeah, which is very different from when you're creating a visualization. We could easily add in 2500 points, but with a great table with a table. Uh, right. We need to come up with some sort of a top.

Have a look at how the number of APIs. Each organization has. So AWS has a lot. Azure. Get up. Google, obviously. And those also tend to tend to have. Licenses, you know, they tend to have that figured out and also logos.

And then. This is this is this is going to be your. Your job is to create the actual Great Tables. Okay. Does this sound like a good plan? Yeah. Yeah, it seems great. All right.

Joining and exploring the data

We cannot really assume that they're in the same order. Otherwise we could do a concatenation, but I think it's safe to do a join on the name. Let's do info. Join. We're going to join on logos. You see, I, you know, I thought that we weren't going to use logos and now it. It turns out we are going to use logos. That's that's life, you know, that is life. I'm going to join on name. And, uh, I guess it doesn't really matter how so how is, uh, by default, inner and inner join. Which sounds good to me.

So this should work. How about we're going to call this for a beautiful. For old time's sake. Um. Maybe we want to narrow down. The, the, the columns that we're actually going to use. Just to make things easier.

Yeah, so, like, this is 252929 rows. And info was 2529. So we know that. And logos as well. So that we got kind of the whole. But that's surprising. Like, I'd imagine some of the logo rows are. So maybe we. We still need to filter out. We, I bet the logos has some that don't. Have logos.

If we could filter out, that'll get us to like. A URL has no missing values. And I bet you there are 519 organizations in here. That's what I'm thinking. There are 519 unique URLs here. Yeah. Okay. That's super helpful for me. Because in my head, I was thinking like. Some of these wouldn't have logos.

Your point is so helpful to hear that actually. A lot of these have logos. But it's a smaller set of organizations. So we can cut it down. Yeah, thanks. That's so helpful for kind of switching my. Like, I had a misconception about. Yeah, data. That's I think. Yeah, no, it's, it's, it's good to have all these to do these sanity checks.

Selecting columns and preparing the table

Yeah, let's 1st, let's 1st, you know, trim down. Some of the columns here, because there's a bit much to look at. So we obviously want name. URL refers to the actual logo. Ooh. Some different formats. So that's. We want to do a filter on that. I don't know whether Great Tables handles. We could, I mean, we could try it, like, on the 1st, 10 rows and just see.

License name. Right. Because you wanted to show the license. Yeah. Background color might be nice too. For for styling purposes. Yeah, I'm into it. Okay. So we're going to select name. Name. License name. URL for the logo. Background color. Anything else? That's great.

Boom. Now this gets updated. I like that. Oh, that's wild. That's a cool feature. Just fits. Okay. So not all of them have a background color. We can set that to white. If we wanted to, but now I guess. One of the things that we need to do is split that name column. And yeah, I also noticed with license name is that sometimes it says. It has the word license.

And some of them have nothing. Obviously we need to filter for that. But so here. Apache 2.0 license and others are missing that last word. So maybe we need to strip. The word license. Maybe, maybe, but maybe, maybe, maybe it doesn't even come up. If we look for the, say, 10 largest. Organizations with the most APIs.

They might also have, I guess, a trick too, is they might have different licenses for different. That is theoretically possible, although that in practice, I would think this is unlikely. Yeah, but there is a way. Yeah, we could also hand pick a few just to kind of. Test the. Process and try to get the kind of end thing.

Splitting the name column and grouping by org

Shall I, shall I take an attempt, make an attempt at splitting that name column. Call name right in order to start an expression. It's a string method, so let's do. Let's do split. Nice. On here. What does this do? Gives us a list. And I guess we want the 1st. Element of this. List. 1st. Oh, nice. I didn't realize that was a. All right. Is this. It was too easy. So this is the name. Well, I guess, I guess we want to call this. Org. Yeah, cool.

Now. I guess we want to do some grouping. Yeah. Some aggregation. Sounds right. Yes. Group by. Org. We want to aggregate this. If we're going to be. Diligent about this, then we're going to check for for duplicates.

Oh, like all rows. Do they all have the same value? Yeah. Right, because. Oh, and I guess what we also want to do is get a count. Equals PL thought. I do think this is kind of a duplicated. Is duplicated should. Because that's like. Since it's grouped by org.

Well, I guess what we could do is get a unique. And if we don't still end up with. So name obviously is not unique. Right because these are all the different APIs within a single organization. Yeah. Okay. Just to double check your. Right now with this group by. You're trying to see are there any like. Road duplicates is that right? Like rows that are duplicated.

Well, there are definitely duplicated rose. When an organization has multiple APIs. Yeah. And that's. Right. So that's what we're hoping is that. They could have like a different URL, I guess. A different URL, but. What you also mentioned a moment ago is that. They can have potentially different licenses. So Google may have a license with. Or an API with 1 license and then another API with another license. Yeah, potentially.

Can you do you mind saying what you're trying to do with the group? I just recapping. Well, it depends on. What we want to put in our great table. What do we want to be? Can you just help me understand the calculation? Like the group? What's that? Trying to do. Okay. So let me 1st, get rid of this line. So that I can show you what we have here.

This is our, our joint data frame with. You know, fewer columns. But what we, what we did is we've created a new column called org. Which contains only the part before. The column. Yeah, we've, we've, we've called that. Or the organization. Yeah. So there are a couple of organizations that have. More than 1 API. Okay. Amazon AWS.

Am I, am I right? Understanding what you're trying to do with it is you're trying to say for each org. You're trying to look at each unique. Kind of entry for each column. Is that right? So you want to see like. Yeah. Say for. 1 password. Dot com. You're trying to list out the. Each URL. Yeah. It's unique. So let's, let's verify. Let's have a look. It always helps.

Because this is still. It's not. A lot of data, but it's still, it doesn't fit in your screen. So. Uh, let's again, do that. Where we're just going to have a look at a single. Just a couple of instances. Org. Contains contains AWS. Yeah. That's the string. That's a string method. Yes. So these are. All AWS. APIs 272. Right.

Well, let's then be a little bit more. Amazon. AWS.com. Yes. Let me save this to a. A separate. Variable so that we can inspect this here. Easily. Yeah, so these are all AWS. APIs. And they all have the same org. Now. We could check. How many different licenses they have and they all have the same license. In this particular case. Yeah. It is still theoretically possible that some other organization. Uses multiple licenses.

But this group by just to go back. Yeah, so this is kind of where you're trying to go. And the group is just listing out. The unique entries. So you can kind of see, like, oh, there's only 1 thing listed. So it must be. The same. All. Exactly. Exactly. Yeah. Nice. And your PL. All is kind of you're doing it for each column. So you can kind of check. If you wanted to check something else. Like. Background color. Yeah.

Well, now I'm. Yeah, and I'm doing unique. So I'm calling. So these, these are all the columns. Except for the column on which we are grouping. By which is org. So all the other columns. For those, we're going to take. All their unique. We're going to take all our values. Keep the unique ones. So that we don't have duplicate the duplicated values in there. And then they're going to end up in a list. Yeah. And we're adding another column called count. Yep. Which is what, what are the number of rows in this group? Right. What are the number of APIs that this organization has? Yeah.

So if you fired this, I'd be curious to take a look at this data frame. With the, even with the filter or. Yeah. Even with the filter. All right. No, that's always useful. So this should be only. One row. Nice. And just to recap. So. Pl. All that unique. Is a column wise. It does for each column. Get the unique for that column. So it's not doing unique across the rows. It's just, it lets you glance at each row. Exactly. This is, this is done on a. Per column basis. Which is not always what you want. But in this case, yes. Yeah, that's a cool move.

Planning the Great Tables output

So, yeah. Again, just so to make sure that we're on the same page here. Right. Thinking about the great table that we want to produce. I was thinking. We would have the top. 10. Or so organizations. No, we could list out. Okay. How many APIs do they have? What license or licensees. Do they use. Their logo. Right. I mean, it's not. It's not rocket science, but we're going to make it look pretty. Right. So, yeah. No, we got this. This is. I feel like this is really shaping up. We're really. We're nearly there.

So we have here this, this aggregated. So let's, let's call this orgs. With a G. We don't need this filter anymore. I'm going to, I'm going to move this over here with a comment for. So now. What we have here. It's called orgs. Let's have a quick look here. These are all lists. And. If we're going to be. So. So there are some that don't have a license name. I guess we would need to filter on those. First. Shall we do that?

There we just want. Only the rows that have. Data doesn't. So. So that just drops like. All no rows, right? Like. No, whenever there is. Well, we can check. Any column with the, because license name here is no still. Is that or. Is it. No, now we have. We now we only have rows that have no missing values. At all. Oh, cool. Now, and we're down to 43. So what we. Yeah, we're cooking.

What's this. List got string for series name, background color. Why is this. String list. Transparent. I think we might just want to drop the. Like where licenses know or something, because this will also drop. If the background color is. Is that right? Right. And that's not there. That's that's your. Right. This is a little bit too aggressive. Yeah. It's good to know about dropping those. I actually had no idea though.

I wanted to check how many values. Let me see. Oh, there are some. Yeah. These are the bigger ones. Who's the winner who's got 281. Google. Yeah. Okay. That makes sense. Right. And just by glancing this. Whoa. No, wait. So there are different. So this 1. EPA. And also the government of British Columbia and Canada. They had, they use different licenses. Yeah.

Yeah. It's a fun. Honestly, I'd never do this. Move the unique. Like grab all columns. Unique. It's kind of a cool move. It's like. It's it produces a lot of like ragged. Basically, but it lets you kind of just. Swoop across, which is. Yeah. Yeah. I don't know. I'm just making this up on the fly. Yeah. I feel like I'll definitely. Use it again.

Building the Great Tables output

So I guess for the actual do we, I guess we're just interested in the logo. Right. In the, no, and also in the license name, do you want to. Yeah. Let's try it. Maybe we just try the top. 5 and we try just the name and the, the like org and the logo. Just to see if we've got. If Great Tables can. Do it.

Yeah. Then, then, you know what, let's. Now, I don't understand why this doesn't work. I'm taking the unique one. Yeah. Wait. I don't know. Okay. That's interesting. Could it be because org itself is a. Dream or does. Doesn't I guess. Because it's group by or it'll ignore. Yeah, but this returns a list. So I would, I would expect. You see, there's still, there's still many things I don't understand about. I would expect. That because this returns a list. I would need to use the list. Name.

The really intriguing thing is that it. It says I got a list. I got a string for series with name. Empty name. Okay, so I'm kind of cheating here. But maybe for the interest of time, that's okay. Yeah, what I'm doing is I'm just. If there are multiple values, I'm just grabbing the 1st 1. Yeah, you would say that for this 1st iteration. That's okay.

Yeah, and then, I mean, I think. Try like the 1st 5 orgs. And a logo, and then just kind of iterate. Yeah. So now you want it to get the top 1. So let's sort. On count. Nice. Descending. Equals true. Cool.

So. We got our license name. This is it. Yeah, we, we shouldn't be. Oh, background color is all is not. Saying a lot here. So. We might as well make that white for all. Logos, but I guess now. What's this. What's yeah, we might need to do a couple, but let's just. Maybe we just try to fire the great table and see what. Yeah, comes out. Because that'll. We shouldn't see the problems when we try to create the table. So we should be able to just. Yeah. Okay.

So now it's your turn. So we have the orgs data frame. How do we turn this into a. Yeah, table. So if we do orgs dot style, we might need to install Great Tables. But let's just see, just run it and see what. I, I installed all the optional dependencies of Polars as well. Yeah. Oh, nice. Okay. So this is a great table. Yes, which is a formatted. And then if we do. Dot format. FMT. Underscore image. And then the name of the column we want to try. So URL.

What's going to just running this. Boom. I can't even believe it. I can't believe I didn't have the faith in Great Tables. This is more, I think this is like a feather in the cap of rich. Oh, yes. Great Tables who. Every day. Astounds me that this just ended up working. So it just works. No credit to me.

I can't believe I didn't have the faith in Great Tables. This is more, I think this is like a feather in the cap of rich. Astounds me that this just ended up working. So it just works.

You definitely deserve credit here. But the data, on the other hand. Is missing some. There are two URLs that are not working here. That's that's. So we. Could. We could filter those out. We could say. Yeah. Yeah, I do think maybe in the interest of time, we just punt them. Yeah.

So. I would say. I would say. Before we do it. Sort over here. We're going to filter. PL dot call. URL. It does. Contains. Any. Oh, nice. Wait. Filter rows, retaining those that match the given predicate expressions. Does that mean this can just take. A list. Oh, I should make it turn it into a list. I guess. I'm not sure. Yeah, I don't know. We'll just figure it out.