Resources

Inspecting websites to find JSON data APIs | Marcos Huerta | Data Science Lab

The Data Science Lab is a live weekly call. Register at pos.it/dslab! Discord invites go out each week on lives calls. We'd love to have you! The Lab is an open, messy space for learning and asking questions. Think of it like pair coding with a friend or two. Learn something new, and share what you know to help others grow. On this call, Libby Heeren is joined by Marcos Huerta, a Data Science Manager at Carmax, as he walks us through the guts of websites looking for data we can play with. He shows us how to find hidden REST/JSON APIs by using the web inspector in Safari/Firefox and then how to get what's necessary to pull the same data programmatically in python or R. Hosting crew from Posit: Libby Heeren, Isabella Velasquez, Daniel Chen Marcos's urls: Website: https://marcoshuerta.com GitHub: https://github.com/astrowonk/ Resources from the hosts and from participants in the Discord chat: Postman: https://www.postman.com/ Insomnia (open source alternative to Postman): https://insomnia.rest/ Baseball Savant website Marcos is using: https://baseballsavant.mlb.com/gamefeed/?gamePk=777076 Isabella Velasquez's blog on using {polite} R package to help scrape Wikipedia: https://ivelasq.rbind.io/blog/politely-scraping/ Festivas Mac app Marcos used to add the lights to his desktop: https://festivitas.app/ Ted Laderas blog post on parsing JSON in R: https://laderast.github.io/intro_apis_json_cascadia/#/how-does-r-translate-json New rvest read_html_live() function: https://rvest.tidyverse.org/reference/read_html_live.html yyjsonr R package: https://github.com/coolbutuseless/yyjsonr tuber R package: https://github.com/gojiplus/tuber WikipediaR R package: https://www.quantargo.com/help/r/latest/packages/WikipediaR/1.1/WikipediaR-package rookiepy python package: https://pypi.org/project/rookiepy/ â–º Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co The Lab: https://pos.it/dslab Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for learning with us! Timestamps 00:00 Introduction 03:05 Web scraping vs. API calls 04:12 Server-side rendering vs. client-side JSON 06:12 Warning: Rate limits and business ethics (ahem) 08:39 Demo: Baseball Savant website 08:57 Using browser Developer Tools and the Network tab 12:15 "What is curl?" 13:30 Importing curl into Postman 16:03 Generating Python code from Postman 16:50 "Are there open source alternatives to Postman?" 17:50 Using the generated code in Python/Jupyter 22:28 R packages for JSON (jsonlite, yyjsonr) 25:09 Demo: Massachusetts Lottery website 28:17 Example: scripts Marcos automated with Cron jobs 30:17 Handling logins and cookies with RookiePie 32:19 Demo: CNN Election Data 34:26 Inspecting ESPN's website 36:58 "Can you scrape YouTube?" 38:19 Finding hidden JSON in CardsMania history 45:00 Benefits of API inspection over Beautiful Soup 46:59 New rvest function: read_html_live 50:40 Inspecting LinkedIn and finding GraphQL 53:58 Encouragement on handling API pagination

Jan 26, 2026
54 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

We are going to go ahead and get started. I'm so excited to have everybody hanging out with us today for the data science lab. My name is Libby. I'm a data community manager here at Posit, and I am joined by Isabella and Daniel. Hi, everyone. I'm Isabella. I'm on the DevRel team at Posit. Hello, everyone. I'm also on the DevRel team at Posit, and I guess I'll see everyone in the Discord chat.

Yeah, Dan's going to be our Discord gremlin today. And I'm joined by Marcos Huerta. Hello, everybody. I'm Marcos. Happy to join Libby and everyone today.

Yeah, we are getting together every Tuesday at the data science lab to be a much more open, transparent, screen-sharing, messy place to hang out with your friends and code. This will be like a, hey, my friend wants to show me how he does something, and we're going to code together. What this means is this is a place for you to stop and ask questions, to say, I can't see your screen. Your text is too small. Wait, slow down. Do that again. This is not a presentation or a talk. This is us like hanging out and coding together.

And if you are someone who knows a lot about the topic that we're covering, I hope that you will share your knowledge and your resources and extra tidbits about what we're talking about in the chat, because that is what this whole thing is about. This is a space for everybody. It doesn't matter what your years of experience are. It doesn't matter what background you come from. You are in the right place, and the Discord server in the data science lab channel is the place to be.

Introduction to finding hidden JSON APIs

What we're going to talk about today is the guts of websites. So when we want to get data from websites, we are often looking at web scraping using, like, Beautiful Soup or rvest, maybe rvest plus Selenium. By the way, rvest has a new, like, read HTML live function, which means you don't have to use Selenium all the time, which is cool. But there's also a more manual alternative to this is an API call that's not an official API, right? There are some websites that have official APIs. Most websites aren't going to have an API that's structured for the public to use, right? But that doesn't mean that that website's not receiving data through API calls.

So, like, opening statements, like, yeah, Libby said it very well. Like, I think, I don't know, 10, 20 years ago, most websites or a lot of websites rendered server side, right? They would do a bunch of stuff on the server and then your computer would retrieve a HTML table or whatever from the server. And that's when you could scrape that with Beautiful Soup, right? You can do a request.get or HTR in R. But you can basically pull that in and then you've got this basically giant string of HTML and various tools like Beautiful Soup or other things can parse those tables and extract the data, right?

I don't know when the trend started, but now a lot of times what happens is the server is not sending you HTML. The server is sending you data. And JavaScript running on your computer is taking that data and building the pieces of the website for you. So the table you see on a website is not, if you went and, like, curled that from the command line or if you request.get or whatever, programmatically with some retrieval code that does not process JavaScript, you would just get a bunch of JavaScript tags and you wouldn't actually have a table, right?

So there's two reasons to do these private APIs. One is because, like, it's cleaner and better than trying to do Beautiful Soup. And two is sometimes Beautiful Soup won't work, right? Because, like, unless you'd have to, like, maybe save the source from a web browser or something to kind of get that generated table. So what we're going to do is we're going to look at a couple of websites. We will use the developer tools. And I'm going to probably mostly be showing Safari, but I'll also show off Firefox and Chrome briefly.

You're going to be using those developer tools, looking for what is your browser actually going to get and seeing if any of that is JSON. And if it is JSON, then we will figure out — you can obviously save that data directly if you just want to play with it. But if you want to get it the next time, we'll show you some tricks to maybe getting some code that will let you kind of hit that over and over again.

Another big caveat is, like, this is all fun if you're just trying to do some analysis. Like, hey, I really want this data from this baseball game or from this government website or whatever. Do not build a business around, like, the tools. Because eventually if you're hitting this thing, like, if you hit it once every now and then, no one's ever going to notice. They're just going to think you're a web browser. If you're, like, banging some private API, just pounding away at it, like, with, like, a, you know, cron job every minute, like, they're eventually going to figure that out and probably block your IP address. So, like, don't do that.

If you're, like, banging some private API, just pounding away at it, like, with, like, a, you know, cron job every minute, like, they're eventually going to figure that out and probably block your IP address.

A lot of these sites, well, not all of them, but a lot of them will have rate limits built into the API that are kind of designed for what a human would do clicking around the website. So, if you start using, you know, request.get or curl on it, you know, a gazillion times, like, you might start getting, like, you know, not 200 responses. You'll get a 400. You'll get a 500. But you'll get some HTTP code that's basically, like, you've hit the rate limit.

Inspecting Baseball Savant in Safari

So here we are at Baseball Savant. I'm in Safari. I will feel free to follow along on your own web browser. And so, you know, it's got this nice, cool site. This is, like, a particular game, right? We've got pictures and exit velocities, all this cool data. So these are all kind of tables. This is, you know, this is all getting constructed with JavaScript. So the first thing we do is you want to inspect element.

But what we actually want in Safari, anyway, is the sources tab. So one thing you will usually have to do is refresh because you want to get fresh requests. Now, sometimes you might — in Safari, sometimes you might see what's called an XHR request. That's frequently a thing you want to look for. But sometimes they don't show up as XHR. They show up as just fetches. So we click on these. And, oh, look, what is this? It's JSON.

So here we see trending players. We see some JSON data. We've got a schedule here with a bunch of stuff in it. We've got this GF, which probably stands for game feed. And that has a ton of data. Look at all this JSON, right? So that's pretty exciting. And, you know, these actually are telling you what they are, picture yearly averages, right? So this is the start of everything.

Now, the first thing you can do is just — let's look at the game feed. And you can just — in Safari, you can just right-click on this and save it. I think Safari is the only one where you can right-click and save a file, though Libby was telling me that she thought that was the way to do it. When I'm on Windows and I'm using Chrome, I can right-click and save object.

Look at this, 100,000 lines of JSON. So here's this giant JSON file, right? And this is a whole other dark art of trying to figure out, like in these nested JSON files, like how do you want to get out what you want? Like, for example, it looks like we've got the scoreboard as a key, and then we've got a list — sorry, a dictionary. But then within that dictionary, we've got a list of dictionaries. We've got some win probabilities by inning. This is all pretty cool, but you can see that it's kind of buried in here.

The point is there's a ton of data here, and we want to play with this data, right? So the first thing you could do is just save it and load it into Pandas or a DataFrame or Pollers or whatever.

Using cURL and Postman to generate code

I think what I want to do first is show, like, maybe you just want to get this data directly into your machine as opposed to kind of saving it as a file like I did. So, again, in Safari, one thing we can do is right-click, and this is definitely in all the other browsers — copy as cURL, right? So what is cURL? cURL is just a command line program on Linux or Unix, and it just basically will download stuff, right? It will download the text or it will stream that. If you type this into, like, a terminal in the Mac, it would just spit out all of the JSON because we're hitting the actual API here.

So this cURL is the API, and these are a bunch of cookies and a bunch of various headers that you may or may not need. So you could just paste this into your terminal and get some stuff out. So this brings us to an app called Postman. Postman will really try to make you have an account. I always ignore this and use it without an account.

So I pasted in that cURL. See, it says paste cURL to import, and then you paste it in. And now it has translated all of that into — the params are just this question mark thing, which is some kind of identifier for the game. And then the headers are all those things we saw. And if we click send, we should get back a bunch of JSON. And here's the JSON that we saw, right?

So now we've got all this JSON data here. One thing you can do, one thing I like to do is I start unchecking things like does it really need this cookie or is it going to get mad at me? Will the request work without the cookie? No, the request still works without the cookie. Do I really need this like user string with all this Apple stuff, you know, pretending we're Safari? Yeah, it still works. Sometimes you turn — do we really need the refer? Probably need this. Let's see. Do we need it? No, we don't. So you can kind of start turning things off and seeing if the request still works or if it doesn't because sometimes you really need that cookie.

Sometimes, you know, the cookie can be really important. For example, I have a little system by which I get some data off of the Kia website for my car, my electric car. The data is all there. The website only shows you like four columns, but if you do what I'm doing here, you can see like the real data structure. But I can't access that without a cookie, so I have to get the cookies out if I want to download it.

Anyway, this little button here, which is very hard to see, is the code button, and this is where it will turn this request that you imported into code. So Python request is what I'm going to copy because that's what I'm going to go show off in Jupyter here in a second, but you've also got everything. You've got PHP. You can go with non-request Python if you want. There's base Python. There's httr. There's rcurl, right, and it will just generate the code for you. It's not that complicated because it's really just a set of headers and a link. It's basically headers and a URL, but nevertheless, if you need code for different things, you can.

Are there open source alternatives for Postman? That's a great question. The other app that I've played with before is RapidAPI, which I don't know if it's open source or not. So RapidAPI is a similar app. I think there might be some Mac ones that are at least not Postman. Adam is saying Insomnia? There you go. There are plenty of them. Thanks, Adam. To me, the trick is can it generate the code for me? Because it's kind of just very convenient to have that ability to generate the code.

Running the request in Python

And it seems to be working. There we go. I think I was using an old version of Python. I mean, that's the problem. So now here we have a bunch of JSON. Basically we'll parse the JSON for you if you're using requests in Python. And so this is now a giant dictionary. It'll do response.text, which will just give you the raw string. And by response.JSON, if you're using requests in Python, we'll basically turn that giant, basically essentially import the JSON library and turn it into a giant dictionary for you. So now we have a giant dictionary called game feed.

And now we can look at the keys and we see all these things. And maybe we can look at home team data and see what's in there. And then you can start playing around with it and trying to figure out what you want. To me, what's interesting is the probabilities. I've looked at this data before.

So at some point you're going to have to, depending on what data you get, you know, it might be a very simple structure, which is just a nice list of something that you want. It might be more complicated where, for example, what I really find interesting in this data is the stuff that's making these probability graphs, right? I think that's kind of fun, right, because you have this probability of who's going to win going up and down. And so that's kind of what I would be interested in grabbing, right?

So now we have a list of dictionaries and that's what you need to do something like turn it into a data frame, right? So now we have a data frame, right? Home team, away team, win probability. Basically it's a real-time estimate from the MLB computers as to how likely one team is to win the game or not based on the score and the ending, et cetera, and the situation usually.

So that is essentially the technique, right? And we can now apply this technique to a variety of websites and see what data is out there, right? To be honest with you, the getting to this, getting the data, getting the JSON is probably, once you've figured it out, it's a pretty easy step. Then there's like traversing the JSON structure, which is like dictionaries and lists of dictionaries and lists of dictionaries.

To be honest with you, the getting to this, getting the data, getting the JSON is probably, once you've figured it out, it's a pretty easy step. Then there's like traversing the JSON structure, which is like dictionaries and lists of dictionaries and lists of dictionaries.

Exploring more websites

Here's the Massachusetts lottery. Powerball's big. If anyone is excited about Powerball. So we're still getting fetches. We're not getting in Safari anyway, but we're getting interesting things like hot and cold numbers. And you can see a lot of stuff from a lot of stuff that you're seeing on the screen here. It's actually a bunch of JSON. We have upcoming draw dates, draw schedules, the numbers of the latest draw, things like that.

So in the same deal, right, you could copy as curl, go over to Postman or RapidAPI, your favorite app, generate the code, switch to a code base, play around with it, automate it. So that is kind of the recipe.

This is how I, if you've ever seen, in the last couple of years, I created like an ICS file that was of the PositConf schedule. Schedules are usually JSON objects somewhere and you can go scrape them and do stuff with them. I'm always like, oh, I wish that there was like a Google calendar file that I could use that I could just upload and have all of the events separated and all of the talks separated and all of their like abstracts and stuff in there so that I can choose between them. And so this method is how I did that for the past couple of years.

And then the important lesson that I learned was I went back the next year to do it and the JSON structure was completely different. So I could not completely reuse all of my code. Yeah, even though it was like the same, the same website. So there are always going to be challenges. And this is definitely not something that you're going to set up and like, like Marcus said, like build a thriving business on something that will just be plug and play forever.

The only thing I actually automate with this is it is my — it's this site. So for the Virginia lottery a few years ago, I built out this kind of manual site that's like a bunch of ginger templates that I wrote myself. And it actually mostly does Beautiful Soup stuff and scrapes various scratch off web pages for the lottery. But like Virginia lottery, the Virginia scratch off things — I needed to know what were all the games. Cause like the URL was predictable based on this number that I need to scrape, but I didn't know what the list of active games was. And so sure enough, somewhere on the Virginia lottery website is a tiny little JSON list. When you go to look at all the lists of scratchers, much like I was showing the Massachusetts site is a list. And so I hit that list once a day, right, to make sure I have like a new games.

But what I'm doing is I'm looking at the ratio of unclaimed to claimed tickets and trying to see which games are the best deal. None of them are a good deal, but like, which one is the good deal every now that you do get positive expectation value.

So I have a server, a Digital Ocean that hosts all of my website, all my dash apps, formerly a shiny server. And it also has Cron. And so sometimes I run these on my own Mac cause my Mac mini never turns off. So sometimes if it's more computationally intensive, I'll just spin up a Cron job on my Mac at home.

But I was going to say is there's an advanced advanced. If you really want to get advanced about this, for example, I mentioned that I have this, the Kia website has stuff that I want for myself. It's my own data, right? And so that is obviously not accessible unless you're logged in. Well, the way it knows if you're logged in is a session cookie. And if you have something like rookiepie, which is — I don't know if this exists for R or anything like it, but rookiepie, it basically, if it only works for Firefox — if you're logging in Firefox, rookiepie can see all the cookies for a site that's from Firefox. So I can log into the website in Firefox, and then I can run my script. Rookiepie goes and grabs the cookies I need and then does all this request stuff that I was showing you and gets the data I want and the way I want it.

Looking at Chrome, CNN elections, and other sites

So if you would do this in Chrome, inspect, refresh — so here we go. It's network. And then fetch. So there it is. So here's the same thing, right? And you just have to kind of be comfortable poking around. Once you do it once with whatever browser you prefer, you'll kind of get used to looking and looking for it. So here's the same data once again. And you can right click and copy as curl.

Another one that I've used before is the CNN election data. Right. So here is now you actually see it under XHR. Right. And so, you know, here's some results from Virginia from — we had elections in November and you can see that all these different hits, you know, the different rate, every race is a different link. Basically, a GBA, Lieutenant Attorney General, Lieutenant Governor. I'm sorry. I guess GG is governor. And so this is similar thing and you can get it. I think this is like full data by County. Right. So you could probably download this from the County, but this is a nice JSON version. Same deal. Right. You can copy as curl. You can bring over to Postman or insomnia or your favorite thing.

So let's see if I can find it. What are the benefits of using this over going straight to something like Beautiful Soup? Well, one is that the data is, if you do this way, the data is structured, right? You don't have to, you're not trying to figure out, you know, find every TD cell. But fundamentally the big, the main thing is, is that if you try and just download, you know, a website, you just get a link to a website you can read and you try and just download that directly with requests or httr or whatever, there won't be anything there because you don't have JavaScript running to ingest the data and assemble it. So you have to do something like Selenium or whatever, which is essentially running a web browser for you.

So why go from data to HTML back to data when you can just go straight from data to data? That's kind of my philosophy. Again, there's some sites where there is no data stream where the server is sending you the table, in which case you do have to kind of scrape it away.

So why go from data to HTML back to data when you can just go straight from data to data? That's kind of my philosophy.

I did want to point out one more time, Ryan had put it in the chat — that if you are using R, the rvest package did not have this initially, but it is new and exciting. And it's the read HTML live function, which does what Marcus just described Selenium does. When you run this, it opens a headless browser on the underside for you, pretends that it's Chrome using Chrome mode, I think. And then that way it can like actively look at something that before you wouldn't be able to, right. You would be trying to get something that's static, but you were looking at something that's live. You can now do it in just rvest, which is really exciting. So definitely go check that out.

Cards game JSON and wrapping up

I mean, a fun one that I like is, and this is something I do with my friends a lot is we play a game. We play cards on cards mania and see if it's going to work or not. So I realized a long time ago, this is a game I played with my friends a while back, that there is a hidden JSON structure and a game history. So this ridiculous JSON structure is basically the entire history of this game of Oh, hell that we play. And so I wrote a bunch of code to parse lists, cause this is like telling you what order you play the cards in. This is the queen of hearts. This is the king of spades or whatever. And then I built a whole website around it for my colleagues and I to analyze how well or not so well we did when we were playing against each other in cards. So that might've been the start of my whole journey of JSON inspecting, trying to find hidden data structures.

There are lots of questions about like YouTube. Could we scrape YouTube? YouTube has its own API. You can just sign up for an API key and go look through stuff. But also like there is a transcript package for Python, but it's going to like figure out that you're not a browser pretty quickly and it's going to block your API. And you have to like use a proxy to get around that there's instructions for it, but it can be really complicated. I really suggest like just going through the legit API.

I have found that the bigger the website, the more stuff is crammed in there. And so sometimes a smaller website, that's a simpler website, you will have more luck with. But the logging in thing is not always a barrier. I recommend just trying this. Like I'm interested in this data. It could be data about you. You could need to log in and try it.

So sometimes you will do something and you'll find an API endpoint and you'll cut, copy it and it will only show you 50 results. And you're like, that's weird. But if you start fiddling with the request and the payload, you might be able to like work around that limit, right? It might have a setting like, you know, a max. I might have a max or a limit that you can change.

GraphQL is kind of this quasi SQL like language that you can, that some APIs use. And so instead of writing, um, instead of there being like the JSON payload or a bunch of question marks in the URL, there's like this kind of quasi SQL JSON language you have to write. And it's, I really don't like it.

But if you do hit something paginated, and YouTube's API is a great example, it's like 50 per page or whatever, it's going to give you a token that corresponds to the next page. And then when you write your loops or whatever you're doing, you just have it take that token for the next page and insert it into the next page token of whatever your API call is. And you can page through it that way. Don't let that stop you. Like use ChatGPT to help you navigate the pagination stuff and understand it. That had really, really slowed me down when I was first hitting API.