Inspecting websites to find JSON data APIs | Marcos Huerta | Data Science Lab

Transcript#

This transcript was generated automatically and may contain errors.

We are going to go ahead and get started. I'm so excited to have everybody hanging out with us today for the data science lab. My name is Libby. I'm a data community manager here at Posit, and I am joined by Isabella and Daniel. Hi, everyone. I'm Isabella. I'm on the DevRel team at Posit. Hello, everyone. I'm also on the DevRel team at Posit, and I guess I'll see everyone in the Discord chat.

Yeah, Dan's going to be our Discord gremlin today. And I'm joined by Marcos Huerta. Hello, everybody. I'm Marcos. Happy to join Libby and everyone today.

Yeah, we are getting together every Tuesday at the data science lab to be a much more open, transparent, screen-sharing, messy place to hang out with your friends and code. This will be like a, hey, my friend wants to show me how he does something, and we're going to code together. What this means is this is a place for you to stop and ask questions, to say, I can't see your screen. Your text is too small. Wait, slow down. Do that again. This is not a presentation or a talk. This is us like hanging out and coding together.

And if you are someone who knows a lot about the topic that we're covering, I hope that you will share your knowledge and your resources and extra tidbits about what we're talking about in the chat, because that is what this whole thing is about. This is a space for everybody. It doesn't matter what your years of experience are. It doesn't matter what background you come from. You are in the right place, and the Discord server in the data science lab channel is the place to be.

Introduction to finding hidden JSON APIs

What we're going to talk about today is the guts of websites. So when we want to get data from websites, we are often looking at web scraping using, like, Beautiful Soup or rvest , maybe rvest plus Selenium. By the way, rvest has a new, like, read HTML live function, which means you don't have to use Selenium all the time, which is cool. But there's also a more manual alternative to this is an API call that's not an official API, right? There are some websites that have official APIs. Most websites aren't going to have an API that's structured for the public to use, right? But that doesn't mean that that website's not receiving data through API calls.

So, like, opening statements, like, yeah, Libby said it very well. Like, I think, I don't know, 10, 20 years ago, most websites or a lot of websites rendered server side, right? They would do a bunch of stuff on the server and then your computer would retrieve a HTML table or whatever from the server. And that's when you could scrape that with Beautiful Soup, right? You can do a request.get or HTR in R. But you can basically pull that in and then you've got this basically giant string of HTML and various tools like Beautiful Soup or other things can parse those tables and extract the data, right?

I don't know when the trend started, but now a lot of times what happens is the server is not sending you HTML. The server is sending you data. And JavaScript running on your computer is taking that data and building the pieces of the website for you. So the table you see on a website is not, if you went and, like, curled that from the command line or if you request.get or whatever, programmatically with some retrieval code that does not process JavaScript, you would just get a bunch of JavaScript tags and you wouldn't actually have a table, right?

So there's two reasons to do these private APIs. One is because, like, it's cleaner and better than trying to do Beautiful Soup. And two is sometimes Beautiful Soup won't work, right? Because, like, unless you'd have to, like, maybe save the source from a web browser or something to kind of get that generated table. So what we're going to do is we're going to look at a couple of websites. We will use the developer tools. And I'm going to probably mostly be showing Safari, but I'll also show off Firefox and Chrome briefly.

You're going to be using those developer tools, looking for what is your browser actually going to get and seeing if any of that is JSON. And if it is JSON, then we will figure out — you can obviously save that data directly if you just want to play with it. But if you want to get it the next time, we'll show you some tricks to maybe getting some code that will let you kind of hit that over and over again.

Another big caveat is, like, this is all fun if you're just trying to do some analysis. Like, hey, I really want this data from this baseball game or from this government website or whatever. Do not build a business around, like, the tools. Because eventually if you're hitting this thing, like, if you hit it once every now and then, no one's ever going to notice. They're just going to think you're a web browser. If you're, like, banging some private API, just pounding away at it, like, with, like, a, you know, cron job every minute, like, they're eventually going to figure that out and probably block your IP address. So, like, don't do that.

If you're, like, banging some private API, just pounding away at it, like, with, like, a, you know, cron job every minute, like, they're eventually going to figure that out and probably block your IP address.

A lot of these sites, well, not all of them, but a lot of them will have rate limits built into the API that are kind of designed for what a human would do clicking around the website. So, if you start using, you know, request.get or curl on it, you know, a gazillion times, like, you might start getting, like, you know, not 200 responses. You'll get a 400. You'll get a 500. But you'll get some HTTP code that's basically, like, you've hit the rate limit.

To be honest with you, the getting to this, getting the data, getting the JSON is probably, once you've figured it out, it's a pretty easy step. Then there's like traversing the JSON structure, which is like dictionaries and lists of dictionaries and lists of dictionaries.