Extracting Data From the Web: Part 1 | RStudio Webinar

Transcript#

This transcript was generated automatically and may contain errors.

This is the first in a two-part series of webinars I'm going to present on extracting data from the web. And what we're going to look at today is how to get data that's been already packaged for you in a web API. A lot of times when people want to share their data through the web, they go through the trouble to make an API, which makes it easier for you to get that data.

On November 30th, we'll follow up this with a web scraping webinar, which will show you how to just go out and scrape the data that you want right off the web pages when it's not been prepackaged into an API.

The topic that we're going to cover today touches on something very deep, and that is actually the topic of web APIs and web development. And I'm not going to go so deep into that. That's not my expertise. And what I will be teaching you is a package called HTTR , which is an R package that allows you to use HTTP web APIs with R. The HTTR package is actually very simple to use.

So if you are a web developer and you are familiar with HTTP, I suggest that maybe you just look at the vignette, the quick start vignette for HTTR. It's about 1,500 words, probably take you less than 10 minutes to read, which is shorter than this webinar. And I put the link up here right now in case you're watching this in the future right now and you think this might be a better way for you to ramp up.

Because what I'm going to do is focus on what HTTP is, what web APIs are. And I'm going to assume that the people who need this webinar are people who have a background like me, where you're a data scientist or statistician, you've identified the web as a great place to collect data, but you don't have that background in APIs or web development. So you don't necessarily have the context to use the HTTR package, which like I said is a fairly easy package, as you'll see when we get to that.

So before we start explaining what web APIs are and whatnot, I do want to acknowledge that these two webinars both come from a tutorial that I taught with Scott Chamberlain and Karthik Graham at the last UseR! conference. Now the webinars are more concise and self-contained, there's no exercises for you to do. So the contents of these webinars we will put on the RStudio webinar GitHub, like we always do. But if you wanted access to the original content, it is available through the ROpenSci GitHub repository, which is the link right here.

So it can then, you know, raise an error or a warning, and HTTR has two functions that can help you do that. Warn for status will check the object to see if, you know, the status code was not 200, and if so, it will raise a warning to say, you know, this was the message, and then stop for status does almost equal, but actually stops with an error.

Putting it all together

So that is all the basics of using HTTR. It might have seemed really simple, but really all you are doing is making requests to a server, and then collecting responses from the server. The magic happens with what you do with those requests and those responses.

Here I focus on getting information back from the server, because normally I think the use case for us will be there's data on the server to that particular URL, and we just want to get it so we can access it, manipulate it. However, you could use this technology to manage a webpage from R, and you could post new pages with post, you could update pages with put, and you could delete pages with delete. And then there are other HTTP verbs that have their equivalent HTTR function in the HTTR package, and you could look at the help page for that package to find them all.

Live demo: OMDB API

Let's move out of the slides and into RStudio and take a look at how we might use this to do a simple data collection task. I'm going to use a website called OMDB API. If you've ever been to IMDB, the Internet Movie Database, it contains information about movies and when they came out, who the actors were, who the characters were, what the plot was. Well, OMDB contains that same sort of information, but it's an open movie database. It is an API on the web with data about movies that exists for people to connect to to collect that data and then use it.

This is a very nice API. They want to be as helpful as possible, so the front page actually talks about how you can use this API and some of its unique features to get stuff. And then down here, we could just type in Frozen. If we wanted, for example, to collect a movie Frozen, search for it, it finds a movie Frozen and says, look, if you want to request the data I have about the movie Frozen, it's all stored at this URL here. So I'm going to copy that to use. I'm going to go to RStudio.

I have a script here that already has this because I copied it earlier, but that's just the URL I got from OMDB API. I'm loading the HTTR package, I'm saving this URL, and now I'm going to use the Git command. I'm going to use the Git command to collect the resources at the URL.

Now what I've saved in Frozen is this response I got back from this Git request. I just got it, stats to unmoved, that's good. That means this worked. I got some JSON data. That's about one kilobyte of data. If I wanted to look on some, I could, there's quite a bit in here. This is the sort of stuff that you find on IMDB, that's what OMDB recreates.

Let's save this to details, and if I was particularly interested in, I could then just look it up in this R list. It's 2013. I can use an R list, I can use an R as I like. As long as I know what URLs to look for, at least the structure of the URLs, I can implement all this into a program, maybe have a for loop or a map function, iterate over a set of URLs, collect all their data, extract the year component, and then I could have a set of years for different movies, and so on. This is how you use HTTR to access data that's been provided over a web API.

Q&A

All right, so that's about half an hour, and that was all the material I've prepared about HTTR, so now let's take some questions. We have some time for that.

Can I give another example? I could if I knew another URL. It really is this simple. I mean, basically I'm just teaching you the syntax for one function and how to interpret the results, and you could apply it to the other functions. Something that might be a little elucidating is if we actually do look at the help for package HTTR, so this opens up a list of all the help pages in this package, and one neat thing about seeing this list is you kind of get a feel for what functions are in the package because every function has a help page, at least in a well-documented package, and here we could see what's there, but the capital function names are the different verbs you can use, you know, patch, post, put, retry, and I guess this is just a generic help page for arbitrary verb.

Yes, by checksum get, method can be used with a query argument. By secular, I don't know if this was in response to a conversation you had, but yes, there are other arguments to all of these functions, so if you want to learn a more complete way of using everything, depending on what your needs are, then the easiest way would be to look up the HTTR package on CRAN, and this view that CRAN provides will show you, for any package, all the vignettes that they have, so if you want to get started with HTTR, here's the quick start vignette that goes into it a little more deeply, and then here's some best practices for writing APIs yourself. These were both written by Hadley Wickham , the package's author, and they contain much more information than I plan to go into today.

But if you want data on hundreds of movies at once by Steven Spitz, Steven, this would be a case where you have to assemble the get function components into a for loop, or what I would suggest is a map function from the per package, which is a little more efficient, but you have to write a program or a script that will tell R to send a request for every movie, and then tell R what to do with response that gets back for every movie. That shouldn't take too much time, and it's completely possible to do.

And then Peter's question is similar, how to instruct R to get the full list of movies. Again, you would have to know what the exhaustive list was. If you had that list of movies, then it'd be very easy to iterate over those movies, or if you had some sense of how OMDB structures its URLs, so you could programmatically change the components until you're satisfied you've gone through them all, then you could do that too. OMDB, I haven't worked with it too much, because I haven't done anything with movie database data that I'd want to spend time on, but they might actually, giving the care that they've taken to create their database, make it easy to access the entire list.

We use this for a website like Yelp. Let's go determine whether or not you could use this technology for a website. What determines whether or not you could use this technology for a website will depend on whether the website contains an API for accessing the data. Not every website does, but many websites do. I don't know if Yelp has an API, but if I wanted to find out, I would start by Googling it, and it looks like it does. So yes, you could use this, and then if I was going to do that, I'd read the documentation for the API for Yelp. Web APIs that you find will be based on HTTP, the GET, the POST, those verbs, but they might have deeper features that you could take advantage of that their documentation will tell you how to do. And normally those features you take advantage of by adding things to the URL that you pass to your request.

How do you learn what parameters a web API will accept by serving? Well, this is a good case for Yelp. When someone creates an API for someone else to use, if they add any features to it, they need to document it themselves. The creator of the API, so the Yelp developers here, they have complete control of what they want to put into the API, what they want to make available, and there's no real standard other than using HTTP that they'll feel obliged to stick to. So there's no universal rule. You have to learn about the API you're trying to access.

What package could we use to take the GET input and convert it into a data frame? That's just function arguments, but I suspect, Jordan, that you're talking about the response that you get back from GET. And the package that I would use to do that is most likely JSON-lite. It depends on if the content of that body actually comes back as JSON, but that would be the most common way for the data to come back. And then with the JSON-lite package, you could use fromJSON to turn that right into a list, which you can then make a data frame with as.data.frame. If the content comes back as something different, it's either going to be text, which you'll just have to use text manipulation tools. There probably is a package. I haven't worked with that, but if it's not JSON or text, it'll be XML, and then you can use the XML package, or even better, the XML2 package, which serves the same role as JSON-lite does for JSON. It makes it easy to work with XML data in R by changing the XML data format into R data formats, like lists and data frames.

Do you ever need a special code to access an API, Steven Spitz? Yes, you almost certainly do. Can you use OAuth 2.0 authentication with the HTTR package request? So yes, a lot of APIs do actually want to protect their data, or at least know who you are if you're coming to use the data, so they require some sort of authentication. You can use authentication with HTTR package. Here we're going beyond my abilities as a statistician to do things, but if you want to set HTTP authentication, I'm in the help page for Git. It has a link right here to authenticate, which is the helper function in HTTR that will help you do that, and then there's some examples down here about how you could go about that. So if you wanted to send a Git request to, for example, HTTP bin, which we were using earlier, you could send it right to that URL. But if you need to authenticate with the user and the password, you would add authenticate to the request, and it would add your user and password to the HTTP message that it sends.

If my job is a firewall, how do I ask IT to let me through so I can get data? If you have a firewall, that's obviously a difficulty. This is a place where I'm going to have to punt. I myself would have to ask IT, and I would be very dependent on what they tell me back. So the only thing I can tell you is that you should ask IT, and they, doing what IT does on a day-to-day basis, should be familiar with how to give you permission to go through a firewall.

How do you make sure you don't overload the web server with requests? Well, if you are automating your work with a map function or a for loop, there will be a lot of requests, and depending on what sort of defensive measures that website has to prevent bots or whatnot, you might have to have the system sleep between each iteration so they don't happen one after another or whatnot. If you're working with a web API, they are likely expecting lots of requests. This is a way to transfer data. So if you're working with something like OMDB, which is purposely sending data across, you might not run into the overload that you expect. If you're working with a basic website, which just thinks users are visiting it, and the API you're relying on is the web browser API to build that website, well, if you hit that frequently, yeah, you might trigger some requests, some security that says, like, you know, too many requests in a certain time frame.

HTTR vs. rvest

John has a very insightful question. How can you contrast HTTR with RVest? So RVest is a package that's designed to help you scrape data off the web, and RVest will be the topic of the November 30th workshop. Now, RVest is really where I have more experience. Like I said a few times, or it's come out a few times, I don't work with web APIs that often, but I do scrape data off the web. For me, I'm mostly interested in just, like, idiosyncratic data that probably isn't shared over an API, but it is on a web page.

So an example of this would be if we didn't go to OMDB, but we went to IMDB, which is the Internet Movie Database, you know, there's all sorts of data here. Let's just pick a movie, the Lego Batman movie. Down here, you know, there's a table of actors and actresses and what roles they play. There's comments, metadata, trivia. All these things are data that's in this web page. If we wanted to look at the source of the web page, you know, we will see a lot of HTML, but in that HTML, there's the data that we're looking at. It hasn't been put in a place that's easy for us to access, so if we wanted it, we'd basically have to take down this HTML and do character processing, string processing on it to get the data we want back from it. That's what I'm calling web scraping, and that's what the RVest package is designed to do.

It's a down and dirty way to just go in there and take what you want and probably cause legal trouble if you try to sell it or something, but it gives you full freedom to get anything you see on the internet, and that's what we'll talk about next time. HTTR is a much more mannerly package. It's for data that the people who put it on the internet expected others to come get, and they've organized it in a nice, neat way, and they've arranged so when you send them get requests, you can get back a response with the data in it.

So if you're contemplating web scraping, and I'll say this on November 30th too, the first thing you should always do is check to see if the data's available through some sort of API, because it would save you a lot of time if you could use HTTR. You will save a lot of time. RVest is a messy data that you really have to clean up. So the difference between HTTR and RVest is the difference between APIs and web scraping.

So if you're contemplating web scraping, and I'll say this on November 30th too, the first thing you should always do is check to see if the data's available through some sort of API, because it would save you a lot of time if you could use HTTR. You will save a lot of time.

This is a convenient place to stop. I apologize if I haven't answered some of your questions. My take-home message to you is that this technology is fairly simple at the level we're using it, and you can just build on it from there. If you want to go further, start with the vignettes for HTTR.