Resources

Web API Updates for R | RStudio Webinar - 2017

This is a recording of an RStudio webinar. You can subscribe to receive invitations to future webinars at https://www.rstudio.com/resources/webinars/ . We try to host a couple each month with the goal of furthering the R community's understanding of R and RStudio's capabilities. We are always interested in receiving feedback, so please don't hesitate to comment or reach out with a personal message

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Thank you, Anne, for that intro. I appreciate it. As she mentioned, I'm an engineer at RStudio, and this is going to be an expanded version of the talk I gave at RStudio Conf, so I can provide more detail on working with APIs in R. We will begin with an overview, a quick review of the Web API basics, move on to tools for accessing those APIs. I have a couple of examples prepared and a practical application that we have built in-house using these techniques, and at the end, I'll give you some resources for finding out more information if you're interested in expanding your knowledge in this area.

Web API basics

So API basics. There are a lot of data accessible over the Internet these days, and a lot of it is made available via APIs, which are application programming interfaces. These allow programs to talk to each other based on a communication contract. If you ask me a question in a way I understand, I will give you an answer. So there are two important parts to HTTP communication over the Web. There's the request, which is data sent to the server from the client, and then the response, which is the data that the server sends back to that client.

Now, this is a very oversimplified view of the client-server communication, but the basics of it are ask a question and get a response, and this is predicated on the fact that the client and server need to understand each other's language, so to speak. The client can only send a request to the server that the server knows how to handle, and then the server will send that response back to the client, and the client will need to know how to parse that response to get the data out based on the question that was asked.

Web APIs usually provide read and or write access to data stores when you're talking about Web APIs, and they allow you to access data on an ongoing basis. This is significant since it allows you to write scripts that pull an API to get the most up-to-date data. Not only will your analyses be current, but when the data changes, your scripts don't have to. Depending on how you design that script, data frames and plots will update automatically when your data changes without you having to download a new CSV or import data manually.

This is significant since it allows you to write scripts that pull an API to get the most up-to-date data. Not only will your analyses be current, but when the data changes, your scripts don't have to.

So as an example of this request-response, I've put down here a curl request. Curl is a utility in the Unix world that allows you to make HTTP requests. In this case, I'm calling the OMDB API. This is the Online Movie Database API, and I'm giving it a request with a query parameter, T for title and R for response format, so in this case, I'm saying, OMDB, give me the data for the title, the movie named Clue, and send it back to me in JSON format. The server understands this request because you've put it into the format that it expects, and so it sends a response that contains the data you asked for, the title, the year, the rating, et cetera. Again, this is a very simplistic example, but it gives you an idea of sort of how simple it is to ask a question and receive an answer and pull that data into the scripts that you're writing every day.

Web APIs are organized into resources that are often in a particular hierarchy. In this example, there's a map that if you follow down from the top, you can get information about accounts, a particular account by its ID, the lists that a particular account has, the ID of those lists, the campaigns, the subscribers, the web forms, et cetera. So these are often organized in maps or webs of information, and if you know how to get to those different levels of information, you can construct a URL and ask for very specific things all the way down at the bottom, like clicks and tracked events in this case.

Now that presumes that you understand a little bit about the API that you're trying to use. Fortunately, most of it has many of them, I should say most of them, have decent documentation. This example is from the Star Wars API. So in this case, you can see across the left the kinds of information you can learn from this documentation. There's a getting started guide here on the first page. But the things that you need to understand are the resources towards the bottom so that you know what you can get and how to get it. For example, this being the Star Wars API, you can get information about people, films, starships, planets, vehicles, et cetera. And the beauty of this documentation, this particular documentation is very good, so it'll tell you exactly what URL to use to get to that information, and it'll give you an example response as well. So it will tell you how to ask the question and what format the answer will be in so that you can then parse it appropriately in your script.

Tools for accessing APIs in R

So to access these APIs and pull this information into R, you have a couple of different options. In some cases, people have already written packages that will wrap those API calls for a given service. The examples I've provided here are for AWS S3. This is S3 buckets in the Amazon Web Services universe. R Google Analytics will get you information about Google Analytics based on the ID code that you have for your website. ACS is for the American Census information, et cetera. So some people have already wrapped these APIs into packages that are very easy to use. You do, I mean, if you can use them, by all means do. They make your life a lot easier. You do need to be aware that the people writing them may not have the same goals as you do. So they've written it to solve their problem and then made it public. But if you have other questions that you need to ask or if there's particular end points that aren't covered in that package, you may need to write that yourself. Or there could be no package for the data source that you're trying to reach, in which case you'll write it yourself with some other tools.

I recommend HTTR or Hitter for making the requests. And then JSON Lite or XML2 for parsing the response. The most common response formats of API calls are JSON and XML, and these two packages are extremely good at parsing that data for you. Once you have the data in your script, you can wrangle the data the way that you normally do with data that comes in using the various packages in the tidyverse to get your data in an orderly rectangle so that you can use it for your analyses and for your plots.

So Hitter request functions wrap HTTP verbs and fortunately they match them so they're very easy to figure out. Get is the one you'll probably be using the most commonly if you're interested in getting information from the web into your scripts. Post is to write back to a data store of some kind. Patch and put are for updating. Delete is fairly obvious. And head is identical to get but without the body. So in this case you'd just be getting the metadata around the request without getting the body data. This can be useful as you're developing to understand sort of everything around the data itself in a much cheaper call.

So to make a request, you first load Hitter with the library function, then call get with the URL. In this case I am using the Star Wars API. I'm calling planets with ID one. So what I'm trying to do here is get information about planet number one in the Star Wars API database. And this will give you a response object. If you print that out, you get some really useful information. The actual URL that was used after any redirects. The HTTP status. The file or content type. And the size of that response.

So you can use various helpers in Hitter to dig into the response object. For example, the status code helper will pull out just the code itself. 200 is what you want to see. That means okay, successful. Anything else may be a problem that you would need to handle in your code. Using the headers, helper will get you everything, all the metadata that came around through the header of the response. Date, content type, connection, transfer encoding, et cetera. And then what most people are interested in is the body itself. This is the content of the response. And so you would use the content function from Hitter to pull that out. And here I'm just looking at the structure of it. So you can see that it is a list and the various amounts of data that come back. You'll see some of it is nested. Some of it is URLs. So you're going to need to understand all this in order to handle it appropriately in your scripts.

Handling the HTTP response

So there are three main parts that you'll probably be most interested in in your HTTP response. The first is status. You can get the full status information with the HTTP status function. It will tell you everything that it possibly can about the status. In most cases, you're only interested in the code. So the status code, sorry, object will come back with just the number that you can make decisions in your code based on that number in most cases. You can automatically throw a warning or raise an error if it does not succeed, if it was not 200 with the warn for status and stop for status helpers in the Hitter package. I highly recommend these. They convert HTTP errors into R errors, making them much easier to handle in your script. And they'll either warn or stop just as they imply.

The second of the components that you'll want to handle are the headers themselves. Date, content type, connection, allow, things like that. There can be custom headers in here. It really depends on the API that you're working with. The one that you'll probably be using most often is the content type. Most APIs have a pretty consistent content type in their response. In this case, they're using JSON. So you may only have to worry about this once. But if you're calling an unfamiliar API or you're calling an API that changes frequently, it may behoove you to take a look at that content type to make sure that what you're parsing in your script matches what you're being sent.

You can also get cookies if you're interested. It really depends on your API or what you're trying to get out of it. But the thing, again, that most people are interested in is the body. You use the content function in Hitter to get the response body. In this case, you pass it your object and a modifier. So in this case, I want a character vector. So I'm going to pass in the word text. And it'll give me back all of that information that we just saw earlier in the structure output. If you have non-text responses, you can get the raw content with the raw modifier. Or you can give it the parsed. Hitter has some default parsers for common file types, including JSON, XML, and a couple of others. This can be really useful if you want a quick and dirty look at it. But honestly, I tend to want to do the parsing explicitly. And so I don't tend to use this very often, but it's there if you want it.

Star Wars API examples

So let's turn now to some examples. Here I've got some examples from the Star Wars universe. So I'm going to be working with a Star Wars API in these examples, and we're just going to walk through some of the basics of how to deal with APIs in R. First thing I'm going to do is pull in Hitter JSON Lite, because I know that the response is going to be in JSON, and Regritter for piping to make it easier to read the code as I'm passing this data around. And our first goal is to get the data for the planet Alderaan.

So the components of this request in Hitter are the same as they are in HTTP. You need a verb or a method, in this case, get, a URL endpoint to hit, in this case, we're looking for planets, and a parameter. Here we're going to use search. Not all API endpoints need parameters, but in this case, I'm explicitly going to do a search. And in this API, the key value pair is search, and then your text that you want to search on. So if I run this, you'll see that the Hitter method get is in use. It's very straightforward. It's very readable. And I can take a look at that in a second. But I also wanted to point out here that there's different ways to do your parameters. If you have many parameters, if you've got six parameters, this URL gets crazy long and hard to read. So there's another format you can use with the get method where you pass in a query list and you give it your key value pairs in that list. We don't need to do that here since I've only got one, but it's worth pointing out, especially if you're looking for a very detailed data that has a lot of parameters in the search query or in the set of parameters you're sending to the API.

So now that I have that data, I'm going to take a look at the data frame names. And I've got URL, status code, headers, cookies, the content itself, and the request and things like that. So I'm getting back a lot of information from this API, which is great. I'm going to take a look specifically at the status code. This is a 200. Excellent. That means it succeeded. And then I'm going to take a look at the headers, specifically the content type and its JSON, which is what we expect. Again, if you're looking at an API that you don't know very well, these fields can help you quite a bit in terms of deciding how to handle the data that comes back.

Now, if I want the text of the response, I'm going to use the content method in Hitter, pass it the object I just created, tell it to give me text, and specify the encoding. Hitter will default to UTF-8, but it throws a little warning, so I usually just go ahead and put it in there explicitly. If you take a look at the resulting object, it's giving you a count, a next, a previous, and the results themselves. And this is containing the data that you're interested in for the most part. But it's not very easy to read, and it's not very easy to get that data back out. It's not very R formatted at the moment. So we're going to need to parse this.

You can do so with Hitter, as I mentioned earlier. So if you passed that into the content method with the parsed modifier here, you can, I'm going to go ahead and run that, and then I'll take a look at it. It's easier to read than this is up here, but it's harder to get, but it's not necessarily easy to get to some of the information in the results themselves. So the count is pretty straightforward. And the results, if you take a look at the structure, it's what we would expect. It all looks fine. You've got some nested elements here under residents and films, and those are actual links to those, which makes it really, really easy to get that data programmatically. So it's all there, and it's all usable, and it's great. But in order to access this stuff, it's a little bit ugly in terms of this code. It's not the end of the world I've seen worse. But it's not always that easy to figure out kind of how that structure works.

So I use jsonlite to parse this instead of the jsonlite package. So I'm going to take that text content that we had up here. You can see that down here in the console. We have this. I'm going to pass that to the from json function in jsonlite. And what this is going to do, I just ran that code, and then we're going to look at the object. It, again, gives you the count and the next and the previous, and we'll talk about that in a second. But the results are much easier to handle, in my opinion. They're certainly much easier to handle than this.

So now that you've got that parsed in json content, I'm going to look specifically at the results and get the names. And now it's a standard rectangular object that we can deal with a lot more easily. And when you want to pull that data out, this code is a lot easier to work with, a lot easier to read, in my opinion, than this code up here. So I do tend to use jsonlite just to make my life easier.