Extracting Data From the Web: Part 2 | RStudio Webinar

Transcript#

This transcript was generated automatically and may contain errors.

Well, as you know, this is the second webinar on how to extract data from the web. This is part two of the series, and it will cover web scraping. If you missed the last webinar, November 9th, at that one we looked at how to gather data off the web that's provided through an API.

And that webinar has been recorded, and it's already available at www.rstudio.com slash resources slash webinars , the same place that this webinar will end up itself. If you'd like to review APIs, you can go there and watch that webinar. What we'll cover today is how to scrape data off a webpage. So this is data that exists in the webpage, but isn't necessarily easy to access and certainly hasn't been prepackaged in the API.

As with the previous webinar, this will be an introductory level webinar. So I'll cover some basic ideas behind web scraping, I'll review as quickly as I can the important points of HTML and CSS that we're going to rely on, I'm going to introduce an R package designed to make it easier to scrape data off the web named RFest, and I'll cover a tool called Selector Gadget that works very well with web scraping.

Together these tools form a basic set of tools that you can use to collect data off almost any webpage that you can imagine, especially basic webpages. So they're very useful, but they are somewhat simple. If you already are familiar with HTML and CSS, or have even done web scraping already and you just want to learn more about the RFest package, you might be able to save time by simply going to the RFest Selector Gadget vignette, which is at this address, and you can read that in about five minutes and skip over all the review that I'm going to do.

Now this vignette is not as comprehensive as the vignette I suggested for the HTTR package in the last webinar, however, I think there's enough information there that you can really get a quick overview of the topic and get started right away.

The last thing I want to say before we begin is that, just like with the last webinar, this material is derived from a workshop that I co-taught with Scott Chamberlain and Karthik Ram at the last UseR! conference, and if you'd like to see the original workshop materials, they're available at this GitHub repository.

Web scraping in context

So let's talk about web scraping, and let's put it in context. We learned in the last webinar that many websites that provide data provide it through an API, which is just a systematic way for a piece of software to interact with that website to collect the data. If your data is available through an API, you're in luck because it's very easy to write code that accesses the API and gets the data in an automated way.

One example of a web API that we looked at in the last webinar was the OMDB API, which was a database of information related to movies that is provided through an API so they're easy to access.

Today we're going to look at a similar webpage, but a very different webpage. That is IMDB, stands for the International Movie Database, and it provides the same type of information. You have information about movies, such as who was starring in which movie, what rating did the movie get, what year did it come out, and so on. There's all sorts of useful information there, but IMDB does not provide it through an API.

IMDB, I'm just inferring here, but I think they want to pull you to the webpage, put your eyeballs on the page, so they can then expose you to what looks like a large amount of ads, which maybe they make the revenue from.

Now in this situation, if I were trying to actually collect movies data, I would first check and see if there was an API, which would make my life easier, and in this case there is, so I would use it. But I'm going to use the IMDB site here as a convenient example of how to scrape web data.

How webpages work

And the way we're going to do this is with a strategy that relies on the nature of webpages. So just to quickly review how webpages work, every website that you visit is stored as a set of instructions on a web server somewhere. You visit it through a web browser on your computer, and the first thing that happens when you try to access that website is your computer sends a request to the server that hosts the website.

When the server gets that request, normally, it sends back an HTML document that contains all the instructions that your web browser needs to build the webpage you're trying to visit. Once your web browser has that document, then it uses those instructions to put together the webpage, and you see it there on your computer.

If you were to look at the HTML document that your web browser receives, it's just text. Text instructions. It's formatted a particular way, but everything that appears on that webpage will be mentioned somewhere here in the instructions, and this is the kernel of our strategy for web scraping.

If you were to look at the HTML document that your web browser receives, it's just text. Text instructions. It's formatted a particular way, but everything that appears on that webpage will be mentioned somewhere here in the instructions, and this is the kernel of our strategy for web scraping.

Let's say we wanted to look at the cast of this movie, so IMDb displays the cast very prominently in this table here. We can see the star was Kristen Bell, and then Idina Menzel, and so on. If we wanted to get that information, we could certainly come here and look at it and write it down ourselves, but that's not efficient at all.

Instead, we could look at the HTML source for the webpage. Now, Chrome makes it very easy to look at the source for a webpage. You don't need to do this to scrape data, but I just want to show you what it looks like. Here's the source for that webpage. It's quite complicated, but if we were trying to find Kristen Bell, we should be able to use our search feature, and we see that every mention of her in the webpage is also somewhere here in this text.

If we wanted to find Kristen Bell in this webpage, or any actor in the webpage, we just need to find the source and extract it from the correct location in the source. That's going to be our basic strategy for web scraping. We're pulling things out of the source of the webpage.

You could look at that source, and it probably occurred to you that that's just one giant character string, and you could use regular expressions and character string manipulations to extract pieces of that, and perhaps you'd feel comfortable doing that, but that's not a very efficient way to go about extracting information from an HTML document, because HTML is meant to have a structure. It's based around a structure, and if you understand that structure, you can extract information from the document in a more precise and targeted way.

If there are any beginners out there who are seeing HTML now and starting to get afraid, don't worry. By the end of this webinar, you'll see that you don't have to touch the HTML. We'll use our tools to do that, but it is important to have at least some idea of what context this web scraping activity is occurring in.