
Extracting Data From the Web: Part 2 | RStudio Webinar - 2016
This is a recording of an RStudio webinar. You can subscribe to receive invitations to future webinars at https://www.rstudio.com/resources/web... . We try to host a couple each month with the goal of furthering the R community's understanding of R and RStudio's capabilities. We are always interested in receiving feedback, so please don't hesitate to comment or reach out with a personal message
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Well, as you know, this is the second webinar on how to extract data from the web. This is part two of the series, and it will cover web scraping. If you missed the last webinar, November 9th, at that one we looked at how to gather data off the web that's provided through an API.
And that webinar has been recorded, and it's already available at www.rstudio.com slash resources slash webinars, the same place that this webinar will end up itself. If you'd like to review APIs, you can go there and watch that webinar. What we'll cover today is how to scrape data off a webpage. So this is data that exists in the webpage, but isn't necessarily easy to access and certainly hasn't been prepackaged in the API.
As with the previous webinar, this will be an introductory level webinar. So I'll cover some basic ideas behind web scraping, I'll review as quickly as I can the important points of HTML and CSS that we're going to rely on, I'm going to introduce an R package designed to make it easier to scrape data off the web named RFest, and I'll cover a tool called Selector Gadget that works very well with web scraping.
Together these tools form a basic set of tools that you can use to collect data off almost any webpage that you can imagine, especially basic webpages. So they're very useful, but they are somewhat simple. If you already are familiar with HTML and CSS, or have even done web scraping already and you just want to learn more about the RFest package, you might be able to save time by simply going to the RFest Selector Gadget vignette, which is at this address, and you can read that in about five minutes and skip over all the review that I'm going to do.
Now this vignette is not as comprehensive as the vignette I suggested for the HTTR package in the last webinar, however, I think there's enough information there that you can really get a quick overview of the topic and get started right away.
Web scraping in context
So let's talk about web scraping, and let's put it in context. We learned in the last webinar that many websites that provide data provide it through an API, which is just a systematic way for a piece of software to interact with that website to collect the data. If your data is available through an API, you're in luck because it's very easy to write code that accesses the API and gets the data in an automated way.
Today we're going to look at a similar webpage, but a very different webpage. That is IMDB, stands for the International Movie Database, and it provides the same type of information. You have information about movies, such as who was starring in which movie, what rating did the movie get, what year did it come out, and so on. There's all sorts of useful information there, but IMDB does not provide it through an API.
Now in this situation, if I were trying to actually collect movies data, I would first check and see if there was an API, which would make my life easier, and in this case there is, so I would use it. But I'm going to use the IMDB site here as a convenient example of how to scrape web data.
How webpages work
And the way we're going to do this is with a strategy that relies on the nature of webpages. So just to quickly review how webpages work, every website that you visit is stored as a set of instructions on a web server somewhere. You visit it through a web browser on your computer, and the first thing that happens when you try to access that website is your computer sends a request to the server that hosts the website.
When the server gets that request, normally, it sends back an HTML document that contains all the instructions that your web browser needs to build the webpage you're trying to visit. Once your web browser has that document, then it uses those instructions to put together the webpage, and you see it there on your computer.
If you were to look at the HTML document that your web browser receives, it's just text. Text instructions. It's formatted a particular way, but everything that appears on that webpage will be mentioned somewhere here in the instructions, and this is the kernel of our strategy for web scraping.
If you were to look at the HTML document that your web browser receives, it's just text. Text instructions. It's formatted a particular way, but everything that appears on that webpage will be mentioned somewhere here in the instructions, and this is the kernel of our strategy for web scraping.
Let's say we wanted to look at the cast of this movie, so IMDb displays the cast very prominently in this table here. We can see the star was Kristen Bell, and then Idina Menzel, and so on. If we wanted to get that information, we could certainly come here and look at it and write it down ourselves, but that's not efficient at all.
Instead, we could look at the HTML source for the webpage. Now, Chrome makes it very easy to look at the source for a webpage. You don't need to do this to scrape data, but I just want to show you what it looks like. Here's the source for that webpage. It's quite complicated, but if we were trying to find Kristen Bell, we should be able to use our search feature, and we see that every mention of her in the webpage is also somewhere here in this text.
If we wanted to find Kristen Bell in this webpage, or any actor in the webpage, we just need to find the source and extract it from the correct location in the source. That's going to be our basic strategy for web scraping. We're pulling things out of the source of the webpage.
You could look at that source, and it probably occurred to you that that's just one giant character string, and you could use regular expressions and character string manipulations to extract pieces of that, and perhaps you'd feel comfortable doing that, but that's not a very efficient way to go about extracting information from an HTML document, because HTML is meant to have a structure. It's based around a structure, and if you understand that structure, you can extract information from the document in a more precise and targeted way.
If there are any beginners out there who are seeing HTML now and starting to get afraid, don't worry. By the end of this webinar, you'll see that you don't have to touch the HTML. We'll use our tools to do that, but it is important to have at least some idea of what context this web scraping activity is occurring in.
HTML structure
Let's look at the structure of HTML very quickly. HTML organizes content by placing tags around the content. Here's an example tag. This piece of HTML would create a link to github.com, and the link would appear in a document as just the word github. If you open this in a web browser, that word would probably be blue or something, so you know that there's a link and you can click on it.
The tag here in this piece of code is A. It stands for anchor. It's for web links, and the tag has a name, and it starts right after the less than sign. This tag also has an attribute. In this case, the attribute is named href, and the attribute has a value, and the value is, in this case, the name of the website we want to link to, and then github itself is the content of the tag.
Within any webpage, the tags are organized in a hierarchy. This webpage here begins with an HTML tag, and then the next level of the hierarchy is a head tag and a body tag, which sort of divides the webpage into two, so we can visualize this as a tree. Under the head tag, we have some other tags, title, links, scripts, and under the body tag, we have some tags. Those tags also have their own sub-tags, for example, p and span here, and beneath p, we have a b tag, and then we have some tables with their own items.
Now if you wanted to find a piece of content in this simple webpage, for example, if you want to find the word here, all you would need to know is that it occurs inside the b tag, and the b tag's in this location, and then you can search for that b tag and extract this content, and you'd have the word here. This is sort of what we're going to do when we scrape data out of our webpages.
If we want to find Kristen Bell's name, she's the lead actress here, we would need to figure out which HTML tag surrounds her name in the code. In this case, since I have the source code up, I'll search for her name, and I found it. Here, she's the star of the show, so her name's mentioned in several places, but I'm most interested in this piece here, because if I can extract Kristen Bell in this table, then I should be able to extract every other actor or actress in the table, so we're looking at the cast table.
I've already searched through this webpage a little bit, and you can see here's cast, so this is the table, and here's Kristen Bell. This is referring to her picture, which appears in the table. Here's her actual name in the table, and we can see that it's contained in the span tag, so we know which tag surrounds Kristen Bell, but in this particular webpage, there are many, many span tags that are not just used for Kristen Bell's name.
In fact, if we were to search the webpage, we'd see that there's about 600 span tags, so we're off to a good start narrowing in on the content we want to get, but we're not narrow enough. We don't want to pull out 600 values and then have to search through those manually to get Kristen Bell, and just as a reminder, the reason we're doing all this stuff, trying to find these tags and understand the structure and how our information's embedded in, because at the end of the day, we want to largely automate this process. We don't want to do things manually. We want to have our computer save us time by doing it for us.
CSS selectors
Let's look at how we could zoom in or be a little more targeted when we search for Kristen Bell. We're going to use more than just the HTML tag to describe and locate her name in this document, and we're going to do that by taking advantage of another technology that's used with webpages, and that's CSS.
So if you're not familiar with CSS, here's a very brief introduction. If you have an HTML document with those tags you're looking at, you'll be able to create a page that has text and links and paragraphs and stuff, but it's going to look very plain because HTML doesn't really contain all that much.
So if you consider a webpage you might have visited, like shiny.rstudio.com, if you look just at the HTML of the webpage, it would build something that looks like this, a plain white document. And if you've ever tried to load a webpage with a really slow connection, you might have actually seen a webpage that looked like this. Your computer was just trying to help you out and show you what it can before the rest of the information related to the webpage showed up. That information would be CSS.
You combine CSS with an HTML document, you get a styled webpage. So this is how the shiny.rstudio.com webpage looks if you visit it, and you'll notice it has the same components, it has the same text and the same web links and stuff, but they all look different. They have a different style, and it's that style that really makes the visual experience that you want to have when you visit the webpage.
If we understand how CSS works, we can use it to scrape data as well as style webpages. So if you look at an example CSS file, you'd see something that looks a little bit like this. These are sets of instructions. Between the brackets here we have actual pieces of styling.
The other part of the CSS, the part that we'll rely on, are selectors. So where it says spanner.num or table.data, this tells your web browser which parts of the HTML document to apply the styling to. For example, that color FFFFF, that's only going to be applied to span tags. We've already seen some span tags. So any span tag in that hypothetical webpage will have a certain color.
But these selectors are a way to specify specific elements in the webpage to then apply styling to. But this idea of specifying a specific element in a webpage is how you're going to extract specific elements as data. We use the same system. So let's look at the basics of this system.
Here is an HTML tag. It's a span tag. Notice this tag has a class and an ID attribute. It has some content. Content here says shiny. Here's our class, here's our ID.
And then here's a selector that we can use in CSS to describe this tag. The word span in the selector will refer to every span tag that exists in your HTML document. And since this is a span tag, it will refer to this as well. So any styling we group under the span tag will be applied here.
But we could also be more specific in the way we apply our styling. We could apply our styling based on class. So this span tag has a class big name. If we want to create a CSS selector that says find all the elements who have the class big name, we do that by putting a period and then big name. So that period prefix tells your web browser that this selector is a class selector. It matches everything that has the class big name.
And if we put that period together with span, now we're saying we want to match everything that is both a span element and a big name class. So it's even more specific. And then we could also be very, very precise. And if we set an ID name for a tag, then we can refer to that tag by its ID with a hashtag. So hashtag shiny would refer to anything that has the ID shiny. But normally when you write a webpage, only one thing will have that ID.
So those selectors we saw there refer to different things. And the way they refer to them is with the prefix that is put in front of them. The prefixes that you could use are no prefix, which refers to the tag name, a dot, which refers to the class name, and a hashtag that refers to an ID.
Now will there be a quiz on this later? Yeah, kind of. You don't actually need to remember all this. I'm going to show you a tool that does it for you. But you will need to be familiar with this if you want to understand what's going on.
So which CSS end identifiers are associated with Kristen Bell's name? Well, if we go back over here, we see it's a span that has the class item prop, and item prop is just another attribute. So we're looking at span and class item prop. So that should narrow things down a little bit for us.
In fact, if I do a search for this, you can see now there's 32 elements that have this combination. And since we're looking at a whole table of names, and the next element's here, it's the next label on the table, that might be okay. We're trying to, at the end of the day, collect every name in that table, so this might help us get there.
The Rvest package
Well, this is a webinar on R, and now we're finally going to get to an R package, an R code that you can use with that knowledge you just gained. The package is called Rvest. It's an R package that makes it easy to extract info from a webpage, and you can install it straight from CRAN, install.packages.rvest. This is also going to install some packages that Rvest depends on, particularly XML2, which some of the functions I'll be using come from. But when you install Rvest, you get that as well.
The basic workflow for working with Rvest is actually very simple. You can narrow it down to three functions, but maybe more, depending on what you're trying to find. The first thing you do is you download the HTML document for a webpage with the readHTML function. Then R will have all the information that's sent to your web browser, but now it's inside R where you can start to work on it.
The second thing you do is you extract specific nodes with HTML nodes. So readHTML is going to turn that HTML into XML, which is just another way to organize that information. Sort of like when you save things in R, you normally save them as a list. A list is a way of organizing information. It's easier for Rvest to use the XML format.
In XML, each tag is called a node, and HTML nodes will extract the different nodes in your HTML document. And then finally, once you have the specific node you want, like the span node, for example, you can extract the content of the node using one of these helper functions. Or alternatively, you can extract the name, span, or the attributes of the tag, such as class equals this, and item prop equals that. You could also extract the tags that are children of that tag, and you can extract tables in a specific way.
So I'm going to cover all of these things, and I'm going to do this in RStudio with some Yelco.
Live demo in RStudio
When we post this webinar material, there will be two scripts that come with it, and the scripts are already right here. The first one is called frozen.r, and I'll cover the second one when we get to it. So this is a script that uses Rvest functions to extract data from IMDb.
If we go rely on those functions, we need to first load library Rvest, which loads the package Rvest. And then the name of the website that we're looking at, the website for Frozen, is right here. This is the URL. So I'm going to save that into R. I will use the readHTML function, and I will give it the URL. So this is actually very simple. You just read in the URL.
I'm going to save that to an object called frozen, and if I were to look at frozen, you can see it has its own print method. It's an XML document, and here we see it's an HTML document, and it has a head node and a body node. This should look familiar. And beneath each of those are other nodes, and what we're looking at is the top of the HTML tree for this document.
Now we figured out that we wanted to extract everything that had the CSS selector span.itemprop. These are all of the tags, or nodes if you will, that are spans and have the itemprop class. And the way we're going to extract those is with the HTML nodes function from Rvest. We go give it the webpage, which I saved up here frozen, and we're going to tell it which CSS selector to use to extract nodes. Here's our CSS selector. I'm going to save all those nodes as cast. The cast is what I'm trying to get to.
If I look at cast, I can see here we have many HTML tags or XML nodes. They're all spans. They're all class itemprop, and they all contain a little bit of content. Down here we see Kristen Bell. We see Iden and Mizelle and Jonathan Groff. These are the actors and actresses of Frozen.
However, we do also see some things here like animation adventure. Usually these aren't actors and actresses, and if we're just trying to select all the names in the table for this movie and another movie and any movie on this webpage, we probably want to do a little more work to make sure we're not extracting things like adventure and animation. I'll come back to that in a second, but let's finish this third step here.
I've extracted these nodes as cast. If I want to get to the content of the node, if I want to drop all the span class itemprop, which I probably do, I can use HTML text to extract the content of the node as text. So HTML text cast now gives me the content of each of those nodes, and in here is the list of all the actors and actresses for this movie, along with some extra stuff that we're going to have to filter out.
If I want just the name of each of these nodes, I can get that. They're all span. That's kind of determined because I used the span CSS selector up here, but this is how you get the names. If I want the attributes for each of these tags, I can get the attributes for each tag, and here we see some information that might help us narrow things down. Many of these are class itemprop, and then they have a name value for itemprop, which is now being used as an attribute, but some of the other ones are class itemprop and then have a genre value, but further down below we saw a keyword value.
Now I'm pretty sure genre, these ones that have the genre value are things like action and adventure, which we probably want to drop, and keywords probably have something other than an actor and actress's name. So this is a way we can start to differentiate. Then finally, if we wanted to see if there were children tags or nodes for these things that we've extracted, we can use HTML children. In this case, there aren't any children.
Extracting HTML tables
So just to review, we download HTML with readHTML. The function readHTML just takes the URL of the webpage you want to download. If you're connected to the internet, and this is a valid URL, R will go fetch the HTML for that webpage. Next, we extract its specific nodes with HTML nodes. You give HTML nodes the names of the website that you downloaded, the object you saved it to, that is, and then a CSS selector that helps it select specific nodes from that webpage. Then finally, you use one of these functions to extract specific types of content from all the nodes that you just collected.
We did this, and we got this list of content, but we scraped too much information, because we have things here beyond the actors and actresses' names. This is pretty common when you are scraping data from a webpage.
Before I come back and fix that, I want to show you how to use HTML tables, because oftentimes, if you're trying to get data from a webpage, it appears in the webpage as a table, and if that's the case, then you can extract that whole table all together into a data frame in R, which is terribly convenient. Let's look at another webpage that has some table information. This is bestplaces.net. It's a website run by Sperlings, and you can look up any city, including where you live, and find out information about your place of residence, or a place you're considering moving to.
I pulled up the webpage for Orlando. I live near Orlando, and Orlando is also the venue for the upcoming RStudio conference, so I thought this would be appropriate. Since the conference is in January, maybe we should look at the weather in Orlando and see what it's like. The average January low is 49 degrees, which actually is pretty low for around here. People can see that freezing. It doesn't normally get that cold, but you can see that this climate information is laid out in a table.
Let's try to extract this whole table at once into a data frame, and I've written a script that can help us do that. Recall this is the URL of the page that has the table. Go to bestplaces.org. We have a script. It's three steps. It's running off the page, but here's our URL. The first thing I'm going to do is I'm going to read that URL in and save it as Orlando.
Next, I'm going to look for any nodes that have the CSS class table, so anything that's a table I want to extract. I'm going to extract it to tables using our HTML nodes function to do that. If I look at tables, you see I got some tables in here, and then finally, I'm going to use the HTML table function to extract the tables.
If I extract all these tables, we have four tables here, what I'll get is a list with four tables in it, and you can see that some of the text in that web page was organized as a table. For example, climate overview, this is a table, even though it might not look like that. This is pretty common because HTML tables are a way to orient things on the page, even if you don't think of them as data tables, but the second table we collected actually is a data table, and I can drill down into that using list subset notation, so HTML table two now is this table, and this table is a data frame.
That's how easy it was to read that data into R as a data frame, so now I can save that to its own object and start working with it to do whatever I want. This code here will work for any spurling web page that's organized in the same way, so I can create a for loop or use map and change the URL each time I run this code, but do the same thing to extract these tables for every location I'm interested in, and that's the power of web scraping, the automation.
Selector Gadget
That was a short aside about tables, but let's come back to our problem of we haven't really found the best way to collect actor and actresses' names from IMDB. The way we're going to solve this problem is with a tool called Selector Gadget. Selector Gadget is an actual tool that you download to your web browser, and you use it to zero in on specific parts of a web page. It adds an overlay to your web page that looks like this, and you use it in a way that's gooey, so I'll just show you how to use it.
The best way to acquire this tool is to open R and run Vignette Selector Gadget. That's the Vignette for Rvest that I suggested people read earlier. If we do that, you'll want to have Rvest loaded when you do this because the Vignette comes in the Rvest package.
You get this Selector Gadget Vignette, and right here at the top it says, to install Selector Gadget, open this page in your browser, and then drag the following link to your bookmark bar. I know this works for Google Chrome. I imagine there's some browsers out there that you might have trouble with, but Google Chrome is free, so you can always get Google Chrome, and then you take Selector Gadget. You drop into your bookmarks bar, just so happens I already have it here because I prepped for this webinar.
Let's go to the Frozen IMD web page, and the way we'd use Selector Gadget is when we're on the web page, we'd open this bookmark, it loads Selector Gadget. Selector Gadget's now at the bottom of the page, as you can see, and now when I hover over different elements in the page, they change colors.
The first step is to highlight what you want, and then you just start trimming it back. Snowman's highlighted, and I don't want Snowman, so I'll click that, Selector Gadget will come up with a way to still collect what I clicked before, but not collect this, so now it's suggesting this CSS selector, and then if I wanted something else, in this case I don't, I think I've narrowed it down, but if I also wanted to collect these, then I could click on what I want to collect too, and again it'll modify it, so basically you just click, click highlighted things you don't want to get rid of them, click things that aren't highlighted that you do want to add them, and Selector Gadget will come up with a CSS path for you to use.
And when it gets to this point, there's no correct CSS to use, oftentimes you can get the same thing in multiple ways, so maybe based on how you click things, Selector Gadget gives you this path, and it provides the same information, and one thing that you might want to keep in mind is if you do this activity one day, and you come back six months later and you want to redo it, websites do change, so the CSS that you want to use might itself change as well. I recommend in every case that you use Selector Gadget to quickly zero in on the CSS path that gives you what you want.
Recap
The RMS package lets you enact this strategy, you download the data using read HTML, this collects the entire website, you then pull out just the tags that you want, using HTML nodes and CSS selectors. You can use the Selector Gadget to get those selectors, but now I'm getting ahead of myself. Finally, once you have the nodes, you extract the content with the helper function, which is normally HTML text. If your node is a table, you can extract the whole table as a R-friendly data frame with HTML table, and then you can use the Selector Gadget widget to find useful selector combinations to get just the information that you want. This is the best way to go about it, because there's not necessarily going to be much rhyme or reason, or intuition to follow about which CSS selectors best describe what you want.
Q&A
So thank you very much for that introduction to web scraping. We have some time, so I'll try to answer some questions.
How will R-Vis handle badly formed pages or missing ending tags? I think R-Vis can handle missing ending tags fairly well, but R-Vis does expect some cooperation with the page designer. It's relying on the structure of the web page, and if the person wrote the HTML really screwed that up, then expect surprises and trouble with R-Vis.
What if the web page is accessed through a username and password? That's a great question, Travis. Yes, you can access such web pages with R-Vist, and the best way to get into that would probably be if you look at the web page read HTML. Basically, you just need to include extra arguments that include your password and username.
Regarding changing site structures, do you have recommendations or best practices for future-proofing production scripts that rely on scraped data? Well, in this case, I don't. I have seen web pages change and change fast, especially if you are working with a web page that you're coming back to because the data's being updated on a regular basis. If you're relying on CSS selectors or the structure of the web page, those things may change. The best practice for doing this is to, once your script has scraped the data, to check that data to make sure it's what you expected to get, at least that it's the right type or it's not an error or null or NA. Then, if the data is not what you expected, then you'll have to raise an error. If the web page has changed, you can't predict how it's going to change, necessarily, so the best you could do is notice that's changed and then go back and fix the process.
Which packages are required again? Just the rvest package, that's R-V-E-S-T, and it sounds like harvest. That does load some other packages like XML, too, but all you need to do is install rvest.
What are the legal issues with scraping? Well, I'm glad you brought that up, Mirdad. It's very ambiguous, so I cannot provide legal advice on this topic.
Can you use rvest with HTTPS? Yes, and I think that you might get into the authentication stuff after you visit.
Do functions like HTML nodes return data as lists? HTML nodes returns data as a specific S3 object, so here cast was HTML nodes, and you can see it's an XML node set of 32, and you use these helper functions to extract that into a more friendly format, so once we did HTML text on cast, then we just have a text vector, but what HTML nodes returns is something that is a slightly exotic object, but it's easier to work with the XML data that's stored in it using that S3 class. However, that class is based on a list, so it does inherit some list nodes.
What are the best techniques to navigate through a number of web pages, i.e. to click next by Robert Allen? Robert, the HTTR package, which we used in the previous webinar, addresses this a little better than the rvest package. It includes some functions that do help you navigate web pages. I'm not sure if these are still experimental or not. I don't use them, and I don't see them, so maybe they were experimental.
Also, if you study the URL for the web pages that you're using, you can often discover the structure or the organization of the web page, and then you can create the URLs yourself. I guess that's not foolproof, but it is something I kind of hack around doing.
How would you extract the same type of information across multiple pages? For example, if I wanted to cast for all Disney movies from 2015, 2005 to 2015. Once you figure out how to extract the data from one web page, you can use that same sequence of functions, extract it from every web page that has the same structure. Every web page on the same site should have the same structure. You just need to supply the URL for each of those web pages into your script. So if you can get those URLs, and you can use a for loop to automate everything that follows.
How you come up with those URLs depends on the web page. If you can figure out the HTTR or the automated way to navigate through links on web pages, which I didn't explain very well at all, and I apologize, then perhaps you can use that. On the other hand, it might be easier to just study the structure of the URLs and change the names at the end of the URL, if that's how it works. But you will need those input URLs, and you'll need to figure out a way to get them somehow. After you do that, everything else you can automate with the three-step sequence we covered.
Can you extract data points from a graph on a website by James Brophy? James, I do not think so. I'm not optimistic about that at all, because most graphs are just going to be images on the web page.
I think we have time for one more question, so I'll choose one that I can actually answer. That might be a little more satisfying, but Pat Schloss asks, is there a way to get specific item prop value, e.g. genre, from our list? This is important. You can see when you download these things in here, one of the attributes of the tags is item prop. With your data, it might be something else of relevance, and you might want to get this value instead of the content of the actual tag. The way you could do that is with HTML attributes or ATTRS. When I run that on CAS, here I do get a list, and for each of the nodes in that list, I get the item prop value. The name of that value is item prop, so this is how you would extract that. Thank you all very much for attending the webinar.
