rvest is new package that makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup . It is designed to work with magrittr so that you can express complex operations as elegant pipelines composed of simple, easily understood pieces. Install it with:
|
|
rvest in action#
To see rvest in action, imagine we’d like to scrape some information about The Lego Movie
from IMDB. We start by downloading and parsing the file with html():
|
|
To extract the rating, we start with selectorgadget
to figure out which css selector matches the data we want: strong span. (If you haven’t heard of selectorgadget
, make sure to read vignette("selectorgadget") - it’s the easiest way to determine which selector extracts the data that you’re interested in.) We use html_node() to find the first node that matches that selector, extract its contents with html_text(), and convert it to numeric with as.numeric():
|
|
We use a similar process to extract the cast, using html_nodes() to find all nodes that match the selector:
|
|
The titles and authors of recent message board postings are stored in a the third table on the page. We can use html_node() and [[ to find it, then coerce it to a data frame with html_table():
|
|
Other important functions#
-
If you prefer, you can use xpath selectors instead of css:
html_nodes(doc, xpath = "//table//td")). -
Extract the tag names with
html_tag(), text withhtml_text(), a single attribute withhtml_attr()or all attributes withhtml_attrs(). -
Detect and repair text encoding problems with
guess_encoding()andrepair_encoding(). -
Navigate around a website as if you’re in a browser with
html_session(),jump_to(),follow_link(),back(), andforward(). Extract, modify and submit forms withhtml_form(),set_values()andsubmit_form(). (This is still a work in progress, so I’d love your feedback.)
To see these functions in action, check out package demos with demo(package = "rvest").

