rvest 0.3.0

I’m pleased to announce rvest 0.3.0 is now available on CRAN. Rvest makes it easy to scrape (or harvest) data from html web pages, inspired by libraries like beautiful soup . It is designed to work with pipes so that you can express complex operations by composed simple pieces. Install it with:

1

install.packages("rvest")

What’s new#

The biggest change in this version is that rvest now uses the xml2 package instead of XML . This makes rvest much simpler, eliminates memory leaks, and should improve performance a little.

A number of functions have changed names to improve consistency with other packages: most importantly html() is now read_html(), and html_tag() is now html_name(). The old versions still work, but are deprecated and will be removed in rvest 0.4.0.

html_node() now throws an error if there are no matches, and a warning if there’s more than one match. I think this should make it more likely to fail clearly when the structure of the page changes. If you don’t want this behaviour, use html_nodes().

There were a number of other bug fixes and minor improvements as described in the release notes .

Recreating Septa Transit Timetables in Python

Recently, Rich and I were poking around transit data, and we were struck by the amount of structuring that goes into transit timetables. For example, consider this weekend rail schedule table from SEPTA, Philadelphia’s transit agency. Notice these big pieces: The vertical text on the left indicating trains are traveling “TO CENTER CITY”. The blue header, and spanner columns (“Services” and “Train Number”) grouping related columns. The striped background for easier reading. Also the black background indicating stations in Center City (the urban core). Tables like this often have to be created in tools like Illustrator, and updated by hand. At the same time, when agencies automate table creation, they often sacrifice a lot of the assistive features and helpful affordances of the table.

Mar 5

Outgrowing your laptop with R and Positron

R-Ladies Abuja has posted a recording of a recent talk on Positron, and you can find it here!

Quarto icon, PDF file icon, accessibility icon, and validation shield

Mar 5

PDF Accessibility and Standards

Quarto 1.9 brings PDF accessibility and standards support, building on new tagging features in LaTeX and Typst.