I’m excited to announce that vroom 1.0.0 is now available on CRAN!
vroom reads rectangular data, such as comma separated
(csv), tab separated (tsv) or fixed width files (fwf) into R. It performs
similar roles to functions like readr::read_csv()
,
data.table::fread()
or read.csv(). But for many
datasets vroom::vroom() can read them much, much faster (hence the name).
The main reason vroom can be faster is because character data is read from the file lazily; you only pay for the data you use. This lazy access is done automatically, so no changes to your R data-manipulation code are needed.
vroom also provides efficient, multi-threaded writing that is multiple times
faster on most inputs than the readr::write_*() functions.
Install vroom with:
|
|
The best way to get acquainted with the package is the getting started vignette.
vroom vs readr#
What does the release of vroom mean for readr? For now we plan to let the two packages evolve separately, but likely we will unite the packages in the future. One disadvantage to vroom’s lazy reading is certain data problems can’t be reported up front, so how best to unify them requires some thought.
Reading delimited files#
Compared to readr, the first difference you may note is you use only one
function to read the files,
vroom()
. This is because
vroom() guesses the delimiter of the file automatically based on the first
few lines (this feature is inspired by a similar feature in
data.table::fread()). This works well most of the time, but may fail to guess
properly in some cases. The delim argument can be used to specify the
delimiter of the file explicitly.
|
|
The summary message after reading also differs from readr. We hope this output
gives a more informative indication as to whether the types of your columns are
being guessed properly. However you can still retrieve and print the full
column specification with spec()
.
|
|
The message will be disabled if you supply a column specification to col_types when reading.
|
|
Reading multiple files#
One feature new to vroom is built-in support for reading sets of files with the
same columns into one table. Just pass the filenames to be read directly to
vroom(). Imagine we have a directory of files containing the flights data, where
each file corresponds to a single airline.
Then, we can efficiently read all of the files into one tibble by passing a
vector of the filenames directly to vroom().
|
|
Reading and writing compressed files#
Just like readr, vroom automatically reads and writes zip, gzip, bz2 and xz compressed files with the standard file extensions.
|
|
Reading remote files#
vroom can also read files from the internet as well by passing the URL of the file to vroom().
|
|
It can even read gzipped files from the internet (although currently not the other compressed formats).
Reading and writing from pipe connections#
vroom provides efficient input and output from pipe() connections.
This is useful for doing things like pre-filtering large inputs with command line tools like grep .
|
|
Or using multi-threaded compression programs like pigz , which can greatly reduce the time to write compressed files.
|
|
Column selection#
vroom introduces a new argument, col_select, which makes selecting columns to
keep (or omit) more straightforward.
col_select uses the same interface as dplyr::select(), so you can do flexible selection operations.
- Select with the column names
|
|
- Drop columns by name
|
|
- Use the selection helpers
|
|
- Or rename columns
|
|
Name repair#
Often the names of columns in the original dataset are not ideal to work with.
vroom() uses the same .name_repair
argument as tibble, so you can use one of the default name repair strategies or
provide a custom function. A great approach is to use the
janitor
make_clean_names() function as the input.
|
|
Column types#
Like readr, vroom guesses the data types of columns as they are read. readr
simply used the first n rows of data, vroom uses an improved heuristic of
looking at data throughout the file, which should improve guessing accuracy.
However if the guessing fails it can be necessary to change the type of one or
more columns.
The available specifications are: (with single letter abbreviations in quotes)
col_logical()’l’, containing onlyT,F,TRUE,FALSE,1or0.col_integer()‘i’, integer values.col_double()’d’, floating point values.col_number()[n], numbers containing thegrouping_markcol_date(format = "")[D]: with the locale’sdate_format.col_time(format = "")[t]: with the locale’stime_format.col_datetime(format = "")[T]: ISO8601 date times.col_factor(levels, ordered)‘f’, a fixed set of values.col_character()‘c’, everything else.col_skip()‘_, -’, don’t import this column.col_guess()‘?’, parse using the “best” type based on the input.
You can tell vroom what columns to use with the col_types() argument in a number of ways.
If you only need to override a single column, the most concise way is to use a named vector.
|
|
However, you can also use the col_*() functions in a list.
|
|
This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.
|
|
Speed#
vroom is fast, but how fast? We benchmarked vroom using a real-world dataset of taxi-trip data, with 14.7 million rows, 11 columns. It contains a mix of numeric and text data, and has a total file size of 1.55 GB.
#> Observations: 14,776,615
#> Variables: 11
#> $ medallion <chr> "89D227B655E5C82AECF13C3F540D4CF4", "0BD7C8F5B...
#> $ hack_license <chr> "BA96DE419E711691B9445D6A6307C170", "9FD8F69F0...
#> $ vendor_id <chr> "CMT", "CMT", "CMT", "CMT", "CMT", "CMT", "CMT...
#> $ pickup_datetime <chr> "2013-01-01 15:11:48", "2013-01-06 00:18:35", ...
#> $ payment_type <chr> "CSH", "CSH", "CSH", "CSH", "CSH", "CSH", "CSH...
#> $ fare_amount <dbl> 6.5, 6.0, 5.5, 5.0, 9.5, 9.5, 6.0, 34.0, 5.5, ...
#> $ surcharge <dbl> 0.0, 0.5, 1.0, 0.5, 0.5, 0.0, 0.0, 0.0, 1.0, 0...
#> $ mta_tax <dbl> 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0...
#> $ tip_amount <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ tolls_amount <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.8, 0.0, 0...
#> $ total_amount <dbl> 7.0, 7.0, 7.0, 6.0, 10.5, 10.0, 6.5, 39.3, 7.0...
We performed a series of simple manipulations with each approach.
- Reading the data
print()head()tail()- Sampling 100 random rows
- Filtering for “UNK” payment, this is 6434 rows (0.0435% of total).
- Summarizing the mean fare amount per payment type.
| package | read | head | tail | sample | filter | summarise | total | |
|---|---|---|---|---|---|---|---|---|
| read.delim | 1m 21.5s | 6ms | 1ms | 1ms | 1ms | 315ms | 764ms | 1m 22.6s |
| readr | 33.1s | 90ms | 1ms | 1ms | 2ms | 202ms | 825ms | 34.2s |
| data.table | 15.7s | 13ms | 1ms | 1ms | 1ms | 129ms | 394ms | 16.3s |
| vroom | 3.6s | 86ms | 1ms | 1ms | 2ms | 1.4s | 1.9s | 7s |
Some things to note in the results. The initial reading is much faster in vroom
than any other method, and most of the manipulations, such as print(),
head(), tail() and sample() are equally fast, so fast they can’t be seen
in the plots. However because the character data is read lazily, operations such
as filter and summarise, which need character values, require additional
time. However, this cost will only occur once. After the values have been read,
they will be stored in memory, and subsequent accesses will be equivalent to
other packages.
For more details on how the benchmarks were performed and additional benchmarks with other types of data see the benchmark vignette .
Feedback welcome!#
vroom is a new package and, like any newborn, may fall down a few times before learning to run. If you do run into a bug or think of a new feature that would work well in vroom please open an issue so we can discuss it!
Acknowledgements#
Even though this is a new release, a number of people have been testing out pre-release versions on their datasets and opening issues, which has been a huge help in making the package more robust.
A big thanks to @alex-gable , @andrie , @dan-reznik , @Evgeniy- , @ginolhac , @ibarraespinosa , @KasperSkytte , @ldecicco-USGS , @LuisQ95 , @matthieu-haudiquet , @md0u80c9 , @mkiang , @R3myG , @randomgambit , @slowkow , @telaroz , @thierrygosselin , and @xiaodaigh !
Also this package would not be possible without the following significant contributions to the R ecosystem.
- Gabe Becker , Luke Tierney and Tomas Kalibera for conceiving, implementing and maintaining the Altrep framework used extensively in vroom.
- Romain François , whose Altrepisode package and related blog-posts were a great guide for creating new Altrep objects in C++.
- Matt Dowle
and the rest of the
Rdatatable
team,
data.table::fread()is blazing fast and a great motivator to think about how to read delimited files fast!
