I’m very pleased to announce that dplyr 0.5.0 is now available from CRAN. Get the latest version with:
|
|
dplyr 0.5.0 is a big release with a heap of new features, a whole bunch of minor improvements, and many bug fixes, both from me and from the broader dplyr community. In this blog post, I’ll highlight the most important changes:
-
Some breaking changes to single table verbs.
-
New tibble and dtplyr packages.
-
New vector functions.
-
Replacements for
summarise_each()andmutate_each(). -
Improvements to SQL translation.
To see the complete list, please read the release notes .
Breaking changes#
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.
|
|
If you give distinct() a list of variables, it now only keeps those variables (instead of, as previously, keeping the first value from the other variables). To preserve the previous behaviour, use .keep_all = TRUE:
|
|
The select() helper functions starts_with(), ends_with(), etc are now real exported functions. This means that they have better documentation, and there’s an extension mechnaism if you want to write your own helpers.
Tibble and dtplyr packages#
Functions related to the creation and coercion of tbl_dfs (“tibble"s for short), now live in their own package: tibble
. See vignette("tibble") for more details.
Similarly, all code related to the data table dplyr backend code has been separated out in to a new dtplyr package. This decouples the development of the data.table interface from the development of the dplyr package, and I hope will spur improvements to the backend. If both data.table and dplyr are loaded, you’ll get a message reminding you to load dtplyr.
Vector functions#
This version of dplyr gains a number of vector functions inspired by SQL. Two functions make it a little easier to eliminate or generate missing values:
- Given a set of vectors,
coalesce()finds the first non-missing value in each position:
|
|
- The complement of
coalesce()isna_if(): it replaces a specified value with anNA.
|
|
Three functions provide convenient ways of replacing values. In order from simplest to most complicated, they are:
if_else(), a vectorised if statement, takes a logical vector (usually created with a comparison operator like==,<, or%in%) and replacesTRUEs with one vector andFALSEs with another.
|
|
if_else() is similar to base::ifelse(), but has two useful improvements.
First, it has a fourth argument that will replace missing values:
|
|
Secondly, it also have stricter semantics that ifelse(): the true and false arguments must be the same type. This gives a less surprising return type, and preserves S3 vectors like dates and factors:
|
|
Currently, if_else() is very strict, so you’ll need to careful match the types of true and false. This is most likely to bite you when you’re using missing values, and you’ll need to use a specific NA: NA_integer_, NA_real_, or NA_character_:
|
|
recode(), a vectorisedswitch(), takes a numeric vector, character vector, or factor, and replaces elements based on their values.
|
|
case_when(), is a vectorised set ofifandelse ifs. You provide it a set of test-result pairs as formulas: The left side of the formula should return a logical vector, and the right hand side should return either a single value, or a vector the same length as the left hand side. All results must be the same type of vector.
|
|
case_when() is still somewhat experiment and does not currently work inside mutate(). That will be fixed in a future version.
I also added one small helper for dealing with floating point comparisons: near() tests for equality with numeric tolerance (abs(x - y) < tolerance).
|
|
Predicate functions#
Thanks to ideas and code from Lionel Henry
, a new family of functions improve upon summarise_each() and mutate_each():
summarise_all()andmutate_all()apply a function to all (non-grouped) columns:
|
|
-
summarise_at()andmutate_at()operate on a subset of columns. You can select columns with:-
a character vector of column names,
-
a numeric vector of column positions, or
-
a column specification with
select()semantics generated with the newvars()helper.
mtcars %>% group_by(cyl) %>% summarise_at(c(“mpg”, “wt”), mean) #> # A tibble: 3 x 3 #> cyl mpg wt #>
#> 1 4 26.66364 2.285727 #> 2 6 19.74286 3.117143 #> 3 8 15.10000 3.999214 mtcars %>% group_by(cyl) %>% summarise_at(vars(mpg, wt), mean) #> # A tibble: 3 x 3 #> cyl mpg wt #> #> 1 4 26.66364 2.285727 #> 2 6 19.74286 3.117143 #> 3 8 15.10000 3.999214 -
-
summarise_if()andmutate_if()take a predicate function (a function that returnsTRUEorFALSEwhen given a column). This makes it easy to apply a function only to numeric columns:
|
|
All of these functions pass ... on to the individual funs:
|
|
A new select_if() allows you to pick columns with a predicate function:
|
|
summarise_each() and mutate_each() will be deprecated in a future release.
SQL translation#
I have completely overhauled the translation of dplyr verbs into SQL statements. Previously, dplyr used a rather ad-hoc approach which tried to guess when a new subquery was needed. Unfortunately this approach was fraught with bugs, so I have now implemented a richer internal data model. In the short-term, this is likely to lead to some minor performance decreases (as the generated SQL is more complex), but the dplyr is much more likely to generate correct SQL. In the long-term, these abstractions will make it possible to write a query optimiser/compiler in dplyr, which would make it possible to generate much more succinct queries. If you know anything about writing query optimisers or compilers and are interested in working on this problem, please let me know!

