dplyr 0.8.2

Introduction#

We’re delighted to announce the release of dplyr 0.8.2 on CRAN 🍉 !

This is a minor maintenance release in the 0.8.* series, addressing a collection of issues since the 0.8.1 and 0.8.0 versions.

top_n() and top_frac()#

top_n() has been around for a long time in dplyr , as a convenient wrapper around filter() and min_rank() , to select top (or bottom) entries in each group of a tibble.

In this release, top_n() is no longer limited to a constant number of entries per group, its n argument is now quoted to be evaluated later in the context of the group.

Here are the top half countries, i.e. n() / 2, in terms of life expectancy in 2007.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28


library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
gapminder::gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  top_n(n() / 2, lifeExp)
#> # A tibble: 70 x 6
#> # Groups:   continent [5]
#>    country   continent  year lifeExp        pop gdpPercap
#>    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 Algeria   Africa     2007    72.3   33333216     6223.
#>  2 Argentina Americas   2007    75.3   40301927    12779.
#>  3 Australia Oceania    2007    81.2   20434176    34435.
#>  4 Austria   Europe     2007    79.8    8199783    36126.
#>  5 Bahrain   Asia       2007    75.6     708573    29796.
#>  6 Belgium   Europe     2007    79.4   10392226    33693.
#>  7 Benin     Africa     2007    56.7    8078314     1441.
#>  8 Canada    Americas   2007    80.7   33390141    36319.
#>  9 Chile     Americas   2007    78.6   16284741    13172.
#> 10 China     Asia       2007    73.0 1318683096     4959.
#> # … with 60 more rows

top_frac() is new convenience shortcut for the top n percent, i.e.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


gapminder::gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  top_frac(0.5, lifeExp)
#> # A tibble: 70 x 6
#> # Groups:   continent [5]
#>    country   continent  year lifeExp        pop gdpPercap
#>    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 Algeria   Africa     2007    72.3   33333216     6223.
#>  2 Argentina Americas   2007    75.3   40301927    12779.
#>  3 Australia Oceania    2007    81.2   20434176    34435.
#>  4 Austria   Europe     2007    79.8    8199783    36126.
#>  5 Bahrain   Asia       2007    75.6     708573    29796.
#>  6 Belgium   Europe     2007    79.4   10392226    33693.
#>  7 Benin     Africa     2007    56.7    8078314     1441.
#>  8 Canada    Americas   2007    80.7   33390141    36319.
#>  9 Chile     Americas   2007    78.6   16284741    13172.
#> 10 China     Asia       2007    73.0 1318683096     4959.
#> # … with 60 more rows

tbl_vars() and group_cols()#

tbl_vars() now returns a dplyr_sel_vars object that keeps track of the grouping variables. This information powers group_cols() , which can now be used in every function that uses tidy selection of columns.

Functions in the tidyverse and beyond may start to use the tbl_vars() /group_cols() duo, starting from tidyr and this pull request

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# pak::pkg_install("tidyverse/tidyr#668")

iris %>%
  group_by(Species) %>% 
  tidyr::gather("flower_att", "measurement", -group_cols())
#> # A tibble: 600 x 3
#> # Groups:   Species [3]
#>    Species flower_att   measurement
#>    <fct>   <chr>              <dbl>
#>  1 setosa  Sepal.Length         5.1
#>  2 setosa  Sepal.Length         4.9
#>  3 setosa  Sepal.Length         4.7
#>  4 setosa  Sepal.Length         4.6
#>  5 setosa  Sepal.Length         5  
#>  6 setosa  Sepal.Length         5.4
#>  7 setosa  Sepal.Length         4.6
#>  8 setosa  Sepal.Length         5  
#>  9 setosa  Sepal.Length         4.4
#> 10 setosa  Sepal.Length         4.9
#> # … with 590 more rows

group_split(), group_map(), group_modify()#

group_split() always keeps a ptype attribute to track the prototype of the splits.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_split()
#> list()
#> attr(,"ptype")
#> # A tibble: 0 x 11
#> # … with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>,
#> #   drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>

group_map() and group_modify() benefit from this in the edge case where there are no groups.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_map(~.x)
#> list()
#> attr(,"ptype")
#> # A tibble: 0 x 10
#> # … with 10 variables: mpg <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
#> #   wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_modify(~.x)
#> # A tibble: 0 x 11
#> # Groups:   cyl [0]
#> # … with 11 variables: cyl <dbl>, mpg <dbl>, disp <dbl>, hp <dbl>,
#> #   drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>