We’re chuffed to announce the release of dbplyr 2.4.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.
You can install it from CRAN with:
install.packages("dbplyr")
This blog post will highlight some of the most important new features: eliminating subqueries when using multiple unions in a row, getting more control on the generated SQL, and a handful of new translations. As usual, release comes with a large number of improvements to translations for individual backends; see the full list in the release notes
SQL optimisation#
dbplyr now produces fewer subqueries when combining tables with union()
and union_all()
resulting in shorter, more readable, and, in some cases, faster SQL.
lf1 <- lazy_frame(x = 1, y = "a", .name = "lf1")
lf2 <- lazy_frame(x = 1, y = "b", .name = "lf2")
lf3 <- lazy_frame(x = 1, z = "c", .name = "lf3")
lf1 |>
union(lf2) |>
union(lf3)
#> <SQL>
#> SELECT `lf1`.*, NULL AS `z`
#> FROM `lf1`
#>
#> UNION
#>
#> SELECT `lf2`.*, NULL AS `z`
#> FROM `lf2`
#>
#> UNION
#>
#> SELECT `x`, NULL AS `y`, `z`
#> FROM `lf3`
(As usual in these blog posts, I’m using lazy_frame()
to focus on the SQL generation, without having to set up a dummy database.)
Similarly, a semi/anti_join() on a filtered table now avoids a subquery:
lf1 |>
semi_join(lf3 |> filter(z == "c"), join_by(x))
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE EXISTS (
#> SELECT 1 FROM `lf3`
#> WHERE (`lf1`.`x` = `lf3`.`x`) AND (`lf3`.`z` = 'c')
#> )
SQL generation#
The new argument sql_options for show_query()
and remote_query()
gives you more control on the generated SQL.
-
By default dbplyr uses
*to select all columns of a table, but withuse_star = FALSEall columns are selected explicitly:lf3 <- lazy_frame(x = 1, y = 2, z = 3, .name = "lf3") lf3 |> mutate(a = 4) #> <SQL> #> SELECT `lf3`.*, 4.0 AS `a` #> FROM `lf3` lf3 |> mutate(a = 4) |> show_query(sql_options = sql_options(use_star = FALSE)) #> <SQL> #> SELECT `x`, `y`, `z`, 4.0 AS `a` #> FROM `lf3` -
If you prefer common table expressions (CTE) over subqueries use
cte = TRUE:nested_query <- lf3 |> mutate(z = z + 1) |> left_join(lf2, by = join_by(x, y)) nested_query #> <SQL> #> SELECT `LHS`.* #> FROM ( #> SELECT `x`, `y`, `z` + 1.0 AS `z` #> FROM `lf3` #> ) AS `LHS` #> LEFT JOIN `lf2` #> ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`) nested_query |> show_query(sql_options = sql_options(cte = TRUE)) #> <SQL> #> WITH `q01` AS ( #> SELECT `x`, `y`, `z` + 1.0 AS `z` #> FROM `lf3` #> ) #> SELECT `LHS`.* #> FROM `q01` AS `LHS` #> LEFT JOIN `lf2` #> ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`) -
And if you want that all columns in a join are qualified with the table name and not only the ambiguous ones use
qualify_all_columns = TRUE:qualify_columns <- lf2 |> left_join(lf3, by = join_by(x, y)) qualify_columns #> <SQL> #> SELECT `lf2`.*, `z` #> FROM `lf2` #> LEFT JOIN `lf3` #> ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`) qualify_columns |> show_query(sql_options = sql_options(qualify_all_columns = TRUE)) #> <SQL> #> SELECT `lf2`.*, `lf3`.`z` AS `z` #> FROM `lf2` #> LEFT JOIN `lf3` #> ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)
New translations#
str_detect(), str_starts() and str_ends() with fixed patterns are translated to INSTR():
lf1 |>
filter(
stringr::str_detect(x, stringr::fixed("abc")),
stringr::str_starts(x, stringr::fixed("a"))
)
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE (INSTR(`x`, 'abc') > 0) AND (INSTR(`x`, 'a') = 1)
And nzchar()
and runif()
are now translated to their SQL equivalents:
lf1 |>
filter(nzchar(x)) |>
mutate(z = runif())
#> <SQL>
#> SELECT `lf1`.*, RANDOM() AS `z`
#> FROM `lf1`
#> WHERE (((`x` IS NULL) OR `x` != ''))
Acknowledgements#
The vast majority of this release (particularly the SQL optimisations) are from Maximilian Girlich ; thanks so much for continued work on this package! And a big thanks go to the 84 other folks who helped out by filing issues and contributing code: @abalter , @ablack3 , @andreassoteriadesmoj , @apalacio9502 , @avsdev-cw , @bairdj , @bastistician , @brownj31 , @But2ene , @carlganz , @catalamarti , @CEH-SLU , @chriscardillo , @DavisVaughan , @DaZaM82 , @donour , @edgararuiz , @eduardszoecs , @eipi10 , @ejneer , @erikvona , @fh-afrachioni , @fh-mthomson , @gui-salome , @hadley , @halpo , @homer3018 , @iangow , @jdlom , @jennal-datacenter , @JeremyPasco , @jiemakel , @jingydz , @johnbaums , @joshseiv , @jrandall , @khkk378 , @kmishra9 , @kongdd , @krlmlr , @krprasangdas , @KRRLP-PL , @lentinj , @lgaborini , @lhabegger , @lorenzolightsgdwarf , @lschneiderbauer , @marianschmidt , @matthewjnield , @mgirlich , @MichaelChirico , @misea , @mjbroerman , @moodymudskipper , @multimeric , @nannerhammix , @nikolasharing , @nviets , @nviraj , @oobd , @pboesu , @pepijn-devries , @rbcavanaugh , @rcepka , @robertkck , @samssann , @SayfSaid , @scottporter , @shearerpmm , @srikanthtist , @stemangiola , @stephenashton-dhsc , @stevepowell99 , @TBlackmore , @thomashulst , @thothal , @tilo-aok , @tisseuil , @tonyk7440 , @TSchiefer , @Tsemharb , @tuge98 , @vadim-cherepanov , and @wdenton .

