We’re chuffed to announce the release of dbplyr 2.4.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.

You can install it from CRAN with:

install.packages("dbplyr")

This blog post will highlight some of the most important new features: eliminating subqueries when using multiple unions in a row, getting more control on the generated SQL, and a handful of new translations. As usual, release comes with a large number of improvements to translations for individual backends; see the full list in the release notes

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

SQL optimisation#

dbplyr now produces fewer subqueries when combining tables with union() and union_all() resulting in shorter, more readable, and, in some cases, faster SQL.

lf1 <- lazy_frame(x = 1, y = "a", .name = "lf1")
lf2 <- lazy_frame(x = 1, y = "b", .name = "lf2")
lf3 <- lazy_frame(x = 1, z = "c", .name = "lf3")

lf1 |> 
  union(lf2) |>
  union(lf3)
#> <SQL>
#> SELECT `lf1`.*, NULL AS `z`
#> FROM `lf1`
#> 
#> UNION
#> 
#> SELECT `lf2`.*, NULL AS `z`
#> FROM `lf2`
#> 
#> UNION
#> 
#> SELECT `x`, NULL AS `y`, `z`
#> FROM `lf3`

(As usual in these blog posts, I’m using lazy_frame() to focus on the SQL generation, without having to set up a dummy database.)

Similarly, a semi/anti_join() on a filtered table now avoids a subquery:

lf1 |> 
  semi_join(lf3 |> filter(z == "c"), join_by(x))
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE EXISTS (
#>   SELECT 1 FROM `lf3`
#>   WHERE (`lf1`.`x` = `lf3`.`x`) AND (`lf3`.`z` = 'c')
#> )

SQL generation#

The new argument sql_options for show_query() and remote_query() gives you more control on the generated SQL.

  • By default dbplyr uses * to select all columns of a table, but with use_star = FALSE all columns are selected explicitly:

    lf3 <- lazy_frame(x = 1, y = 2, z = 3, .name = "lf3")
    lf3 |> 
      mutate(a = 4)
    #> <SQL>
    #> SELECT `lf3`.*, 4.0 AS `a`
    #> FROM `lf3`
    
    lf3 |> 
      mutate(a = 4) |> 
      show_query(sql_options = sql_options(use_star = FALSE))
    #> <SQL>
    #> SELECT `x`, `y`, `z`, 4.0 AS `a`
    #> FROM `lf3`
    
  • If you prefer common table expressions (CTE) over subqueries use cte = TRUE:

    nested_query <- lf3 |> 
      mutate(z = z + 1) |> 
      left_join(lf2, by = join_by(x, y))
    nested_query
    #> <SQL>
    #> SELECT `LHS`.*
    #> FROM (
    #>   SELECT `x`, `y`, `z` + 1.0 AS `z`
    #>   FROM `lf3`
    #> ) AS `LHS`
    #> LEFT JOIN `lf2`
    #>   ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)
    
    nested_query |> 
      show_query(sql_options = sql_options(cte = TRUE))
    #> <SQL>
    #> WITH `q01` AS (
    #>   SELECT `x`, `y`, `z` + 1.0 AS `z`
    #>   FROM `lf3`
    #> )
    #> SELECT `LHS`.*
    #> FROM `q01` AS `LHS`
    #> LEFT JOIN `lf2`
    #>   ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)
    
  • And if you want that all columns in a join are qualified with the table name and not only the ambiguous ones use qualify_all_columns = TRUE:

    qualify_columns <- lf2 |> 
      left_join(lf3, by = join_by(x, y))
    
    qualify_columns
    #> <SQL>
    #> SELECT `lf2`.*, `z`
    #> FROM `lf2`
    #> LEFT JOIN `lf3`
    #>   ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)
    
    qualify_columns |> 
       show_query(sql_options = sql_options(qualify_all_columns = TRUE))
    #> <SQL>
    #> SELECT `lf2`.*, `lf3`.`z` AS `z`
    #> FROM `lf2`
    #> LEFT JOIN `lf3`
    #>   ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)
    

New translations#

str_detect(), str_starts() and str_ends() with fixed patterns are translated to INSTR():

lf1 |> 
  filter(
    stringr::str_detect(x, stringr::fixed("abc")),
    stringr::str_starts(x, stringr::fixed("a"))
  )
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE (INSTR(`x`, 'abc') > 0) AND (INSTR(`x`, 'a') = 1)

And nzchar() and runif() are now translated to their SQL equivalents:

lf1 |> 
  filter(nzchar(x)) |> 
  mutate(z = runif())
#> <SQL>
#> SELECT `lf1`.*, RANDOM() AS `z`
#> FROM `lf1`
#> WHERE (((`x` IS NULL) OR `x` != ''))

Acknowledgements#

The vast majority of this release (particularly the SQL optimisations) are from Maximilian Girlich ; thanks so much for continued work on this package! And a big thanks go to the 84 other folks who helped out by filing issues and contributing code: @abalter , @ablack3 , @andreassoteriadesmoj , @apalacio9502 , @avsdev-cw , @bairdj , @bastistician , @brownj31 , @But2ene , @carlganz , @catalamarti , @CEH-SLU , @chriscardillo , @DavisVaughan , @DaZaM82 , @donour , @edgararuiz , @eduardszoecs , @eipi10 , @ejneer , @erikvona , @fh-afrachioni , @fh-mthomson , @gui-salome , @hadley , @halpo , @homer3018 , @iangow , @jdlom , @jennal-datacenter , @JeremyPasco , @jiemakel , @jingydz , @johnbaums , @joshseiv , @jrandall , @khkk378 , @kmishra9 , @kongdd , @krlmlr , @krprasangdas , @KRRLP-PL , @lentinj , @lgaborini , @lhabegger , @lorenzolightsgdwarf , @lschneiderbauer , @marianschmidt , @matthewjnield , @mgirlich , @MichaelChirico , @misea , @mjbroerman , @moodymudskipper , @multimeric , @nannerhammix , @nikolasharing , @nviets , @nviraj , @oobd , @pboesu , @pepijn-devries , @rbcavanaugh , @rcepka , @robertkck , @samssann , @SayfSaid , @scottporter , @shearerpmm , @srikanthtist , @stemangiola , @stephenashton-dhsc , @stevepowell99 , @TBlackmore , @thomashulst , @thothal , @tilo-aok , @tisseuil , @tonyk7440 , @TSchiefer , @Tsemharb , @tuge98 , @vadim-cherepanov , and @wdenton .