dbplyr 2.4.0

We’re chuffed to announce the release of dbplyr 2.4.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.

You can install it from CRAN with:

install.packages("dbplyr")

This blog post will highlight some of the most important new features: eliminating subqueries when using multiple unions in a row, getting more control on the generated SQL, and a handful of new translations. As usual, release comes with a large number of improvements to translations for individual backends; see the full list in the release notes

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

SQL optimisation#

dbplyr now produces fewer subqueries when combining tables with union() and union_all() resulting in shorter, more readable, and, in some cases, faster SQL.

lf1 <- lazy_frame(x = 1, y = "a", .name = "lf1")
lf2 <- lazy_frame(x = 1, y = "b", .name = "lf2")
lf3 <- lazy_frame(x = 1, z = "c", .name = "lf3")

lf1 |> 
  union(lf2) |>
  union(lf3)
#> <SQL>
#> SELECT `lf1`.*, NULL AS `z`
#> FROM `lf1`
#> 
#> UNION
#> 
#> SELECT `lf2`.*, NULL AS `z`
#> FROM `lf2`
#> 
#> UNION
#> 
#> SELECT `x`, NULL AS `y`, `z`
#> FROM `lf3`

(As usual in these blog posts, I’m using lazy_frame() to focus on the SQL generation, without having to set up a dummy database.)

Similarly, a semi/anti_join() on a filtered table now avoids a subquery:

lf1 |> 
  semi_join(lf3 |> filter(z == "c"), join_by(x))
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE EXISTS (
#>   SELECT 1 FROM `lf3`
#>   WHERE (`lf1`.`x` = `lf3`.`x`) AND (`lf3`.`z` = 'c')
#> )

SQL generation#

The new argument sql_options for show_query() and remote_query() gives you more control on the generated SQL.

By default dbplyr uses * to select all columns of a table, but with use_star = FALSE all columns are selected explicitly:

lf3 <- lazy_frame(x = 1, y = 2, z = 3, .name = "lf3")
lf3 |> 
  mutate(a = 4)
#> <SQL>
#> SELECT `lf3`.*, 4.0 AS `a`
#> FROM `lf3`

lf3 |> 
  mutate(a = 4) |> 
  show_query(sql_options = sql_options(use_star = FALSE))
#> <SQL>
#> SELECT `x`, `y`, `z`, 4.0 AS `a`
#> FROM `lf3`

If you prefer common table expressions (CTE) over subqueries use cte = TRUE:

nested_query <- lf3 |> 
  mutate(z = z + 1) |> 
  left_join(lf2, by = join_by(x, y))
nested_query
#> <SQL>
#> SELECT `LHS`.*
#> FROM (
#>   SELECT `x`, `y`, `z` + 1.0 AS `z`
#>   FROM `lf3`
#> ) AS `LHS`
#> LEFT JOIN `lf2`
#>   ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)

nested_query |> 
  show_query(sql_options = sql_options(cte = TRUE))
#> <SQL>
#> WITH `q01` AS (
#>   SELECT `x`, `y`, `z` + 1.0 AS `z`
#>   FROM `lf3`
#> )
#> SELECT `LHS`.*
#> FROM `q01` AS `LHS`
#> LEFT JOIN `lf2`
#>   ON (`LHS`.`x` = `lf2`.`x` AND `LHS`.`y` = `lf2`.`y`)

And if you want that all columns in a join are qualified with the table name and not only the ambiguous ones use qualify_all_columns = TRUE:

qualify_columns <- lf2 |> 
  left_join(lf3, by = join_by(x, y))

qualify_columns
#> <SQL>
#> SELECT `lf2`.*, `z`
#> FROM `lf2`
#> LEFT JOIN `lf3`
#>   ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)

qualify_columns |> 
   show_query(sql_options = sql_options(qualify_all_columns = TRUE))
#> <SQL>
#> SELECT `lf2`.*, `lf3`.`z` AS `z`
#> FROM `lf2`
#> LEFT JOIN `lf3`
#>   ON (`lf2`.`x` = `lf3`.`x` AND `lf2`.`y` = `lf3`.`y`)

New translations#

str_detect(), str_starts() and str_ends() with fixed patterns are translated to INSTR():

lf1 |> 
  filter(
    stringr::str_detect(x, stringr::fixed("abc")),
    stringr::str_starts(x, stringr::fixed("a"))
  )
#> <SQL>
#> SELECT `lf1`.*
#> FROM `lf1`
#> WHERE (INSTR(`x`, 'abc') > 0) AND (INSTR(`x`, 'a') = 1)

And nzchar() and runif() are now translated to their SQL equivalents:

lf1 |> 
  filter(nzchar(x)) |> 
  mutate(z = runif())
#> <SQL>
#> SELECT `lf1`.*, RANDOM() AS `z`
#> FROM `lf1`
#> WHERE (((`x` IS NULL) OR `x` != ''))