Interacting with Databases by Bárbara Borges from Shiny from useR! Brussels 2017

Transcript#

This transcript was generated automatically and may contain errors.

So, um, I am Bárbara, I'm a software developer at RStudio . Um, I am in the Shiny team and I'm gonna talk today a little bit about Shiny, but mostly about databases. How do you interact with databases from R in general and then specifically in Shiny because that brings a whole new set of concerns.

Uh, so to give you an idea of what I have planned for this talk, uh, I kinda see it as having two parts. Uh, part one is mostly what, uh, we're gonna spend this time on, uh, and it's gonna basically be about database- database best practices in R. In particular, um, I'm gonna go through a set of packages that really help you when you're interacting with databases in R and I'm gonna show how they build on top of each other.

So very quickly, DBI is a package that helps you standardise, uh, your interactions with a database from R. ODBC is a really, uh, powerful, uh, DBI backend which will make sense when I get to it. Um, dplyr , uh, is a pack- is probably the package that most of you know here. That's a package that is really great for, um, wrangling and transforming data in a tidy way and it can be used, uh, for database- for databases as well. So not only memory data but also remote data. And lastly, uh, the pool package is actually the only package that I wrote from the ones that I'm gonna talk about today and that's the one that relates most to Shiny. So pool helps you to connect to a database from Shiny using all of these other packages.

Um, so that's, uh, my main goal for today, uh, but part two of the talk is, uh, I'm gonna try to spend a few minutes on a Shiny app demo, uh, that I've created for this talk. It's a CRUD app so it means that you can create, read, update, and delete data from a database all from within a Shiny app, all from within R. In this case I chose a SQLite database, uh, which is local and really fast and doesn't require internet.

Overview of relational databases

Uh, so, databases. Um, at the risk of, uh, oversimplifying you can think of databases as being, uh, in one of two categories. Uh, there's relational databases and then there's the newer NoSQL or object oriented databases. We'll focus on the first one, uh, and that's mostly for historical reasons. They've been, uh, around for longer. Uh, the infrastructure around them in general, not just in R, is stronger.

Uh, and very quickly relational databases and their management systems store data in columns and rows. These make up tables. It looks spreadsheet-y. Uh, they use SQL which is a language for querying data. Uh, a few popular, uh, databases, uh, that are relational are, in this case open source, are MySQL, Postgres, and SQLite. Recently for distributed systems you also have things like Apache Hive and Cloudera Impala which are relational-like. Um, and most of these packages work really well, uh, with relational databases.

The DBI package

So the first foundational package, uh, that we're interested in when it comes to databases in R is called DBI and it stands for database interface, uh, and the whole idea of DBI is that it's a standard, it's an interface, uh, that defines how you should interact with the database from R. So basically it has, um, S3 generics which are, uh, functions that are implemented for particular types of connections.

So DBI itself doesn't, uh, care what type of connection you have, it just says that whatever is your connection you have to implement this method for connecting, that method for disconnecting, and a bunch of other methods. So the advantage of this is that you get, um, different databases using basically the same language and that gives you a lot of portability in your code. You can basically migrate your database to another system and be able to use, uh, a lot of your old code.

Uh, so one thing that I wanna, uh, note particularly is that, uh, as I said, DBI itself doesn't care what is the particular database you're using. Uh, all of that work is done by other packages and these, uh, packages will implement all of these methods for a particular database, uh, or a database management system. So these are the so called DBI compliant backends, um, our SQLite, our MySQL, our Postgres are all examples of DBI compliant packages, um, so you need to use both of them if you want to connect to, uh, a SQLite database.

Um, so and if this sounds like extra work to you, again, the advantage is that by having DBI, um, you have a very portable, um, language, uh, for a lot of different packages. In practice, uh, DBI looks like this. You can create a connection by calling DB connect. In this case I am creating a, um, connection to, uh, SQLite database in memory. Uh, you can write a table, you can get a query, this is SQL right here, so it, you have to be a little bit fluent in SQL and when you're done you need to disconnect your connection which is very important because if you forget to do this and you're doing a lot of queries and creating a lot of connections you will actually leak a lot of connections.

So the gist of it's pretty simple. Uh, one thing that I want to note in DBI before we move on is that it also provides really good, um, SQL injection prevention, uh, functions. So, uh, SQL injection is when, um, users, third parties can, uh, go to your app or go to your program and inject code that has basically malicious intent. So in this case even though the string here is asking for a name a user can do this, uh, and, and if you're not escaping this correctly this can cause your database to lose a whole table.

So this has been a problem ever since relational databases have existed, uh, and the whole point of here, the punchline of this comic is that you need to sanitize your database input and DBI has a lot of functions for that. So keep that in mind. That is important.

So this has been a problem ever since relational databases have existed, uh, and the whole point of here, the punchline of this comic is that you need to sanitize your database input and DBI has a lot of functions for that.

ODBC and dplyr

So moving on to ODBC. I'm going to spend very little time on this. Uh, ODBC, uh, is yet another, uh, standard, yet another specification. Uh, so ODBC, uh, unlike DBI however is independent of operating system, of programming language, and it really is just a standard API of how to interact with databases. How this relates to R is that very recently, uh, we have the ODBC package, ODBC now in lowercase, which is a DBI compliant backend.

So what this means is that if you have an ODBC driver for a package that has been written by whomever, uh, that is not in R because ODBC is not language dependent, and then you have DBI and ODBC, you don't need to write your own driver for whatever database you want to connect to if you already have the ODBC driver. So that's the real power of this, uh, and it really augments the power of DBI by extending it to so many other databases that didn't have specific drivers.

Uh, in practice it looks very similar to what we've seen already. You have DB connect, you have a series of, uh, arguments. One thing to note is that since, um, you're always connecting with ODBC, these arguments are actually the same always now. It's always, um, UID for example instead of user or username or one of the thousand variants that you can have.

Dplyr. Um, so the idea, now that you know about DBI and you can use DBI for so many things, the idea, uh, to take it one level up is to not have to worry about SQL, um, and just use dplyr syntax, use R code to talk to a database, to extract data, to insert data, uh, to transform data. Uh, so dplyr extends and wraps a lot of DBI methods and you can do most of what you can do with DBI and SQL directly with dplyr and R.

Uh, there are a few things that you still need DBI for because, um, dplyr is not magic, uh, although it seems like it sometimes. Um, and there has been actually a very recent revamp of dplyr, uh, and the introduction of a new package called Dbplyr , uh, so this is, uh, fresh off the presses. Um, the bottom line I think is if you ha- if you're used to R and if you're used to the dplyr verbs, like filter and mutate and group by and summarize, it's a very good, uh, idea to go with dplyr. It also does a bunch of optimizations for you with SQL translation and lazy evaluation so you're never loading data into R until you need it.

Uh, so if dplyr is good enough for you, it tends to be, uh, a great idea to use it and I hope this slide makes it really clear why that's so. Um, as you can see we have exactly the same standard for connection that we have with DBI, but now to query it we are just using dplyr, we're piping things, uh, it's very easy for us to read as human beings, you have a connection, you ask for the table called city and then you ask for the first two rows and that's it. It's very intuitive, there's no select asterisk from city limit to, uh, that we had to deal with before.

The pool package

So, finally, pool. Uh, so after all these packages why do we need another one? Um, the problem that pool, uh, tries to solve is how- what is the best way to interact with the database from Shiny? So Shiny brings its own set of constraints, uh, there- for every one session there's only a single R process and potentially many users. Establishing connections to a remote database can take quite a while and they can go down at any time because of network problems or, uh, a bunch of other things.

Um, so you're really in this conundrum that you don't want a fresh connection for every query that you make for every user action because that would be so slow, but you also don't want one connection per app because what if it breaks down? What if there's two users trying to make a query at the same time? So the pool package, uh, is meant to solve this problem, uh, by creating a pool of connections and it- so it fetches the connections from the remote, uh, server and then it keeps them locally and you can check them out as you need and return them and it does this kind of behind a curtain so you don't even notice.

So the pool package, uh, is meant to solve this problem, uh, by creating a pool of connections and it- so it fetches the connections from the remote, uh, server and then it keeps them locally and you can check them out as you need and return them and it does this kind of behind a curtain so you don't even notice.

It's actually very similar to how DBI looks except for the creating and the closing of the pool itself. So the idea is that it's kind of like having one connection per query except that the connections now are local, uh, and are recycled. So every time you're done with a connection it goes back to the pool and it can be used again. Uh, so it gives you, uh, usually very good performance and it takes care of connection management.

So it is mainly- pool is mainly important for Shiny apps, uh, because of this whole interactive context I was talking about, but it can be used in any other context without any problem. Uh, you might not need it necessarily, but it definitely doesn't hurt and it also integrates seamlessly with both DBI and dplyr. We'll be, uh, on CRAN really soon, uh, and to have an idea of what it looks like, as you can see we still have our connection, our disconnection, our query. The query is exactly the same as it is with dplyr or DBI, uh, and the only things really that change is that instead of, uh, calling dbconnect I'm calling dbpool and instead of calling dbdisconnect I'm calling poolclose and this takes care of making all the connections and all the disconnections that you might need.