Resources

Interacting with Databases by Bárbara Borges from Shiny from useR! Brussels 2017

Connecting to an external database from R can be challenging. This is made worse when you need to interact with a database from a live Shiny application. To demystify this process, I'll do two things. First, I'll talk about best practices when connecting to a database from Shiny. There are three important packages that help you with this and I'll weave them into this part of the talk. The DBI package does a great job of standardizing how to establish a connection, execute safe queries using SQL (goodbye SQL injections!) and close the connection. The dplyr package builds on top of this to make even easier to connect to databases and extract data, since it allows users to query the database using regular dplyr syntax in R (no SQL knowledge necessary). Yet a third package, pool, exists to help you when using databases in Shiny applications, by taking care of connection management, and often resulting in better performance. Second, I'll demo these concepts in practice by showing how we can connect to a database from Shiny to create a CRUD application. I will show the application running and point out specific parts of the code (which will be publicly available)

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, um, I am Bárbara, I'm a software developer at RStudio. Um, I am in the Shiny team and I'm gonna talk today a little bit about Shiny, but mostly about databases. How do you interact with databases from R in general and then specifically in Shiny because that brings a whole new set of concerns.

Uh, so to give you an idea of what I have planned for this talk, uh, I kinda see it as having two parts. Uh, part one is mostly what, uh, we're gonna spend this time on, uh, and it's gonna basically be about database- database best practices in R. In particular, um, I'm gonna go through a set of packages that really help you when you're interacting with databases in R and I'm gonna show how they build on top of each other.

So very quickly, DBI is a package that helps you standardise, uh, your interactions with a database from R. ODBC is a really, uh, powerful, uh, DBI backend which will make sense when I get to it. Um, dplyr, uh, is a pack- is probably the package that most of you know here. That's a package that is really great for, um, wrangling and transforming data in a tidy way and it can be used, uh, for database- for databases as well. So not only memory data but also remote data. And lastly, uh, the pool package is actually the only package that I wrote from the ones that I'm gonna talk about today and that's the one that relates most to Shiny. So pool helps you to connect to a database from Shiny using all of these other packages.

Um, so that's, uh, my main goal for today, uh, but part two of the talk is, uh, I'm gonna try to spend a few minutes on a Shiny app demo, uh, that I've created for this talk. It's a CRUD app so it means that you can create, read, update, and delete data from a database all from within a Shiny app, all from within R. In this case I chose a SQLite database, uh, which is local and really fast and doesn't require internet.

Overview of relational databases

Uh, so, databases. Um, at the risk of, uh, oversimplifying you can think of databases as being, uh, in one of two categories. Uh, there's relational databases and then there's the newer NoSQL or object oriented databases. We'll focus on the first one, uh, and that's mostly for historical reasons. They've been, uh, around for longer. Uh, the infrastructure around them in general, not just in R, is stronger.

Uh, and very quickly relational databases and their management systems store data in columns and rows. These make up tables. It looks spreadsheet-y. Uh, they use SQL which is a language for querying data. Uh, a few popular, uh, databases, uh, that are relational are, in this case open source, are MySQL, Postgres, and SQLite. Recently for distributed systems you also have things like Apache Hive and Cloudera Impala which are relational-like. Um, and most of these packages work really well, uh, with relational databases.

The DBI package

So the first foundational package, uh, that we're interested in when it comes to databases in R is called DBI and it stands for database interface, uh, and the whole idea of DBI is that it's a standard, it's an interface, uh, that defines how you should interact with the database from R. So basically it has, um, S3 generics which are, uh, functions that are implemented for particular types of connections.

So DBI itself doesn't, uh, care what type of connection you have, it just says that whatever is your connection you have to implement this method for connecting, that method for disconnecting, and a bunch of other methods. So the advantage of this is that you get, um, different databases using basically the same language and that gives you a lot of portability in your code. You can basically migrate your database to another system and be able to use, uh, a lot of your old code.

Uh, so one thing that I wanna, uh, note particularly is that, uh, as I said, DBI itself doesn't care what is the particular database you're using. Uh, all of that work is done by other packages and these, uh, packages will implement all of these methods for a particular database, uh, or a database management system. So these are the so called DBI compliant backends, um, our SQLite, our MySQL, our Postgres are all examples of DBI compliant packages, um, so you need to use both of them if you want to connect to, uh, a SQLite database.

Um, so and if this sounds like extra work to you, again, the advantage is that by having DBI, um, you have a very portable, um, language, uh, for a lot of different packages. In practice, uh, DBI looks like this. You can create a connection by calling DB connect. In this case I am creating a, um, connection to, uh, SQLite database in memory. Uh, you can write a table, you can get a query, this is SQL right here, so it, you have to be a little bit fluent in SQL and when you're done you need to disconnect your connection which is very important because if you forget to do this and you're doing a lot of queries and creating a lot of connections you will actually leak a lot of connections.

So the gist of it's pretty simple. Uh, one thing that I want to note in DBI before we move on is that it also provides really good, um, SQL injection prevention, uh, functions. So, uh, SQL injection is when, um, users, third parties can, uh, go to your app or go to your program and inject code that has basically malicious intent. So in this case even though the string here is asking for a name a user can do this, uh, and, and if you're not escaping this correctly this can cause your database to lose a whole table.

So this has been a problem ever since relational databases have existed, uh, and the whole point of here, the punchline of this comic is that you need to sanitize your database input and DBI has a lot of functions for that. So keep that in mind. That is important.

So this has been a problem ever since relational databases have existed, uh, and the whole point of here, the punchline of this comic is that you need to sanitize your database input and DBI has a lot of functions for that.

ODBC and dplyr

So moving on to ODBC. I'm going to spend very little time on this. Uh, ODBC, uh, is yet another, uh, standard, yet another specification. Uh, so ODBC, uh, unlike DBI however is independent of operating system, of programming language, and it really is just a standard API of how to interact with databases. How this relates to R is that very recently, uh, we have the ODBC package, ODBC now in lowercase, which is a DBI compliant backend.

So what this means is that if you have an ODBC driver for a package that has been written by whomever, uh, that is not in R because ODBC is not language dependent, and then you have DBI and ODBC, you don't need to write your own driver for whatever database you want to connect to if you already have the ODBC driver. So that's the real power of this, uh, and it really augments the power of DBI by extending it to so many other databases that didn't have specific drivers.

Uh, in practice it looks very similar to what we've seen already. You have DB connect, you have a series of, uh, arguments. One thing to note is that since, um, you're always connecting with ODBC, these arguments are actually the same always now. It's always, um, UID for example instead of user or username or one of the thousand variants that you can have.

Dplyr. Um, so the idea, now that you know about DBI and you can use DBI for so many things, the idea, uh, to take it one level up is to not have to worry about SQL, um, and just use dplyr syntax, use R code to talk to a database, to extract data, to insert data, uh, to transform data. Uh, so dplyr extends and wraps a lot of DBI methods and you can do most of what you can do with DBI and SQL directly with dplyr and R.

Uh, there are a few things that you still need DBI for because, um, dplyr is not magic, uh, although it seems like it sometimes. Um, and there has been actually a very recent revamp of dplyr, uh, and the introduction of a new package called Dbplyr, uh, so this is, uh, fresh off the presses. Um, the bottom line I think is if you ha- if you're used to R and if you're used to the dplyr verbs, like filter and mutate and group by and summarize, it's a very good, uh, idea to go with dplyr. It also does a bunch of optimizations for you with SQL translation and lazy evaluation so you're never loading data into R until you need it.

Uh, so if dplyr is good enough for you, it tends to be, uh, a great idea to use it and I hope this slide makes it really clear why that's so. Um, as you can see we have exactly the same standard for connection that we have with DBI, but now to query it we are just using dplyr, we're piping things, uh, it's very easy for us to read as human beings, you have a connection, you ask for the table called city and then you ask for the first two rows and that's it. It's very intuitive, there's no select asterisk from city limit to, uh, that we had to deal with before.

The pool package

So, finally, pool. Uh, so after all these packages why do we need another one? Um, the problem that pool, uh, tries to solve is how- what is the best way to interact with the database from Shiny? So Shiny brings its own set of constraints, uh, there- for every one session there's only a single R process and potentially many users. Establishing connections to a remote database can take quite a while and they can go down at any time because of network problems or, uh, a bunch of other things.

Um, so you're really in this conundrum that you don't want a fresh connection for every query that you make for every user action because that would be so slow, but you also don't want one connection per app because what if it breaks down? What if there's two users trying to make a query at the same time? So the pool package, uh, is meant to solve this problem, uh, by creating a pool of connections and it- so it fetches the connections from the remote, uh, server and then it keeps them locally and you can check them out as you need and return them and it does this kind of behind a curtain so you don't even notice.

So the pool package, uh, is meant to solve this problem, uh, by creating a pool of connections and it- so it fetches the connections from the remote, uh, server and then it keeps them locally and you can check them out as you need and return them and it does this kind of behind a curtain so you don't even notice.

It's actually very similar to how DBI looks except for the creating and the closing of the pool itself. So the idea is that it's kind of like having one connection per query except that the connections now are local, uh, and are recycled. So every time you're done with a connection it goes back to the pool and it can be used again. Uh, so it gives you, uh, usually very good performance and it takes care of connection management.

So it is mainly- pool is mainly important for Shiny apps, uh, because of this whole interactive context I was talking about, but it can be used in any other context without any problem. Uh, you might not need it necessarily, but it definitely doesn't hurt and it also integrates seamlessly with both DBI and dplyr. We'll be, uh, on CRAN really soon, uh, and to have an idea of what it looks like, as you can see we still have our connection, our disconnection, our query. The query is exactly the same as it is with dplyr or DBI, uh, and the only things really that change is that instead of, uh, calling dbconnect I'm calling dbpool and instead of calling dbdisconnect I'm calling poolclose and this takes care of making all the connections and all the disconnections that you might need.

CRUD app demo

So, very quickly, um, the Shiny app that I built, uh, that is a CRUD app, this is kind of what it looks like. You can see there's a few, well maybe you cannot see, but there's a few options. You can create a table, uh, you can, uh, I'm actually gonna just show it.

Um, so, um, you can create a table, uh, you can create a table and, uh, so, alright, so, sorry for the zoom, uh, the idea here is that there's an overview page that shows you the tables you already have, um, in this case I already populated this table with one, uh, this database with one table.

Okay, so you can see here that it's a very small table. I've got three rows, three columns and it's kind of an inventory for those of you who can't read at the back. There's ID and then, uh, an item and how many of those you have. So, it's kind of an inventory type thing.

Uh, so, you can create a new entry, um, so, gonna have four, um, you can create a new entry, um, and once you create entry, you can see it's immediately reflected here, there's now four rows, if you read, you'll see there's now a new row. You can update an entry, uh, so, in this case if I wanna update the one I've just done because I've got some more watches in this case, um, I can do that and again it's automatically reflected here. I can also delete rows, uh, and do all of the sender things that you'd like to do with a cred app.

So, this is all possible, uh, using Shiny, uh, DBI, dplyr and pool. All the source codes available with the repo where these slides live as well and I encourage you to look at it if you're curious, um, it covers basically, it puts into practice basically everything I talked about in here. So, thank you.

We have time for one question.

So, we sent the package earlier this month, so, in a couple of weeks, couple of weeks. Couple of weeks, yeah, it's, I, so, what, dplyr was just released a couple of weeks ago, uh, and it changed a lot of things, so I was waiting for that to update pool, uh, and get it released to CRAN without having to break compatibility, which would have happened. So, now that's done and I've actually finished it for, for useR, uh, what I wanted to do, it'll be like one more week for me to test things completely and then I'll, I'll submit it to CRAN.