Edgar Ruiz | Databases using R The latest | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

Okay. All right. So thank you everyone for coming. What we want to do today is do an update on where we at when it comes to working with databases in R. So last year we introduced some concepts of how we see the world when it comes to being able to access remote data and be able to get results and analysis done in the best possible way for an R user to do. And the idea there is, as I have here in overview, is to be able to send and push all the computation back to the database and then just bring back the results, right? If you think in terms of what the SQL database does, it has the data and it also has a SQL engine and that SQL engine does most of the stuff that we do in local R today, right?

So we group data, we filter data, we select data, and all that stuff can be done in the database. So we introduced those concepts of doing that and we've been talking through different conferences and even to the class. So what I wanted to do today is kind of give an update and for you to know where more resources are, starting with this one. So db.rstudio .com, if you work with databases in R, I would absolutely suggest that you bookmark this in your browser.

Resources at db.rstudio.com

We've been, the one that we had last year had all the basic stuff such as the packages that we're going to use as well as the different back ends that you can connect to and how you can do that through ODBC and the DBI package. Ever since then, we have added more best practice articles and I think this is one of the major things that you can take from this talk is like this would be a good place to start if you have questions about specific things of how to do. Because what's going to happen is that we're going to start with simpler exercises and then we're going to get more complex and there may be some issues getting to that through that complexity.

So it has, the site has a search bar that you can use to be able to perform some, like if you're looking for schema information, it shows you all the results and you can easily go to that article. Another thing that we have here is, as you were reading through the introduction for this talk about techniques, one of the things we've seen that customers sometimes struggle with is actually publishing the content, right, from a shiny app that you have local that you wanted to run now on something like RStudio Connect, for example, or shiny server pro or something like that. How do you transfer the credentials? How do you handle the security?

We have actually three articles that handle security in here, so I would recommend that if you are getting to that point, just like what Joe was talking about, you know, as you go into production and try to scale it up, this is often times becomes a challenge and we have some resources here. Another resource, excuse me, is the RStudio community. There's a special tag for databases, so if you have questions about how to do certain things with databases and R, you can post it here. Also, as you continue to learn and get better at it, you're more than welcome to post answers because we appreciate how the community comes together and help each other out, especially as we're learning how, what things we can do better to improve these interfaces that we have been working on.

New features and the expanding ecosystem

Um, in RStudio Connect, now you're able to, uh, set environment variables and those environment variables are, um, encrypted, both in disk and also in the memory. So, if you have a situation where you need to, uh, have the credentials that that shiny app is using, um, in the published version, then you can leverage that and connect. So, that's another new one that we actually are start recommending a lot based, whenever there's certain challenges to using something like a DSN or something like Kerberos, for example.

Also, we've seen, uh, as far as, in a year's time, how folks have, that have adopted using dplyr with databases and let it do the translation for you. How they start wrapping these more complex stuff inside functions and they start using things like, um, Perl, for example, and Rlang . And so, I'm very excited about that because I was expecting us to get to this point a little bit longer time frame than where we are now. So, I think as you go and start practicing using, connecting with DBI and using ODBC and using dplyr, there's definitely techniques that we're gonna be expanding our articles in db.rstudio.com to cover this kind of thing.

Another cool thing is that the universe is expanding. I did a quick analysis of CRAN packages as far as all the packages that depend on DBI or dbplyr and also that said something about database or query inside the title or description and ended up finding 61 packages. 61 packages that essentially gives us more functionality that adds to more things that we can do with databases. So we have things here like the connections, for example, there are postgres, we have dbparser here, we have dbplyr teradata, so we have some vendor specific ones as well as ones that extend functionality such as tidy predict and dbplot for example.

Another thing I wanted to show is that we have, um, of those 61 packages, I posted here, like, when were the last update done in CRAN and I was surprised how basically at least more than half been updated since last year and 87 percent has been updated since 2017. I don't know if you have had the same situation as I have that you go and you find like a group of packages that you really like and then you find out the last time it was updated was like 2001. So, like, oh, it may not work out for me. So, this is, I'm very excited about that too.

Another cool thing is that the universe is expanding. I did a quick analysis of CRAN packages as far as all the packages that depend on DBI or dbplyr and also that said something about database or query inside the title or description and ended up finding 61 packages.

Live demo: exploring Austin 311 data

So, let's do a quick demo. I want to show some techniques for connecting, exploring, visualizing data that we're going to have in a database. We're going to use several packages I have listed here, such as dbplyr as we talked about. We're going to use leaflet , and also a little bit of purr and rlang. The data is, I figured that since we're in Austin, we could use Austin open data, we're going to do that. So, it's data that's available on BigQuery that shows where different 311 calls are. So, you have 9-1-1 calls that are for emergency police. 3-1-1 is like the non-emergency police. And we'll look at this data here.

So, the first thing I'm going to do is I'm going to connect to BigQuery. And I'm very excited about this one because it's data that's, again, it's public. Usually data that I demo is data that I have like in my laptop, on the SQL server, stuff like that. So, this one is something that, once you have the code, you can try it out yourselves. So, we're going to do a pointer to the 311 services and we're going to use glimpse. So, this is a dbplyr command that's running inside the database and bringing back essentially the same information that you would as if it was local. But it's obviously not in my laptop.

So, I'm getting all the information here. You can see we have the complaint descriptions, right? We have dead birds, loose dogs, animal service stuff. Also when it was open, when it was closed and most importantly because it's going to be what we're going to do today, it's geocoded. So, we're going to look at the longitude latitude of it. To take a look real quick, see how big this data is. It's not huge, right? It's 640,000 records. But again, I don't have any of that in my laptop. Everything is running against the BigQuery table.

Here I want to see the top ten most used complaint description. You'll notice that it's running against BigQuery and it's telling me how many records it's bringing back. So, it's only bringing back ten records. So, that's all that it went back and forth. The internet and me right now and I'm analyzing 600,000 records live here, right? And it's immediately I can see the top ten with no problems. So, that's like the coolest thing ever, right? Because I'm using R markdown here to publish all this information, all this analysis without having to preload all this data.

The internet and me right now and I'm analyzing 600,000 records live here, right? And it's immediately I can see the top ten with no problems. So, that's like the coolest thing ever, right? Because I'm using R markdown here to publish all this information, all this analysis without having to preload all this data.

I'm going to do some data cleanup real quick because some records weren't properly geocoded. So, the first one is I'm going to take just the loose dog complaints and see where they are located, right? So, I'm going to run this. I'm going to use dbplot. So, instead of doing a dot plot, I'm going to do a raster plot so I don't have to bring back, you know, however, I think it's 41,000 records back to R and then have a bunch of dots that are going to be hard for me to see where they are. Rather we can do a raster so it's a lot less results. But, as you can see here, this doesn't say anything, right? Because we have no frame of reference here where these dots are located in real life.

So, what we'll do is we'll use a function called dbcompute raster that dbplot has. And what that does is that it will run the same calculations that we did for the raster but instead of giving me a ggplot object, it's giving me a data frame, right? A table. And what's cool is that, again, I just, every time I'm just calculating against the database and I'm only bringing back 484 rows instead of the 41,000 rows, right? So, it's all that's been calculated. I want to, since I'm going to use leaflet, I want to draw the squares. As you can see here, I can't draw a square just with the two points of reference here. So, I'm just going to create a quick function. As you can see, I'm using rlang to do that. So, you don't need to have to create a huge package to start learning tidy eval. It's, tidy eval is your friend, trust me.