
Edgar Ruiz | Databases using R The latest | RStudio (2019)
Learn about the latest packages and techniques that can help you access and analyze data found inside databases using R. Many of the techniques we will cover are based on our personal and the community’s experiences of implementing concepts introduced last year, such as offloading most of the data wrangling to the database using dplyr, and using the RStudio IDE to preview the database’s layout and data. Also, learn more about the most recent improvements to the RStudio products that are geared to aid developers in using R with databases effectively. VIEW MATERIALS https://github.com/edgararuiz/databases-w-r About the Author Edgar Ruiz Edgar is the author and administrator of the https://db.rstudio.com web site, and current administrator of the [sparklyr] web site: https://spark.rstudio.com. Author of the Data Science in Spark with sparklyr cheatsheet. Co-author of the dbplyr package and creator of the dbplot package
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Okay. All right. So thank you everyone for coming. What we want to do today is do an update on where we at when it comes to working with databases in R. So last year we introduced some concepts of how we see the world when it comes to being able to access remote data and be able to get results and analysis done in the best possible way for an R user to do. And the idea there is, as I have here in overview, is to be able to send and push all the computation back to the database and then just bring back the results, right? If you think in terms of what the SQL database does, it has the data and it also has a SQL engine and that SQL engine does most of the stuff that we do in local R today, right?
So we group data, we filter data, we select data, and all that stuff can be done in the database. So we introduced those concepts of doing that and we've been talking through different conferences and even to the class. So what I wanted to do today is kind of give an update and for you to know where more resources are, starting with this one. So db.rstudio.com, if you work with databases in R, I would absolutely suggest that you bookmark this in your browser.
Resources at db.rstudio.com
We've been, the one that we had last year had all the basic stuff such as the packages that we're going to use as well as the different back ends that you can connect to and how you can do that through ODBC and the DBI package. Ever since then, we have added more best practice articles and I think this is one of the major things that you can take from this talk is like this would be a good place to start if you have questions about specific things of how to do. Because what's going to happen is that we're going to start with simpler exercises and then we're going to get more complex and there may be some issues getting to that through that complexity.
So it has, the site has a search bar that you can use to be able to perform some, like if you're looking for schema information, it shows you all the results and you can easily go to that article. Another thing that we have here is, as you were reading through the introduction for this talk about techniques, one of the things we've seen that customers sometimes struggle with is actually publishing the content, right, from a shiny app that you have local that you wanted to run now on something like RStudio Connect, for example, or shiny server pro or something like that. How do you transfer the credentials? How do you handle the security?
We have actually three articles that handle security in here, so I would recommend that if you are getting to that point, just like what Joe was talking about, you know, as you go into production and try to scale it up, this is often times becomes a challenge and we have some resources here. Another resource, excuse me, is the RStudio community. There's a special tag for databases, so if you have questions about how to do certain things with databases and R, you can post it here. Also, as you continue to learn and get better at it, you're more than welcome to post answers because we appreciate how the community comes together and help each other out, especially as we're learning how, what things we can do better to improve these interfaces that we have been working on.
New features and the expanding ecosystem
Um, in RStudio Connect, now you're able to, uh, set environment variables and those environment variables are, um, encrypted, both in disk and also in the memory. So, if you have a situation where you need to, uh, have the credentials that that shiny app is using, um, in the published version, then you can leverage that and connect. So, that's another new one that we actually are start recommending a lot based, whenever there's certain challenges to using something like a DSN or something like Kerberos, for example.
Also, we've seen, uh, as far as, in a year's time, how folks have, that have adopted using dplyr with databases and let it do the translation for you. How they start wrapping these more complex stuff inside functions and they start using things like, um, Perl, for example, and Rlang. And so, I'm very excited about that because I was expecting us to get to this point a little bit longer time frame than where we are now. So, I think as you go and start practicing using, connecting with DBI and using ODBC and using dplyr, there's definitely techniques that we're gonna be expanding our articles in db.rstudio.com to cover this kind of thing.
Another cool thing is that the universe is expanding. I did a quick analysis of CRAN packages as far as all the packages that depend on DBI or dbplyr and also that said something about database or query inside the title or description and ended up finding 61 packages. 61 packages that essentially gives us more functionality that adds to more things that we can do with databases. So we have things here like the connections, for example, there are postgres, we have dbparser here, we have dbplyr teradata, so we have some vendor specific ones as well as ones that extend functionality such as tidy predict and dbplot for example.
Another thing I wanted to show is that we have, um, of those 61 packages, I posted here, like, when were the last update done in CRAN and I was surprised how basically at least more than half been updated since last year and 87 percent has been updated since 2017. I don't know if you have had the same situation as I have that you go and you find like a group of packages that you really like and then you find out the last time it was updated was like 2001. So, like, oh, it may not work out for me. So, this is, I'm very excited about that too.
Another cool thing is that the universe is expanding. I did a quick analysis of CRAN packages as far as all the packages that depend on DBI or dbplyr and also that said something about database or query inside the title or description and ended up finding 61 packages.
Live demo: exploring Austin 311 data
So, let's do a quick demo. I want to show some techniques for connecting, exploring, visualizing data that we're going to have in a database. We're going to use several packages I have listed here, such as dbplyr as we talked about. We're going to use leaflet, and also a little bit of purr and rlang. The data is, I figured that since we're in Austin, we could use Austin open data, we're going to do that. So, it's data that's available on BigQuery that shows where different 311 calls are. So, you have 9-1-1 calls that are for emergency police. 3-1-1 is like the non-emergency police. And we'll look at this data here.
So, the first thing I'm going to do is I'm going to connect to BigQuery. And I'm very excited about this one because it's data that's, again, it's public. Usually data that I demo is data that I have like in my laptop, on the SQL server, stuff like that. So, this one is something that, once you have the code, you can try it out yourselves. So, we're going to do a pointer to the 311 services and we're going to use glimpse. So, this is a dbplyr command that's running inside the database and bringing back essentially the same information that you would as if it was local. But it's obviously not in my laptop.
So, I'm getting all the information here. You can see we have the complaint descriptions, right? We have dead birds, loose dogs, animal service stuff. Also when it was open, when it was closed and most importantly because it's going to be what we're going to do today, it's geocoded. So, we're going to look at the longitude latitude of it. To take a look real quick, see how big this data is. It's not huge, right? It's 640,000 records. But again, I don't have any of that in my laptop. Everything is running against the BigQuery table.
Here I want to see the top ten most used complaint description. You'll notice that it's running against BigQuery and it's telling me how many records it's bringing back. So, it's only bringing back ten records. So, that's all that it went back and forth. The internet and me right now and I'm analyzing 600,000 records live here, right? And it's immediately I can see the top ten with no problems. So, that's like the coolest thing ever, right? Because I'm using R markdown here to publish all this information, all this analysis without having to preload all this data.
The internet and me right now and I'm analyzing 600,000 records live here, right? And it's immediately I can see the top ten with no problems. So, that's like the coolest thing ever, right? Because I'm using R markdown here to publish all this information, all this analysis without having to preload all this data.
I'm going to do some data cleanup real quick because some records weren't properly geocoded. So, the first one is I'm going to take just the loose dog complaints and see where they are located, right? So, I'm going to run this. I'm going to use dbplot. So, instead of doing a dot plot, I'm going to do a raster plot so I don't have to bring back, you know, however, I think it's 41,000 records back to R and then have a bunch of dots that are going to be hard for me to see where they are. Rather we can do a raster so it's a lot less results. But, as you can see here, this doesn't say anything, right? Because we have no frame of reference here where these dots are located in real life.
So, what we'll do is we'll use a function called dbcompute raster that dbplot has. And what that does is that it will run the same calculations that we did for the raster but instead of giving me a ggplot object, it's giving me a data frame, right? A table. And what's cool is that, again, I just, every time I'm just calculating against the database and I'm only bringing back 484 rows instead of the 41,000 rows, right? So, it's all that's been calculated. I want to, since I'm going to use leaflet, I want to draw the squares. As you can see here, I can't draw a square just with the two points of reference here. So, I'm just going to create a quick function. As you can see, I'm using rlang to do that. So, you don't need to have to create a huge package to start learning tidy eval. It's, tidy eval is your friend, trust me.
I'm going to run the locations against the new function and something before I forget about this function. So, what's neat about it is that I can do locations and then pipe it directly into my new tidy eval function and say size and now have my autocomplete function that works, right? So, I can start typing latitude and it works. Just like if you were creating your own tidy function. So, it makes it where it's easily pipeable for you. So, that's one of the big advantages that you gain by using that. So, and then we're going to create our squares. Now, you can see that we have, we went from just one for each longitude and latitude. Now, we have the squares themselves that I can draw.
Next, I'm going to actually use leaflet to see where they are. So, much easier. But again, now we know where they are, but we don't know where's the concentration of loose dog calls. Alright. So, what we'll do is that the dbplot function will return the number of records concentrated in that square. So, what we'll do is like, we're going to make the max number of calls the redder of the squares and the less number of calls, the lighter red. So, it'll make it visually easier to see where the calls are, right? So, we're making it fancy now. And on top of that, we also added some pop ups. So, I can click on here and see that there's twelve hundred calls in this section of Austin, there's 128. You see, it's a lot easier to navigate.
So, like this section is so red that I can't even see what area is in the map. But, very cool that now I'm able to do this analysis again against the database. I haven't brought anything. So, let's make it even fancier and add the dot where we are, right? So, this is where we're at right now. Hopefully Google Maps gave me the right geolocation. So, if you're a local and I'm not 100 percent correct, please don't blame me. Alright, so, we see here where we are and where the concentration is. So we feel safe now that we don't, we're not gonna get to see too many loose dogs apparently.
Running K-means inside the database with modelDB
With model DB, which is a new package that's been since 2018, you can actually run some models inside the database. One of them is K-means. So, let me switch here. What it's doing right now is taking all the longitudinal latitude and it goes through its iteration and see if the average is less than the previous run. So, basically what K-means done in R, but it's doing it against the database. It's running SQL statements until it finds one that's optimal to the calculations. In this case, it's gonna run, I think, 17 cycles. And then we'll be able to see it here. So this should be the last one. So now I have what it considers an optimal center for each of them and it gives me three centers.
That's the default. And you can see the actual SQL statement and you'll see the numbers here of the calculations. So, let me run this quickly. So this is the one. Now I can place the squares of my three segments against the map. And again, I'm going back and forth with the database. I am overlaying where my squares are. And then this last one, hopefully have enough time. This is the one that takes a bit longer because what I'm doing is I'm using a purr command to do DB, to calculate the raster for each of the centers, give them a different color and then you can see each color where the calls are, right?
So everything I did, I just did against the database and I'm able to have all my results in R and all the computation and the modeling and the rest of the plot were done remotely. So this is the kind of thing that we want to continue on building up on and obviously even improve the cancer transitions we have today. So all right, thank you.

