Resources

Edgar Ruiz - Easing the pain of connecting to databases

Overview of the current and planned work to make it easier to connect to databases. We will review packages such as odbc, dbplyr, as well as the documentation found on our Solutions site (https://solutions.posit.co/connections/db/databases), which will soon include the best practices we find on how to connect to these vendors via Python. Talk by Edgar Ruiz

Oct 31, 2024
17 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Well, thank you everyone. I appreciate everybody being here. My name is Edgar. I am a software developer, software engineer at Posit. I've been with Posit for several years and I'm very excited to see all of you today.

Usually I talk about work that I've been doing, but this time around I'm going to be talking about work that others have been doing inside Posit. So I'll be standing on the shoulders of giants. Some of them are here today, so I really appreciate Joe's work.

Improvements to the odbc package

The first thing I want to talk about regarding making it easier for us to be able to work with databases is improvements to the odbc package. Improvements that are now available in CRAN that you should be able to take advantage of.

So the first improvement that I want to talk about, other than of course this really cool logo that we just started adding, is this new thing that we have with odbc where before whenever you were trying to connect to odbc you needed about 10 arguments to be able to do so.

This is something that was funny because I personally had to go and try to connect and so that way we can actually have it documented in our solutions site. So we figured that we needed several different arguments such as first of all starting with the odbc call that we would do in DBI. Then of course we will call the driver. We need to tell odbc where the driver is. The host, meaning the location where we want to, where our workspace is in Databricks. And also the token that we're going to use, the path is called a personal access token.

Also the SQL warehouse location inside Databricks, the port that we're going to use, the authentication mechanism, protocol, thrift stamp transport, the SSL and UID, which is weird because AuthMech usually means token, but anyway, they want all of these. In order for you to be able to connect, we had to have all of them, which we get used to it. As users, we just get used to it and I am sure it happens to all of us. We connect one time and we start a new project with what we do. It's just we copy the old code and put it in a new project and we continue.

And what happens after that? Well, we can share that code to somebody that doesn't have our same setup. And guess what? It doesn't work. And we start seeing the pains there.

So now we have a new function inside odbc called Databricks. So Databricks sets sensible defaults for us and also it lets us deal with only one of the arguments that we needed to pass, which is that SQL warehouse path that is inside Databricks.

This is a big departure from how usually we have dealt with odbc. Usually odbc was sort of kind of like behind the curtain kind of package that we would only deal with only if we really needed to. The only thing that we wanted to do was to work with the driver and also do the type conversions, right? From what your database has to R, right? If you have a character, an integer, a double, how are you going to deal with all that? That's all inside odbc, which adapts itself based on the type of connection that you have, let's say Oracle, Teradata and such.

But this time around, we're extending odbc where now you can actually use a function that instead of odbc, you just call Databricks. This is work that Aaron Jacobs did. So it was just outstanding, you know, because now it makes it so much easier for us to connect.

And of course, once we figured we could do this, we could see that passing that function and then using that path would get us connected. We know that there's going to be locations inside our laptops, inside our systems that will be the most common where the drivers are. So not only it does the sensible defaults, it actually also is smart enough to look for the driver in locations that are going to be default, right?

The other thing that it does is that once it starts looking and finding your environment variables that you haven't set up in your machine, then it will actually decide what kind of authentication method you are using and just use that. So if you have a path like your personal authentication token, it'll use that. But if it detects like if you're on workbench and you have the Databricks integration set up, it'll just use that. So you don't need to tell anything else. It'll just connect.

It'll pass your credentials and then it will use those defaults for the other arguments that are part of it, including the protocol and the SSL and all this other stuff that we're talking about. For example, the UID changes, right? It's not token anymore if you're using workbench, for example. So it does all that for you automatically. And of course, any of these arguments can be overwritten by you. So let's say that you have a driver that you're testing is maybe in a different spot. Well, just still use odbc Databricks and then just overwrite the driver location. It's like it won't send you back where you have to use odbc, odbc again, which is amazing.

This is another thing that's cool. So if you attended the keynote, James touched on that particular function. That's why it sounds familiar. But one of the things that because of time, I'm sure, wasn't really mentioned is the fact that inside Mac, there was actually an issue, an issue that has to do with the driver needing a specific entry in an INI file. That if it's not there, it's not going to work. So as part of the documentation, we added all the steps that you needed to do, like you see right here, in order to fix that.

Well, with the new function, the odbc Databricks, it will automatically do all that for us. Other than, of course, it won't install the Unix odbc, but everything else it'll do automatically. It'll make the entry that you need and automatically connect you. It's like you don't even have to think about it.

It's like you don't even have to think about it.

That's why if you have used this Databricks function recently and you haven't used it before, you didn't suffer in Mac, that's because, you know, Aaron made sure that that was something that you didn't have to worry about. So you are one of the blessed ones.

Snowflake support in odbc

Another function that we are actually rather Simon and Aaron and Hadley added is the Snowflake function, which, as you would imagine, would help you connect to Snowflake, which is also amazing, right? Now we open the door. It's not only odbc anymore. We can expand it to other ones.

Snowflake, the same idea before you needed several arguments. And now if you have all the environment variables set up, you probably will only need to call the function. So you say DB connect and odbc Snowflake and you're good to go. Again, it's really cool because this will allow us to be very flexible in how you connect.

Because, for example, if you were connecting through the browser based SSO that Snowflake has, and if you're on your desktop, it will work for you as well as reverting back to the more traditional username and password. And it will, you know, you can just adapt it, right? Again, you won't go back to odbc, odbc. You just add or overwrite, rather, those two entries.

So same thing what Snowflake does. It does look for the driver. The driver, automatic driver lookup works for Linux and Mac machines. But it does the same thing. It looks for the driver. Also determines the authentication that you're using and does the right setup, the right defaults that you need. So you don't even need to think about them. Passes your credentials. And what you saw earlier, you don't need to worry about having to revert back. You just use the Snowflake or Databricks and can overwrite anything.

Database connections in Positron

The other thing I want to talk about is the DB connection in Positron. I'm actually really excited about this. Whenever we were talking about this new IDE internally at Positron, I was like, we need the database connection. Please let us have it, you know, version one, if we can. And I was really excited when that was decided on.

So as you recall, we have the RStudio Connections pane. And this is where the words I use when I type. They just type themselves. The beloved thing just came out. Like, I didn't think about it when I saw it. I had already written it. I've been using RStudio for so long. And when it came out, it actually came out first, if I'm correct, with the Sparkly R package. And when I saw that, it just blew my mind.

And then when it was expanded to use the other databases through dbplyr and dbi, it was, like, just amazing, right? Because I used to have, you know, three screens. I had RStudio, I had Microsoft SQL Studio, and I had Outlook open 24-7 when I was at work. And now I can just merge it into one, right? Because I don't have to worry about going back and saying, okay, what's the name of the table again? What are the fields that it has? I can just do it all inside the RStudio IDE.

And I just thought that was such a big productivity increase that we can gain that it just, like, blew my mind. And it did change my productivity. I didn't have to think about it after just getting used to it. Because I'm able to navigate, right, right through the catalogs, the schemas, and the tables that I have access to. When I make the R connection to the database, it automatically is there. And also, as you recall, you can actually preview the names of the fields, the types. I can also preview data.

And, of course, that's now available in Positron, right? So, to me, this is great. The actual look of it is amazing. It's very well integrated with the look of VS Code and Positron. And we have the same functionality, day one, right? Again, all this knowledge that we have gathered over time with RStudio is now being implemented here rather quick in Positron. So, it's available in the public beta. The same idea, we can navigate through a database structure. We can preview the field names. And just like I said, nice integrated look.

Remember the preview that we can have on the first thousand rows in RStudio? Well, we can do the same thing in Positron. But, in Positron, we get this super cool thing where it does some analysis, some stats on the 1,000 rows that it's previewing and gives you some idea of how well populated the column is, of course, all the names and different types.

Before, you had to scroll to the side and look at the names, right? And kind of infer the types. And here, you just see the types that they are listed. So, it's just amazing. It also tells you how many NAs there are, how many null values. And you see that in that inline graph that, for example, top one, kind of hard to read, says 7%. Here's a closer look of it where you can see it.

So, this is where you can see it, where you can click on it, and then you can go and see just the data. And the other very cool thing is that it works with Python, right? So, before, if you wanted to do some Python work in RStudio and if you wanted some database work, it was that you had to go to reticulate and do all those sorts of things as we had to. But because of how Positron works and how Daniel actually coded this, is that it's basically available for both.

If you connect via SQLAlchemy to the database, it will create the same type of connection. So, here you see an example that I have connected to that same Databricks cluster, or excuse me, SQL Warehouse. I created an engine. You see that? And that engine object right there that I created with my code will be cataloged inside our variable environment as an engine and it will give it a specific icon.

Once you click on that icon, then you'll see a new connection in your connections pane that has the SQLAlchemy prefix and the type of connection that it is. So, that's all you have to do. You connect, you click on that icon, and now you see it, which I thought is just so great.

Then you see the same idea, right? You can expand on the catalogs and be able to see the schemas, get down to the tables. You can also click on a specific table and see the exact same steps, right? So, imagine you're not even using R, you're just using Python, and you get the same goodies directly from Python.

So, imagine you're not even using R, you're just using Python, and you get the same goodies directly from Python.

Q&A

Yeah, so I'll start off with a couple of questions from the crowd.

Yeah, can you connect to remote databases using odbc via SSH Tunnel? My setup is that the database gives access to a static IP, which is also remote, and local user needs or has a private key to access the static IP.

So, I am not sure about that. We're talking about Positron being able to connect to a remote database using odbc through SSH Tunnel? I mean, I think that question disappeared on me. I don't think it would probably be from, like, your desktop. So, the idea is, like, I think there's a certain thing that has to do with SSH and Positron, so I may not be 100% sure on that, but we can surely test it if I can get some more details on that.

I mean, it might have also just been RStudio. Yeah, if it's RStudio, and if you have been able to connect before, you should be able to connect now. There's no changes on the RStudio side. It's just the Positron side.

Someone asked, where can I get an odbc sticker? I have some.

How are the 1,000 rows on the data preview chosen? Is it a random sample, or are the first 1,000 rows retrieved, or something else? No, that's a great question. So, it does the call of just send me the first 1,000, so it's whatever the database is sending, as if you were doing select from table limit 1,000, that's what you will return.

Would the odbc update and the authentication methods work with tokens slash passwords that change daily, CyberArk, for instance? That's a great question. It should, but you would have to get the updated token. You have to enter it yourself.

Okay, the dev work has odbc, colon, data bricks, versus a DB-specific package like R Postgres. Can you talk about why the work went into odbc rather than incorporating the driver bindings into a dedicated package? Oh, no, that's a great question. I actually prefer it in odbc just because that way you don't have to think in another package. Plus, there wasn't really a case to create a full new package for something just that's going to make it easier to connect on arguments, unlike with R Postgres where it will have other goodies that cannot be coded as a simple driver.