Live Q&A following Workflow Demo - June 26th
Please join us for the live Q&A session for the June 26th Workflow Demo - this will start immediately following the demo. Predicting Lending Rates with Databricks, tidymodels, and Posit Team Anonymous questions: https://pos.it/demo-questions Demo: 11 am ET [Happening here! https://youtu.be/qIzKJKcmh-s?feature=shared] Q&A: ~11:35 am ET
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey Nick, and Garrett, thank you all so much for joining us today and thank you to Garrett for such a great session. As a reminder, we host these Workflow Demos the last Wednesday of every month and they're all recorded and so there's actually 15 different workflows now shared to the Posit website but also the Posit YouTube where you all saw this video.
So from building a model annotation tool to pins, workflows, to deploying APIs, I'll share that full playlist with you here in the chat in just a second.
But while I know many people joining today are current customers, if you are new to Posit team and would like to try it out for free or want to chat more with our team about the Posit and Databricks integration, we'd love to set up some time to chat. I know some of the questions here are hard to answer through just the one question in the chat so if you want to schedule more time with us, let me share this with you as well.
But before jumping into our Q&A here, I do also want to add that if you're interested in diving deeper into using Databricks with R, Edgar from our team is leading a workshop at PositConf 2024 in August in Seattle and so you can learn more about the workshops on our website. I'm trying to share the links as I talk as well but here's a great post that gives more details on all the workshops too.
Introductions
So jumping back over to our Q&A, thank you so much Garrett, Isabella, Ryan, and Nick for joining us today. Do we want to go around the room and maybe do some brief introductions and say hi to everybody?
Ryan, you want to introduce yourself first? Yeah, absolutely. So hey everybody, my name is Ryan. I'm a data science advisor here at Posit so I just help folks get used to using our professional tools and a lot of times that means getting used to our open source tools as well. So I do a lot of trainings, webinars, workshops. I've hosted a few of these sessions in the past. I'll also be doing a workshop at Conference so if you want to learn more about how to use Posit's tools with a little bit more of an R focus, I'll be hosting one of those workshops as well. And I'll hand it over to Isabella.
Hi everyone, I'm Isabella Velazquez. I am on the developer relations team at Posit. I really enjoy talking about R, Python, and all the suite of Posit tools. I won't be having a workshop at Conf but I will be there so looking forward to meeting you. I'll pass it to Nick.
Hey everybody, name's Nick Pelican. I'm a senior solution architect at Posit. So what I do is I help our customers understand how Posit's products can work best with their enterprise data stores, things like Databricks. So I've been in Posit for about a year. Before that, I spent about 10 years working with Databricks in various capacities, data engineer, data scientist. So happy to answer any of your questions there. I will also be at Conf. I am TAing Hadley's workshop and I'm also giving a talk on data contracts. So please come.
Awesome. And Garrett, I know you introduced yourself in the beginning, but you want to introduce yourself again here. Yeah, hi everyone. I'm Garrett. I'm on the developer relations team here at Posit and I co-developed that demo that we went through with Isabella. And like Ryan, I will also be teaching a workshop at Posit Conf on getting started with using Shiny for Python. So if that interests you, I hope to see you there.
Awesome. And I realized I forgot to introduce myself when I started here too. So hi, I'm Rachel Dempsey. I lead customer marketing at Posit. And so I host a variety of different community events and love to get people together. So I'll do a quick shout out for one of those other opportunities. Every Thursday at noon Eastern time, I host a data science hangout where we're joined by a different featured leader from the community to answer questions from you all. And I'd love to see you there too, if you want to join us.
Why use RStudio or VS Code instead of Databricks notebooks?
So let's jump over into our questions. So I know there was back in the YouTube with the demo, there were a few questions that kind of touched on this topic about why don't you use Databricks notebooks to do all the analysis and avoid the process of connecting to Databricks from our studio. And Ryan, I don't know if you want to get started on that one.
Yeah, I can at least touch upon it, but I'll certainly defer to other folks here on the call. But it's really about using the development environment that you enjoy. Now, the notebooks built within Databricks are kind of what you prefer to use. We certainly aren't going to tell you to not use those, go for it. But we know so many folks have learned to love the RStudio IDE, and love to use Visual Studio Code, especially now that we have those packaged within Posit Workbench. We just want to make sure that you have the ability to access the data to continue those analyses within those RStudio IDE, VS Code within Posit Workbench. So it's really more just about being able to use the IDEs that you love and being able to pull in the data from Databricks.
Yeah, I'll 100% echo what Ryan said. I totally agree. It's all about using the tool that you feel most comfortable in. One of the great things about using Posit Tools with Databricks is that you get access to all of the power of RStudio, if you're familiar with that, or especially I'm a Python dev. So for me, VS Code, very near and dear to my heart. The VS Code on Posit Workbench, integrating with Databricks is fantastic because you have access to all of the extensions that make VS Code so great.
ODBC vs Sparklyr R for remote clusters
So another question that was asked a little bit earlier in the demo was, don't you have to set the engine to Sparklyr R or something for running in a remote cluster? And Nick, I think you maybe want to be the one to answer this one.
Absolutely. So taking a look at that demo. So what you saw Garrett using there is the ODBC package. With the ODBC package, you've got to have two different options. You can run against Databricks clusters or Databricks general purpose compute. What that's doing there is the ODBC connector is just sending Spark SQL to the cluster and that's it. You can also use the ODBC connector to connect to Spark, to connect to Databricks SQL warehouses. So that's the Databricks SQL only compute. Sparklyr R is really only required if you want access to almost the entire, the full supported Spark API. So all of the things that you can do with Spark outside of Spark SQL. So things like ML, ML model fitting, accessing external data stores, all those things that Spark SQL doesn't necessarily do, that's where Sparklyr R comes in.
Data exploration with remote databases
Thank you. And I know a few people jumped over from when we first opened up the Q&A. So I just want to remind everybody that you can ask questions anonymously if you want using the Slido link I shared in the chat, but you can also just post them right into the YouTube chat here. I see Ben did that. So let me pull one of Ben's questions over. Do you have any recommendations for data exploration when connecting to a database like Databricks? For example, I love using Skim R, but I need to collect the data first.
Garrett, you want to take a first step at that one or? Yeah, I wish I had a better answer for that and maybe someone else here does, but I think you just need to make a strategy and use, you know, tidyverse code sent over to make your own counts or I'm not sure if dbplyr handles this summary function, but basically to recreate that yourself.
Yeah, I can't say anything better than that. One thing is if you don't know what's inside the Databricks database, so I think the connections pane and the fact that it shows you what's there is a really helpful way to get started. And you can use that pane to learn a little bit about the datasets, like the types of the data and the column names and that sort of thing. Another thought like shown in the workflow too is if there are any kind of dbplyr verbs that you can apply before you collect the data to go ahead and do that first and then continue your analysis after you've done collect.
Model training: Databricks cluster vs. Posit Workbench
And I know, Nick, you had answered Zeg P's question before and apologies for mispronouncing your name, but there was a follow-up there and it was, then the model was trained in the Databricks cluster?
Yes. Yeah. So if you're using any of the Spark ML APIs, that trains the model in the Databricks cluster. But in this demo, we weren't training in the cluster, we were just pulling the data over to R and keeping things there. Exactly. Exactly. Yeah. And that's really showing you the power of Workbench because Spark ML, if you've used it, that's kind of a limited set of machine learning models that you can use. One of the great things about using Posit Workbench with Databricks is that you can then, if you need to train something that potentially you couldn't do, it's not supported in the Spark ML API, you can 100% do that locally on your Posit Workbench cluster.
One of the great things about using Posit Workbench with Databricks is that you can then, if you need to train something that potentially you couldn't do, it's not supported in the Spark ML API, you can 100% do that locally on your Posit Workbench cluster.
ODBC and dbplyr: local vs. Databricks
Thank you. I'm going to jump over to Slido and see if there's some questions I missed over there. And there's one from Gabe that was, what is the difference between connecting with ODBC and dbplyr to a local database versus database on Databricks? Is it just about the connection setup?
Well, I could say from the developer perspective, it is almost 100% about the connection setup. We've, one of the things that's new for this demo is the ODBC Databricks function, which uses the Databricks driver so you could connect with ODBC. That's specifically for Databricks, you use something else there. Maybe Nick, with your experience with Databricks, would you know of things that we should be aware of?
Honestly, it is just like connecting to, using the ODBC driver to connect to Databricks, whether that's the Databricks clusters and general purpose compute or to Databricks SQL warehouses, versus connecting to something like Postgres or like a DuckDB you have running locally. Again, it's just like, it's just changing some correction parameters. It's still, it's all SQL down at the bottom. And it works just like your regular SQL connection.
RStudio IDE within Databricks
One other question, which goes back to like the different ways to use R or RStudio with Databricks and Shashir asks, I thought there's a way to use RStudio IDE within Databricks. Is that not the case any longer?
I can take this one. So Shashir, yes, there is still the possibility to run the RStudio IDE within a Databricks cluster. Just to kind of give you the high level of how, of that architecture, what you're doing there is you're taking the open source RStudio server and deploying that onto the head node of your Databricks cluster. The issue with that is that what's that, what you're doing there is to run that RStudio IDE, that Databricks cluster has to be running. So all of that Databricks compute, which as we all probably know, can be pretty expensive and is really optimized for data manipulation is now being used for coding. So what that's going to, what that ends up doing is now, you know, if you're just developing a shiny app, if you're doing some Quarto, if you're just developing a shiny app, if you're doing some Quarto, some documentation in Quarto, you're then incurring Databricks compute costs to do that development work. So it's really not a very cost effective solution. So that's kind of where we, what we did as part of this partnership with Databricks is we built connectors so that you can have Posit Workbench running in a way, in a much more compute or a cost optimized environment and be connecting to Databricks compute. So you can use Posit Workbench for what it's best for, and you can use Databricks for what it's best for. You can cost optimize from there.
tidymodels support for remote clusters
Great. Thank you. One other question that just came into the chat is how is tidymodels supported for remote clusters?
I did want to mention to this question, I think your last question, SegP, the one thing that is also available with Sparklyr R is the ability to create user defined functions. And so the ability to use R code and then apply, apply, apply, it's called the Spark apply function to be able to run that within the clusters themselves. And so please keep an eye out on the Posit blog. We will have a blog post specifically using the same prediction example, but with Sparklyr R and using user defined functions to kind of push that computation over to, to Spark as opposed to doing it, you know, locally or through a REST API.
And Isabella, I know you shared a blog post with me using the New York City taxi data. Is that the one you're, you're talking about, or is that another example? This is a blog post that's coming out. Okay. Not yet. Please keep an eye out, but I'm happy to share the, that blog post that kind of shows the setup example of using Sparklyr R and its functions. And then also, as Garrett mentioned earlier, the ability to see the Databricks table within RStudio.
So if I could just clarify this, this question is like, until now, we would say tidymodels, you run locally, it's our code. And if you wanted to run your computations on the cluster, you'd use the ML functionality of Spark through Sparklyr R. But now there's the opportunity to create user defined functions. And that will, we anticipate that's going to be the path going forward. And so please stay tuned for that blog post that Isabella's writing, and I'll start to schedule out the strategy for using tidymodels back inside the remote cluster.
Running models in Workbench vs. Databricks
Thank you. Okay, so give me just one second to try and copy some questions over from Slido. But another question that just came in was, could you elaborate a bit more on the differences of running the models from within Workbench first from within Databricks?
Well, I can paint a simple picture there. Databricks has generally will have much more compute that runs much faster with much larger datasets inside of Databricks. The large datasets will be living in Databricks. And so if you can run your compute there, you'll be paying Databricks for that compute, but you'll have a faster experience. You could also run your code in your Workbench session. That's what you do if you weren't connected to Databricks. And it might be a little slower. Generally, it's not something that's going to limit your work. But if you're working with R and you have the data inside of R, you are constrained to data that fits there. You can buy more server space for your Workbench. But if you think about the world generally, Databricks is going to have larger data, faster compute, and R is going to have more modest compute and more modest size data. But the code that you write in R, people tend to think is more intuitive and friendly. And to be honest, many modeling scenarios and use cases don't require the full dataset. People seem content to work with a smaller dataset or a sample of their data in R.
Yeah. And I'll just echo that from the Python side as well. We find that these development environments really give you the ability to iterate quickly. And because you're not constrained to what's supported by, let's say, the Spark API. Instead, you can test every model under the sun. You can do functions that may not necessarily be supported in Databricks. That's the beauty of Posit Workbench. There's no constraints. If it runs in R or Python, it'll run on Workbench. So that gives you the ability to really experiment, really fine tune down to the best candidate models. And then really, once you're ready to go to production, what I see a lot of folks doing is that's when they start training on the big data. That's when they start putting compute into Databricks. So you've got, again, the best of both worlds.
That's the beauty of Posit Workbench. There's no constraints. If it runs in R or Python, it'll run on Workbench.
Managing packages on Posit Workbench
Thank you. I know we're a few minutes to the top of the hour here, but I see one other question we hadn't answered yet from Shashir. And it was, any thoughts on how to manage packages and package versions on Posit Workbench?
I mean, I can start a little bit, because with this question asking about how to manage packages on Posit Workbench, we do also have one of our other professional tools, Posit Package Manager, which helps simplify a lot of the package management and does it exactly as its name implies. We also have ways, we have a public instance of Package Manager, so you can install various packages on CRAN or PyPI. So there are methods that we've built out and tools that we've built out to help manage the packages specifically within Posit Workbench. Now, obviously, you're going to be running code on the Databricks cluster. There might be kind of an added layer of complexity there. So maybe I'll defer to someone else in the call if they want to talk about that.
I know we're chatting in our private chat here about a few coming soon things. I'm not sure if we're allowed to say some of those coming soon things or not. Maybe I'll just say, stay tuned for some of the package management Databricks stuff.
Sounds good. Let me just do a quick check to see if we missed any questions here. I don't see any over on Slido. Let me do a quick check here in the chat as well. But just want to say thank you all so much for taking the time to join us today for the demo. I know sometimes when you're not in the room with us here live, it's kind of hard to ask follow-up questions. So if you do want to chat one-on-one with our team, I'm just going to share this link here again in the chat where you can just book a call with the Posit team. We'd love to chat more with you. But huge thank you to Garrett for leading the demo today and Isabella. I know you both worked on this demo together. And Ryan and Nick for jumping in for the Q&A as well. Great to get to hang out with all y'all and see you today. Have a great rest of the day.