April 30th Workflow Demo Live Q&A

Transcript#

This transcript was generated automatically and may contain errors.

If you just joined, I'm going to give people one more minute to jump over from the other YouTube room. Here we go. You can see some people coming over now. Let me just give people 30 more seconds here.

Okay. Numbers are starting to go up. Awesome. Let me bring Edgar here on stage as well. Hey, Edgar.

Okay. I'm just posting the Q&A in every possible place so people know where to go and find us. Awesome. Okay. Well, thank you all so much for joining us today, and thank you to Edgar for the great demo. As a reminder, we host these Workflow Demos the last Wednesday of every month, and they're all recorded. If you want to make sure that you added them all to your calendar, I can share it here in the chat with you. But as I mentioned, they're all recorded, so there's a full playlist of, I think, like 24 different workflows. I think this officially marks two years of Workflow Demos, which I can't believe. And so we'd love to have you join us the last Wednesday of every month.

And so to share a little bit about how you can ask questions today, you can put them right into the YouTube chat, but you can also ask questions on Slido. So if you want to ask anything anonymously, you can do so, and I'm writing in the chat as well, and Hannah can help me out over here on YouTube. But you can ask questions on Slido as well.

And so, Edgar, let's first introduce ourselves for everybody. I know you introduced yourself in the beginning, but I'm Rachel Dempsey. I lead customer marketing here at Posit, and so always looking for new ways to bring customers together and share different use cases and workflows with each other. I'm based in the Boston area. I'm actually at the Boston office today. Edgar, you want to introduce yourself?

Hi, yes. I'm Edgar. I'm in the open source team. I'm over the Sparkly R package, and I also work on the backend for Databricks and pins, so a lot of that coding. Before that, I was a social engineer with Posit as well, so I got to speak with a lot of customers, so a lot of interaction like that. And before that, I was in the financial industry, so come from that background too, and that's why I kind of used the vending data to help us with today's demonstration.

Having said that, of course, any security system is going to be as secure as you make it. The best example I can give is, like, you could have the best lock in your door, the most secure lock in your door. But if you forget to actually turn the key and actually lock it, it really doesn't do anything for you.

Q&A: Pandas DataFrames, troubleshooting, and IDE choices

I see a question that just came in here, which was, Edgar, do Pandas DataFrames also automatically convert to R-friendly DataFrames if you go from Python to R?

I have not seen that happen. I actually avoided that when I was testing the Python side, but I can confirm if – you can email me and I'll be able to confirm on that because I know you can't go from R DataFrame into a Pandas DataFrame really easily, but the way back, usually if you see the cadence to use the Python version of pins, you usually have to establish what kind of file format you're going to save it as. So I don't know if there's an actual binary that you would save the Pandas in, but I would have to check. Typically, you would say CSV or Arrow, and that is easily portable back into R.

Another question was, can you suggest some best practices for troubleshooting the connection to Databricks? Somebody said, when I try to pin a DataFrame over 1,000 rows to my Unity catalog, the process hangs and doesn't complete. I'm unsure if it's a limitation on my local machine or my Databricks instance.

That's a great problem to diagnose. So one of the things I would say, I would definitely check to see if, let's say, like the first 100 hang, right? And then try the next 200. We may be kind of focusing on rows and not columns. There may be also an issue that it's very wide, and it takes a lot of data, a lot of bandwidth. Another problem that you may have is that, in the serialization of the object, there may be an issue. So there may be some bad data, right? And that is not being converted properly, and that's what's hanging. So I would, usually, whenever a customer has this problem, I suggest that, first of all, you partition the data. Let's say go 100, then 200, 300, see where it fails, and see if that, let's say it fails at 500, right? Then try to upload the number 400 to 500 and see if that fails. Because if that fails, that means there's some issue with your data, not with the amount. So that's how I would start to diagnose that.

I don't know if it's an issue with anything on your local machine, as far as uploading. Unless you have a really locked-down machine by your company where you have to go to a proxy and things like that, that may be the problem. You should be able to ask the IT security team in your company to see if there's any limitations on that proxy, if you're going through that as well.

And also, as a Posit professional customer, you can also open a support ticket with our team too, so we can help dig into the issue a little bit deeper.

Okay. One other question was, in this imaginary example company, does everybody on your team use the same IDE? Usually not. That's very common for that to happen. I would say, and Rachel would agree, that maybe five years ago, everybody said RStudio and nothing, right? But now we don't see that anymore. We see some folks on RStudio, some folks on Positron now, especially when it comes to Databricks and folks like Notebook. So it's really rare now to see a homogenous implementation, the same exact IDE across all teams.

That's one reason too, and not to get too commercial about it, but that's why Posit Workbench exists. And we acknowledge the fact that that doesn't happen anymore, right? Because we want to make sure that you're successful, as successful as possible. And that usually means too, giving you the best runway for you to be able to do whatever you need, which means also having an IDE that you feel comfortable in.

LLMs, AI, and new packages

So I know you've worked on a ton of front packages as well. So if people had other questions that weren't specific to Databricks and pins, what are things that you're excited to talk about?

Right now, this whole thing with the LLMs and AI, it's pretty exciting. Not so much from the coding assistant part that I think most of us have kind of tied AI to, but more on the data side. I feel like there's a lot of opportunity there. I've been working on a package called mall that lets you run role-wise and check to see, let's say, if you want to run an LLP operation, like sentiment analysis, instead of you having to spend a lot of time training a sentiment NLP, you can use an LLM to basically determine for you if it's positive or negative or neutral. So it's really easy for you to run those analysis using that. So I feel like that's very exciting for me.

We also have the lang package. It's a nascent package, but it translates our help into a different human language, like from English to Spanish or French. So to make it easier for us to be able to, because English is my second language, to make it easier for us to kind of read the help in our own language. And the LLM does a great job, especially the modern ones, that I feel like that implementation of it is really exciting to kind of look at the possibilities for it.

One of the remaining questions we had was, what is your favorite thing about working with Databricks?

Oh, honestly, what I really like is how complete the ecosystem is and how easy it is to use. Like I think in terms of like peak AWS, right? Whenever you would need a Spark cluster, you would have to have an entire Hadoop cluster implementation with Spark on top, which means an EMR cluster took a while. Like you actually had to know how to do this thing, right? To get it working. And then to set up a table, like a permanent table, you had to do an S3 bucket. And then there's a lot of work that you have to do.

But with Databricks, everything is so easy to do. I feel like as a data scientist, it gives you that, those almost like superpowers at your fingertips where you can have a Spark cluster and a volume running so you can have pins and a table ready for you to have data in. Within like, you know, half an hour, you have everything set up without it being too complicated. Connect it to your RStudio, to Positron, to your notebook and get going and analyze large amounts of data. This is kind of like the dream that we've had for years happening now because how Databricks keeps everything inside its ecosystem in a way that it makes a very good sense to set up.

But with Databricks, everything is so easy to do. I feel like as a data scientist, those almost like superpowers at your fingertips where you can have a Spark cluster and a volume running so you can have pins and a table ready for you to have data in. This is kind of like the dream that we've had for years happening now.

Upcoming workflow demo and final questions

So the last Wednesday of May, May 28th, Joe Chang, our CTO at Posit, is going to be joining us to talk about harnessing LLMs for data analysis. So that will be the next workflow demo in the series. But just wanted to remind everybody about that, that they come up the last Wednesday of every single month.

One specific about pins was, do pins live outside version control like Git? Yeah, it does live outside of it. And I think there are two different concepts. Git is to track changes to your code, to the implementation code. The version that you have in pins is to track differences in the data itself. That's the main thing that you have to keep in mind. If you think about in terms of, okay, I want to use this specific version of pins in your code. If you say, okay, I used to use this version, I'm going to use this version. That's what you track in Git, right? The versioning that is in pins. So yeah, there are two different things in my opinion.

And one other, I guess, not really a question, but a comment. Edgar, thanks for your work on the mall package. It was very useful when I was working in market research at a previous role. Thanks. Yeah, I'm so excited about that package. And it works on both. It works in R and it works in Python. It's a polar data frame. So you can use them either one.

What's the best way to get started with that is going to the GitHub page? I'm sorry. Going to the website. The website itself is very complete and you can switch between the two languages and walks you through how to set it up. And I'm also very excited to kind of share that for that specific package. Very soon, I'm going to add Elmer backend. So you'll be able to go beyond Olama. You should be able to use anything that Elmer and on the Python side, chatless does. So that way, if you want to use like cloud or open AI to run the NLP, you'll be able to do that. So that's coming soon.

Well, we have, I think, time for two more questions here. I see one more that just popped in from Juan. And it is, does pins coexist with targets? When would one use one versus the other?

Yeah, targets, they would coexist. Yeah. Like, let's say this can have the example that we use today for the final workflow. You could orchestrate the update of the pin. Excuse me, the update of the job run, like whenever you run the job and then the update of the pin. And let's say, you know, an update to the notebook itself, you can kind of orchestrate all that with targets. So I think there's the two complementary tools. I don't think they compete in any way that I can see them doing that.

Yeah, because, excuse me, pins, because more as a data source or a target for targets as far as saving an updated version of the pin back into it. It would have been cool to have a targets layer as part of the demo, but I just didn't have the time.

Okay, I think we covered all the questions. Awesome. Well, thank you so much, Edgar, for all the time that you put into preparing this demo and being here to answer questions from everybody. I'm sure we'll be sharing these resources more broadly as well across LinkedIn and follow up emails with everybody who joined us today. But if you'd like to suggest a workflow, I wanted to let you know, you can always reach out to me directly on LinkedIn. Just look up Rachel Dempsey at Posit, but you can also let us know in the comments below in YouTube too. Have a great rest of the day, everybody. Thank you so much for joining us. Bye.

April 30th Workflow Demo Live Q&A

Transcript#

Posit Conf 2025

Pins versus database

Security of pins with Databricks

Q&A: Pandas DataFrames, troubleshooting, and IDE choices

LLMs, AI, and new packages

Upcoming workflow demo and final questions