Resources

April 30th Workflow Demo Live Q&A

Join us for Live Q&A immediately following the Workflow Demo happening on April 30th at 11am ET / 8am PT with Edgar Ruiz @ Posit. Demo will be here first: https://youtu.be/ab4CIlafsbo?feature=shared Q&A will start around 11:35 am ET / 8:35 am PT

May 1, 2025
23 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

If you just joined, I'm going to give people one more minute to jump over from the other YouTube room. Here we go. You can see some people coming over now. Let me just give people 30 more seconds here.

Okay. Numbers are starting to go up. Awesome. Let me bring Edgar here on stage as well. Hey, Edgar.

Okay. I'm just posting the Q&A in every possible place so people know where to go and find us. Awesome. Okay. Well, thank you all so much for joining us today, and thank you to Edgar for the great demo. As a reminder, we host these Workflow Demos the last Wednesday of every month, and they're all recorded. If you want to make sure that you added them all to your calendar, I can share it here in the chat with you. But as I mentioned, they're all recorded, so there's a full playlist of, I think, like 24 different workflows. I think this officially marks two years of Workflow Demos, which I can't believe. And so we'd love to have you join us the last Wednesday of every month.

And so to share a little bit about how you can ask questions today, you can put them right into the YouTube chat, but you can also ask questions on Slido. So if you want to ask anything anonymously, you can do so, and I'm writing in the chat as well, and Hannah can help me out over here on YouTube. But you can ask questions on Slido as well.

And so, Edgar, let's first introduce ourselves for everybody. I know you introduced yourself in the beginning, but I'm Rachel Dempsey. I lead customer marketing here at Posit, and so always looking for new ways to bring customers together and share different use cases and workflows with each other. I'm based in the Boston area. I'm actually at the Boston office today. Edgar, you want to introduce yourself?

Hi, yes. I'm Edgar. I'm in the open source team. I'm over the Sparkly R package, and I also work on the backend for Databricks and pins, so a lot of that coding. Before that, I was a social engineer with Posit as well, so I got to speak with a lot of customers, so a lot of interaction like that. And before that, I was in the financial industry, so come from that background too, and that's why I kind of used the vending data to help us with today's demonstration.

Posit Conf 2025

Well, I like to use every chance I can get to talk about Posit Conference as well, so I just want to remind everybody that Posit Conf 2025 is coming up in September, and we'd love to see you there. So, Edgar, I was curious, what are you most excited about for Posit Conf?

Oh, man, it's going to be great. Being able to see everyone, first of all, that's what always excites me the most, as well as we're going to have some great workshops, great talks. There's going to be a workshop that involves Databricks, and it's going to be led by James, so we're looking for that one too.

Pins versus database

So, let's dive into some questions about the demo. And, Edgar, I know you briefly mentioned this towards the beginning, but this is a common question I've heard from customers, so I was wondering, can you share a bit more about when to use pins versus database?

Yeah, of course. The main thing about, like I mentioned during the demo, is to think in terms of how big your data is and how much of the data you're going to use. If it's really large amounts of data and you're actually only accessing part of it, you know, this is the sales data, you only want a specific month, a specific range of dates, you don't want the whole thing, then it makes more sense to use a database versus a pin where you would download everything. That, to me, is more about being able to have some ancillary data, then later it may be kind of promoted into a database.

But there are certain instances where you just want to keep the data in because you've downloaded it or it changes a lot inside your team. Advantages inside the Databricks ecosystem is that it's not a big lift between choosing the two, right? Because the table is so easy to instantiate, especially if you have the rights to do so, that it's really more about deciding what the best tool is on small data versus larger data, or also the object, right? Because you cannot save a list object to a database unless you want to get complicated. Or a model, for example, like we showed in the demo, those are more like, you know, objects that are better in a volume, which you could use pins for.

I know this integration between pins and Databricks is pretty new, like a few months, right? Yeah. Okay, but how long have you been working on pins? Oh, man, since it started. I remember when Javier was talking about it. I know that Tarif, who's the president of the company, was talking about it too before that. So yeah, since the very beginning, I saw the first few drafts of it, and I really enjoyed and liked the concept and how we could use boards and pins and really to get some data easily shared.

Because this has been an issue with, you know, customers or even personally, you know, when you try to share some data across teams or even across projects, you have to copy the data over and over again, and you end up with a bunch of versions of it. It's better that you have something centralized that is also versioned, like we discussed with pins. And that's what I think got me really excited about it.

Security of pins with Databricks

So I know I've heard back from a few years ago when I used to work in sales, some customers had some questions around, like, security of pins, and I was wondering how is that addressed with Databricks?

Yeah, so Databricks has a really good authentication and security infrastructure that you can use. You can lock down to each object. And by object, I mean files, tables, anything that you have inside the Unity catalog. So it's really easy for you to make sure that only folks who are supposed to access it inside your environment in Databricks can access it.

Having said that, of course, any security system is going to be as secure as you make it. The best example I can give is, like, you could have the best lock in your door, the most secure lock in your door. But if you forget to actually turn the key and actually lock it, it really doesn't do anything for you. So that's the other thing, right? The infrastructure provided by Databricks will make it secure, but it's also up to us to make sure that we set up the proper security on top of – not security, sorry, the proper permissions for the folks that need to access it actually are able to and those who are not, not.

Having said that, of course, any security system is going to be as secure as you make it. The best example I can give is, like, you could have the best lock in your door, the most secure lock in your door. But if you forget to actually turn the key and actually lock it, it really doesn't do anything for you.

Q&A: Pandas DataFrames, troubleshooting, and IDE choices

I see a question that just came in here, which was, Edgar, do Pandas DataFrames also automatically convert to R-friendly DataFrames if you go from Python to R?

I have not seen that happen. I actually avoided that when I was testing the Python side, but I can confirm if – you can email me and I'll be able to confirm on that because I know you can't go from R DataFrame into a Pandas DataFrame really easily, but the way back, usually if you see the cadence to use the Python version of pins, you usually have to establish what kind of file format you're going to save it as. So I don't know if there's an actual binary that you would save the Pandas in, but I would have to check. Typically, you would say CSV or Arrow, and that is easily portable back into R.

Another question was, can you suggest some best practices for troubleshooting the connection to Databricks? Somebody said, when I try to pin a DataFrame over 1,000 rows to my Unity catalog, the process hangs and doesn't complete. I'm unsure if it's a limitation on my local machine or my Databricks instance.

That's a great problem to diagnose. So one of the things I would say, I would definitely check to see if, let's say, like the first 100 hang, right? And then try the next 200. We may be kind of focusing on rows and not columns. There may be also an issue that it's very wide, and it takes a lot of data, a lot of bandwidth. Another problem that you may have is that, in the serialization of the object, there may be an issue. So there may be some bad data, right? And that is not being converted properly, and that's what's hanging. So I would, usually, whenever a customer has this problem, I suggest that, first of all, you partition the data. Let's say go 100, then 200, 300, see where it fails, and see if that, let's say it fails at 500, right? Then try to upload the number 400 to 500 and see if that fails. Because if that fails, that means there's some issue with your data, not with the amount. So that's how I would start to diagnose that.

I don't know if it's an issue with anything on your local machine, as far as uploading. Unless you have a really locked-down machine by your company where you have to go to a proxy and things like that, that may be the problem. You should be able to ask the IT security team in your company to see if there's any limitations on that proxy, if you're going through that as well.

And also, as a Posit professional customer, you can also open a support ticket with our team too, so we can help dig into the issue a little bit deeper.

Okay. One other question was, in this imaginary example company, does everybody on your team use the same IDE? Usually not. That's very common for that to happen. I would say, and Rachel would agree, that maybe five years ago, everybody said RStudio and nothing, right? But now we don't see that anymore. We see some folks on RStudio, some folks on Positron now, especially when it comes to Databricks and folks like Notebook. So it's really rare now to see a homogenous implementation, the same exact IDE across all teams.

That's one reason too, and not to get too commercial about it, but that's why Posit Workbench exists. And we acknowledge the fact that that doesn't happen anymore, right? Because we want to make sure that you're successful, as successful as possible. And that usually means too, giving you the best runway for you to be able to do whatever you need, which means also having an IDE that you feel comfortable in.

LLMs, AI, and new packages

So I know you've worked on a ton of front packages as well. So if people had other questions that weren't specific to Databricks and pins, what are things that you're excited to talk about?

Right now, this whole thing with the LLMs and AI, it's pretty exciting. Not so much from the coding assistant part that I think most of us have kind of tied AI to, but more on the data side. I feel like there's a lot of opportunity there. I've been working on a package called mall that lets you run role-wise and check to see, let's say, if you want to run an LLP operation, like sentiment analysis, instead of you having to spend a lot of time training a sentiment NLP, you can use an LLM to basically determine for you if it's positive or negative or neutral. So it's really easy for you to run those analysis using that. So I feel like that's very exciting for me.

We also have the lang package. It's a nascent package, but it translates our help into a different human language, like from English to Spanish or French. So to make it easier for us to be able to, because English is my second language, to make it easier for us to kind of read the help in our own language. And the LLM does a great job, especially the modern ones, that I feel like that implementation of it is really exciting to kind of look at the possibilities for it.

One of the remaining questions we had was, what is your favorite thing about working with Databricks?

Oh, honestly, what I really like is how complete the ecosystem is and how easy it is to use. Like I think in terms of like peak AWS, right? Whenever you would need a Spark cluster, you would have to have an entire Hadoop cluster implementation with Spark on top, which means an EMR cluster took a while. Like you actually had to know how to do this thing, right? To get it working. And then to set up a table, like a permanent table, you had to do an S3 bucket. And then there's a lot of work that you have to do.

But with Databricks, everything is so easy to do. I feel like as a data scientist, it gives you that, those almost like superpowers at your fingertips where you can have a Spark cluster and a volume running so you can have pins and a table ready for you to have data in. Within like, you know, half an hour, you have everything set up without it being too complicated. Connect it to your RStudio, to Positron, to your notebook and get going and analyze large amounts of data. This is kind of like the dream that we've had for years happening now because how Databricks keeps everything inside its ecosystem in a way that it makes a very good sense to set up.

But with Databricks, everything is so easy to do. I feel like as a data scientist, those almost like superpowers at your fingertips where you can have a Spark cluster and a volume running so you can have pins and a table ready for you to have data in. This is kind of like the dream that we've had for years happening now.

Upcoming workflow demo and final questions

So the last Wednesday of May, May 28th, Joe Chang, our CTO at Posit, is going to be joining us to talk about harnessing LLMs for data analysis. So that will be the next workflow demo in the series. But just wanted to remind everybody about that, that they come up the last Wednesday of every single month.

One specific about pins was, do pins live outside version control like Git? Yeah, it does live outside of it. And I think there are two different concepts. Git is to track changes to your code, to the implementation code. The version that you have in pins is to track differences in the data itself. That's the main thing that you have to keep in mind. If you think about in terms of, okay, I want to use this specific version of pins in your code. If you say, okay, I used to use this version, I'm going to use this version. That's what you track in Git, right? The versioning that is in pins. So yeah, there are two different things in my opinion.

And one other, I guess, not really a question, but a comment. Edgar, thanks for your work on the mall package. It was very useful when I was working in market research at a previous role. Thanks. Yeah, I'm so excited about that package. And it works on both. It works in R and it works in Python. It's a polar data frame. So you can use them either one.

What's the best way to get started with that is going to the GitHub page? I'm sorry. Going to the website. The website itself is very complete and you can switch between the two languages and walks you through how to set it up. And I'm also very excited to kind of share that for that specific package. Very soon, I'm going to add Elmer backend. So you'll be able to go beyond Olama. You should be able to use anything that Elmer and on the Python side, chatless does. So that way, if you want to use like cloud or open AI to run the NLP, you'll be able to do that. So that's coming soon.

Well, we have, I think, time for two more questions here. I see one more that just popped in from Juan. And it is, does pins coexist with targets? When would one use one versus the other?

Yeah, targets, they would coexist. Yeah. Like, let's say this can have the example that we use today for the final workflow. You could orchestrate the update of the pin. Excuse me, the update of the job run, like whenever you run the job and then the update of the pin. And let's say, you know, an update to the notebook itself, you can kind of orchestrate all that with targets. So I think there's the two complementary tools. I don't think they compete in any way that I can see them doing that.

Yeah, because, excuse me, pins, because more as a data source or a target for targets as far as saving an updated version of the pin back into it. It would have been cool to have a targets layer as part of the demo, but I just didn't have the time.

Okay, I think we covered all the questions. Awesome. Well, thank you so much, Edgar, for all the time that you put into preparing this demo and being here to answer questions from everybody. I'm sure we'll be sharing these resources more broadly as well across LinkedIn and follow up emails with everybody who joined us today. But if you'd like to suggest a workflow, I wanted to let you know, you can always reach out to me directly on LinkedIn. Just look up Rachel Dempsey at Posit, but you can also let us know in the comments below in YouTube too. Have a great rest of the day, everybody. Thank you so much for joining us. Bye.