AI-Powered Data Engineering Workflows: Positron for Databricks Users

Transcript#

This transcript was generated automatically and may contain errors.

Hello and welcome, everyone. My name is James Blair. I'm a senior product manager at Posit, focused on cloud partnerships, primarily focused on Databricks. Today, we're going to spend some time talking about data engineering using Positron , which is a new developer tool that we've developed here at Posit and Databricks, and how these two tools work together to complement one another.

To set the stage, we're going to talk about a business problem that we can use to anchor our discussion against. I'll provide a little bit of an introduction to Positron and how we plan to use that alongside Databricks. Then the bulk of our conversation today will be spent in a live demonstration of using Positron alongside Databricks to build data engineering workflows and pipelines. We'll have a short kind of comparison at the end of what it means to use a true developer environment or integrated developer tool like Positron versus native features like Databricks notebooks, and when you might use one over the other. And then we'll have some time for Q&A here at the end.

The thing that makes Positron Assistant a little bit unique is that not only does it know what files I have and what the contents of those files are, it also knows information about my current session.

This request doesn't really depend on my current session. There's not a lot in my current session that is going to impact this particular request that we have. So, even though it's adding that as context, it's just going to say, okay, I'm going to write some Python code that's going to fetch this hourly data. So, it's now imported requests and pandas, and it's identified the latitude and longitude coordinates for New York City and reached out to this OpenMeteo weather API endpoint with this latitude and longitude value, a date range, and it's asking for a few different values from this sort of hourly report, temperature, precipitation, wind speed, and then we've got the time zone defined here.

This is really nice, right? I didn't have to write any of this by hand. I could have written this. I can easily review it here, and I can see this looks like what I would have probably done if I would have gone right through the documentation and put this together. There's nothing in here that appears to be off or egregious to me. This looks like it should work.

And I can come in and say, okay, let's actually run this data. And you'll notice that this ran inside of my active console. So, I can now see that I have this weather data that has come from this API, and it's returning valid values. I've got temperature here. I've got precipitation. I've got wind speed. So, I've got what I want, or it looks like I have what I want from this dataset here, which is great. This gives me something that I can start to work on.

So, now that we have that, we could spend some time exploring it a little bit further here. If we wanted to, we could say, like, help us explore with visualization. And this can generate code that's going to make a plot, and the plot's going to show up here inside of my IDE.

It says, okay, you don't seem to have Seaborn or Matplotlib. Do you want to install those? I can choose which level of permission I want to give this. We'll say this is okay. It'll pop open here. I'm using UV to manage my Python environment. So, this will use UV to make sure these packages are available here in this environment. Once this is done, Positron Assistant recognizes, okay, it looks like those packages were installed. This looks like, yep, Seaborn was installed successfully. Matplotlib was installed successfully. Now, let's run this visualization code again, and we'll get this resulting visualization that will appear here inside of our viewer pane in just a moment.

So, this becomes a really powerful way to quickly interact with and explore new data or existing data, right? We could do the same thing with the TRIPS data inside of Databricks. Here, we're doing it with this data that we've queried from this API, but it gives us a very nice iterative way to sort of work through our data in real time, and here's our plot here. And so, we've got this nice plot that shows temperature across the month, precipitation across the month, wind speed across the month. So, a really nice sort of laid out collection of plots here to understand the data that we're looking at. And I didn't have to open a separate window. All this is rendering inside of Positron.

Joining weather data to taxi trips

Now, if we look back at our TRIPS data, which we've got pulled open here, we can see that these pickup times and drop-off times are from 2016, right? So, this 2023 data that we just queried isn't going to help us here. We're going to need data back from 2016.

And then, we've loaded sort of this first 1,000 rows of this TRIPS data into our Python session here inside of Positron. So, what that does is we now have sort of like a little bit of a local sandbox where you can say, okay, let's sort of put together how we can join these things together, see what that looks like, and then, we can start to build a pipeline that we can push out to Databricks to build this sort of systematically in a very sort of data engineering frame of mind.

We can keep track of what it's doing. We can see that we've got this TRIPS with weather that's now been created. And we can start to see, if we scroll through here, we can start to see how it's put this TRIPS data together. You'll also notice, and maybe you saw it down here, there were a couple of errors that came up as this was going, right? It joined, it made a weird join. It had a value that didn't make sense. It was through an error here.

But because Positron Assistant has access to the console that we're working with, this Python session, it saw the error, addressed it, rewrote the query, and moved on. And we didn't have to go back and say, hey, you know, this actually created an error. Can you fix the error? It was able to do so automatically.

Because Positron Assistant has access to the console that we're working with, this Python session, it saw the error, addressed it, rewrote the query, and moved on.

Part of that is because we've enabled this to operate in the sort of agent mode where it has a higher level of autonomy to go and do things. And if we wanted to pull that back a little bit, if we wanted to stay in a little bit firmer control of what was going on, we could change the leveling here. We could say, just ask for permissions for everything, or you do have added permissions. So we have the ability to change kind of the behavior and the autonomy level that this tool has.

So now we've created this – we've taken the pickup time and we've rounded that to the nearest hour because our weather data was only hourly. So now we've got the pickup hour, and now we have the temperature of pickup, the precipitation of pickup, the wind speed of pickup. And then we've also created these – it went above and beyond, as AI tools are destined to do at times. And it's created this sort of temperature category. So is it cold? Is it cool? Is it mild? Like, what's the – what is the feeling or, you know, what's a categorical definition of the temperature? And a Boolean to indicate whether or not there was precipitation.

So now this gives us – kind of proves out this idea that we can do what we set out to do. We can take this taxi data, we can join it with this weather data that we've collected from the API. And if we go back to our original sort of hypothesis here, that allows us to start to answer questions like, how does weather impact the distance that taxis travel? How does it impact fare? Is there – the amount of fare that's charged? Is there a real impact there? We can start to look at some of these questions.

However, we've only done this for 1,000 rows. This trips with weather is just the first 1,000 rows. And it's not in Databricks. If somebody else wanted to use this data, this is just sitting inside this session, this Python session in Positron. So we're part of the way there. We've proved out what we can do. And the next step is to say, well, now let's define a way to do this programmatically inside of Databricks and on some sort of scheduled cadence in the event that this was a live table.

Building a Databricks asset bundle

So we want to make this sort of up-to-date joined dataset that is regularly available to other users inside of my Databricks platform. One of the ways that we could do this is we can leverage Databricks asset bundles or DABs. So Databricks asset bundle is kind of at its core. It's this YAML file that helps define a series of different dependencies and configurations of this bundle of resources that we want to deploy to Databricks. And then we can define sort of the behavior associated with that bundle.

We can lean on tools like Positron Assistant to help us create these things. So we can come in and we can say, hey, like now that I've proven that I can join this data with this data, let's create a Databricks asset bundle that helps us do this on a regularly scheduled basis.

If we come in and I look at our resources, we can see that I have this weather integration job here that defines a job inside of Databricks that runs daily at 6 a.m., runs every day. It has some parameters around the catalog and scheme that are being used, the date that we're starting and ending. These represent the start and end dates of the New York taxi information. And then it goes through and it defines this task to go fetch the weather from this OpenMeteo API. And we're using this notebook that was created that contains just the Python code that we've already walked through to go fetch the data for the API.

Part of this as well, if we pull this open, if we look at our sort of ETL transformations that we define in here, we're using Spark declarative pipelines or Lakeflow Spark declarative pipelines, I think is the technical term for this, where if we come in here and take a look, we're defining these DLT tables that inherently understand data dependencies. So what this does is this allows us to publish this job and the associated pipelines. And then when these things run, they understand what dependencies are needed so that things run in the right sequence and order and fail gracefully if there's an error.

Inside of our project where we have this Databricks YAML file defined, we can simply run this Databricks bundle deploy from the Databricks CLI that will take that YAML file and everything that it defines and the dependencies that it contains, and then deploy this to Databricks. Once that deployment is successful, we can also execute this bundle run that says, now execute that in real time or execute that at the trigger and execution for that.

And if we run this, we'll see some feedback here in the terminal that gives us an indication that we've uploaded the bundle files. So that includes all the different declarative pipeline information that we have as well as the notebooks that define reaching out to the API. And then we can see that we've now started this weather integration, this weather integration job is now running.

And if we pop open our Databricks UI here and come over to jobs and pipelines, we can see that this weather integration is currently executing. And I can see down here, this particular flow right here is currently running. We can click into this and see where we're at. So currently, we're in that fetch weather data portion where we're running that fetch weather notebook. And if we click in, we can see details about where we're at in that notebook execution.

So we can see here, each of the different steps, we grabbed 4,368 weather records. And now we're writing that weather to a new table. We're writing it to weather underscore raw. And now that table contains this weather data that we fetched from the API. And if we come back here and we look back at our run, we've now moved on to this declarative pipeline.

And in here, we have a number of different dependencies that are currently being mapped out. And then we'll see a number of different inputs here. So we have this weather, the weather raw table gets read in here. We're pulling the taxi trips data. And then we're creating a number of materialized views based on this data that we've pulled in from this OpenMeteo API.

This pipeline is now bringing all those things together and then generating a number of different materialized views. And you might, you know, there's a number of different ways people might look at these views. You might be familiar with bronze, silver, gold layers and levels of data. This would fit that abstraction, right? You could have a gold level of sort of reporting, aggregated information. You could have a silver layer of the joint data. And then you have your bronze layer, just kind of the raw tables that are feeding into these joints.

So we see each of these sort of in sequence here, all defined inside of this Databricks asset bundle. And the advantage of that is if we come back into Positron for a moment, all of this is just sort of standard Python code. When I look at my weather data source Python file, this is just Python code. And because it's just a Python file and I'm working inside of Positron, I have access to debugging capabilities.

I have the ability to use version control and operate in a way that allows me to work really well with this, with the files that I'm operating with and collaborate with others through tools like Git and GitHub or whatever the case might be.

All the things that I would expect to have available to me, version control, integration, right? We can see all of my Git information in here, everything that I might want to have the ability to do, run debugging, things of that nature, are all part of this Positron environment. I can build all the pieces here and then I can deploy those pieces as part of this bundle to Databricks. Once that bundle has been deployed, it can be scheduled, it's going to run.

I can define expectations here. I can, in fact, we can see an example of this, right? I have an expectation that there's a valid date time. I have an expectation that there's a reasonable temperature. So I can define data expectations here so that if something doesn't meet these expectations, the pipeline will fail and I can be notified of it, go in and address what's going on. And all this can be driven from Positron, leveraging tools like Positron Assistant, and then pushing the outcome and the result out into Databricks.

Notebooks vs. IDE: when to use each

Let's talk for just a second, and I touched on it briefly, but there's a difference between notebooks, whether it's Jupyter notebooks or Databricks notebooks in some of the cases, and a true sort of integrated development environment. And this is not comprehensive, and I think there's reasons you'll use one over the other, and depending on what you're trying to do, there's a tool that might be better suited for the job.

But there's sort of a well-known dialogue around a few common things with notebooks that you have. They're messy when working with version control because you have a lot of JSON data that's attached versus standard Python files where your diffs are a little bit easier to understand. Working in and sort of sharing code across notebooks, particularly within Databricks, can be a little bit of a challenge versus Python. You can structure things in modules, and you can have shared imports. There's code review can be a little bit of a challenge with notebooks. Again, going back to sort of the Git diff challenges that we have there. Unit testing can be a little bit tricky, although there are some frameworks that I think can help out there. But if you're using just standard Python files, you've got access to standard testing frameworks that you can leverage to build testing around those files. Debugging in notebooks can be a little bit of a challenge. In an IDE, you have a full catalog of debugging tools that can be at your disposal.

And like we kind of highlighted a little bit today, Positron Assistant does a really great job if it's going and encounters any sort of errors at addressing those errors in real time. And then we also have, you know, notebooks are a little bit notorious for the fact that cells can be executed in arbitrary order. And so if you're not careful in the construction of the notebook, you can end up in a state where it's not it's no longer linear. Stateless functions and the way in which a Python file natively executes can help prevent some of that from creeping in.

If you're looking at, for example, a code diff with a notebook versus a standard Python file, you end up with quite a lot of noise in a notebook diff versus a Python file that's a little bit cleaner to understand what's changed and what those changes might introduce. So reviewing changes through version control can be a little bit more seamless with standard Python files. You have testing that we talked about previously. You can define tests in one place in your developer environment, and those tests can then be applied anywhere those Python files run.

And the idea here really is to say, look, let's have one collection of code. We can build tests around that. We can run that code in our local environment in Positron. And then we can, using Databricks asset bundles and some of these other tools that Databricks provides, we can then move that code, push that out into Databricks, and run that as part of a pipeline or a scheduled job inside of our workspace.

Now, again, this isn't to say that we shouldn't ever use notebooks. Notebooks can be great for ad hoc data analysis, doing quick experiments, sharing results, right? Notebooks have a nice sort of format to them. It's a lot easier to sort of look at a notebook. You've got inline plots than it is to like look at just a Python file if you're trying to walk somebody through something. And for interactive tutorials or just kind of interacting in real time, I think notebooks provide a really great interface. But when you're doing sort of these production workloads, defining multiple different pieces at a time, building out these complex pipelines and jobs, using a true developer environment, and we think Positron is very well suited for this task, can start to have a distinct advantage over trying to consolidate and run everything outside of a notebook or collection of notebooks.

Summary and resources

So just as a review of some of the things we've chatted through today, Positron brings a very well thought out developer environment for data work to the market. And it's available on the desktop. We use Positron on the desktop throughout this entire discussion. It's also available on the commercial side through a tool that we offer called Posit Workbench, where you can have that hosted in a server and it's accessed through a browser. So there's a few different avenues that you can take to access and get access to Positron. But on the desktop side, it's a free install. It's available to anyone.

Databricks asset bundles allow us to treat our infrastructure as code. We can define different components, the clusters that we're using, what files we're including in the bundle, how that execution is happening. Databricks asset bundles give us a really solid framework for defining locally what we want to run in Databricks and then pushing that compute out there.

These Lakeflow declarative pipelines and Lakeflow Spark declarative pipelines help us simplify our pipeline development. We're able to describe expectations. We're able to inherit dependencies. And so this gives us a really nice framework for describing which data depends on what outputs and then where that data goes from there. And the IDE itself provides workflows that improve our reliability, our ability to collaborate, our ability to orchestrate and manage tests in that regard.

And at the end of the day, the whole premise, and if you think about who we are as Positron, our goal is to support free and open source scientific research, technical communication, and data science for as long as we can. And so what that means is ultimately, we want you to use the right tool for the job. And we think we've built some tools that apply in certain circumstances. There are certainly many other tools that are useful in applying this in this regard as well. So use the tool that's best for you. We think Positron is uniquely positioned to be a tool not only for data scientists, but also for data engineers working alongside platforms like Databricks, like we've demonstrated today.

We think Positron is uniquely positioned to be a tool not only for data scientists, but also for data engineers working alongside platforms like Databricks, like we've demonstrated today.

Before we jump into questions, just really quickly, we have some resources that we can share, the GitHub repository for this, details about Positron, and then some information from the Databricks site about asset bundles and declarative pipelines. You can visit these links and learn more. The slides are all part of the GitHub repository. So if you want kind of one place you can go and then get the rest of these resources, this GitHub repository has everything that we've walked through today.

Okay. With that, let's kind of open up and see what sort of questions we have, and we'll do our best to answer those as we've got time today.