Resources

AI-Powered Data Engineering Workflows: Positron for Databricks Users

Modern data engineering demands more. Move beyond the limitations of hosted notebooks and embrace the power of an IDE for modern data engineering. We’ll show you how Positron combines the best of local development with the scale of Databricks. In this session led by James Blair at Posit, we’ll cover: 1. AI-Driven Development: Moving from concept to code faster with built-in AI. 2. Seamless Exploration: Navigating data effortlessly with the Catalog Explorer. 3. Production-Grade Pipelines: Building, testing, and deploying robust workflows. Helpful Links: Download Positron: https://positron.posit.co/ Databricks Asset Bundles: http://docs.databricks.com/dev-tools/bundles GitHub Repo for Demo: https://github.com/blairj09/databricks-data-engineering Lakeflow Declarative Pipelines: http://docs.databricks.com/aws/en/ldp Talk to us about integrating Posit & Databricks: https://posit.co/schedule-a-call/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello and welcome, everyone. My name is James Blair. I'm a senior product manager at Posit, focused on cloud partnerships, primarily focused on Databricks. Today, we're going to spend some time talking about data engineering using Positron, which is a new developer tool that we've developed here at Posit and Databricks, and how these two tools work together to complement one another.

To set the stage, we're going to talk about a business problem that we can use to anchor our discussion against. I'll provide a little bit of an introduction to Positron and how we plan to use that alongside Databricks. Then the bulk of our conversation today will be spent in a live demonstration of using Positron alongside Databricks to build data engineering workflows and pipelines. We'll have a short kind of comparison at the end of what it means to use a true developer environment or integrated developer tool like Positron versus native features like Databricks notebooks, and when you might use one over the other. And then we'll have some time for Q&A here at the end.

The business problem: weather and NYC taxi trips

Let's say that a topic of interest for us is understanding how weather impacts New York City taxi trips. The distance, the fares, how does weather impact some of this? Now, some of you may recognize that there's some convenience to this question, and the convenience is this New York taxi dataset is built into every Databricks environment, so you automatically – if you have access to a Databricks environment, you automatically have access to this data about taxi trips in New York City.

However, this taxi data does not contain any weather information. It has information like the trip distance and the fare amount and the pickup locations, but it does not contain any information about the weather during the trip itself. So, what we want to be able to do is we want to be able to supplement this existing data inside of Databricks with some new data that provides context around weather, and with that new data, we can begin to answer additional questions that we have not been able to access or answer previously, questions like how does temperature impact the trip distance or how does weather impact the fare of the trip, things like this.

In order to succeed here, what we need is we need some sort of historical weather data that we can integrate and join with the taxi data that's available inside of Databricks, and ideally, we'd like this to be as granular as we can. Hourly feels like a pretty good level of granularity so that we could match the hour that the trip occurred with the weather in New York City at that particular hour, and then we can – if we can define additional characteristics, not just temperature, but precipitation, other weather patterns, we have greater ability to understand this interaction between weather and these taxi trips in New York.

If I'm in an organization that comes to me and says, look, we have all this taxi data, but we want to understand the impact of weather, and this could be anything else, right? This could be we have a bunch of demographic data, and we'd like to add X, Y, and Z. We have a bunch of loan data, and we'd like to supplement that with this. And I think this is a fairly common request of we have our existing data inside of our data lake or inside of a warehouse somewhere, and now we've decided that we want to add or supplement that with something new.

Introducing Positron

What Positron is, is this is a next generation developer tool that we've built here at Posit. We're very excited about what Positron offers. We've been working on this for a long time, and we've been very excited over the past couple of years to really start bringing this into the spotlight and showcasing the hard work that's gone on behind the scenes.

The thing that makes Positron unique is this is built on top of VS Code, like many developer tools are today, built on top of the open source components of VS Code. But we've taken everything that we've learned as a company and organization in our experience of building RStudio, the premier developer environment for our developers. We've taken all the lessons learned building RStudio and applied them to this multilingual developer tool that's very, very focused on data science and analytics.

What that means is, while VS Code provides a very robust tool for software engineering, a lot of times people will sometimes refer to VS Code and similar tools as very fancy or complex text editors, which they are. And then they're supplemented with extensions and all these other things. But at the root of what we're doing, we're manipulating text, we're writing code, and VS Code provides a lot of useful functionality around working with lots of files and mapping dependencies between them and all these things.

Data science is a little bit different and working with data is a little bit different, because in many cases, you're not just writing and then compiling something static. Instead, you're working in sort of an interactive environment, whether it's Python or whether it's R, I have some sort of running process. And in that running process, I'm loading libraries, I'm adding data, I'm exploring, I'm generating plots, I'm making visualizations. So there's this very sort of rapid feedback loop that's happening when I'm working with data.

One of the things that made RStudio so successful for such a long time — and this is not, you know, for those of you that are users of RStudio, this is not a signal that we are abandoning the RStudio project or deprecating that in any sort of way. We will continue to maintain RStudio, that this is not a change of direction, rather this is really an expansion of the tools that we provide.

Positron is, again, built on VS Code. It has native support for R and Python. There's other things that we plan on adding support for in the future, SQL and some other things. And there's also native support for the whole ecosystem of VS Code extensions. So, the Databricks extension works inside of Positron. Other extensions that you might know and love having used VS Code work inside of Positron as well. So, you get the benefit of this vast ecosystem of tools inside of VS Code with the added benefit of native R and native Python execution that's built into the product itself.

Demo: exploring data in Positron

This is what Positron looks like. We're not going to spend a ton of time kind of orienting ourselves here. But again, if you've used VS Code previously, you'll recognize that this looks and feels a lot like VS Code. One of the things that's a little bit unique is that I have this console down here that has currently a running Python session. And if I wanted to start another Python session, I could. If I wanted to start an R session down here, I could do that as well. And now I have both Python and R running concurrently.

One of the things that's new to Positron that we'll highlight here is this catalog explorer. So, over here on the left-hand side, I can choose this top icon, which allows me to view my file system. But down here at the bottom, there's this catalog explorer. And in this catalog explorer, I can come in and I can take a look at the different Databricks catalogs and schemas and tables that I have access to.

I can execute this here inside of my console. And this is going to go open a browser window, authenticate me into Databricks, and then come back here in Positron. And I'll now have this trips variable that will represent, and we can see the SQL statement here, represent the first 1,000 rows from this taxi trips data. And over here on the right-hand side, I can pop this open and I can see different details about the variables that are here.

So, now, I've got the ability to come in. I can connect to Databricks. I can explore what's available there. This is the first half of our problem. We want to understand this trips data. And again, this is available to every Databricks account. So, if you have a Databricks account, there's this sample catalog in there. There's this NYC taxi schema, and you've got this trips table that's there that you can access.

Fetching weather data with Positron Assistant

How do we get weather data into here? Well, the first thing that we need is we need some weather data. And there's many different places we might look for this. We're not going to spend time today searching for the best possible weather data. But one thing that I do know is that there's a historical API called OpenMeteo that supports historical weather reports from different places. And so, we can use this open API. It doesn't require a key or any sort of authentication. It's rate limited, so we need to be conscious and good stewards.

In order to do this, so, historically, I could go and I could read the API docs, and then I could come in, and I could open up a new Python file, and I could go through, and I could send some queries to the API, and figure out what endpoint was the right endpoint, and build out sort of step-by-step how I want to access this data and what features of this data I'm interested in. There's nothing wrong with that approach. That's the way that we've done things for a long, long time. However, here in 2026, the landscape of tools has rapidly changed over the course of the past year, and particularly the past six months. So, we're actually going to turn to a tool inside of Positron called Positron Assistant that we're going to use to help us explore this API.

Positron Assistant is an AI tool built into Positron that does what you would expect modern AI agent tools to do. It gives us access to foundational models. We can see I'm using Claude Sonnet 4.5 here. This is configured to use AWS Bedrock. I could use Anthropic natively. I could use OpenAI. I've got a number of different ways I can configure this particular assistant with the model that I want to use.

The thing that makes Positron Assistant a little bit unique is that not only does it know what files I have and what the contents of those files are, it also knows information about my current session. So, I can see here in the context, it tells me that it sees, it has information about my current runtime, Python 3.12.9, which is this session right here. So, it knows that I have this TRIPS data loaded, and it knows that I've got a connection to Databricks and some other things there.

The thing that makes Positron Assistant a little bit unique is that not only does it know what files I have and what the contents of those files are, it also knows information about my current session.

This request doesn't really depend on my current session. There's not a lot in my current session that is going to impact this particular request that we have. So, even though it's adding that as context, it's just going to say, okay, I'm going to write some Python code that's going to fetch this hourly data. So, it's now imported requests and pandas, and it's identified the latitude and longitude coordinates for New York City and reached out to this OpenMeteo weather API endpoint with this latitude and longitude value, a date range, and it's asking for a few different values from this sort of hourly report, temperature, precipitation, wind speed, and then we've got the time zone defined here.

This is really nice, right? I didn't have to write any of this by hand. I could have written this. I can easily review it here, and I can see this looks like what I would have probably done if I would have gone right through the documentation and put this together. There's nothing in here that appears to be off or egregious to me. This looks like it should work.

And I can come in and say, okay, let's actually run this data. And you'll notice that this ran inside of my active console. So, I can now see that I have this weather data that has come from this API, and it's returning valid values. I've got temperature here. I've got precipitation. I've got wind speed. So, I've got what I want, or it looks like I have what I want from this dataset here, which is great. This gives me something that I can start to work on.

So, now that we have that, we could spend some time exploring it a little bit further here. If we wanted to, we could say, like, help us explore with visualization. And this can generate code that's going to make a plot, and the plot's going to show up here inside of my IDE.

It says, okay, you don't seem to have Seaborn or Matplotlib. Do you want to install those? I can choose which level of permission I want to give this. We'll say this is okay. It'll pop open here. I'm using UV to manage my Python environment. So, this will use UV to make sure these packages are available here in this environment. Once this is done, Positron Assistant recognizes, okay, it looks like those packages were installed. This looks like, yep, Seaborn was installed successfully. Matplotlib was installed successfully. Now, let's run this visualization code again, and we'll get this resulting visualization that will appear here inside of our viewer pane in just a moment.

So, this becomes a really powerful way to quickly interact with and explore new data or existing data, right? We could do the same thing with the TRIPS data inside of Databricks. Here, we're doing it with this data that we've queried from this API, but it gives us a very nice iterative way to sort of work through our data in real time, and here's our plot here. And so, we've got this nice plot that shows temperature across the month, precipitation across the month, wind speed across the month. So, a really nice sort of laid out collection of plots here to understand the data that we're looking at. And I didn't have to open a separate window. All this is rendering inside of Positron.

Joining weather data to taxi trips

Now, if we look back at our TRIPS data, which we've got pulled open here, we can see that these pickup times and drop-off times are from 2016, right? So, this 2023 data that we just queried isn't going to help us here. We're going to need data back from 2016.

And then, we've loaded sort of this first 1,000 rows of this TRIPS data into our Python session here inside of Positron. So, what that does is we now have sort of like a little bit of a local sandbox where you can say, okay, let's sort of put together how we can join these things together, see what that looks like, and then, we can start to build a pipeline that we can push out to Databricks to build this sort of systematically in a very sort of data engineering frame of mind.

We can keep track of what it's doing. We can see that we've got this TRIPS with weather that's now been created. And we can start to see, if we scroll through here, we can start to see how it's put this TRIPS data together. You'll also notice, and maybe you saw it down here, there were a couple of errors that came up as this was going, right? It joined, it made a weird join. It had a value that didn't make sense. It was through an error here.

But because Positron Assistant has access to the console that we're working with, this Python session, it saw the error, addressed it, rewrote the query, and moved on. And we didn't have to go back and say, hey, you know, this actually created an error. Can you fix the error? It was able to do so automatically.

Because Positron Assistant has access to the console that we're working with, this Python session, it saw the error, addressed it, rewrote the query, and moved on.

Part of that is because we've enabled this to operate in the sort of agent mode where it has a higher level of autonomy to go and do things. And if we wanted to pull that back a little bit, if we wanted to stay in a little bit firmer control of what was going on, we could change the leveling here. We could say, just ask for permissions for everything, or you do have added permissions. So we have the ability to change kind of the behavior and the autonomy level that this tool has.

So now we've created this – we've taken the pickup time and we've rounded that to the nearest hour because our weather data was only hourly. So now we've got the pickup hour, and now we have the temperature of pickup, the precipitation of pickup, the wind speed of pickup. And then we've also created these – it went above and beyond, as AI tools are destined to do at times. And it's created this sort of temperature category. So is it cold? Is it cool? Is it mild? Like, what's the – what is the feeling or, you know, what's a categorical definition of the temperature? And a Boolean to indicate whether or not there was precipitation.

So now this gives us – kind of proves out this idea that we can do what we set out to do. We can take this taxi data, we can join it with this weather data that we've collected from the API. And if we go back to our original sort of hypothesis here, that allows us to start to answer questions like, how does weather impact the distance that taxis travel? How does it impact fare? Is there – the amount of fare that's charged? Is there a real impact there? We can start to look at some of these questions.

However, we've only done this for 1,000 rows. This trips with weather is just the first 1,000 rows. And it's not in Databricks. If somebody else wanted to use this data, this is just sitting inside this session, this Python session in Positron. So we're part of the way there. We've proved out what we can do. And the next step is to say, well, now let's define a way to do this programmatically inside of Databricks and on some sort of scheduled cadence in the event that this was a live table.

Building a Databricks asset bundle

So we want to make this sort of up-to-date joined dataset that is regularly available to other users inside of my Databricks platform. One of the ways that we could do this is we can leverage Databricks asset bundles or DABs. So Databricks asset bundle is kind of at its core. It's this YAML file that helps define a series of different dependencies and configurations of this bundle of resources that we want to deploy to Databricks. And then we can define sort of the behavior associated with that bundle.

We can lean on tools like Positron Assistant to help us create these things. So we can come in and we can say, hey, like now that I've proven that I can join this data with this data, let's create a Databricks asset bundle that helps us do this on a regularly scheduled basis.

If we come in and I look at our resources, we can see that I have this weather integration job here that defines a job inside of Databricks that runs daily at 6 a.m., runs every day. It has some parameters around the catalog and scheme that are being used, the date that we're starting and ending. These represent the start and end dates of the New York taxi information. And then it goes through and it defines this task to go fetch the weather from this OpenMeteo API. And we're using this notebook that was created that contains just the Python code that we've already walked through to go fetch the data for the API.

Part of this as well, if we pull this open, if we look at our sort of ETL transformations that we define in here, we're using Spark declarative pipelines or Lakeflow Spark declarative pipelines, I think is the technical term for this, where if we come in here and take a look, we're defining these DLT tables that inherently understand data dependencies. So what this does is this allows us to publish this job and the associated pipelines. And then when these things run, they understand what dependencies are needed so that things run in the right sequence and order and fail gracefully if there's an error.

Inside of our project where we have this Databricks YAML file defined, we can simply run this Databricks bundle deploy from the Databricks CLI that will take that YAML file and everything that it defines and the dependencies that it contains, and then deploy this to Databricks. Once that deployment is successful, we can also execute this bundle run that says, now execute that in real time or execute that at the trigger and execution for that.

And if we run this, we'll see some feedback here in the terminal that gives us an indication that we've uploaded the bundle files. So that includes all the different declarative pipeline information that we have as well as the notebooks that define reaching out to the API. And then we can see that we've now started this weather integration, this weather integration job is now running.

And if we pop open our Databricks UI here and come over to jobs and pipelines, we can see that this weather integration is currently executing. And I can see down here, this particular flow right here is currently running. We can click into this and see where we're at. So currently, we're in that fetch weather data portion where we're running that fetch weather notebook. And if we click in, we can see details about where we're at in that notebook execution.

So we can see here, each of the different steps, we grabbed 4,368 weather records. And now we're writing that weather to a new table. We're writing it to weather underscore raw. And now that table contains this weather data that we fetched from the API. And if we come back here and we look back at our run, we've now moved on to this declarative pipeline.

And in here, we have a number of different dependencies that are currently being mapped out. And then we'll see a number of different inputs here. So we have this weather, the weather raw table gets read in here. We're pulling the taxi trips data. And then we're creating a number of materialized views based on this data that we've pulled in from this OpenMeteo API.

This pipeline is now bringing all those things together and then generating a number of different materialized views. And you might, you know, there's a number of different ways people might look at these views. You might be familiar with bronze, silver, gold layers and levels of data. This would fit that abstraction, right? You could have a gold level of sort of reporting, aggregated information. You could have a silver layer of the joint data. And then you have your bronze layer, just kind of the raw tables that are feeding into these joints.

So we see each of these sort of in sequence here, all defined inside of this Databricks asset bundle. And the advantage of that is if we come back into Positron for a moment, all of this is just sort of standard Python code. When I look at my weather data source Python file, this is just Python code. And because it's just a Python file and I'm working inside of Positron, I have access to debugging capabilities.

I have the ability to use version control and operate in a way that allows me to work really well with this, with the files that I'm operating with and collaborate with others through tools like Git and GitHub or whatever the case might be.

All the things that I would expect to have available to me, version control, integration, right? We can see all of my Git information in here, everything that I might want to have the ability to do, run debugging, things of that nature, are all part of this Positron environment. I can build all the pieces here and then I can deploy those pieces as part of this bundle to Databricks. Once that bundle has been deployed, it can be scheduled, it's going to run.

I can define expectations here. I can, in fact, we can see an example of this, right? I have an expectation that there's a valid date time. I have an expectation that there's a reasonable temperature. So I can define data expectations here so that if something doesn't meet these expectations, the pipeline will fail and I can be notified of it, go in and address what's going on. And all this can be driven from Positron, leveraging tools like Positron Assistant, and then pushing the outcome and the result out into Databricks.

Notebooks vs. IDE: when to use each

Let's talk for just a second, and I touched on it briefly, but there's a difference between notebooks, whether it's Jupyter notebooks or Databricks notebooks in some of the cases, and a true sort of integrated development environment. And this is not comprehensive, and I think there's reasons you'll use one over the other, and depending on what you're trying to do, there's a tool that might be better suited for the job.

But there's sort of a well-known dialogue around a few common things with notebooks that you have. They're messy when working with version control because you have a lot of JSON data that's attached versus standard Python files where your diffs are a little bit easier to understand. Working in and sort of sharing code across notebooks, particularly within Databricks, can be a little bit of a challenge versus Python. You can structure things in modules, and you can have shared imports. There's code review can be a little bit of a challenge with notebooks. Again, going back to sort of the Git diff challenges that we have there. Unit testing can be a little bit tricky, although there are some frameworks that I think can help out there. But if you're using just standard Python files, you've got access to standard testing frameworks that you can leverage to build testing around those files. Debugging in notebooks can be a little bit of a challenge. In an IDE, you have a full catalog of debugging tools that can be at your disposal.

And like we kind of highlighted a little bit today, Positron Assistant does a really great job if it's going and encounters any sort of errors at addressing those errors in real time. And then we also have, you know, notebooks are a little bit notorious for the fact that cells can be executed in arbitrary order. And so if you're not careful in the construction of the notebook, you can end up in a state where it's not it's no longer linear. Stateless functions and the way in which a Python file natively executes can help prevent some of that from creeping in.

If you're looking at, for example, a code diff with a notebook versus a standard Python file, you end up with quite a lot of noise in a notebook diff versus a Python file that's a little bit cleaner to understand what's changed and what those changes might introduce. So reviewing changes through version control can be a little bit more seamless with standard Python files. You have testing that we talked about previously. You can define tests in one place in your developer environment, and those tests can then be applied anywhere those Python files run.

And the idea here really is to say, look, let's have one collection of code. We can build tests around that. We can run that code in our local environment in Positron. And then we can, using Databricks asset bundles and some of these other tools that Databricks provides, we can then move that code, push that out into Databricks, and run that as part of a pipeline or a scheduled job inside of our workspace.

Now, again, this isn't to say that we shouldn't ever use notebooks. Notebooks can be great for ad hoc data analysis, doing quick experiments, sharing results, right? Notebooks have a nice sort of format to them. It's a lot easier to sort of look at a notebook. You've got inline plots than it is to like look at just a Python file if you're trying to walk somebody through something. And for interactive tutorials or just kind of interacting in real time, I think notebooks provide a really great interface. But when you're doing sort of these production workloads, defining multiple different pieces at a time, building out these complex pipelines and jobs, using a true developer environment, and we think Positron is very well suited for this task, can start to have a distinct advantage over trying to consolidate and run everything outside of a notebook or collection of notebooks.

Summary and resources

So just as a review of some of the things we've chatted through today, Positron brings a very well thought out developer environment for data work to the market. And it's available on the desktop. We use Positron on the desktop throughout this entire discussion. It's also available on the commercial side through a tool that we offer called Posit Workbench, where you can have that hosted in a server and it's accessed through a browser. So there's a few different avenues that you can take to access and get access to Positron. But on the desktop side, it's a free install. It's available to anyone.

Databricks asset bundles allow us to treat our infrastructure as code. We can define different components, the clusters that we're using, what files we're including in the bundle, how that execution is happening. Databricks asset bundles give us a really solid framework for defining locally what we want to run in Databricks and then pushing that compute out there.

These Lakeflow declarative pipelines and Lakeflow Spark declarative pipelines help us simplify our pipeline development. We're able to describe expectations. We're able to inherit dependencies. And so this gives us a really nice framework for describing which data depends on what outputs and then where that data goes from there. And the IDE itself provides workflows that improve our reliability, our ability to collaborate, our ability to orchestrate and manage tests in that regard.

And at the end of the day, the whole premise, and if you think about who we are as Positron, our goal is to support free and open source scientific research, technical communication, and data science for as long as we can. And so what that means is ultimately, we want you to use the right tool for the job. And we think we've built some tools that apply in certain circumstances. There are certainly many other tools that are useful in applying this in this regard as well. So use the tool that's best for you. We think Positron is uniquely positioned to be a tool not only for data scientists, but also for data engineers working alongside platforms like Databricks, like we've demonstrated today.

We think Positron is uniquely positioned to be a tool not only for data scientists, but also for data engineers working alongside platforms like Databricks, like we've demonstrated today.

Before we jump into questions, just really quickly, we have some resources that we can share, the GitHub repository for this, details about Positron, and then some information from the Databricks site about asset bundles and declarative pipelines. You can visit these links and learn more. The slides are all part of the GitHub repository. So if you want kind of one place you can go and then get the rest of these resources, this GitHub repository has everything that we've walked through today.

Okay. With that, let's kind of open up and see what sort of questions we have, and we'll do our best to answer those as we've got time today.