Resources

Data + AI Summit 2024 - Keynote Day 2 - Full

Speakers: - Alexander Booth, Asst Director of Research & Development, Texas Rangers - Ali Ghodsi, Co-Founder and CEO, Databricks - Bilal Aslam, Sr. Director of Product Management, Databricks - Darshana Sivakumar, Staff Product Manager, Databricks - Hannes Mühleisen, Creator of DuckDB, DuckDB Labs - Matei Zaharia, Chief Technology Officer and Co-Founder, Databricks - Reynold Xin, Chief Architect and Co-Founder, Databricks - Ryan Blue, CEO, Tabular - Tareef Kawaf, President, Posit Software, PBC - Yejin Choi, Sr Research Director Commonsense AI, AI2, University of Washington - Zeashan Pappa, Staff Product Manager, Databricks About Databricks Databricks is the Data and AI company. More than 10,000 organizations worldwide — including Block, Comcast, Conde Nast, Rivian, and Shell, and over 60% of the Fortune 500 — rely on the Databricks Data Intelligence Platform to take control of their data and put it to work with AI. Databricks is headquartered in San Francisco, with offices around the globe, and was founded by the original creators of Lakehouse, Apache Spark™, Delta Lake and MLflow. Connect with us: Website: https://databricks.com Twitter: https://twitter.com/databricks LinkedIn: https://www.linkedin.com/company/data… Instagram: https://www.instagram.com/databricksinc Facebook: https://www.facebook.com/databricksinc

Jun 14, 2024
2h 15min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey everybody! Hey, super excited day two here. We have an awesome program in front of us but I want to start by first again thanking our partners. Without them this program would not be possible so I want to thank the GSIs and the hyperscalers and all the ISVs that you see on this picture. Please go to the expo hall, check out what they're up to.

Okay, so we have a really awesome program in front of us today. You're gonna hear from Texas Rangers, you're gonna hear from Duck TV creator, you're gonna hear from Matej Zaharia who started the Spark project, he's gonna talk about UC, we're gonna hear about Apache Iceberg from the original creator of the project, Ryan Blue, and then we're gonna hear from Deposit and RStudio Tarif and then Professor Yehchin from UW.

And then we have lots of lots of announcements today. Before I jump in I wanted to quickly recap yesterday. So in case you missed it, yesterday we talked about the acquisition of Tabular which was a company started by the original creators of Apache Iceberg and what we talked about is how we intend to bring these formats closer and closer together. Delta Lake and Apache Iceberg.

And if you want compatibility today or if you want interoperability today we announced the GA of Uniform. So store your data in project Uniform, stands for universal format, and you will get the best of both of those formats. Okay, we got all the original creators of both of those projects, we're making sure that Uniform really works well with both.

So that was the first thing that we announced. Second, we talked about Gen AI. And in Gen AI we kind of talked about how lots of companies are focused on general intelligence, which is super cool. Models that are really good at anything. You can ask them about history, math, and so on. But we're focused on data intelligence. Data intelligence is not just general intelligence, it's intelligence on your data, on your custom data, on the proprietary data of your organization. Being able to do that at a reasonable cost and with privacy intact.

Okay, we talked about compound AI systems and the agent framework that we released yesterday that lets you build your own compound AI systems. Okay, and then we heard from Reynold yesterday about data warehousing, and he talked about the performance improvements that we've seen over the just last two years. So just BI workloads, concurrent workloads on BI, we saw 73% improvement on the BI workloads that are running on Databricks the last two years. Okay, so we're just tracking those over two years, it's a massive improvement.

So check it out. And then I was very excited about AIBI. So AIBI was a project that we built from the ground up, with generative AI in mind, to completely disrupt how we do BI today. So that's also available in Databricks, so check it out. Okay, so that's what we did yesterday.

Okay, I will introduce the next speaker. So next speaker, her name is Ye Qin, she's a professor at UW. She's going to be talking about SLMs. Okay, what's SLMs? Everybody's talking about large language models. These are small language models. Okay, how do they work? What makes them tick? What's the secret sauce to make SLMs work really, really well? Super excited to welcome on stage, Ye Qin.

Small language models: impossible distillation

All right, so I'm here to share with you impossible possibilities. So last year, when Sam Altman was asked, how can Indian startups create foundation models for India? He said, don't bother, it's hopeless. Whoa. First of all, I hope that Indian startups didn't give up and will not give up. Second of all, this conversation could have happened anywhere else, in US, any universities, or startups, or research institute, without that much of compute.

So here comes impossible distillation. How to cook your small language models in an environmentally friendly manner, and it tastes as good as the real thing. So currently, what we hear as the winning recipe is extreme scale pre-training followed by extreme scale post-training, such as RLHF. What if I told you, I'm going to start with GPT-2, that small, low-quality model that nobody talks about, and somehow, I don't know why, how, but somehow, we're going to create or squeeze out high-quality small model, and then compete against much stronger model, that may be two orders of magnitude larger.

Now, this should really sound impossible, especially when you might have heard of a paper like this, that says the false promise of imitating proprietary large language models. Although what they report is true for that particular evaluation experimental setup they reported, please do not generalize, over-generalize, to conclude that all the small language models are completely out of league. Because there are numerous other counterexamples that demonstrate that task-specific symbolic knowledge distillation can work across many different tasks and domains, some of which are from my own lab.

Today, though, let me just focus on one task, which is going to be about how to learn to abstract in language. To simplify this task, let's begin with the sentence summarization as our first mission impossible. So here, the goal is to achieve this without extreme scale pre-training, without RLHF at scale, and also without supervised data sets at scale, because these things are not always necessarily available. But wait a minute, we have to use usually all three, at least some of them, but how are we supposed to do any good against larger model without any of this?

So the key intuition is that current AI is as good as the data that it was trained on. We have to have some advantage, we cannot have a zero advantage, so that advantage is going to come from data. By the way, we have to synthesize data, because if it already exists somewhere on the internet, OpenAI has already crawled it, that's not your advantage, they have it too. So you have to create something genuinely novel that's even better than what's out there.

So the key intuition is that current AI is as good as the data that it was trained on. We have to have some advantage, we cannot have a zero advantage, so that advantage is going to come from data.

So usually, distillation starts with large model, but we're going to toss that out just to show you how we may be blinded to the hidden possibilities. So I'm going to start, just for demonstration purposes, with GPT-2, that poor, low-quality model, and then I'm going to do some innovations, which I'm going to sketch in a bit, to make high-quality data set that can then be used to train small model that will become powerful model for a particular task.

The only problem, though, is that GPT-2 doesn't even understand your prompt, it cannot do prompt engineering using GPT-2. You ask it to summarize your sentence, it generates some output, that does not make any sense. So then you try again, because there's usually randomness to it, you can sample many different examples, like hundreds of examples, and we find that it's almost always no good, like less than 0.1% good.

Where there's a will, there can be a way. So we had multiple different ideas that included our neurologic decoding. This is plug-and-play inference time algorithm that can incorporate any logical constraints to your language model output. For any off-the-shelf models, we can plug and play this to guide the semantic space of the output. But because GPT-2 is so bad, even with this, you know, the success ratio was only about 1%. But this is not zero.

Now we are going somewhere, because if you over-generate a lot of samples and then filter out, you can actually gain some good examples this way. And then students, brilliant students, came up with many different ideas. I'll gloss over the technical details, but we found some ways to increase the chance success ratio to beyond the 10%, just so that it's a little bit easier to find good examples.

So then overall framework goes to something like this. You start with the poor teacher model, you over-generate a lot of the data points, and then because there's a lot of noise in your data, you have to do serious filtration. So here we use the three-layer filtration system. The details are not very important, but let me highlight the first one, Intelmonter filter, which was based on off-the-shelf Intelmonter classifier that can tell you whether a summary is logically entailed from the original text or not. This is off-the-shelf model, that's not perfect. It's maybe about 70 to 80% good, but it's good enough when you use this aggressively to filter out your data.

Then we use that data to train smaller model, much smaller model, which can then become the teacher model for the next generation students. So we repeat this a couple times to make, in the end, high-quality dim sum data and high-quality model. When we evaluated this against the GPT-3, which was the best model of that time, so this actually was done before ChatGPT came out, and we were able to beat that GPT-3, which was at that time the best summarization model out there.

But since ChatGPT came out, people are like, whatever, you know, ChatGPT can do everything, including summarization, so why should we bother? So here comes Mission Impossible 2, where we are now going to compete against ChatGPT 3.5, and to make the challenge even harder for us. Now we are going to summarize documents, not just the sentences, and then we are also going to do all of this above without relying on that off-the-shelf Intel mount classifier. I mean, in practice you can do that, just like academically we wanted to see how much can we push the boundary against the commonly held assumption about the scale.

So our new work, InfoSum, is information-theoretic distillation method, where the key idea is instead of that off-the-shelf Intel mount filtration system, we're going to use some equations. That equation has actually only three lines of some conditional probability scores that you can compute using off-the-shelf language models. It's too early in the morning, so let's not drill into the details of the equations, but I can just tell you hand-wavy that if you shuffle this around, you can interpret this as special cases of pointwise mutual information, which you can use for the purpose of filtering your data.

So we use the same overall framework as before. We now use the PTHEA 2.8 billion parameter model because we liked it a little bit better than GPT-2, and for the filtration, we are now using the three short equations that I showed you earlier. And then we do the same business. This time, though, we make the model even smaller, only 0.5 billion parameter model that leads to high-quality summarization data set as well as model.

So how well do we do? Well, as promised, we do either as good as ChatGPT 3.5, at least for this task, or we do better depending on how you set up the evaluation challenges and benchmarks. So you can check out more details in our paper. To summarize, I demonstrated how we can learn to summarize documents even without relying on extreme scale, pre-trained models, and many other things at scale.

The real research question underlying these two papers, though, is this idea about how we can learn to abstract. Because right now, the recipe is, let's just make models super big, the bigger the better. But humans, you and I, cannot really remember all the contexts, you know, like a million tokens. Nobody can remember a million tokens in your context. You just abstract away everything I told you instantaneously, but you still remember what I just said so far. That's really amazing human intelligence that we don't yet know how to build efficiently through AI models, and I believe that's possible. We're just not trying hard enough because we're blinded by just the magic of a scale.

InfiniGram: N-gram models at infinite scale

Okay, so finally, Infinigram as the third mission impossible. So switching the topic a little bit, now the mission is to make classical statistical N-gram language models somehow relevant to neural language models. How many of you even talk about N-gram models anymore? I don't know. Do you even learn this these days? Here, we're gonna make N equals infinity. We're gonna compute this over trillions of tokens, and the response time should be super instantaneous, and we're not even going to use a single GPU on this. Like, wow.

Let me tell you how hard it is. So hypothetically, if you're gonna index five trillion tokens in a classical N-gram language model with N equals infinity, then you're, roughly speaking, looking at two quadrillions of unique N-gram sequences that you somehow enumerate, sort, count, and store some error, which might take maybe 32 terabytes of disk space, maybe more, who knows, but it's too much. We cannot do that. And if you look at what other large-scale classical N-gram models other people have ever built, it was Google in 2007, due to Jeff Dean and others, who only scanned two trillion tokens. I mean, it was a lot back then. Up to five N-grams, which already give them about 300 billions of unique N-gram sequences that they have to enumerate, to sort, count, etc. So it's too many. People didn't do beyond that very much. So how on the earth is it possible that we can actually blow this up to infinity?

So before I reveal what we did, I invite you to go check out this online demo, if you so desire. infinity-gram.io So you can look up any token you want. Here's one example highlighted, which has 48 characters. I don't know why that word even exists, but not only it exists, if you look it up, there are like more than three thousands of instances, and it shows you what millisecond it took. It's 5.5 millisecond it took, and then it also shows you how you can tokenize that long word.

You can also try multiple words to see which word comes next. So for example, actions speak louder than what. So it's going to show you on the web what are the other next word that comes, and you know, again, it's a super fast. So what did we do? You'll be surprised to hear how simple this idea actually is. So there's something called the suffix array that I think not all algorithm classes teach, but some do. It's that data structure that's implemented very carefully. So we index the entire web corpus using this suffix array, and the truth is we don't pre-compute any of these n-gram statistics. We just have this data structure ready to go, and when you call a particular query, we compute this on the fly, but thanks to the data structure, we can do this super fast, especially when you do C++ implementation.

I know people don't usually use that language anymore when it comes to AI research, but it's good stuff that actually runs much faster. How cheap is this? So it's only a few hundreds of dollars we spent for indexing the entire thing, and then even for servicing the APIs, you can get away with a pretty low cost, and it's really really fast. Even without GPUs, the latency for different types of API calls are just a few tens of milliseconds.

You can do a lot of things with this. So one thing I can share with you right now is you can interpolate between your neural language models with our InfiniGram to lower the perplexity, which is the metric people often use for evaluating the quality of your language model across the board. And this is only the tip of the iceberg that I expect to see. I'm actually working on other stuff that I wish I could share, but I cannot yet. But we started serving this API endpoint a few weeks ago, and already we served 60 million API calls, not counting our own access, so I'm really curious what people are doing with our InfiniGram.

Concluding remarks on data quality

So concluding remarks. The TLDR of my talk is that AI, at least in the current form, is as good as the data that it was trained on. So the past and current AI usually depends primarily on human-generated data, but it could really be that the future will rely on AI-synthesized data. I know that there's a lot of concerns about this, that maybe the quality is not very good, there may be bias, so you cannot do this in a vanilla way. You should do it in a more innovative way, but there are many evidences piling up that this actually works.

So segment anything by meta-sem is an example of AI-synthesized annotation on image segmentations. Helped with human validation, but human alone couldn't annotate that many examples of images. Here's another example. Textbooks are all you need by Microsoft, if I want to three. Again this is a case where when you have really high quality data, textbook quality data, synthesized, you can actually compete against larger counterpart across many, many different tasks. Maybe it's not as still general as larger models in some capacities, but this is amazing to serve a lot of business needs, where you may not need generalist, you may need a specialist.

And also what textbook alludes to you is that quality is what matters. It's not just brute force quantity, but it's quality. DALI 3 is yet another example. Why is it better than DALI 2 out of a sudden? Well in large part because of better captions. But which better captions? The previous model used all the good captions. Well they synthesized the captions. That's how you get high quality data. Of course you have to do this with care, but there are many more piling examples of task-specific symbolic knowledge distillation, including the work of my own lab, that demonstrate that this could really make smaller models really unlock the hidden capabilities of small models. So it's really about quality, novelty, and diversity of your data, not just the quantity, and I'll end my talk here. Thank you.

DALI 3 is yet another example. Why is it better than DALI 2 out of a sudden? Well in large part because of better captions. Well they synthesized the captions. That's how you get high quality data.

Apache Iceberg and Delta Lake: joining forces

Awesome. Okay, so there we had it. Mission Impossible. So the secret sauce behind these SLMs, small language models, is the data. Surprise. Okay, awesome. So back to this slide. We saw it yesterday. So this is the Data Intelligence Platform, and this is sort of guiding us different portions of the platform that we're going through. We went through a bunch of them yesterday, and today the next level that we're going to go through is Delta Lake and Uniform.

So we have a talk on Delta Lake. That was our agenda a month ago when we put this together, but it turned out that, you know, we now have acquired company Tabular. So we really really wanted you to hear from Ryan Blue, the original creator of Apache Iceberg. So I want to welcome him on stage and bring him on here.

Hey, Ryan. Hey, good to be here. Awesome. Okay, so congratulations. Welcome to Databricks. Thank you. We are really excited to be here and also excited to get started on this new chapter in data formats. Awesome. So what's the main benefit of joining Databricks? Why join forces?

I, you know, I've never wanted people to worry about formats. Formats have always been a way for us to take on more responsibility as a platform and take responsibilities from people who, you know, worry about things. When we started this, people were worrying about whether or not things completed atomically. And so this next chapter is really about how do we remove the the choice and the need to stress over, you know, am I making this the right choice for the next ten years? That weighs a lot on people. And I think we just, we want to make sure that everything is compatible. That we're all, you know, running in the same direction with the single standard, if possible. Hopefully we can get there.

Yeah, I think we're gonna get there. Actually, you had a talk, right, a while ago that said something like, I want you to not know about, you know, these formats and Iceberg. There was some title, right? Exactly. I don't want anyone thinking about, you know, table formats or file formats or anything like that. That's a massive distraction from what people actually want to get done in their jobs. So I want people focusing on getting value out of their data and not the minutiae. That's the kind of nerdy problem that, you know, I get excited about. Leave that to us.

Hey, I like it. As a nerd, I think it's awesome. We've got thousands of people to learn how to do asset transactions and understand all the underpinnings of the stuff that otherwise would not give a damn about. Okay, well, everybody wants to hear, like, origin stories. So can you tell us a little bit? How did Iceberg get started? What's the sort of history?

Well, at Netflix, we were really grappling with a number of different problem areas. Atomicity was one, that we didn't trust transactions and what was happening to our data. We also had issues like, you know, more correctness problems. You couldn't rename a column properly and those sorts of things. And we realized that the nexus of all of the user problems was the format level. We just had too simplistic of a format with the Hive format, and we decided to do something about it.

And then I think the real turning point was actually when we open-sourced it and started working with the community, because it turns out everyone had that problem, and we could just move so much faster with the community. It's been an amazing experience. And you started, you were involved in the starting of the Parquet project before that, right? Was some of these thoughts even discussed to do this kind of atomicity and so on back there, or no?

So, part of my experience in the Parquet project informed what we did here, because there were several things that just were not file-level issues. They were this next level of, you know, really table-level concerns. Like, what's the current schema of a table? You can't tell that from just looking at all the files.

Yeah, you know, a lot of people think that this is, you know, the first time we're talking about these things, you know, you and I and others. But this isn't the first time. We're actually talking about interoperability and how to make this work, right? That's true. You know, we've been in touch over the years, you know, talking about this several times. I'm glad that we finally got to the point where it made sense. You know, I think we were always going and doing our own things, but now we've gotten to the point where both formats are good enough that we're actually duplicating effort. And the most logical thing to do is this. It is to start working together, start, you know, avoiding any duplication if possible between the two.

Yeah, that's super awesome. Okay, so, I think a lot of people here are wondering, what does this mean for the Apache Iceberg community? Well, I'm really excited because I see this as a big commitment and a pretty massive investment in the Iceberg community and the health of both Delta Lake and Iceberg in general. I'm very excited, you know, personally to, like, work on this and do a whole bunch of fun engineering problems. And that'll be really nice.

Awesome, man. So, we're super excited to partner with you, you know, collaborate on Delta, Uniform, Iceberg, all these formats, and then, you know, make it such that no one here ever needs to care about this ever again. Thanks so much. Thank you.

Delta Lake Uniform GA

So, now, as I said, originally this talk was just going to be about Delta. So now I want to welcome to stage the CTO of data warehousing at Databricks, Sean Hovsepian, to talk about Delta and Uniform. Welcome. Thanks, Ali. And a lot of us, we used to work together with Ryan in the past, and it's really exciting to have him here so we can work together again. So this talk is going to be very exciting.

Delta Lake. First of all, I'm going to announce the general availability of Delta Lake Uniform. What is Uniform? Really, it's just short for two words, universal format. It's our approach to allow full lake house format interoperability. See, with all of these different formats, Delta, Iceberg, Hudi, it's essentially a collection of data files in parquet and a little bit of metadata on the side. All of the formats use the same MVCC transactional techniques to keep that together.

And so we thought to ourselves, in this age of LLMs, transforming language left and right, couldn't we just find a way to translate that metadata into different formats so that you just need to have one copy, and that's exactly what we're doing with Uniform. The Uniform GA allows you to essentially write data as Delta and be able to read it as Iceberg, and we've worked very closely with the Apache Xtable and Hudi team to make that possible. And we're going to be working with the Iceberg team to make that even better.

The great thing about Uniform is there's barely a noticeable performance overhead. It's super fast. You get awesome features like liquid clustering. There's support for all of the different data types from map, lists, arrays. And best of all, it's got a production-ready catalog. With UC and Uniform, it's one of the only live implementations of the Iceberg REST catalog API, and that's available for everybody using Uniform.

There have been over 4 exabytes of data that have already been loaded through Uniform. We have hundreds of customers using it, and one of them, in particular, mScience, as you can see here, was very happy that they were able to have one copy of their data, which allowed them to reduce costs and have better time to value.

And it's innovations like Uniform that are really making Delta Lake the most adopted open lakehouse format. There have been over 9 exabytes of data processed on Delta yesterday. Over a billion clusters per year using it. And this is tremendous growth. It's 2x more than last year. And if you're like me, and when you saw these numbers, I did not believe 9 exabytes. I literally, up until yesterday, were going back, looking at the code, making sure they calculated it correctly, because it's just a tremendous amount of data every day that's going into Delta.

And it's adopted by a large percentage of the Fortune 500, 10,000-plus companies in production, lots of new features. But most interestingly, it's sort of that last number. There are over 500 contributors. And best of all, according to the Linux Foundation, and this is their project analytics site, it's open, anyone can go to it today, over about 66% of contributions to Delta come from companies outside of Databricks. And it's this community that just really makes us super excited and enables a ton of these features that are now available.

And these are time-tested, awesome, innovative functionality. Things like change data feed, log compaction. I love the row IDs feature that just came out. But there are things like deletion vectors. Deletion vectors are a way that allow you to do fast updates and DML to your data. In many cases, it's 10 times faster than merge used to be. So if you have DBT workloads or you're doing lots of operational changes to data, deletion vectors make your life easier. And there have been over 100 trillion row rewrites that have been saved because of these deletion vector features. And it's enabled by default for all Databricks users.

And so it's through these features that we've also been able to unlock access to this amazing ecosystem of tools that support Delta. And with Uniform, that's now GA, we're able to get the same access to the Hudi and Iceberg ecosystem. So if you have tools and SDKs, applications that work in there, they're all part of the Delta family now, thanks to Uniform. And there's been some great improvements to a lot of the connectors, the Trino, Rust connector, lots of awesome innovation happening here. And a lot of that is thanks to this new thing that we've developed called Delta Kernel.

Essentially, at the core of all of this, there's a small library that you can just plug and play into your applications or SDKs that contains all the logic for the Delta formats, all the version changes, new features, and it's making it so much easier for people to integrate and adopt Delta, and most importantly, stay up to date with the latest features. And we've been seeing this. The Delta Rust connector is community supported and has amazing traction. Just a few weeks ago at Google's I-O conference, I believe, BigQuery introduced complete support for Delta. And very recently, DuckDB added full support for Delta.

And the best part of this is we have Hannes here, who's the co-founder, one of the co-creators of DuckDB, CEO of DuckDB Labs, professor of computer science, who's going to talk to us a little bit about how they integrated Delta into DuckDB. Hannes, get over here.

DuckDB and Delta Lake integration

Yes, hello and very good morning. It's wonderful to see all of you here. I have to adjust my eyes a bit to the amount of people. As Shanta said, I'm one of the people behind DuckDB. So for those of you who do not know what is DuckDB, it's a small in-process analytical data management system, speaks SQL, has zero dependencies, and I'm having a lot of fun working on it with a growing team.

And last year, I talked about DuckDB on this very stage for the first time. And it was very exciting. But also lots of things have happened since then in DuckDB land. There's been an incredible growth in adoption for DuckDB. We're seeing all sorts of crazy things. And here, as an example, it's just the stars on GitHub have doubled within a year to almost 20,000. And in fact, we're so close to 20,000. So if you want to like it today, then maybe we'll beat it.

But what also happened, and just last week, we actually released DuckDB 1.0. And that was a big moment for us. It was the culmination of six years of R&D in data management systems. And what does 1.0 mean? That means that we have now a stable SQL dialect and various APIs. And most importantly, our storage format for DuckDB is going to be backwards compatible from now on out.

But maybe taking a little bit of a step back, how does DuckDB fit into a general ecosystem? If you look at the world's most widely used data tools, Excel, and we look at a very capable system like Spark, there's still a pretty big gap. There's a lot of data sets that are not going to work in Excel, but they are maybe a bit too small to actually throw Spark at them. So DuckDB is really perfect for this last mile of data analysis, where you may not need a whole data center to compute something.

So for example, you have already gone through your log files in Spark. And now it's time to do some last mile analysis with DuckDB, doing some plots, what have you. That's where DuckDB fits into this big picture. But now we have to somehow get the data from Spark to DuckDB. So how are we going to do that? Obviously, we're going to use the best tool for the job available, right? CSV files.

Maybe not. So typically, people use Parquet files for this. Obviously, both Spark and DuckDB can read and write Parquet files, so that works really well. But we've all heard about the issues that have appeared with updates and schema evolution, these kind of things, which is why we have Lakehouse formats. So today, we are announcing official DuckDB support for Delta Lake. It's going to be available completely out of the box with zero configuration or anything like that.

But we have done a bunch of these integrations. And one thing that's really special about the Delta Lake integration is that we use this Delta kernel that Databricks is building with the community. And that's really exciting because it means that we don't have to build this from scratch like we used to, for example, with the Parquet reader, but we can actually delegate a lot of the hard work of reading Delta files to the Delta kernel while at the same time keeping our operators within the engine and so on and so forth. So it's really exciting.

And one thing that's really special about the Delta Lake integration is that we use this Delta kernel that Databricks is building with the community. And that's really exciting because it means that we don't have to build this from scratch like we used to, for example, with the Parquet reader, but we can actually delegate a lot of the hard work of reading Delta files to the Delta kernel while at the same time keeping our operators within the engine and so on and so forth.

We also made an extension for DuckDB that can talk to the Unity catalog. So with this extension, we can find the Delta Lake tables in the catalog and then actually interact with them from DuckDB itself. So here we can see a script that actually works if you install DuckDB now. You can install this Unity catalog extension. You can create your secret, which is like credentials. And then you can basically just read these tables as if they were local tables. If you want to hear more about this, there's actually going to be a talk this afternoon at 1.40. Just look for DuckDB in the title.

So the Delta extension joins this growing list of DuckDB extensions. For example, there's others for Iceberg, Vector Search, Spatial, and this sort of thing. But as an open source project and a small team, we're really excited about Tabular and Databricks bringing Delta Lake and Iceberg closer together because for us, it means we don't have to maintain two different things for the same, essentially, problem. And we're really excited about that. It means less work for us, and I think everyone wins.

I just want to plug one sort of small thing that we're actually launching today. I've mentioned extensions to DuckDB. We've seen a lot of uptake in DuckDB extensions. And from now on, actually, we are launching community extensions, which means that everyone can make DuckDB extensions and basically publish them and then installing them is as easy as just typing install into a DuckDB near you. So that's all for today. Thank you very much.

Delta Lake 4.0

So how do we top that? By going forward. And forward to Delta 4.0. So we have the branch cut. It's available. Delta 4.0 is the biggest change in Delta since history. It's jam-packed with new features and functionality. Things like coordinated commits, collations, all sorts of new functionality that make it easier to work with, various different types of data sets. We won't have time to go through all of this, so I'm going to pick a couple and dive into why these are such amazing features.

So Liquid Clustering is generally available now as part of Delta 4.0. And with Liquid Clustering, we really wanted to set out to solve this challenge that so many people have brought up. Partitioning, it's good for performance, but it's so complicated. You get over-partitioning, small files, you pick the wrong thing. It's a pain to resolve. And Liquid solves this with a novel data layout strategy that's so easy to use that we hope all of you will say goodbye to partitioned by. You never need to say that again when you define a table.

Not only is it easy to use, we found out it's up to 7 times faster for writes and 12x faster for reads. So the performance benefits are amazing. There are about 1,500 people, customers, actively using this. The adoption has been insane. Over 1.2 zettabytes of data have been skipped. And you don't have to take my word for it. Even Shell, when they started using it for their time series workloads, saw over an order of magnitude improvement in performance. And it was just so easy to use.

Next, open variant data type. And this one's really important. That first word, open, is really exciting. So what happens is now in this world of AI, you have more and more semi-structured text data, alternative data sources, all of this coming into the lake house. And we wanted to come up with a way to make it easier for people to store and work with these types of data in Delta. And usually what happens is when you're stuck with semi-structured data, most of the data engineers, they sort of have to make a compromise.

None of us like to make compromises. But usually, it's about being open, flexible, or fast. And often, they'd only be able to pick two out of these three. So for example, for semi-structured data, one approach is just store everything as a string. That's open. It gives you tons of flexibility. But parsing strings is slow. Why would you store a number as a string and have to reread it every time? So of course, there's an option to pick the fields out of your semi-structured data, make them concrete types, and you get amazing performance. This is open, very fast to access. However, if you have sparse data, you sort of lose out on a lot of that flexibility to modify the schema.

And relational databases for a while have had special enum or variant data types. But all of those had always been proprietary. If you wanted to use them to get a balance of not having to store everything as a string and not having to shard out every single column, you got locked in. So that's why we're very excited with variants to be able to get that sweet spot in the middle. You can have your JSON data, store it with flexibility, fully open with amazing performance. It's very easy to use. It works with even complex JSON. Here's an example of the syntax. And we found, of course, it's eight times faster than storing your JSON data as raw strings. This is just tremendous.

So if you're storing JSON in a string field today, go back to work or home and start using Variant. It's available in DBR 15.3. But most importantly, all of the code for Variant is already checked in to Apache Spark. There's a common subdirectory in the 4.0 branch right now that has all of the implementation details for Variant and all of the operators. And there is a binary format code definition and library that we've made available open source so all the other data engines can also use Variant. We really want this to be an official open format that everyone adopts so that, finally, we have a non-proprietary way of storing semi-structured data reliably.

With that, I just want to summarize Delta Lake 4.0. It's interoperable. We have this amazing ecosystem of people like Hannes working together, making it better and stronger. We get amazing performance benefits. And all of this is just so much easier to use now than it ever was before.

Unity Catalog open source launch

OK, so back to the state of intelligence platform side, so that was awesome. We heard about Delta Uniform. We heard from Ryan Blue. Cool that DuckDB now will support writing natively into Delta and Uniform and UC. And then we have Delta 4.0, so that's awesome. So next, creator of, or original creator of Apache Spark project, Matej Zahari, is going to tell us about Unity Catalog.

Hi, everyone. Thanks, Ali. Yes, I have the new Unity Catalog t-shirt. You'll be able to get one soon, I think, somewhere, so. So I have a somewhat longer session for you today, because I'm talking about governance with open source Unity Catalog, as well as data sharing. If you're familiar, we have another open source project. We launched delta sharing. That's really making waves in the open data collaboration space, and we have a lot of exciting announcements around that.

So I'll start by talking about what's new in Unity Catalog and what it means to open source it, why we did it, what's in there. And then, so Ali announced that we're open sourcing it yesterday, but he let me keep one more thing to announce today that I'll talk about. That's the next big direction for Unity Catalog. And then, finally, I'll switch gears to sharing and collaboration, and we'll have some cool demos of all these things, too.

So let's start with Unity Catalog. So I think everyone who works in data and AI knows that governance, whether it's for security or quality or compliance, remains a huge challenge for many applications. And it's getting even harder with generative AI. There are new regulations being written all the time about it. I heard in California alone, there are 80 bills proposed that regulate AI in some form. And also, you need to really understand where your data is coming from if you're going to create models and deploy them and run these applications.

So we hear things from our customers all the time about how they would love to use AI, but they can't really govern it with their existing frameworks. And even on the data space, it's complex enough. Even there, all the rules are changing, and people are really worried about how to best do it. So we wanted to step back and think, it's 2024. If you had to design an ideal governance solution from scratch today, what would you want it to have? We asked a bunch of CIOs. And we thought you really wanted to have three things.

The first thing is what we call open connectivity. So you should be able to take any data, any source that's in your organization, and plug it into the governance solution, because no one's going to migrate everything into just one data system over time. Most organizations have hundreds or thousands of these. So you really want a governance solution that can really cover all this data wherever it lives, in any format, even in other platforms.

Then we also really think you need unified governance across data and AI. I think it's clearer than ever with generative AI that you have to manage these together. You can't be managing AI without knowing what data went into it. And also, all the output of AI, as you do serving, is going to be data about how your application is doing. It's got the same problems of quality and security. So we really need it to be unified. And then finally, we heard everyone asking for open access from any compute engine or client, because there are so many great solutions out there. They'll keep coming out. They'll be the next data processing engine, the next machine learning library. You want it to work with your data.

So this is what we're building with Unity Catalog, especially with the open source launch today. So first of all, open connectivity. Unity Catalog and Databricks as a platform uniquely lets you connect data in other systems as well and process it together in a very efficient way through this feature we call Lakehouse Federation. So you can really connect all your data and give users a single place where you set security policies, you manage quality, and you make it available.

It also is the only governance system in the industry, really, that's unified governance for data and AI. So since the beginning, since we launched this three years ago, we had support for tables, models, files. And we're adding new concepts as they come out in the AI world, like tools, which we talked about yesterday with the tool catalog concept for Gen AI agents. And for all these things, you can get these capabilities on top, ranging from access control to lineage to monitoring and discovery.

And finally, one of the big things that is possible today through the open API and open source project we just launched is open access to all your tables through a wide range of applications. I'll talk more about that in a bit. But the cool thing here is, again, it's not just data systems like DuckDB, but also a lot of the leading machine learning ones, like LangChain, can integrate with Unity Catalog.

So let me start with open connectivity. So I'm really excited today to announce GA of Lakehouse Federation, the ability to connect and manage external data sources in Unity. So this is a feature that we launched last year, a year ago at summit. It's called Lakehouse Federation, it's going to allow you to use the same governance rules on top and get the same experience managing quality, tracking lineage and so on as people work on them that you get with your Delta tables in there.

And it's been growing extremely quickly. We now have 5,000 monthly active customers and if you look at the graph of queries on Lakehouse Federation, it's still growing exponentially. And another really cool thing that we're announcing is Lakehouse Federation for Apache Hive and Glue. So if you've got an existing Hive meta store or Glue, you've got lots of tables in there, you can now connect that efficiently to Unity Catalog and manage the data in that as well.

So what about unified governance across data and AI? There's so much happening in this space and our team has been working hard to launch a whole range of new features here. So first, Lakehouse Monitoring is going GA. So Lakehouse Monitoring is the ability to take any table or machine learning model in Unity Catalog and automatically check for quality on it. And the other thing about this, since it's integrated into the data platform, we know exactly when the data changes or when the model is called. So it's very efficient and it does all this computation incrementally.

And it gives you these rich dashboards about quality, classification of the data discovered and so on, plus all these reports go into these tables. And then the second thing that we're launching a preview of soon is attribute-based access control. So we've developed this policy builder and tagging system where we've separated all of these metadata into unique headers and tags. And you can then propagate these masking policies across all of them. And this works easily through the UI or through SQL.

So a lot of people asked me yesterday, why are you open sourcing Unity Catalog? And really it's because customers need it. Customers are looking to design their data platform architecture for the next few decades. And they want to build on an open foundation. And today, even though you see a lot of cloud data platforms that claim support for openness in the data, they're not really open sourcing.

So there are a lot of cloud data warehouses out there, for example, that could read tables, say, in Delta Lake or Iceberg, but most of them also have their native tables that are more efficient, more optimized, and they really nudge you to use those and to have your data locked in. And then there are other platforms, even some of the Lake platforms where it seems like everything is in an open format, but it's not. And so we want an open Lakehouse where they own the data, no vendor owns the data, without lock-in, and where they can use any compute engine in the future.

And so we want an open Lakehouse where they own the data, no vendor owns the data, without lock-in, and where they can use any compute engine in the future.

So we've been big fans of this approach for a while. We think it's where the world is going. So that's why we design everything we do to support it. And already today in Databricks, all your data is in an open format. There's no concept of a native table. And the ecosystem of clients that only understand Apache Iceberg or Hudi can still read your tables. And the next logical step is to also provide open source catalog.

So this is what we have in the first release of Unity Catalog. We proposed the project to the Linux Foundation, and it's been accepted this morning. So it'll be there. And another cornerstone of Unity Catalog is we're doubling down on this format, so you can connect to it from any engine that understands that, and we hope that means that a lot of the ecosystem out there will work with it.

All right, so you might be asking, is this for real? When are you actually releasing it? Maybe in 90 days? Maybe 89 days? Because Ali announced it yesterday. Yeah, so this is Unity Catalog on GitHub. Looks solid to me. People are working hard on it. So just going to go into the settings here. Scroll down to the danger zone. And make this thing public. Yep, I want to make it public. I understand. All right, and I think it's public now. So, yeah, take a look. So, GitHub.com, Unity Catalog. Thanks, everyone. Yeah, so that wasn't that hard.

And, of course, we invite all of you to contribute. We'll be working hard to expand the project, and we want to do it all in the open. We're not going to keep it closed for a while to build this stuff up. All right, so, yeah, it's now available. So we just released version 0.1. This version supports tables, volumes for unstructured data, and AI tools and functions. So it implements that tool catalog concept we talked about yesterday, has the Iceberg support, and if you look at our website, it has open-source server, APIs, and clients, and these work just as well with your instance of Unity Catalog on Databricks.

We're also really excited to have a great array of launch partners, everywhere from the cloud vendors, some of which have been contributing a lot to open standards like the Iceberg REST API already, to leading companies in AI and in governance. So Microsoft, AWS, Google, they're all excited to see this happening, and we hope to see more of them in the future. And, of course, there's a lot more coming. We're working on bringing a lot of the nice things you have in Databricks on Unity out here, including delta-sharing, models, MLflow, and a lot of other things that we're working on.

So that's kind of an overview of Unity Catalog, some of our launches in there. It's great to hear about them, but even better to see a demo, and for that, I'd like to invite Zeeshan Pappa, one of our product managers, to show you all of these new features.

Thanks Mate, I'm glad to be here.

This offer offers a unified interface awesome, applying access controls and querying In some cases, only some of your data will reside in the Lakehouse. To address this, we've simplified and secured access control to systems such as BigQuery, catalogs such as Glue and Hive, MySQL, Postgres, Redshift, Snowflake, and Azure SQL. All of this is powered by Lakehouse Federation.

Switching over here on the left-hand side to a SQL Editor I'll show you how to query an external data system by running some SQL that will join a store report table that is federated from Snowflake along with a Lakehouse native source that contains data on retail store returns.

Once this table is created, this store report table will become a Unity Catalog managed object, which means the platform now handles all of your table management challenges including automatic data layout, automatic performance, and predictive optimizations for you. But managed doesn't mean locked in. This table, or any Unity Catalog object is accessible outside of Databricks via Unity Catalog's open API.

Let me show you how easy it is to query this newly created object using DuckDB. First, I will opt this table in for external access as I've done for other tables in this catalog. Next, I'll switch over to DuckDB which is the same nightly bill that you can access right now. I'll attach this catalog, AccountingProd, to DuckDB and now I'll run a quick query to see all of the tables in this catalog.

And as you can see, that store report table that I just created is right there. Next, I'll run a quick query to select from this table.

I can do the same tables created in Unity Catalog and quickly query them using DuckDB's native Delta Reader. This is Databricks' commitment to open source and open interoperability here and now.

Unity Catalog governance and PII monitoring

So far, I've walked through Unity Catalog's Explorer, Lakehouse Federation, and our new open API. However, a major challenge for many organizations is ensuring consistent and scalable policy enforcement across a diverse range of data and AI assets. Let me show you how easy it is to scale your governance using tags and ABAC policies combined with proactive PII monitoring.

Let's switch over here to the OnlineSalesProd catalog and let's take a look at this table called WebLogs. One of the features that's been enabled in Unity Catalog is Lakehouse Monitoring, which allows for simplified and proactive PII detection of anomalies in your data and models. Within this dashboard over here on the left-hand side, you can explore columns and rows and you can see that PII has been detected in the user input column.

Now, this is obviously a problem. Before this dataset can be actually used, this data must be masked and appropriate policies must be applied. Let's switch back to the Catalog Explorer.

Back in this Explorer, over here in the Rules tab, a new rule can be created to express policy across all data. It's now so much easier to mask all email columns across all tables with a single rule. Let's give this rule a name. Let's call it Mask Email.

And we're going to give it a quick description. Let's mask some emails. And we want to apply this to all account users. And this is a rule type of Column Mask. And we're going to select a Column Mask function that I previously created, conveniently called Mask Email.

And we want to match on a condition when a column is detected that has a tag with a tag value of PII email. Let's go ahead and create that rule. And that's it. And now to validate this mask, we're going to go back to the Weblog Sales table and we can observe here in this table in the Sample Data the User Input column over here to the right that the data has now been masked.

Since this applies to the entire Catalog, let's go to a different table. As you can see, we've got an Email Address column in here as well, tagged PII email. Let's go up to Sample Data. As you can see over here as well, this table has also been masked with one rule.

So Matei, I've shown how Unity Catalog enables organizations to have open access to their data seamlessly no matter where it resides while applying unified governance to ensure its integrity and security. Thank you.

Unity Catalog metrics

Thanks so much, Zeeshan. All right. So that's Unity Catalog in action. So as I said, Ali let me keep one thing to announce today which I'm really excited about.

So what we just saw is you could set up a Catalog, it's open, it's got access control, monitoring, you can get to it from any engine, you can federate stuff into it. Are you done? As an engineer, you might say this is pretty good. What will happen, unfortunately, is someone will come in and ask a business question. For example, how is my ARR trending in EMEA?

And to answer this kind of question, there's not enough information in just the Catalog, in just the table schemas and things like that. So you have to understand things like how is ARR defined? That's some kind of unique calculation for your business. Maybe there are many tables that mention ARR. Which one actually is the right one to use to get this information? How is EMEA defined? Which countries are really part of it? And so the question is, how do you bridge this gap?

So this is something that is typically done in some kind of metrics layer and we're really excited to announce that we're adding first class support for metrics as a concept in Unity Catalog.

So this is something we'll be rolling out later this year. So Unity Catalog metrics. So the idea here is that you can define metrics inside Unity Catalog and manage them together with all the other assets. So you can set governance rules on them, you can find them in search, you can get audit events, you can get lineage for them and so on.

And like the other parts of Unity, we're taking an open approach to this. We want you to be able to use the metrics in any downstream tool. So we're going to expose them to multiple BI tools. So you can pick the BI tool of your choice. We'll of course integrate them with AIBI. One of the things we're excited about is we're designing metrics from the beginning to be AI friendly so that AIBI and similar tools can really understand how to use these and give you great results.

And you'll be able to just use them through SQL, through table functions that you can compute on. And we are also partnering with DBT, Kube, and AtSkill as external metrics provider to make it easy to bring in and govern and manage metrics from those inside Unity.

So with all of this stuff in action, I'd like to invite Zeeshan to the stage again for a quick demo of metrics.

Thanks, Matei. As you mentioned, metrics enhance business users' ability to ask questions, understand their organization's data, and ultimately make better decisions. Rather than sifting through all of the data, certified business metrics can be governed, discovered, and queried efficiently through Unity Catalog.

Having already discussed the Catalog Explorer, let's dive into business metrics. In this overview tab here, you'll see a list of all available sales metrics. A few of these metrics are marked certified. This indicates a higher level of trustworthiness.

We're going to select the Web Revenue metric down below here. As you can see, it's also marked certified. By clicking into this, you can see the metadata that's associated with this metric and the predefined dimensions that are associated with Web Revenue. These are used when querying the metrics such as date or location. This is like having a built-in instruction manual for your data.

On the right-hand side here, you can see the metric overview section. This is where you can see the description of the metric, who edited it, and who certified it. You can also see information about where the metric came from and where the metric is used, such as notebooks, dashboards, and genie spaces. Let's click into this dashboard.

As you can see in this dashboard over here on the right-hand side in the x-axis column, I have all of the interesting information such as the dimensions and country, city, state, etc. This allows you to slice and dice without needing to fully understand the data model.

Let's go into a notebook as well. This metric isn't just usable in a dashboard. It's also queryable from external tools and notebooks. In this notebook, we're using the get metric function to pull all aggregated data. It's that simple.

Finally, let's go back into a genie space. Here, Web Revenue metrics can be used to answer natural language questions. In this space, you'll also see that this visualization was created by asking about the revenue generated across states using this metric. This approach extends the reach of these metrics throughout the organization, making them accessible to business users.

As you can see, Matej, Unity Catalog metrics make it easy for any user to discover and use trusted data to make better decisions.

Sharing and cross-org collaboration

Thanks, Yixuan. Super excited about this. So, after doing two demos this morning, I think Yixuan can have the rest of the day off.

All right. So, the final portion of the talk I want to talk about, I want to talk about what we're doing in sharing and cross-org collaboration. Depending what industry you're in, you've probably seen that data sharing and collaboration between companies, between organizations, is becoming a really important part of the modern data space. It can help providers and suppliers coordinate better. It can help streamline a lot of business processes.

Just yesterday, I met a customer who thought that they could speed up, basically, launching new drugs by a factor of two by implementing these kind of technologies. So, really powerful way for many industries to move forward.

And we started looking at this area about three years ago. We wanted to provide great support for it. And we started by talking to a lot of data providers who collaborate. And what they told us was that many of the data platforms out there support some kind of sharing between different instances, but it's always closed. You can only share within that same data platform, within that customers of that data warehouse or whatever.

And as a provider, or as any company that wants to collaborate with a lot of partners, this is very restrictive. So, Empirity, for example, who's a CDP, said that they would prefer to invest in open solutions that let them set up data collaboration once and then be able to reach anyone, regardless of what platform they're computing on.

So, that's the approach that we've taken with all our sharing and collaboration infrastructure by creating an open collaboration ecosystem based on open standards. And the core of that is delta sharing, a feature of Delta Lake that allows you to securely share tables across clouds and across data platforms. And then we've built on that with Databricks Marketplace and Databricks Clean Homes.

So, if you're not familiar with delta sharing, basically this is a core part of the Delta Lake project, where if you have a table and increasingly other kinds of assets as well, you can run this server that has an open protocol and serve out just parts of your table to other parties that are authorized to access them.

And because the protocol is open, it's a very simple one based on Parquet files that they're given access to, it's really easy to implement a lot of consumers. So, of course, you can use Databricks to access these, but you can also just use Pandas, Apache Spark, even BI products like Tableau and Power BI are letting you load data right into there. And it makes a lot of sense. If you're a data provider, you want to publish something, why should the other party even need to install a data warehouse in the first place? Why not deliver that data, say, straight to Tableau or straight to Excel or something like that?

40% of those recipients are not on Databricks. So this idea of cross-platform collaboration is real and our customers are able to deliver data and to have real-time data exchange with anyone regardless of what data platform they're using. So super excited about the growth of that this year.

40% of those recipients are not on Databricks. So this idea of cross-platform collaboration is real and our customers are able to deliver data and to have real-time data exchange with anyone regardless of what data platform they're using.

We are continuing to expand Delta sharing and one exciting announcement is we're hooking together two of the best features of the platform, Lakehouse federation and sharing to let you share data automatically from other data sources as well.

So we talk to a lot of companies who have some data in a data warehouse and they want to collaborate. And since we built this federation technology that can efficiently query this data, push down filters, get it out and deliver it, we are just connecting that to Delta sharing to let you seamlessly do this. So now you can really share data from, you know, any data warehouse, any database with any app that understands the Delta sharing capability that we provide.

Data marketplace and cleanrooms

So one of the things that builds on Delta sharing is data marketplace. This is something we launched about two years ago. And it's also been going extremely well and we are very excited about it. We are the first data marketplaces anywhere in the cloud on any platform.

Our team has been adding a whole bunch of new functionality there that providers are asking for like private exchanges, sharing of non-data assets like models and volumes, usage analytics and even support for non-database clients. If you put data in there, you can reach these other platforms and it's a lot of fun.

And we are also very excited to welcome 12 new partners to this, to the sharing and marketplace ecosystems. Some of these announcements went out last week, but anywhere from Axiom, Amperity, Atlassian, industry leaders in many different domains are now connecting to this ecosystems and making data available to users on Databricks and we are very excited to see how this will continue to go.

The final thing I want to talk about is that we are soon launching public preview of Databricks cleanrooms.

So what is a cleanroom? It's a place where you can have a database, a database of some tables, some code, some unstructured data, some AI models, any kind of asset you can have on the Databricks platform and you can agree on a computation with someone else that you run and then send the results to just one recipient. So, for example, it could be as simple as you each have some tables and you want to figure out, you know, how many records you both have in common and it can be as complicated as, you know, someone has a machine learning model and you have a Databricks model and you want to get the predictions or get the differences between the two models.

So two things really distinguish Databricks cleanrooms from other cleanroom solutions out there. The first is because you have this complete data and AI platform, you can really run any computation. It can be machine learning, SQL, Python, R and so on versus just SQL in many other cleanroom solutions. And the second is because you have a cleanroom solution, you can do cross-cloud and even cross-platform collaboration.

If someone's primary data store is not Databricks, they can still seamlessly connect it to the cleanroom and do work on that. So this is going to go into public preview just a little bit later this summer. And we've already seen some really awesome use cases.

One company we've been working closely with is MasterCard, who is a private company that does a lot of data-driven work for cleanrooms. And they have a range of different partners. You can imagine the different kinds of things that they can do with their data. And who is, you know, looking for the best way to do private versions of the state-of-the-art algorithms and techniques to work with this data.

So I want to show you all this collaboration work in action. And for that, we have our third demo.

Thanks, Mate. Picture this. I'm part of a media team at a large retailer. We are teaming up with a supplier to run a joint advertisement campaign to grow our sales. And to do that, we need to collaborate on joint customer data.

I, as a retailer, have data on my customers and their shopping behavior. My supplier has their customer loyalty data. However, we have some challenges. First, we cannot share any sensitive information about our customers. And finally, we want to leverage machine learning and Python for our analysis and not just SQL.

Databricks clean rooms can help with all of this in a privacy safe environment. Let's see how.

I have a market in the East US as my cloud and region. And what's amazing is that it doesn't matter that my supplier and I are on different regions and clouds. I then go ahead and specify my supplier as a collaborator. And once the clean room is created, I bring in the data.

Now, my clean room is ready for my supplier to come join. So, let me flip hats. I'm now the supplier, hence dark mode. And I join the clean room that my retail counterpart added me to. Now, I can bring in the data associated with the table, but not the actual data. This is perfect context for good collaboration, while ensuring that I'm not privy to any sensitive information.

Now it's my turn to bring in my customer data, but my customer data is on a snowflake warehouse outside Databricks. And I don't want to create a custom ETL pipeline to bring this data in. And I don't have to, because lucky for me, I can directly specify Lakehouse federated tables as sources to this clean room. With no copy, no ETL, these clean rooms truly scale for cross-platform collaboration.

And now, my favorite part. I inspect the notebook. The code looks good. And I run it. So the job run has successfully started. And in a few seconds, it's done. And I'm presented with delightful visual results to help me understand that we can target 1.2 million households for our campaign based on factors such as customer age, income bracket, and household size. Thank you.

So let's go back to our slides to summarize what we just saw. Our retailer and supplier were able to bring their respective customer data to a privacy-safe environment, the clean room, and collaborate without sharing any sensitive information with each other. It didn't matter that they were in different clouds, regions, or data platforms. They could collaborate on more than just structured data, and they were able to use Python for machine learning. Thank you all so much. And back to you, Matej.

Awesome demo. So super excited about clean rooms, and especially cross-platform clean rooms. I think it's really going to transform a lot of industries. It just makes sense to be able to collaborate on data and AI in real time in a secure fashion.

So I think, overall, I've given you a good sense of our approach. We really believe that picking the right governance and sharing foundation for your company is essential for the future, and we think it needs to be an open and cross-platform approach. We've been thrilled to see both Unity Catalog and Delta Sharing go from just an idea to being used by virtually all our customers in a few years, and we're excited that both of these are open. We're excited about the partners, and we invite you to join the open ecosystem.

Texas Rangers: data and AI in baseball

That was a lot of tech in this keynote, but the exciting thing is what you can do with the tech. For that, I'm super thrilled to invite our next speaker, an actual sports star for the first time on the Data and AI Summit stage, Alexander Booth from the Texas Rangers.

That was a huge moment for us as a baseball organization. Moving from the bottom of the league to winning our first ever World Series. All credit must go to the players and coaches that made this happen. This was also a huge moment for our community as over 500,000 people attended our World Series parade, and for me, growing up a lifelong Rangers fan, it was a dream come true.

However, this was also a win for the data team that I lead at the Rangers, and I'm here today to talk to you about how we use data intelligence to drive competitive advantage and transform how the modern game is played.

Most of you may know this, but baseball has always been a data-driven sport. Whether it's comparing statistics on the back of baseball cards to the modern age of Moneyball. However, how data is used in decision making has changed dramatically in this modern age of AI. Data used to be descriptive, evaluating past performance. Now, data is predictive, optimizing our understanding of future player performance.

One example of this is how we're using data and AI for biometric insights. We build predictive models on how the body's motion affects how a ball is thrown, leading to designed pitches guided by AI that are personalized for each unique pitcher. Further, with a better understanding of how players move when swinging the bat, we can provide biomedical recommendations to optimize for specific types of hits.

With these insights, we can advise our players. You're trying to hit for power? Get those legs and try to get it out of the ballpark. If you want to just hit for contact, just square up the ball. In Little League, my coach would always tell me to choke up on the bat and bend your knees. We now measure that at high frame rate, 300 frames a second. This pose tracking gives us further insight into injuries and workload management, too.

Data and AI helped make the most of our players' athletic talents, leading to those incredible clutch hits that maybe you all saw during the World Series.

Pose tracking isn't our only new data source. We also track every player's position continuously at 30 frames a second for every Major League game. This gives us unprecedented ways of measuring defensive capabilities. By understanding tendencies, reaction times, and the way that our fielders move when trying to catch that fly ball, we can optimize our defensive placement using AI to maximize the likelihood of a player making that out. And yeah, maybe we got a little bit too good at that, and Major League Baseball changed the rules a couple years ago, but we still use it to this day.

This culminated in a playoff run where we went 11-0 on the road, highlighted by impressive defensive plays such as home run robbing catches and clutch double plays.

Data modernization challenges

How did we change this data and AI game? It wasn't always like this. Getting to this point where we could realize these successes was not easy. There were so many challenges that we faced just a few years ago when we began our data modernization journey. Stop me if any of this sounds familiar.

Our on-prem stack could not scale to these new data sources. Those rising IT costs and the maintenance of these on-prem servers led to an untenable ROI on AI investments. Further, as our data team grew, supporting minor league operations all around the country, as well as scouting initiatives all around the world, those governance and permissions became difficult to manage.

We lacked governance and ran into fragmented silos. Our data teams were split between minor league player development, amateur analytics, international and advanced scouting teams. These slow and disjointed processing within silos led to delays of reports that our players and coaches needed. In some cases, we weren't delivering reports until the next day, well after the game had already finished.

And while we don't have a live link to the dugout, perhaps caused by a certain trashcan banging incident a few years ago, it is still imperative that our players receive that information post-game to prepare themselves for what happened and how to be successful tomorrow. With 162 games, baseball is a

marathon, and quick feedback is a necessity for our players. To solve these problems, we have unified and simplified our data and AI stack on the lakehouse with Databricks.

Unity Catalog unites our data silos under one roof. We have a variety of data with sensitive information, such as player addresses, financial contract information. Further, biomechanical and medical records should not be widely accessible through the org. Unity Catalog allows us to have that single shared platform with appropriate permissions in place to comply with both internal and external regulations such as FERPA and HIPAA.

Unity Catalog also gives us the ability to manage clusters, ETL pipelines, all within the JSON metadata. Once our data is loaded, the data intelligence platform is also able to comment and provide AI summaries around what that data actually is. This democratizes use for our analysts who sometimes struggle to figure out where the correct data source lies.

Finally, we've also built hundreds of ML and AI models on this data. The ML registry governed by Unity Catalog gives us a great platform to organize and search those models. However, Unity Catalog also allows us to govern who and which data teams can access models and features from the feature store for their own projects.

Data lineage of all of this gives us a great insight to see how the data flows from source to modeling to those final BI reports that our players need. Transparency builds trust. And of course, data sharing allows us to connect with other data verticals and vendors inside of our department. This includes ballpark, it includes concessions, as well as sharing live data within how the fans are engaging with the team. Everything with the appropriate permissions in place.

The net result? We now have four times more data ingested and used for AI at the same cost as our legacy systems. We have hundreds of users scattered around the country and the globe with secure and governed access to these data and ML KPIs. We also have ten times faster data insights after games and workouts, getting the reports into the hands of our players quickly that they need to be the best that they can be.

We also have ten times faster data insights after games and workouts, getting the reports into the hands of our players quickly that they need to be the best that they can be.

And of course, all of this contributed to our first ever World Series win. I tried to have a spotlight on my ring for the whole time, but it's all, they just said no to that. But Databricks is really helping our organization win by empowering our team with data intelligence.

AI BI Genie demo

However, we're just getting started here. With the rise of generative AI, we have invested time and effort to find innovation in this new space. I actually have a quick demo where I will be using the Databricks AI BI Genie to provide a natural language interface into our data. With the trade deadline coming up, as well as being in San Francisco, I thought it would be fun to see if there are any players on the San Francisco Giants that might have future trade value.

In this application, we are using public data from Baseball Savants. Notice that these tables, as well as the application, are both governed throughout Unity Catalog. Users need to have the correct permissions to access both. Comments, as well as summaries, describe and help teach the Genie application what exactly these internal KPIs that mean something to me, but maybe not to you, what those actually are. And of course, all of this needs to be shared and governed within the workspace.

Analysts can ask broad questions of this data. You're going to see here that we're going to be looking for just who on the Giants has any trade value.

I know, I type super slow, I guess. They're like, do you want to do this on the computer? And I'm like, yeah, it's fine. You can pretend that I'm typing that out. The Genie doesn't know how to answer this question of what is trade value. It just brings back statistics about players on the Giants. So what we can do now is instruct the Genie what I care about with trade value. I want to look at the difference between expected and observed performance to look for undervalued players.

Notice that we quickly see that the Genie application is able to see that Luis Matos, as well as Matt Chapman, both have had significant underperformance this season on the Giants, but maybe we'll have better performance for the rest of the season if they call Luis Matos ever back up from AAA. But that's a side note. Anyway, we can save this as a thumbs up as well as provide instructions and save it as an instruction to save it off as for an easy access later on. And we can also visualize this data for quick consumption.

Since we saved this as an instruction, it's trivial now to do the same analysis for other baseball clubs. Here I'm asking it do the same analysis for the Chicago White Sox.

After some time thinking, that's my double fast forward click, there it goes, we see that Martin Maldonado as well as Andrew Benatendi have been underperforming for the Chicago White Sox. What this has allowed us to do is democratize and allow our analysts, SQL developers, and less technical stakeholders unprecedented access to our raw data in our database. This allows them to ask the questions they need and create an efficient starting point for targeted and further decision making leading into the trade deadline.

Thank you so much to the Databricks team that supports us. Michelle, Hussein, Chris that onboarded us up here. Thank you for the opportunity to speak with you all this morning. Finally, we're always looking to continue pushing the boundaries of data and AI in sport. If interested, please reach out. Baseball is a team sport after all. And I will say we do a lot of our hiring in LinkedIn, so best of luck if you use that QR code, but you can always find me on LinkedIn and happy to kind of talk about this further. Thanks so much to everybody.

Apache Spark update

Wow, that's so cool. Did you guys see the ring that he was wearing? It was like gigantic. That's so awesome. I call it Moneyball 2.0. You got to go check out their booth in the hall. They can actually analyze your swing and everything. They'll collect all the data and they'll give you a score and you can improve it, so check that out.

Okay, so I'm going to introduce next my co-founder. I actually said it backstage that they said, hey, he's the number one committer in Apache Spark, but actually we looked and now he's actually no longer. He's number three, but for seven years he was the number one committer in Apache Spark project and he's going to tell us about Spark. And one of the cool things is this project is now 10 years plus old, so you think, okay, we know what Spark is, but actually it has dramatically changed in the last just two, three years. The project has completely been transformed and he's going to tell us how we did that and what are the changes and how did the community pull off this change that it's gone through. So let's welcome Reynold Xin to stage.

Alright, thank you, Ali, for that number three speech. Good morning again. So, as many of you know, this conference actually started previously named as the Spark Summit or Spark and AI Summit, and this talk will be going back to the roots of the original conference, Apache Spark.

The reason why I'm here today is because I'm going to talk about how Spark and AI Summit came together. So three years ago at this conference, we polled 100 of you and we asked, what were the biggest challenges you had with Apache Spark? Alright, so about 100 of you. And here's what the 100 of you told us.

By far the number one was, hey, I have a bunch of Scala users, they're in love with Spark, it's great, but I also have a whole bunch of Python users out here, as a matter of fact, there are more of them, and it really doesn't get Spark. Spark's kind of clunky, it's difficult to use in Python, it's not a great tool for them. And the number two is everybody else would say, hey, I love Spark, I've been using it, I'm using Scala also, but dependency management or my Spark application is a nightmare, and version upgrades take six months, one year, three years, you name it.

And then there's a consensus among the language framework developers out there, not a huge population, but very important component of the Apache Spark community, would tell us, hey, because of that tight JVM language, so nature of Spark, it's very, very difficult to interact with Spark outside of JVM as a framework developer, not just as an end user. So we got to work.

And let's talk about the first one. My name is Michael Scala, but my user's only part of Python. If you've been to this conference in the past, you know this is not the first time we're talking about Python. But I found this video from about three years ago, just the other day, as I was preparing a talk, and it's from Zach Wilson, who used to be a data engineer at Airbnb. And here's what Zach has to say.

years ago at this conference, I think it might have still been named Spark and AI Summit back then, and the theme of all the slides were white background instead of dark background. We talked about the Project Zen initiative by the Apache Spark community, and it really focused on the holistic approach to make Python a first-class citizen. And this includes API changes, including better error messages, debuggability, performance improvements, you name it. It comes with almost every single aspect of the development experience.

2022, two years earlier, we gave a progress report and talked about all the different improvements we have done in those two Spark releases. And last year, we showed a concrete example of how much autocomplete has changed, just out of the box, from Spark 2 all the way to Spark 3.4.

So this slide summarizes a lot of the key important features for PySpark in Spark 3 and Spark 4. And if you look at them, it really tells you that Python is no longer just a bulldog on the Spark, but rather a first-class language. And there's actually many Python features that are not even available. There are Python native, Python idiomatic, they're not available in Scala. For example, you could define Python, you should define table functions these days, and use that to connect to arbitrary data sources, and it's actually a much harder thing to do in Scala.

At this conference alone this year, we had more than eight talks talking about various features of just PySpark itself. So a lot of work has gone into it, but how much benefit are the users seeing? Again, this is one of those moments I can tell you nonstop about it, but it's the best, you try it out yourself. It's actually a completely different language.

When you look at the last 12 months alone, PySpark has been downloaded by over 200 countries and regions in the world, just according to PyPI stats. I was doing some number analysis the other day, I was really surprised to find this number. Just on Databricks alone, for Spark versions 3.3 and above, so it does not include any of the earlier Spark versions, there's a lot of them out there, but just for Spark 3.3 versions and above on Databricks, our customers have run more than 5 billion PySpark queries every day. To give you a sense of that scale, I think the leading cloud data warehouse runs about 5 billion queries a day on SQL. This is actually matching that number, it's only a portion, a small portion of the overall PySpark workloads.

Just on Databricks alone, for Spark versions 3.3 and above, our customers have run more than 5 billion PySpark queries every day.

But the coolest thing was, as I found the earlier video from Zach, in which he said, hey, Scala's the native way of doing it, I found another video he published just about three months ago. By the way, I've never met Zach until like last week, when I reached out to him and asked, hey, would it be okay for me to show you the video? So let me play you this video from this year by Zach.

But things have changed in the data engineering space. The Spark community has gotten a lot better about supporting Python, so if you are using Spark 3, the differences between PySpark and Scala Spark and Spark 3 are, there really isn't very much difference at all.

So thank you for the endorsement from Zach. So if your impression of Spark was, hey, Spark is written natively in Scala, that's still true. We love Scala. But if your impression is, hey, if I'm really using Python, I would get super crazy JVM stack traces, I would get terrible error messages, the API is not very idiomatic, try it out again. It's very different from three years ago, right? And of course, the job's never done. We'll continue working on improving Python for Spark, but I think it's fairly reasonable to declare, hey, Python is a first class language of Spark.

Spark Connect and Spark 4.0

So I'll talk about the other two problems, version upgrades, dependency management, and JVM language. Now let me dive into a little bit more about why this problem exists. So the way Spark is designed is that all the Spark application you write, your ETL pipelines, your data science analysis tools, your notebook logic that's running, runs in a single monolithic process called a driver that includes all the core server sides of Spark as well. So all the applications actually don't run on whatever clients or servers they independently run on. They run in the same monolithic server cluster.

And this is really sort of the essence of the problem, because one, all of this, because they all run in the same process, the applications have to share the same dependency. And not only do they share the same dependency as each other, they share the same dependency as Spark itself. Debugging is difficult, because in order to attach a debugger, you have to attach the very process that runs all of those things. And now, last but not least, if you want to upgrade Spark, you have to upgrade the server and you have to upgrade every single application running on the server in one shot. It's all nothing. And this is a very difficult thing to do when they're all tightly coupled.

So two years ago at this very conference, Martin and I introduced to you Spark Connect. The idea of Spark Connect is, again, very, very simple at a high level. We wanted to take the DataFrame and SQL API of Spark that's either Python or Scala-centric and create a language-agnostic binding for it based on gRPC and Apache Arrow. And this sounds like a very small change, because it's just introducing a new language binding and a new API, language-agnostic, but really it's the largest architectural change to Spark since the introduction of DataFrame's APIs themselves.

And with this language-agnostic API, now everything else runs as clients connecting to the language-agnostic API. So we're breaking down that monolith into, you can think of it as microservices running everywhere. And how does that impact end-to-end applications? Well, different applications now will actually run as clients connecting to the server, but they are really clients. They're running in their own sort of isolated environment.

And this makes upgrade super easy, because the language binding is designed to be language-agnostic and forward and backward compatible from API perspective. So you could actually upgrade the Spark server side, say from Spark 3.5 to Spark 4.0, without upgrading any of the individual applications themselves. And then you can upgrade the applications one by one as you like at your own pace. Same thing with debuggability. Now you can attach the debugger to that individual application that runs in a separate process anywhere you want without impacting the server, without impacting the rest of the applications.

Now for all of the language developers out there, this language-agnostic API makes it substantially easier to be building new languages. Just in the last few months alone, we have seen community projects that build Go bindings, Rust bindings, C Sharp bindings, all of this. And it can be built entirely outside the project with their own release cadence.

So one of the most popular programming languages, probably the top two programming languages for data science are R and Python. Spark has built-in Python support. There's also built-in R support called Spark R. But actually, the most popular R programming library for Spark is not the built-in Spark R. It's a separate project called Sparklyr. And Sparklyr is made by this company called Posit, which is actually I was talking to the Posit folks behind the stage, and I told them, hey, I think Posit is the coolest open source company audience I've never heard of. And the reason you have not heard of it is they renamed themselves fairly recently to Posit.

But the people at Posit created the most foundational open source projects. For example, dplyr, the very project that defined the grammar for data frames that we're all enjoying today, ggplot, the grammar visualization, RStudio, the most popular R IDE, Wes McKinney, who created Pandas, works at Posit, and also Apache Arrow. So I would actually like to welcome Tareef, president of Posit, onto the stage to talk to you more about Sparklyr.

Posit and Sparklyr

Good morning, everyone. Thank you very much for the introduction. It's very kind of you. I'm very excited to be here. And thank you, Databricks, for giving us the opportunity to speak to this audience. We as a company are probably some people that you don't know. You've never heard of us until he gave you a little bit of update. But we are a public benefit corporation. We've been around for about 15 years. Our focus is very much about code first data science. Our governance structure is one that allows us to think about things for a very, very long term. So our ambitions are actually to be around for the long haul and to continue to invest in these open source tools.

We have been we support hundreds of R packages, and we also support the RStudio IDE. And if you've been watching us for a while, you may have noticed that over the last five years, we've added a lot of capabilities to the Python ecosystem, right? So in some cases, these are multilingual solutions. So things like Quarto, Shiny for Python, Great Tables, all of these are examples of projects that we have, and we have more that are coming out over the coming years.

In 2016, we released a package called Sparklyr. And the reason we released it is because we wanted to have an idiomatic implementation for the R users that is more aligned with what the tidyverse is. And for those of you who don't know what the tidyverse is, it's like a philosophy of how you write packages and the patterns that sort of go along with that. The original design of Spark made it so that for users in corporations in particular to be able to use it, they would have to run RStudio and R on the servers themselves.

So you can imagine when Spark Connect became available last year, we were very, very excited because it finally solved one of the key problems that we saw, which is like, how do you make it so that the end user through a client does not have to get into a JVM and can just access it directly?

And so happy to say, you know, so we started last year, and basically by the end of last year, we had gotten support for Spark Connect to happen, Unity Catalog. We worked with the Databricks team to figure out how to make sure that Sparklyr and the IDE had clean support for that. And one of the most interesting things is we added support for R user-defined functions, which is actually a really big deal, because now the R users in your organizations can actually participate in using Spark to solve the really hard problems, and they can collaborate with other people in the Spark ecosystem. So we're very excited about that, and we're interested in sort of getting people's feedback on that if you get a chance to try it out.

So this is very anticlimactic. For those of you who were there yesterday for the demo, you saw Casey, she's like, the world stopped. We decided to make life easy. It's hard to demo some of these things. But the change, this is the open source desktop IDE, and you can see that this is one line change that you have to make to be able to connect to Spark Connect. And now this user on the desktop can go ahead and access the Spark cluster and leverage the full capabilities of that. This is one of the key things that we think that will make a big difference in terms of people's ability to contribute and adopt Spark.

So you probably have noticed, over the last year, we've been announcing all kinds of things with Databricks. One of the key things, obviously, where Sparklyr and Spark Connect and support for that. But we have also been making changes on our commercial product. So the first commercial product that we have supporting this is something called Posit Workbench, which gives you sort of a server-based authoring environment that supports RStudio, Jupyter Notebook, JupyterLab, VS Code, and ties into the authentication and authorization of those systems. And so you basically get the full power of the governance that you have in Databricks, but having it surface to the data scientists. You can expect that over the coming year, there'll be more commercial products and open source tools that will have those tighter integrations with the Databricks stack.

If you're at all curious or interested, feel free to check out any of these links to learn about how we're working more with Spark and Spark Connect and how we're working with Databricks. Thank you very much.

Alright. Thank you, Tareef. The reason I'm so excited about Spark Connect is that it makes frameworks like Sparklyr possible. It makes it easy to use, makes it easy to adopt, easy to upgrade, easy to build. And this really ultimately benefits all the developers, all the data scientists, all the data engineers out there, because now they can use whatever language they're most comfortable with. It doesn't require all of those to be built into Spark. You will get idiomatic R on Spark.

Now, with Spark Connect, it's really trying to solve this last two problems. Version upgrades, managing Spark, make it easier to be building non-JVM language bindings. With that, it brings us to Spark 4.0. This is actually not a conference in which we'll announce Spark 4.0's release today. It's actually an upstream open source project working at its own pace, but it is coming later this year. To give you a preview of some of the features, just similar to other major version, previous major version releases of Apache Spark, there will be thousands of features I can't possibly go all into today. But Spark Connect will GA and become the standard in Spark 4.0. ANSI SQL will become the standard in Spark 4.0. There's a lot of other features that we're looking forward to.

But one thing I'm particularly excited about, definitely at this conference, is the opportunity for the different open source community to be collaborating with each other, especially when it comes to compute and storage. So many, many features actually requires co-designing the compute stack, which is where Apache Spark comes in, as well as the storage stack, which is where Delta Lake, Linux Foundation Delta Lake and Apache Iceberg come in. As a matter of fact, many of the features you've heard about at this conference, at session talks, at keynotes, collations, road tracking, merge performance, variant data type Shawn talked to you about, type widening, they are not just features in Delta or features in Iceberg or features in Spark. They actually require co-thinking about all three projects for them to work. And this is really a spirit of open source and a spirit of collaboration in open source.

So last week, even though Spark 4.0 is not officially released yet, last week, the Apache Spark community have officially released Spark 4.0 preview. It's not the final release, but it gives you a glimpse into what Spark 4.0 will look like. Please go to the website, check it out, download it, give it a spin, and let us know your feedback. Thank you very much.

Awesome. Super excited about Spark 4.0. I got to say, you got to check it out. PySpark is amazing these days. And also all that version management, installing it, managing Spark, it's just so much simpler these days. I just tried it a week ago. You can just go to any terminal and just say pip install PySpark. And that's it. It'll just install the whole thing. It just works. It's hugely different from, let's say, 10 years ago, where you would have to set up the servers and the daemons and all of that and configure it and use it in local mode and all that. Just pip install PySpark.

Databricks Lakeflow

So back to our data intelligence platform roadmap. We're now reaching towards the end. But this is the most exciting thing for me. This project that we're going to talk about next is something that actually a couple years ago we asked all the top CIOs that use Databricks, what's the number one thing you want Databricks to do? And it was a really surprising answer. It was something that we didn't expect them to say to do. And since then, we've been super focused on nailing this problem. So I'm very, very excited to welcome on stage Bilal Aslam, who is going to take us through what we've done there. Please welcome him.

Alright. Good morning. So I'll get started. As it turns out, there are five Bilals at Databricks. I asked all five to give me a little cheer. But that was more than five. So thank you.

So thank you, Ali, for the introduction. So we've heard about machine learning. We've heard about BI. We've heard about analytics. And all these amazing things. And I'm here to tell you that every single one of them, every thing, starts with good data.

Alright. How do you get to good data? Well, there are three steps you have to follow. And every single one of us, including me, we are traditionally cobbling together lots of different tools in an ever-increasing tool chain that gets more and more expensive. Let's go through that real quick.

So Spark, and especially Databricks, is already very good at big data. As Ronald was telling you, this is the world's biggest big data processing framework. But as it turns out, a lot of your really valuable data is sometimes in smaller data. So for example, you may have MySQL, Oracle, Postgres, all these different databases. They're incredibly valuable. So you might be setting up Debezium and Kafka and a monitoring stack and a cost management stack just to get the changes from these systems into Databricks. Or you might, I'm actually pretty confident that every single one of us is using a CRM of some kind. Maybe you're using Salesforce, NetSuite, maybe you use HRMs like Workday and NetSuite, right? Tons of valuable data in there just waiting to get into Databricks so you can start using it.

And then once your data is in a data platform like Databricks, the next step, the very next step is to transform it. As it turns out, newly ingested data is almost never ready for use by the business. You have to filter it, you have to aggregate it, you have to append it and clean it. Lots of technology choices, DBT, a great open source project. You might have heard about Delta Live Tables and PySpark, Ronald was telling you how popular it is. Which one of these do you use? And again, how do you monitor it? How do you maintain it?

And once your data is transformed, that's really not even half the battle. You get the value out of data by actually running your production pipelines in production. I don't like waking up at 2 in the morning with an alert, so now you have to orchestrate. So you might be using Airflow. Great. Now your tool chain is just expanding just a little bit more. You're responsible for managing Airflow and its authentication stack and so forth. And then, of course, you might have to monitor all these things in CloudWatch. This is unnecessarily complex, and this is inefficient, and it's actually very expensive.

Which is why I am extremely proud to unveil what we're calling Databricks Lakeflow. This is a new product that's built on a fundamental foundation of Databricks workflows and Delta Live Tables with a little bit of magic sauce added on. I'm actually going to start with the magic sauce. And it gives you one simple, intelligent product for ingestion, transformation, and orchestration.

The very first component of these three components is something we call Lakeflow Connect. Lakeflow Connect is native to the Lakehouse. These are native connectors, and when I say they're native and they're high performance, and they're simple for all these different enterprise applications and databases.

If in the audience today you're using SQL Server, Postgres, a legacy data warehouse, or you're using these enterprise applications, we're on a mission to make it really simple to get all of this data into Databricks. And this is actually powered by Archeon Technology, a company we acquired last year.

So I'll give you a quick demo in a moment, but I actually want to talk about one of our customers called Insulet. And Insulet manufactures a very innovative insulin management system called the Omnipod. And they had a lot of customer support data locked up in Salesforce. And they're one of our happy customers of Lakeflow Connect. With that, they're able to get to the used to spend days on getting insights. Now they have it down to minutes. It's super exciting.

Lakeflow Connect demo

So actions speak louder than words. So let's take a look. So you're in Lakeflow here. And what I'm going to do is I'm going to click on ingest. And you see it's pointing click, which is pretty awesome. And it's designed for everybody. I'm going to click on Salesforce. And my friend Eric Orgren has set up a connection. By the way, everything in Lakeflow is governed by Unity Catalog and is secured by Unity. So you can manage it very easily and govern it.

And there are three steps. And now, OK, great. So now I see these objects from Salesforce. And I'm going to choose orders. I actually work for Casey. I don't know if you remember her cookie company. I'm building the data pipeline. She's my CEO. So I'm going to bring in some order information for our ever-growing cookie business into this catalog and schema. And within seconds, data should show up in our lake house. Excellent.

All right. That's it. That's all it took. There are no more steps.

Behind the curtain: change data capture

So I want to do something here and kind of give you a peek behind the curtain. We're all engineers here. And there's actually something pretty magical that's happening inside of Lakeflow Connect. You might think, gosh, how hard could it be to connect to these APIs and these databases? Can't you run a SQL query?

It turns out that what you actually want to do is only obtain the new data, the changed data capture from these source systems, from these databases and enterprise applications. And as it turns out, this is a really, really hard problem. You don't want to tip over the source database. You don't want to exhaust API limits. The data has to arrive in order. It has to be applied in order. Things go wrong. It's the real world. And you're coupling systems together. And you have to be able to recover. All of this is undifferentiated heavy lifting.

And I'm really glad that we're doing it, because with Archeon Tech, CDC is no longer a three-letter word. It's point and click. It's automated operations. And it's consistent and reliable.

Lakeflow Pipelines: transformations

Let's go to the second part of this product, the second component. What happens once you bring in data? You're now able to load data from these databases and enterprise applications. And the very next thing you have to do is to transform it, which is to prepare it. Remember, you have to filter, aggregate, join, and clean it. Typically, this involves writing really complicated programs that have to deal with a lot of error conditions.

The magic trick behind Lakeflow Pipelines, because it's built on the foundation, it's the evolution of Delta Life Tables, is that it lets you write plain SQL to express both batch and streaming. And the magic trick is we turn that into an incremental, efficient, and cost-effective pipeline.

Go ahead and create a little bit of an aggregation out of that. Okay, so let me show you how simple that is within Lakeflow. One of my favorite features here, by the way, is that it's one single unified canvas. So this little DAG at the bottom, you always see it. You can hide it if you want, but I'm going to click here on Salesforce and I'm going to write a transformation. Okay, that's simple.

Now, this is an intelligent application. It's built on the data intelligence platform, so I might just go ahead and ask the assistant what it thinks I should join. Okay, it comes up with a pretty reasonable join. It says you can join these tables, and I'm just going to let it figure out how to join them, figure out the key for me, and that's pretty awesome. Okay, that looks about right. It found the customer ID key. I'm going to go ahead and accept that, and let me just run this transformation real quick. I don't have to deploy it. I can run it in development, and it'll actually give me the ability to debug it real quick. Okay, perfect. So I can see that I have orders, dates, products, customers. All of this came together really nicely. I have a nice little sample of data. Perfect. All right, so we can go back to slides now.

So again, I'm going to give you a little bit of a peek behind the curtain here, why is this pretty amazing. Notice that in this, there was no cluster. There was no compute. I didn't have to set up infrastructure. I didn't have to write a long script. I just wrote SQL, and this is the power of declarative transformations.

This here is actually my valuable transformation, and instead, without Lakeflow Pipelines, you have to do table management. You have to do incrementalization many, many times, and even have to deal with schema evolution. I've spoken with some of our customers. They've written entire frameworks to do schema evolution and schema management. Again, that's undifferentiated, hard, heavy lifting. Why should you spend time on that, right?

This beast just grows and goes, and Lakeflow Pipelines are powered by something called materialized views, and they're magical because they automatically handle the bookkeeping for you. They handle schema evolution. They handle failures and retries and backfills, and they magically choose the right way to incrementalize your data, so it's pretty awesome.

they're magical because they automatically handle the bookkeeping for you. They handle schema evolution. They handle failures and retries and backfills, and they magically choose the right way to incrementalize your data, so it's pretty awesome.

Adding streaming to the pipeline

Okay, now, you might be thinking, hey, that's pretty great, but you just wrote a couple of lines of SQL. My pipelines are more complicated, so in my world, my CEO, my cookie CEO, is really demanding our e-commerce website. It's just really taking off, and now we need to be able to do real-time actions on our website, so from this pipeline, this pipeline which looks like Batch, I'm going to add some streaming to it. I'm going to go ahead and write some joined and enriched records into Kafka, so let me show you how easy that is.

Great, so remember, this is my pipeline here. I did this materialized view, a transformation, so again, from this unified canvas, I just add a pipeline step, and I'm going to go live here. I'm going to write some code, okay, so I'm going to create something called a sink. Think of a sink as a destination. I'm just going to call it Kafka because I'm going to write to Kafka, and all I have to do is, this is kind of cool because all I'm doing is writing SQL here, and I'm going to point this at Kafka.databricks.com, and that should be enough to create a sink, and all the credentials are coming through Unity Catalog, so this is, again, governed, and I'm going to create something called a flow, and a flow, think of that as an edge that writes changes into Kafka.

I'm going to do target Kafka, and I'm going to select from the sales table that I just created, and I'm going to use the table changes table value function. Okay, something's not right here, and I need to do dev. Great, okay, so this looks good, and remember, this is what looks like a batch pipeline, and I'm going to turn this into streaming. There we go, and just like that, our data's in Kafka. Let's go to Slides.

And this is something super exciting. There is no code change here. I didn't have to make a change. I didn't have to choose another stack. Everything just works together. One of the coolest things we're doing is something called real-time mode for streaming. You can think of this as real-time mode for streaming makes streaming go really, really fast, and the magic trick here, it's not fast once or twice. It's consistently fast, so if you have operational streaming use case where you have to deliver data and insights, just turn it on, and this pipeline will go really, really fast, and we have talks about it. Ryan Neonhaus is doing a talk on it, so please go check it out.

Orchestration and dashboards

Perfect, so now I have ingested data from SQL Server and Salesforce. I have very quickly built a pipeline that is able to deliver batch and streaming results. It's always fresh. I didn't have to do manual orchestration, but now my CEO is very demanding. The cookie business continues to grow, and Casey wants insights, and she wants a dashboard that she can use to figure out how her business is doing, and this is where orchestration comes in, and orchestration is really how do I do all the other things that are not involved with loading data and transforming data, such as building a dashboard and running it or refreshing a dashboard.

One of my favorite capabilities in Databricks is something called Databricks Workflows, and we've evolved it into the next generation, and Workflows is a drop-in complete orchestrator. No need to use Airflow. Airflow is great, but it's completely included in Databricks, and this is just a list of innovations. It has lots of capabilities that you might be used to in traditional orchestrators.

Okay, so what I'm gonna do here now is I'm gonna walk over, and I'm going to start building a dashboard, and I'm going to run it after my pipeline is done. Okay, let's take a look. Okay, so remember, I have data going into Kafka. I have all this. I'm going to just add another step. I love this unified canvas. It's like a really nice context on where I am, and this is super cool, the Assistant Suggested Dashboard. That's pretty cool, actually useful. Remedy and Product Insights, I like that. That's what I would have wanted, and there it is. That's our dashboard.

So hey, good news. Our cookie business continues to grow. We're not all the way done with the business, and this is super cool. We actually have a really interesting insight here that sugar cookies tend to sell in the month of December, so super cool. So that's it. You don't have to do anything else. Let's get back to slides.

Triggers and unified monitoring

So I'm gonna wrap up really quickly. I'm super excited about one innovation that I think will make our lives as data teams and data engineers much, much better. Look, it's great to create DAGs, things that run after another. It's great to have schedules. When should something run? But as your organization grows, what you really want are triggers, and triggers, think of them as work happens when new data arrives or data's updated, and this is actually what allows us to do another magic trick, which is run your pipeline exactly when it's needed, when upstream tables are changed, when upstream files are ready. This is super cool. It's completely available in the product. It's actually a foundational block of Lakeflow jobs.

Perfect, so now everything is running. I've ingested data. I have transformed it. I've built a dashboard. My pipeline's running in production. Like I said, I hate waking up in the middle of the night, and typically I have to glue together a lot of different tools to see cost and performance. Lakeflow includes unified monitoring. It includes unified monitoring for data health, data freshness, cost, runs, and you can debug to your heart's content, but it has that single pane of glass so you don't have to if you don't want to.

Lakeflow is built on the Databricks Intelligence Platform. It's native to it. This gives us a bunch of superpowers. You get full lineage through Unity Catalog. That includes ingested data, so all the data upstream from Salesforce or Workday or MySQL, we already captured the lineage. It includes federated data. It includes dashboards, even ML models, not a single line of code needed.

It's built on top of serverless compute. Frees you up from managing clusters, managing instances, how many executors, which type of instance should I use? It's serverless, it's secure, and completely connected to Unity. So it frees you up from that hassle, but what's also really cool about it is that we did this benchmark. This is real data for streaming ingest. It's three and a half times faster, and it's 30% more cost effective. So that's, you know, have your cake and eat it too. It's super exciting.

Data intelligence and wrap-up

Data intelligence is not just a buzz word. As you have seen in the last couple of days, it's actually foundational to Databricks. It's also foundational to Lakeflow. Lakeflow includes a complete integration with Databricks, IQ, and the Assistant, so every time you're writing code, every time you're building a DAG, every time you're ingesting data, we're here to help you author, monitor, and diagnose.

And one last thing, this is actually an evolution of our existing product, so you can confidently keep using Delta Life Tables and workflows. We'll make sure that everything is backwards compatible, all your jobs and pipelines will continue to work, and you can start enjoying Lakeflow. So, Lakeflow is here. We're actually doing a talk, Elise and Peter are doing a talk on Lakeflow Connect, I think, very soon. Lakeflow Connect is in preview. Please join us, give us feedback on what connectors you want, we're very excited about it. Pipelines and jobs are coming soon.

All right, I think that's it, thank you.

That was awesome, Bilal the Fourth. Sounds like a king. That was super, super awesome. What I really loved about that is, I don't know if you noticed it, this is actually a big deal. So, Spark has micro-batch architecture, so things take, when you're trying to stream things, it takes a couple seconds, sometimes five, six seconds. What it showed you, real-time mode, that we now have, real-time mode gets it down to 10, 20 milliseconds, so it's like a 100x improvement. The P99th percentile of latency is around 100 milliseconds, so it's kind of a game-changer.

real-time mode gets it down to 10, 20 milliseconds, so it's like a 100x improvement. The P99th percentile of latency is around 100 milliseconds, so it's kind of a game-changer.

And then, of course, we saw Connect, you can get your data in there, you can do incrementalization, you don't have to worry about getting the logic right, it'll just do it for you. So, super exciting. Okay, awesome, so I just wanna wrap this up quickly. So, on the top row there, you see the announcements from yesterday, I'm not gonna bore you and go through those again. On the bottom row, you can see what we did today. So, we just heard data engineering, so you saw that. Unity Catalog, open source, live on stage by Matei, that was super cool. But also, metrics, I'm excited about metrics. Every company has KPIs. How do we have certified KPIs that we can rely on, and that we know are semantically correct, and we know how to compute them, so that's also a big deal. And then, we heard about Delta Lake 4.0, and Project Uniform going GA. So, lots and lots of great stuff, and that's it for today. Hope you enjoy your lunch, and then, please go to the sessions, they're super, super awesome. Thank you, everyone, thanks.