
The Future Roadmap for the Composable Data Stack
Discover cutting-edge advancements in data processing stacks. Listen in as Wes McKinney dives into pivotal projects like Parquet and Arrow, alongside essential interface libraries like Ibis and tidyverse. Wes navigates through the current state of these projects, highlighting areas for further innovation and growth. Sign up for our "No BS" Newsletter to get the latest technical data & AI content: https://hubs.li/Q02vz6xC0 ABOUT THE SPEAKER: Wes McKinney, Principal Architect, Posit PBC (co-founder of Voltron Data) ABOUT DATA COUNCIL: Data Council brings together the brightest minds in data to share industry knowledge, technical architectures and best practices in building cutting edge data & AI systems and tools. FIND US: Twitter: https://twitter.com/datacouncilai LinkedIn: https://www.linkedin.com/company/datacouncil-ai/ Website: https://www.datacouncil.ai/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So, this talk is a little bit of an overview of some work that's been going on in the open source ecosystem over the last decade, my thoughts about it, and give you some ideas about where I think things are going.
But firstly, I'll tell you a little bit about my backgrounds. Most people are familiar with Pandas, but I've been working on lots of projects over the last 16 or 17 years. I'll tell you about my new role back at Posit and what I'm doing there, why I work for Posit, why you should pay more attention to Posit as a company, give you an overview of where this whole concept of composable data systems came from, tell you about what's active and what I think is exciting right now, and where I think things are headed, or at least where I'm personally focusing my energy to help things move forward.
Background and current roles
So, as Sean mentioned, my full-time job is I'm an architect at Posit, formerly RStudio. I co-founded Voltron Data, just had a big week at GTC conference last week, I'll talk a little bit about that. I did just launch a venture fund, which I'll tell you about. Many of you probably have my book, Python for Data Analysis, it's in its third edition, it's now freely available on the internet, so if you go to westmckinney.com slash book, you can read it for free. It's also very helpful if you need to look it up and don't have it on you and you want to look something up.
I've been working on a bunch of open source projects, I'm also helping out LanceDB, my former co-founder Chong is here with some folks from LanceDB, also very excited about what they're doing there.
Over the last several years, up until very recently, I was 100% focused on Voltron Data and the mission of that company is unlocking the potential of GPU acceleration in large-scale analytics workloads. We also have built a large team to support Apache Arrow development and offering partnership to companies that are building on the Arrow ecosystem, so we've created partnerships with Meta and Snowflake and other companies, raised a lot of money, so very, very exciting.
I just launched a micro-venture fund, so micro refers to the size of the checks, so I have some LPs, but I've been investing in data and data infrastructure companies for the last five years and I wanted to be able to do more investing, in particular to invest with a particular thesis around what we can do to help accelerate growth in companies that are building on this new stack of composable open source technologies.
This was just a couple of weeks ago, I'm a speaker here, so I didn't have to buy a VC ticket for this conference, maybe next year, but I am a part-time VC, not a full-time VC, so my goal is to not invest as a full-time job, but I do enjoy helping founders and being involved in startups.
About Posit
Many of you maybe know Posit by its former name, RStudio, which is a 15-year-old company. It's now 300 people, it's a remote-first company, so the headquarters is in Boston, but there are people all over the world. Its mission is building open source data science software for code-first developers, so it stayed clear of building low-code tools for data science.
Also very passionate about technical communication, so JJ Allaire, who founded Posit, he goes all the way back to ColdFusion. He created ColdFusion in the 1990s, and so building tools to help with publishing websites and creating communication for the internet has been a passion for 30 years and that continues at Posit.
It is a certified B corporation, no plans to go public or to be acquired. It's designing itself for long-term resiliency with the intent to be a 100-year company.
300 people is a lot of mouths to feed, so you might be wondering, well, how does Posit make money with all this open-source software? Their main business is making open-source work in the enterprise, and so that comes in a few ways. The workbench used to be known as RStudio Server. They added support for JupyterLab, Jupyter Notebooks, VS Code, in addition to the RStudio IDE that enables you to do development in a hosted environment, Connect, which allows you to publish Streamlit, Dash, Shiny, all kinds of data applications, publish your Jupyter Notebooks, Quarto documents. A very helpful product if you need to get your work in the hands of the people you work with. There's also a package manager product, so there's a lot of stuff going on there.
I've actually been involved with this company for a long time. I knew Hadley Wickham and the RStudio folks well before 2016, but we started working together more actively in 2016 when we started the Arrow project. In 2018, I formally partnered with them to create Ursa Labs to do full-time development on Apache Arrow. They helped me incubate Ursa Computing, which turned into Voltron Data. After several years of working on that, I decided to come back to Posit to help them with their journey to become a polyglot data science and computing company.
Going back to 2018, the way I described the reason why I wanted to work with Posit was that Hadley and I think that the language wars are stupid. We feel the enmity between the R and Python communities is counterproductive, and that we are in the business of creating tools to make data analysis, make humans more productive working with data, and so that we share a passion for the long-term vision of empowering data scientists and creating a positive relationship with the open-source user communities.
Hadley and I think that the language wars are stupid. We feel the enmity between the R and Python communities is counterproductive, and that we are in the business of creating tools to make data analysis, make humans more productive working with data.
Composable data systems
Last year at VLDB, I collaborated with fine folks at Meta to write this paper called The Composable Data Management System Manifesto. It's a great paper. I want to help you in this talk by presenting some of the key ideas from the paper and why they matter. I do recommend that you check out this paper. I think it's really well written.
What is a composable data system? The way that I think about this is that we are building systems based on open standards and protocols. We're designing for modularity, reuse, and interoperability with other systems that share those common interfaces in open standards and protocols. In particular, we're resisting this idea of vertical integration where every layer of your system is bespoke and you build it yourself. It's this homespun thing where you build everything top to bottom.
It does create more work in some ways because you have to deal with coordination and open source diplomacy. You have to think about moving along these different pieces that you share governance and you share ownership with other developers and people who are working on different layers of the stack. The hope is that over time that each of the component pieces that we use to build these systems become so much better, easier to use, faster, more interoperable, and that you end up creating a solution that is a lot more future-proof and delivers better results over time.
It's a little bit painful because you have to go through the painful growing pains of creating these new systems and making them all work together as well as negotiating with all of the open source developers that are involved in these projects. The way that it was put in the words in the paper is that we envision by decomposing data management systems into a more modular stack of reusable components, the development of engines can be streamlined while reducing maintenance costs and ultimately providing a more consistent user experience.
By clearly outlining APIs and encapsulating responsibilities, data management software could more easily be adapted, for example, to leverage novel devices and accelerators as the underlying hardware evolves. That was one of the reasons why we founded Voltron Data. We observed that it's hard for people to take advantage of GPUs to accelerate their systems. We also saw that there's many new hardware companies creating new types of hardware. We need to make a corresponding investment in the software layer to make it easier to take advantage of new developments in hardware, whether that's faster computing, faster networking, faster storage.
Ideally, from the standpoint of you as a user, you just want to be able to write Python code and as the hardware improves, all your software gets faster and you don't have to think about the messy details of, well, how do I take advantage of this bleeding edge development in computing hardware?
The hope is that by relying on a modular stack that reuses execution engines, language front-ends, the data systems can provide a more consistent experience and semantics to users from transactional to analytic systems, stream processing to machine learning workloads. That sounds very nice in principle.
Why composability is happening now
One question people have had is, why is this happening now? Why is it happening in the 2020s? Why didn't this happen in the 2010s or even earlier? It helps to go back and think about the generational eras of big data technology going back to the original Google MapReduce paper in the mid-2000s.
The clone of Google MapReduce and Google File System as HDFS and Apache Hadoop, that was the first release of Hadoop from Yahoo was in 2006. The MapReduce paper popularized this idea of disaggregated storage and compute. You have a big distributed file system, you have all of your data sets live as files in that file system, and then you have a bunch of compute engines that can process on those files. You process files in, the systems write files out back into that distributed file system. But what this created was this collection of monolithic vertically integrated systems that were part of the Hadoop ecosystem.
Just getting these components to a place where they were viable and you could use them to productively work was hard enough. Thinking about a hierarchy of needs thing, thinking about how we could modularize components of those systems and reuse large pieces of software between those systems, that was pie-in-the-sky thinking back in the late 2000s, early 2010s.
But there was a big shift in the 2010s where a particular large software vendor shifted from selling proprietary vertically integrated software components to delivery of services in a cloud environment. That also led to an emergence of open source standards, file formats like Parquet, as well as there was major progress in computing, networking, and storage. Disk drives got a lot faster, networking got a lot faster, processor units. In GPUs a little bit, but in GPUs it's continued to go off to the moon as Moore's Law has continued in more specialized processor chips.
Through that period, as open standards began to emerge, we started thinking about how we could create technologies that would help better take advantage of the innovation in the hardware and the hardware layer. That has led us in this current era to think about re-architecting all of our systems on interoperable standards that will enable us to get much better performance and efficiency across a heterogeneous application stack that's doing a mixture of ML, feature engineering, machine learning training, and AI inference.
We've had this emergence of new standards for composability, which I will talk about. We're seeing simultaneously an emergence of new systems which are built to be composable from the ground up, as well as retrofitting existing pieces of systems with new components which are created with this new composable mindset.
Apache Arrow and the new stack
A lot of involvement in this was helping create the Apache Arrow project. We realized in the mid-2010s that we needed to create a cross-language fast memory format which could be used for fast in-memory computing as well as fast data interchange across disparate systems. We started imagining what if we had this shared data science run time, a set of libraries and computational systems that we could use interchangeably across programming languages.
It didn't matter which language front-end you're using, what programming language. At the same time, in the database world, there was also a gnashing of teeth around this problem of how inefficient it is to import and export data from databases. If you're using just a single database, single data warehouse, you can't get a thick milkshake through a straw that's too small if you've ever had that experience.
What we've done is create technologies to provide an open standard for Arrow native APIs for interacting with databases using a standard wire protocol out of the box rather than using a proprietary or system-specific wire protocol that has to be serialized when you're getting data out. Our hope is that, firstly, with existing systems, but at the same time they can also build support for the new wire protocol directly into the database. They have their legacy protocol as well as the new open standard protocol, and that yields significantly better performance.
DuckDB is one example of now a very successful project, which is putting a super-fast columnar analytical database into a browser pretty much anywhere that can run C++ code or WASM.
Architecture of composable systems
If we think about the architecture of what new composable systems look like, this is a diagram from the paper. We have a modular execution engine, a runtime which provides for distributed computing and coordination. That could be Spark, that could be Ray, or another distributed runtime that makes use of one of these modular engines like DuckDB or Velox or another.
There are now several modular execution engines written in different programming languages such as the now Apache DataFusion. It's a Rust-based modular query engine. It has a query optimizer on the front end as well, but can also be just as an execution engine in Rust.
One of the things that we've developed in this community to make connecting these engines to the language front end easier is to have an intermediate representation for queries. The idea is that your language front end, whether that's a DataFrame API or a SQL API, we translate that into a substrate plan and we can optimize that substrate plan and then the execution engine can take that plan and turn that into a physical execution plan that is sent off to the cluster to be executed.
We also retrofit existing popular query engines with these new modular accelerators. Three of them that you can get access to or you can look at or contribute to right now. One is Prestissimo, which is retrofitting PrestoDB, I should clarify, not Trino, with Velox. There's Comet, which has been just donated into the DataFusion project from Apple, which is accelerating Spark with DataFusion. There's also Apache Gluten, part of the Apache Incubator, which is accelerating Spark with Velox.
Eventually we'll have DataFusion and Presto and probably every combination of modular accelerators in our query engine projects. All of this begins to pose the question of why as a user should I be locked into using a particular full-stack query engine like Spark or Presto? The cost, performance, and latency of these different systems varies greatly.
We can do a lot with DuckDB. But when you have an actual big problem that's too big for DuckDB, you want to be able to transition gracefully to using that larger scalable execution engine. Of course, last year at Data Council we had Jordan Tagani on the stage in a duck costume talking about how big data is dead. Big data still is big data. Big data does still exist. I believe that we should be able to gracefully transition from working at the single machine, could be a large single machine, but the small scale to a large scalable data processing engine without as much pain and suffering.
SQL portability and the Ibis project
This isn't that easy, though, in part because it advertises just standard SQL. Because SQL dialects, while SQL is a standard, in practice SQL dialects are non-portable and they feature a wide spectrum of different features. There's a tool like SQLGLOT, which is amazing, which helps with transpiling and converting from one SQL dialect to another, but this does create a problem. You have to solve this problem because if you have a mountain of existing SQL queries and you want to migrate them to use a different data processing engine, you have to rewrite all those queries and deal with all the quirky edge cases of type coalescing, conversion, and whatnot. Knowing which engine to use in every scenario is non-trivial and may not always be obvious.
A few years ago in CIDR 2021, folks at Microsoft approached this problem from the standpoint of the Azure data platform to say, could we create a unified API for interacting with all these different database engines that we have available in the Azure platform and intelligently choose which one to use based on the type of workload, like how big the data set is, what statistics we have about the data, performance, good total cost of ownership for the query. They described a research project that they did called Magpie where they created a pandas-like data frame API. They have a compiler and optimizer at the middle. They have a common data layer, which is based on Arrow, that provides for a distributed data fabric to coordinate these different execution engines within the Azure platform.
I love this idea and I would really like to help it become more mainstream and a reality for data engineers everywhere. A project that I've been involved in, I started about nine, almost 10 years ago, and that we've been putting a lot of resources into is the Ibis project in Python, which had been an effort to harmonize the best features of modern SQL with Python's fluent data frame APIs. We also wanted in Ibis to fix shortcomings in the pandas API and make things that are hard to do with pandas a lot easier to do with Ibis, so we made a deliberate choice to depart from the pandas API.
We wanted to make it easy to build complex, large analytical SQL queries by using the benefits of a modern programming language, functions and parameters and things like that, things that don't exist in SQL. Over the years, we've been really focused on portability and compatibility across a wide spectrum of engines who now support over 20 backends to provide a unified API that could be the basis of building a multi-engine data stack.
Ibis code, this produces a pretty complex SQL query and everything is based on pipe-like sequences of methods that compose with each other to create complex expressions.
Dev to prod and GPU-scale workloads
One of the workflows that we hear a lot of pain points about and people wanting to work in this ecosystem and to create a composable data stack is how to go from dev to prod. Nobody wants to burn a ton of snowflake credits while they're developing and the ideal scenario would be to be able to develop, build your model, do all of your development locally on your laptop with DuckDB and whenever you're ready to go into prod in Spark SQL or BigQuery or Snowflake, then it's about as easy as flipping the switch. Run this workflow but in Snowflake and to not have to rewrite all of your queries. That's one of the key things that we're trying to facilitate in this project.
One of the reasons why this is so important is because at truly large-scale workloads, this is benchmarks that were just published by Voltron Data last week at GTC. We showed that when you have a specialized cluster of GPUs, so this is comparing Spark SQL and EMR compared with a 80-node, 80-GPU on-prem cluster with Finaband and NVLink and all the bells and whistles. If you have a truly large-scale workload at 10 terabyte and up and you're willing to make an investment in hardware or to rent hardware to work at that scale, you can get some truly amazing performance with the hardware that's available today.
If you have a truly large-scale workload at 10 terabyte and up and you're willing to make an investment in hardware or to rent hardware to work at that scale, you can get some truly amazing performance with the hardware that's available today.
We want to facilitate taking advantage of hardware acceleration in really large-scale workloads. It's a pain to work at small-scale, but whenever you're working at very large-scale queries, you have to work in a completely different way. That's very painful for the user. If you're a large company and you have a lot of workloads that are going into your GPU cluster, which you want to reserve only for your large-scale workloads, you don't want to clog with a bunch of small-scale queries.
I see all of these technologies building up to make that possible. I really aspire to help us create the reality of a multi-engine, multi-engine data stack to have execution engines available that are tailored for different data scales so that when you're below a terabyte, you can use DuckDB, DataFusion, things that run on your laptop, and that those larger scales that you can transition to using Spark or more specialized tools at a much larger scale, a lot more gracefully and for you as a user, a lot more productivity.
I'm pretty excited about these kind of new portable language front-end projects. I didn't have time to talk about it in this talk, but something really exciting is a whole new programming language for analytical queries that I encourage you to check out. Another project, I think it's called PRQL, Pipeline Relational Query Language, another effort to build a better SQL. Particularly as these execution components standardize, I think a lot of the work in data systems is making it easier to orchestrate systems, write queries, execute them portably across these different environments.
Now, almost eight years into Arrow, almost nine years since we started conceiving and putting together the project, it's great that Arrow is becoming the de facto standard for how we build APIs and how we connect systems together. I think that bodes well for even better performance.
So with that, I'm sure you're all hungry, ready for lunch, but I think I have time for a few questions, and I appreciate your attention. Thanks.
