The Future Roadmap for the Composable Data Stack

Transcript#

This transcript was generated automatically and may contain errors.

So, this talk is a little bit of an overview of some work that's been going on in the open source ecosystem over the last decade, my thoughts about it, and give you some ideas about where I think things are going.

But firstly, I'll tell you a little bit about my backgrounds. Most people are familiar with Pandas, but I've been working on lots of projects over the last 16 or 17 years. I'll tell you about my new role back at Posit and what I'm doing there, why I work for Posit, why you should pay more attention to Posit as a company, give you an overview of where this whole concept of composable data systems came from, tell you about what's active and what I think is exciting right now, and where I think things are headed, or at least where I'm personally focusing my energy to help things move forward.

Background and current roles

So, as Sean mentioned, my full-time job is I'm an architect at Posit, formerly RStudio . I co-founded Voltron Data, just had a big week at GTC conference last week, I'll talk a little bit about that. I did just launch a venture fund, which I'll tell you about. Many of you probably have my book, Python for Data Analysis, it's in its third edition, it's now freely available on the internet, so if you go to westmckinney.com slash book, you can read it for free. It's also very helpful if you need to look it up and don't have it on you and you want to look something up.

I've been working on a bunch of open source projects, I'm also helping out LanceDB, my former co-founder Chong is here with some folks from LanceDB, also very excited about what they're doing there.

Over the last several years, up until very recently, I was 100% focused on Voltron Data and the mission of that company is unlocking the potential of GPU acceleration in large-scale analytics workloads. We also have built a large team to support Apache Arrow development and offering partnership to companies that are building on the Arrow ecosystem, so we've created partnerships with Meta and Snowflake and other companies, raised a lot of money, so very, very exciting.

I just launched a micro-venture fund, so micro refers to the size of the checks, so I have some LPs, but I've been investing in data and data infrastructure companies for the last five years and I wanted to be able to do more investing, in particular to invest with a particular thesis around what we can do to help accelerate growth in companies that are building on this new stack of composable open source technologies.

This was just a couple of weeks ago, I'm a speaker here, so I didn't have to buy a VC ticket for this conference, maybe next year, but I am a part-time VC, not a full-time VC, so my goal is to not invest as a full-time job, but I do enjoy helping founders and being involved in startups.

About Posit

Many of you maybe know Posit by its former name, RStudio, which is a 15-year-old company. It's now 300 people, it's a remote-first company, so the headquarters is in Boston, but there are people all over the world. Its mission is building open source data science software for code-first developers, so it stayed clear of building low-code tools for data science.

Also very passionate about technical communication, so JJ Allaire , who founded Posit, he goes all the way back to ColdFusion. He created ColdFusion in the 1990s, and so building tools to help with publishing websites and creating communication for the internet has been a passion for 30 years and that continues at Posit.

It is a certified B corporation, no plans to go public or to be acquired. It's designing itself for long-term resiliency with the intent to be a 100-year company.

300 people is a lot of mouths to feed, so you might be wondering, well, how does Posit make money with all this open-source software? Their main business is making open-source work in the enterprise, and so that comes in a few ways. The workbench used to be known as RStudio Server. They added support for JupyterLab, Jupyter Notebooks, VS Code, in addition to the RStudio IDE that enables you to do development in a hosted environment, Connect, which allows you to publish Streamlit, Dash, Shiny , all kinds of data applications, publish your Jupyter Notebooks, Quarto documents. A very helpful product if you need to get your work in the hands of the people you work with. There's also a package manager product, so there's a lot of stuff going on there.

I've actually been involved with this company for a long time. I knew Hadley Wickham and the RStudio folks well before 2016, but we started working together more actively in 2016 when we started the Arrow project. In 2018, I formally partnered with them to create Ursa Labs to do full-time development on Apache Arrow. They helped me incubate Ursa Computing, which turned into Voltron Data. After several years of working on that, I decided to come back to Posit to help them with their journey to become a polyglot data science and computing company.

Going back to 2018, the way I described the reason why I wanted to work with Posit was that Hadley and I think that the language wars are stupid. We feel the enmity between the R and Python communities is counterproductive, and that we are in the business of creating tools to make data analysis, make humans more productive working with data, and so that we share a passion for the long-term vision of empowering data scientists and creating a positive relationship with the open-source user communities.

Hadley and I think that the language wars are stupid. We feel the enmity between the R and Python communities is counterproductive, and that we are in the business of creating tools to make data analysis, make humans more productive working with data.

If you have a truly large-scale workload at 10 terabyte and up and you're willing to make an investment in hardware or to rent hardware to work at that scale, you can get some truly amazing performance with the hardware that's available today.

We want to facilitate taking advantage of hardware acceleration in really large-scale workloads. It's a pain to work at small-scale, but whenever you're working at very large-scale queries, you have to work in a completely different way. That's very painful for the user. If you're a large company and you have a lot of workloads that are going into your GPU cluster, which you want to reserve only for your large-scale workloads, you don't want to clog with a bunch of small-scale queries.

I see all of these technologies building up to make that possible. I really aspire to help us create the reality of a multi-engine, multi-engine data stack to have execution engines available that are tailored for different data scales so that when you're below a terabyte, you can use DuckDB, DataFusion, things that run on your laptop, and that those larger scales that you can transition to using Spark or more specialized tools at a much larger scale, a lot more gracefully and for you as a user, a lot more productivity.

I'm pretty excited about these kind of new portable language front-end projects. I didn't have time to talk about it in this talk, but something really exciting is a whole new programming language for analytical queries that I encourage you to check out. Another project, I think it's called PRQL, Pipeline Relational Query Language, another effort to build a better SQL. Particularly as these execution components standardize, I think a lot of the work in data systems is making it easier to orchestrate systems, write queries, execute them portably across these different environments.