Wes McKinney - Retooling for a Smaller Data Era | PyData Global 2024

Transcript#

This transcript was generated automatically and may contain errors.

So with that small, small detour, I'll give a slightly shorter version of my talk, which let's figure out how to control this.

So this talk is about work that's been done in an open source ecosystem over the last 10 years, which has had a lot of impact on the PyData ecosystem, but is more broadly about data science, data processing, database, and data science systems and how they work together.

I assume most people in the audience are familiar with me from my work on the Pandas project and my book Python for data analysis. And I've also, in recent years, have been shifting my effort to some other projects like Apache Arrow, Apache Parquet, as well as the IBIS project, and we'll talk a little bit about those in this talk.

My present day job is that I'm a principal architect at Posit, formerly known as RStudio , a data science platform company. I co-founded Voltron Data, and I'm a part-time investor through my firm Composed Adventures.

A lot of what I've been involved with at Posit, if you're interested, is doing some work on the new Positron data science IDE. So it is a kind of brand new polyglot IDE experience that's been built on top of the open source VS Code platform. So we created the classic four-pane data science layout with the code editor, console, variables pane, and plots pane. I've been building a fast, scalable, interactive data explorer for looking at data frames and database tables. So you can get that through the public betas of the Positron IDE, so check it out.

Data size is relative

One of the things that we encounter over and over is that data size is relative, and what we think of as big data or small data has changed a lot over the last 20 years. So what used to be big data can now fit on your laptop. So we used to think that a gigabyte or 100 gigabytes or a terabyte of data was big data. But now a terabyte of data may compress down to a set of parquet files that are not that big and can easily fit on your laptop and be queried very effectively with many of the tools that we have today.

And so if you go back and think about the original big data paper from Google, the MapReduce paper from 2004, I don't know if anybody knows how many processing cores, CPU cores, were standard in top-of-the-line servers in 2004. So the answer is one. So in 2004, all servers and data centers had a single processor core. So maybe they had hyper-threading in that era so you could run two threads simultaneously, but the processing capabilities 20 years ago were much more modest than they are today.

And if you look at the servers that you can buy and rent now in AWS or in other cloud services, you can get servers that have 96 physical cores. And so if you have hyper-threading, that's 192 concurrent threads of processing on a single CPU socket. And servers can have two of these or even sometimes more than two sockets. And so you could have a dual socket server in the cloud with 384 simultaneous threads of processing. And so that in a single server is bigger than many of the big distributed clusters of machines that Google folks were talking about in their MapReduce paper 20 years ago. So this is a massive change.

We've also seen similar evolution in hard disk drive performance. So 20 years ago, everything was spinning rust hard disk drives with high seek latency. We moved on to solid state drives and got massive improvement in seek performance, much higher read and write bandwidth. And then more recently, we've moved on to non-volatile memory NVMe, which has brought the read and write bandwidth starting to approach 10 gigabytes a second. And I expect that disk will continue to get faster and faster.

And we've seen similar trends in networking performance as well. So this chart shows basically state-of-the-art networking performance over time. And so we've gone from gigabit Ethernet to terabit Ethernet in less than 20 years, which is pretty impressive.

For a long time, people were talking about how Moore's law is dead. So Moore's law is the idea that every 18 months, the number of the transistor density in CPUs doubles, and effectively, the processing capability of CPUs goes up by a factor of two. And that started to plateau in CPUs at some point, but we've continued to see core counts go up, especially in GPUs and specialized silicon, which has enabled us to continue to have that exponential increase in processing capabilities.

Ergonomics and the cost of big data systems

But one of the challenges in thinking about the development of big data systems compared with our nice, friendly Python tools, PyData tools, is that people thought about the ergonomics, the usability, the user experience of big data systems in a very different way. Whereas in Python, we really value our library ergonomics, the code that we write, that it should be very easy to read, that it should be easy to type, very fluent and natural.

And so that emphasis on developer productivity, user productivity, was comparatively less important in the big data ecosystem, where it was really all about how do we make it feasible to process terabytes and petabytes of data. And so, not just the usability and the ergonomics, but also the overhead and the cost of processing data at scale was also often an afterthought.

This was highlighted in this famous paper from 2015 by some former Microsoft research developers who've gone on to have illustrious careers working on TensorFlow and Materialize and other projects. But they pointed out that while many of the big data systems had achieved scalability and the ability to process large data sets, they also introduced a lot of overhead that makes the cost for processing each terabyte or each gigabyte significantly higher than what you could do on smaller scale data sets on a single machine.

So the way that they described it in the paper is these systems have achieved impressive scalability, but to what extent are they truly improving performance as opposed to parallelizing the overhead that they introduce into the process?

So the way that they described it in the paper is these systems have achieved impressive scalability, but to what extent are they truly improving performance as opposed to parallelizing the overhead that they introduce into the process?

And so to make this, what we're talking about a little bit more complete, is think about a SQL query that aggregates a very large table. Maybe we're grouping by one column, computing the average of another column. We may very well write similar code in pandas or similar code in R. And these are all basically doing the same thing, but under the hood, the architecture of the systems that evaluates this code, depending on the scale of the data and the underlying execution engine, is very different.

So in pandas and R, we're loading the whole data set into memory, whereas in the SQL engine, it might be some distributed MapReduce job that's running on a big cluster of machines if you have a massive data set that can't possibly fit on a single node.

And so this has led to this sort of hierarchy of needs when thinking about building these systems, when looking at things from the big data perspective, where the ability to scale to say, I can process a terabyte of data or 10 terabytes or a petabyte of data, that starts out as being the primary concern. And only after that point can you begin to think about, okay, how do we make it faster? Just wall clock time. We want it to take less time from start to finish to complete the job.

After you've made it fast, you can start thinking about efficiency, both from a resource like the amount of hardware you use, but also increasingly we're beginning to think about processing efficiency in terms of power utilization. And so how many kilowatt hours and thus how much money does it cost to execute a particular workload? And when you're paying for compute time by the core hour in the cloud, this starts to become a big deal because the queries that you're running are converting into a bill that you're getting from AWS or from Google Cloud.

And then beyond these performance and efficiency considerations, as we start building more heterogeneous data systems that are doing raw data processing, feature engineering, as well as AI training and model serving, we have a set of components that is solving many different problems. That can't be achieved within a single system. And so how those systems fit together, how they are composed with each other, and how efficient is composing them together is another concern which has come more to the forefront in recent years, especially with the growth and adoption of AI.

So the hope of all of this is to work towards a future multi-engine data stack where we can choose the execution engine that makes most sense for the size of data to get good performance and good efficiency. But with freedom of choice when it comes to the language front end.

So sorry about the AV issues, but I appreciate everybody coming and look forward to engaging with you all in the open source world and see you next time.

Wes McKinney - Retooling for a Smaller Data Era | PyData Global 2024

Transcript#

Data size is relative

Ergonomics and the cost of big data systems

Composable data systems and open standards

Apache Arrow, DuckDB, and embeddable engines

Toward a multi-engine data stack