
Retooling for a smaller data era
I had the pleasure of interviewing Wes McKinney, Creator of Pandas, a name well-known in the data world through his work on the Pandas Project and his book, Python for Data Analysis. Wes is now at Posit PBC, and during our conversation at Small Data SF, we covered several key topics around the evolving data landscape! Wes shared his thoughts on the significance of Small Data, why it’s a compelling topic right now, and what “Retooling for a Smaller Data Era” means for the industry. We also dove into the challenges and potential benefits of shifting from Big Data to Small Data, and discussed whether this trend represents the next big movement in data. Curious about Apache Arrow and what's next for Wes? Check out our interview where Wes gives some great insights into the future of data tooling. #data #ai #smalldatasf2024 #theravitshow
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
all right welcome to the Ravid show we are here at small data SF and look who I have with me Wes McKinney I'm so happy to have you on the Ravid show it's finally you know years of following you your content that obviously we've done so much for the community out there so excited to have you in the Ravid show obviously I'm excited to chat about small data I'm excited to chat about Posit you'd be you know co-founded Voltron data you obviously created co-created Pandas Apache Arrow now and many more projects that you work on so I'm excited to chat with you today yeah thanks for thanks for having me awesome
just for you know since we are you're at small data I'm kind of curious to know what body you you're in why are you interested in this topic of small data well I started out building building tools for for small data to make working with with small data sets that fit in fit in memory work on work on your laptop work on your desktop make right that easier right so when I when I got out of school and and found myself doing some data analysis in Excel and with sequel and with our I really felt the desire to make the tools there's a lot better to make people more productive and I found that I really enjoy building tools designing tools for humans to use not only making systems more efficient more more scalable but also making the code easier to write to make make the whole process for the people in the loop more pleasant
What "small data" means today
I think what's interesting is that what what was once big data is not is not big data anymore so people used to think about 100 gigabytes is big data or a terabyte is big data but that data fits on fits on your laptop now and we've got right you know the semiconductor industry has has succeeded in hacking a lot of very powerful processor cores you know even on mobile phones and exactly and laptops and this disk speed has gotten really really fast and at the same time there's been a lot of engineering work to build these high-performance execution engines like duck DB that can be used that can be used anywhere that can run on your laptop and so the kinds of capabilities that we have that can run locally and you're right right on your laptop or on your mobile phone is really is really impressive and it far exceeds anything that we had 10 years ago or 15 years ago
and so what was once thought about as big data is no longer big data and so that's partly what we mean when we say we say small data or making small data making small data cool again right because the reality is that a lot of a lot of business workflows don't actually you might have big data the way we like to say is that big data is what happens when you collect a lot of small data over time in a business you know collecting sales data or mobile phone you know analytics from from users and web applications and mobile applications and things like that but a lot of business analytics you know over there's been studies that have shown that more than 99% of business analytics workloads only end up touching a small fraction of the total the total data set
big data is what happens when you collect a lot of small data over time in a business
and so I think the idea is that we want to build tools that can operate efficiently at that scale of course to have the the option to use an actual big data system when you need to operate on the full data set right but yeah it's it's a yeah there's no sense in kind of using kind of the heavy artillery for these problems that you know hundred gigabytes and on down is what I would qualify as the small data where there's no sense in using a distributed system or running right running spark or something that where you need to spin up a cluster when you can do it on you know you can rent a single node in AWS that has 24 terabytes of memory and so if your data fits in memory then it's not big data anymore just keep it easy and just keep it you know less complex
Retooling for a smaller data era
so what I mean by retooling is that over the last over the last 10 years I think there was a realization in the mid in the mid 2010s that we needed to rethink many of the many of the different layers of the data stack not only the user interfaces but the middle layers the the execution engines the query optimization the storage layers to make everything a great deal simpler and more and more efficient and so my talk was about the open standards and protocols that we've been developing to to modularize and standardize those different layers of the stack yep to make things more interoperable more inefficient as well as making the individuals people writing code a great deal more efficient
so the idea is that in thinking about the smaller data era is that we think that in the future now not only now but also in the future the most work can take place on a single node so we need to optimize around that single node development experience to make right really really productive and able to to prototype and work at that at that small scale but then whenever they do need to scale scale out or maybe they 99% of the time they're working on a single node either in the cloud or on on their their desktop or their laptop but when they do actually encounter big data that it's not a great hardship where you have to say oh we need to use a totally different system to work with this data but actually we can say okay here's my code I've developed it for small scale but I need to run that at large scale but that we have a the ability to generate a configuration from that users code without having to rewrite the code you can operate on at large scale on a distributed cluster
The small data mindset and interactivity
I think it really comes down to you know use a user experience right question where I think we want to bring the small I think small data is not just about the data size but it's also about the the mindset that when we think about small data we also think about immediacy we think about speed we think about interactivity and so we want to design our user user interfaces and our tools around that that interactivity that fast fast feedback cycle rapid feedback interactivity responsiveness low latency interactions
and so so there's there's a bunch of projects happening there's a new project called mosaic which is a duck DB powered data visualization cross filtering framework out of right out of the University of Washington and CMU it's really about like how do we build a framework that enables us to build these low latency interactive data dashboards where we can attach to you know maybe it's maybe it's a small data set but it might not be but that you know all that is in the background and we can think about how do we build this this highly interactive responsive data application where we can get answers to our questions really really fast and the users don't have to think about the infrastructure right and like how to deploy this and how to run it and so I think that that small data mindset really helps inform us our kind of our user user experience design
Apache Arrow
Apache Arrow is not it's not a project that gets used by by a lot of end-user data scientists and data engineers but it's one of those open standards and protocols that I was talking about where it provides this high-bandwidth high-performance data connectivity layer we recognize that in order to build next-generation more efficient systems that we needed to be able to transport large quantities of data from system to system at very high bandwidth across networks and from storage and into memory and we need that we need that data format to also be very efficient for query processing for actually executing and processing data so we developed we started developing arrow in in the mid 2010s in 2015 2016 it's now almost a nine-year-old project we've been developing it there's arrow libraries for all kinds of all kinds of programming languages and different different types of systems but it is really served as one of the the new pieces of connectivity fabric that has enabled these kind of next-generation accelerated accelerated systems that are built for to power us for the next 20 years
IBIS and scale-independent computing
last several years we've been put a lot of emphasis on a Python project called IBIS which basically is a portable data frame API that's aiming to achieve that scale independent computing for Python developers basically incorporating what we learned from building pandas to create create something that is designed for the local experience but whenever you need to point to a click house database or a big query or spark sequel that you can you can go from working locally with duck DB at a certain scale call it one terabyte and down on your laptop to working at the hundred terabyte or the one petabyte scale yep without having to rewrite your code
you can go from working locally with duck DB at a certain scale call it one terabyte and down on your laptop to working at the hundred terabyte or the one petabyte scale without having to rewrite your code
The Positron project and staying connected
I'm at Posit now I'm involved with the positron project it's a new data science ID I would encourage you to follow along with that IBIS arrow and pandas of course are still projects that I created or co-created and still have large communities around them so if you're interested in you know getting involved with these communities they are always in need of new developers and your feedback from from the user standpoint is also very helpful so there's many ways for for you to get involved and yeah I look forward to seeing everyone around
Python for data analysis — what's next
so we came out with the third edition of Python for data analysis in 2022 which was updated for pandas 2.0 yes and yeah I still I still keep it up to date and right now you know we're in a bit of a transitional period there's Polars which is a new data frame library exactly on and so people have asked me about having like a Polars version of the book so I'm in a little bit of like a wait-and-see to kind of conceive of like what will maybe the fourth edition of the book look like yes and what what sort of new content that I might like to add but I think that you know the topics that are treated in the book are really about learning skills and learning how to think about working with data in Python and so I think it's a lot of it's really evergreen content so even if people are using pandas less and less in the future because they're using Polars or something else you can still take the skills that you learn in the book and you can put them to work it will make it that much easier to learn other tools in the ecosystem and become productive
