Retooling for a smaller data era

Transcript#

This transcript was generated automatically and may contain errors.

all right welcome to the Ravid show we are here at small data SF and look who I have with me Wes McKinney I'm so happy to have you on the Ravid show it's finally you know years of following you your content that obviously we've done so much for the community out there so excited to have you in the Ravid show obviously I'm excited to chat about small data I'm excited to chat about Posit you'd be you know co-founded Voltron data you obviously created co-created Pandas Apache Arrow now and many more projects that you work on so I'm excited to chat with you today yeah thanks for thanks for having me awesome

just for you know since we are you're at small data I'm kind of curious to know what body you you're in why are you interested in this topic of small data well I started out building building tools for for small data to make working with with small data sets that fit in fit in memory work on work on your laptop work on your desktop make right that easier right so when I when I got out of school and and found myself doing some data analysis in Excel and with sequel and with our I really felt the desire to make the tools there's a lot better to make people more productive and I found that I really enjoy building tools designing tools for humans to use not only making systems more efficient more more scalable but also making the code easier to write to make make the whole process for the people in the loop more pleasant

What "small data" means today

I think what's interesting is that what what was once big data is not is not big data anymore so people used to think about 100 gigabytes is big data or a terabyte is big data but that data fits on fits on your laptop now and we've got right you know the semiconductor industry has has succeeded in hacking a lot of very powerful processor cores you know even on mobile phones and exactly and laptops and this disk speed has gotten really really fast and at the same time there's been a lot of engineering work to build these high-performance execution engines like duck DB that can be used that can be used anywhere that can run on your laptop and so the kinds of capabilities that we have that can run locally and you're right right on your laptop or on your mobile phone is really is really impressive and it far exceeds anything that we had 10 years ago or 15 years ago

and so what was once thought about as big data is no longer big data and so that's partly what we mean when we say we say small data or making small data making small data cool again right because the reality is that a lot of a lot of business workflows don't actually you might have big data the way we like to say is that big data is what happens when you collect a lot of small data over time in a business you know collecting sales data or mobile phone you know analytics from from users and web applications and mobile applications and things like that but a lot of business analytics you know over there's been studies that have shown that more than 99% of business analytics workloads only end up touching a small fraction of the total the total data set

big data is what happens when you collect a lot of small data over time in a business

and so I think the idea is that we want to build tools that can operate efficiently at that scale of course to have the the option to use an actual big data system when you need to operate on the full data set right but yeah it's it's a yeah there's no sense in kind of using kind of the heavy artillery for these problems that you know hundred gigabytes and on down is what I would qualify as the small data where there's no sense in using a distributed system or running right running spark or something that where you need to spin up a cluster when you can do it on you know you can rent a single node in AWS that has 24 terabytes of memory and so if your data fits in memory then it's not big data anymore just keep it easy and just keep it you know less complex

you can go from working locally with duck DB at a certain scale call it one terabyte and down on your laptop to working at the hundred terabyte or the one petabyte scale without having to rewrite your code

The Positron project and staying connected

I'm at Posit now I'm involved with the positron project it's a new data science ID I would encourage you to follow along with that IBIS arrow and pandas of course are still projects that I created or co-created and still have large communities around them so if you're interested in you know getting involved with these communities they are always in need of new developers and your feedback from from the user standpoint is also very helpful so there's many ways for for you to get involved and yeah I look forward to seeing everyone around

Python for data analysis — what's next

so we came out with the third edition of Python for data analysis in 2022 which was updated for pandas 2.0 yes and yeah I still I still keep it up to date and right now you know we're in a bit of a transitional period there's Polars which is a new data frame library exactly on and so people have asked me about having like a Polars version of the book so I'm in a little bit of like a wait-and-see to kind of conceive of like what will maybe the fourth edition of the book look like yes and what what sort of new content that I might like to add but I think that you know the topics that are treated in the book are really about learning skills and learning how to think about working with data in Python and so I think it's a lot of it's really evergreen content so even if people are using pandas less and less in the future because they're using Polars or something else you can still take the skills that you learn in the book and you can put them to work it will make it that much easier to learn other tools in the ecosystem and become productive

Retooling for a smaller data era

Transcript#

What "small data" means today

Retooling for a smaller data era

The small data mindset and interactivity

Apache Arrow

IBIS and scale-independent computing

The Positron project and staying connected

Python for data analysis — what's next