Resources

779: The Tidyverse of Essential R Libraries and their Python Analogues — with Dr. Hadley Wickham

Tidyverse #RProgramming #RLibraries Tidyverse, ggplot2, and the secret to a tech company’s longevity: Hadley Wickham talks to @JonKrohnLearns about Posit’s rebrand, Tidyverse and why it needs to be in every data scientist’s toolkit, and why getting your hands dirty with open-source projects can be so lucrative for your career. This episode is brought to you by Intel and HPE Ezmeral Software (https://bit.ly/hpeintel). Interested in sponsoring a SuperDataScience Podcast episode? Visit https://passionfroot.me/superdatascience for sponsorship information. In this episode you will learn: • [00:00:00] Introduction • [00:02:55] All about the Tidyverse • [00:15:19] Hadley’s favorite R libraries • [00:28:39] The goal of Posit • [00:34:12] On bringing multiple programming languages together • [00:50:19] The principles for a long-lasting tech company • [00:53:34] How Hadley developed ggplot2 • [01:03:52] How to contribute to the open-source community Additional materials: https://www.superdatascience.com/779

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Do you personally encourage teams to be multilingual? Or do you often write in multiple different languages, maybe even within the same project?

I think most people tend to be like, you know, like 90% R or 90% Python. But in general, it's just better to be pragmatic. And I think that's one area where I think generative AI is like really interesting. It's really easy to translate between them. Maybe this means the barriers between them are going to erode a little bit more.

If we were going to have like a metric that me and my team are going to try and optimize, like I think the metric that I want to optimize is like the amount of time you spend in like flow states.

Dr. Hadley, welcome to the Super Data Science Podcast. It is a surreal experience for me to have you here. We have, I have seen you in person. In fact, we have actually seen each other in person, I'm sure of this. And so let me tell you these stories. We didn't, I didn't tell you this before we started recording. So you're just getting this on air.

Circa 2014, in New York at an O'Reilly Strata Hadoop World Conference, you did some kind of hands on training, it might have been a half day training, they use the the air, the flight times data set. Does that all track? Yeah. So I did that I was in the audience for that. And then a couple of years later, in 2016, at the joint statistical meetings in Chicago,

there was an announcement from our studio, that Hadley Wickham would be at the RStudio booth at a certain window of time. And I walked by a couple times. And we made eye contact and you kind of smiled friendly. But I was too nervous to talk to you. I didn't know what to say.

You were at that time and still today, one of the most iconic people in data science to me. And I was just like, well, you know, there's no what do I do? What do I say? How do I introduce this? And so now I finally know what to say. I've got a question to ask you. Hadley, welcome to the super data science podcast. Where in the world are you calling in from? I'm calling in from Houston, Texas. Nice. It is truly such an honor to have you on the show. You were on the show in the past.

So four years ago, you were on the program. And that was specifically episode number 337. But at that time, our host was Kirill Aromenko. And the timestamp on this is pretty interesting, because that episode was published in February 2020. So it was a pre-pandemic world.

A very different, different time. Exactly. So all of that has passed and we're almost back to normal, except that so many data scientists are working from home. All right. So straight into the technical content. Hadley, if there's one word that is most associated with you, it's got to be tidy. In 2014, you wrote a highly cited paper called Tidy Data. And you're also an author of the popular tidyverse, a collection of packages that share a high level design philosophy.

Last but not least, you have been writing a book called Tidy Data Principles, which is to be completed next year in 2025. What does tidy mean in the context of data programming? And yeah, what's the guiding principle? That's definitely not a question that Chad GPT would have asked you.

What "tidy" means

Tidy to me is about having things that are kind of like well organized and like well broken down into kind of like little pieces that you can then like reassemble, like Legos. I think that's been a motivation for a lot of my work is like, how do you take some like big, maybe kind of vaguely ill-defined problem and then break it down into like concrete pieces that you can actually get stuck into?

And experiment with and play around with and iterate towards a final solution. All right. And yeah, so in your tidy data paper, you draw parallels between tidy data and the principles of relational databases, specifically Codd's relational algebra. What is Codd's relational algebra? Could you elaborate on how database design can benefit statisticians and data analysts in their work?

Yeah, so you can go and like look up on like Wikipedia or somewhere what Codd's relational like algorithm actually algebra actually is, but like I can never remember it, but I think there's a. I don't know, like I.

Yeah, it's one of those kind of very precise definitions where every like word makes sense individually, but when you string them together in a sentence, it's very hard to understand what it means. But I was kind of.

And I sort of got into like relational data because my dad had done a lot of database design for capturing data about cows in particular, and this idea about cows, yeah, for cattle breeding.

And this idea of like making sure that kind of each facts, each unique fact is recorded like once in a data set rather than having it like potentially like either split across multiple places or reported recorded in multiple different ways in different places.

And so I think like that, like the ideas of Codd's relational algebra are really important. Like you want to have your you know, you don't have inconsistencies in your data, but it's really difficult for folks who are not like trained in databases and computer science to get the idea of the algebra.

And so a lot of that, like the idea of the tidy data was like, how can I frame this in a way that like makes more sense to statisticians and data scientists and people working with data. And to me, like it's like this, like, it's just like you've got a rectangle and all tidy data really is.

We put the variables and the columns and you put the observations in the rows and you kind of wonder like that legitimately took me like eight years to figure out. It seems so simple in hindsight, but it's just one of those things that like once you figure it out and explain it to other people, it makes a lot of sense, but it takes a while to get there.

We put the variables and the columns and you put the observations in the rows and you kind of wonder like that legitimately took me like eight years to figure out. It seems so simple in hindsight, but it's just one of those things that like once you figure it out and explain it to other people, it makes a lot of sense, but it takes a while to get there.

It takes a while, even for me as my first time using the tidy data principles, which must have been many years ago now is probably a decade ago or more. But that first time kind of wrapping your head around shaping the data in this tidy way, it is so different from the way that we're typically taught in university.

And so therefore, that first time doing things in a tidy way, exactly as you described, where each piece of information is only replicated once as opposed to having a whole column where the variable name, let's say you have a binary outcome and every row of this giant table is like outcome zero or outcome one.

It's so wasteful, especially if that's like a string. It's incredibly inefficient. And so popping over to the tidy principles where that goes away is transformative. But it did it. I remember the first time trying to like melt data.

And I'm like, what? And even when I saw it the first time, it took me a while to figure out and I was kind of like, this isn't right. Like I needed to kind of, I felt like I needed to change it into the way that I'm used to, to even work with it.

Yeah, it's really, it's really interesting at the moment. I'm on the sort of run the program committee for Posit Comp. So we decide on like what talks we're going to have and how we're going to arrange them. And so the program committee is mostly like data scientists.

And so the way we think we like, we do a lot of organizing data and like Google sheets because it's so easy to collaboratively edit, but it's always in like a tidy format.

So that, and what that typically means is you can't actually get a shape. You can't really understand the shape of the whole program without like joining three different things together.

And it's interesting, like when I share it with my colleagues in like marketing who have to then turn this into the website, like the way they want to put that in a Google sheet is just like so totally different. There's like merge cells, there's like colors. It's just like, I get it, but just like the, you could not like compute on data in that form, but it's so much easier to look at as a human.

Are you able to, when you're in a data science company like Posit, are you able to kind of dictate, you know, marketing, you've got to do things our way, the data scientist way, or is that, it's an, that's an impossible.

It's impossible. I mean, I, you know, I, I don't think I could dictate it. I could certainly do more to like try and persuade people or like, you know, help them do their jobs better, but it's like a lot of, a lot of work and you're, you know, fighting.

Like the fact that none of their tools think about data in this way. And I don't really want to have to go and create like the tidyverse for marketing and the tidyverse for finance and like every other, every other field.

We like, we do, I think we have like more, certainly more penetration of like Quarto and other kind of those, those sort of document generation tools, but there's still quite a high bar for folks who are like, if you're not used to using Git and GitHub, like collaborating in this way is pretty tough.

Even though that final product can be pretty nice. And that's, that's one of the things we, I sort of hope for the future of Quarto is like these tools for, you know, scientific documents that let you mingle text and code, but also can like work with Google dogs and multiple people can be like contributing to them.

You can comment on them. You can share them with non-technical folks.

Yeah. Quarto is something that we also have as a whole topic area later for discussion. It is a great tool that I think anybody can be using within data science or amongst data analysts, particularly.

Have there been persistent challenges that you faced, like technical hurdles or adoption resistant conceptual misunderstanding? Has there been any of that over the years?

With tidy data in particular, sorry.

Tidy data. Not too much. I think there's definitely some areas where there's just such strong conventions for the field of having things that I would think want to be in a single column spread across multiple columns.

There's some types of data where it's just like arranging things in non tidy forms. It's just much, much more memory efficient.

But like by and large, I think most people have like looked at the tidy data framework and even if they like don't agree with all the tools or don't use, I'll still find that like framing really, really useful.

And I think that thought of like, let's separate that out into a separate step rather than trying to make every other tool do a little bit of tidying just makes it makes life so, so much easier.

100%. And for people out there who haven't had the tidy experience, it is absolutely worth wrapping your brain around it. Because once you do get used to it, everything becomes so much easier.

And all of the all of the tools in your tidyverse work so seamlessly together. It's like, yeah, it's like, I don't know, when I was a kid, you know, you'd have some computer video games that were just garbage and buggy.

And you'd constantly like, you know, you'd be able to walk through walls or whatever. And then you get your for me, it was like having a Super Nintendo for the first time. And like nothing ever crashes and everything just works.

And the tidyverse is kind of like that you get the data formatted in that way. And you're just it's just smooth sailing.

Flow states and the tidyverse goal

Yeah, that's the I mean, we've we've one of the things we've sort of wondered about, like, if we're going to have like a metric that me and my team, we're going to try and optimize, like, I think the metric that I want to optimize is like the amount of time you spend in like, blow state, we are just like thinking of stuff you want to do with data, like code just kind of like flows out of your fingers.

And it all just works. And, you know, we're certainly still some distance from that perfect utopia. Over time, it really feels like the amount of time you can kind of stay stay in that flow state, stay answering asking questions of the data you actually care about, not like, how do I get this, this thing into this other function, so I can actually just do what I want to do.

On that note, in a recent article about strategies and function design, you have discussed the importance of making strategic choices explicit to users.

So hopefully this is ringing a bell.

Do you think that this kind of this kind of concept of providing of providing a clear presentation of strategies in the way that you're designing a tool or function that that actually enhances the user's experience or functionality maybe allows them to be in that flow state for more of the time that they're programming because they understand the thinking behind the strategic choices in the tools?

Yeah, I think so. And kind of framing that even like a little more broadly, like one of the things that me and my teams would think about and talk about quite a lot is like how much do you want to force you to learn some like new concept that's, you know, really going to that might really like a better mental model that's going to help you in the long run.

But until you learn what that thing is, it's going to like the code is going to be a bit of a mystery for you. And that kind of balance of like we want to we want you to like learn some new ideas like this idea of tidy data.

Like they're clearly like this is big, pretty clearly a big payoff of getting that concept into your head. And then versus other times, like, are we just teaching you some kind of like technical jargon that's like, you know, really useful for us, but maybe it's just more like junk to fill your brain up with.

So that's one of the things we're going to think about a lot, like how like what do we like? How much do we want to accommodate your existing mental model versus how much we want to give you a new and better mental model like either possibly kind of against your will.

Favorite libraries

Amongst all of the libraries that you've developed in the tidyverse. So there's things like reshape and plier for shaping the data and and being able to have pipelines of data within this tidy framework.

Is there any particular library that is near and dear to your heart that you feel like this was one that either I don't know, conceptually developing it like you must kind of have favorites, right?

Like they're not equal to you, right?

Yeah, I think I mean, I have to say one of my favorites is like dbplyr, which allows you to write like R code, dplyr code, and then translate that translates that automatically to SQL.

And one of the things I kind of love about it is this sort of combination of like this really kind of deep technical knowledge of R that you need to make this work in terms of like translating the R code.

And there's also kind of like there's no way to do it perfectly like you cannot translate perfectly every piece of R code to equivalent SQL.

There's also this kind of like how do I carve out like the biggest benefit with the smallest amount of work and I think the combination of those two things is something I really enjoy.

It's kind of related to that like there's bits of the test that package like one of the one of the test that is a package for doing unit testing in R.

And one of the things I worked on a couple of years ago was this package called Waldo, which is all about like concisely describing the difference between two objects.

And that's kind of like a similar problem. Like you've got this like deep technical understanding of like the language and all the objects and like writing C code iterate through them.

And then how do you like present that to the user in like a way that helps them see the differences as easily as possible.

So that kind of tension there between like this again programming and like human psychology. I just find that like really interesting and fun to explore.

Yeah, it must be interesting to think yourself. Well, there's clearly this need to be addressed and it's a it's an impossible problem to solve perfectly.

So how can we get most of the way there in a way that will satisfy most people?

And those are two libraries that I definitely need to spend more time with. I don't think I've used to be prior or test that actually.

Despite my life, it's also just one of those things where I'm like, it seems like a miracle that it works so well.

Like, I think the thing that's fascinating to me is that like it really reveals like for the kind of core data, this core like stuff you do to vectors and data science, like whether you're summarizing them or you're like filtering them.

It's basically the same code and like R or Python or SQL or JavaScript. There's like pretty you can express pretty much the same things in every single language in a way that is surprising and interesting.

Shiny and reactive programming

Beyond the libraries that we just mentioned in the tidyverse, one favorite of mine is shiny because it allows you to so rapidly build interactive Web applications for data analysis, especially compared to any other Web development framework that I've tried.

And I don't have experience, much experience developing any kind of Web tools, but I can very easily use shiny to get a Web application up and running for people to be to have a self-service dashboard.

That's a clicking point. Do you want to talk a bit about shiny? And actually, I think something about it is that it's now. And we're going to talk about positive soon and the change from our studio. But this is something that works kind of across programming languages now.

Yeah, exactly. So there's now shiny for our and shiny for Python. And they, you know, they're completely separate code bases. But the idea that really unifies them is this idea of reactive programming.

And the idea of reactive programming, I think, at its heart is pretty simple. You've got like a bunch of inputs to your app, things that people can change. And you've got a bunch of outputs.

And what reactive programming does is just like automatically figures out what's the minimal amount of work to do when you change one of the inputs to update the needed outputs.

And that again, that's that's that's one of these ideas like tidy data that takes your takes a little while to get your head around. It's like probably it's quite possibly an idea you've never encountered before in programming.

It works a little bit differently to things you might have encountered. But like once you get that idea, it just gives you this incredible tool set to create apps that like where things just work.

And you don't have to worry about like things either like updating too often doing a bunch of needless work and making your app too slow or just failing to update.

And so you've got these mysterious like bugs in your app where things don't change. And you expect them to, which is like one of the most frustrating things to try and debug when something doesn't happen.

That that is fun. Yeah. So, yeah, shiny. Really, really, really cool.

It allows you to spin up basically like that Super Nintendo game that I was just describing. It just kind of works like you think it should. People don't walk through walls accidentally as they're using your your dashboard that you developed in literally minutes.

Yeah, it's funny. Like, I remember talking to Joe Chang, who wrote shiny like very early on.

And I was like, Joey, you think I use this like website like you can use like use Ruby for that. Use PHP for that. Like why on earth would a data scientist want to make a website?

And now it's like so obvious because you don't want to give decision makers in the organization just like a PDF. You want to give them like a little interactive app. And there's just been so many examples of people just like really impressing their bosses with shiny.

Because you can like whip up something in a couple of hours that looks like a polished app. It does exactly what you want. I remember a very early phone call from a shiny user saying like we saved him a quarter of a million dollars.

Because instead of going and like finding a like a contractor to implement a web app and a dashboard, he just did it himself over a weekend.

And that not only is that a cost and time benefit, but also that like if you as a data scientist can do it yourself, you don't have to try and communicate to someone else exactly what you want.

That is tough.

It is. Well, and this also allows you to make changes yourself. You know, if if you notice an issue or a user complains to you, you can just go in and fix it as opposed to needing to be a middleman, a middle person, I guess.

Yeah, I think one of the interesting things about dashboards is like if your dashboard is successful, like people are going to demand changes to it like very, very quickly.

But if you have a really, really good dashboard, that means like there's going to be like two or three execs in your company who now want to like make a bunch of tweaks to it.

And if that's like some weeks long process where you've got to figure it out and communicate to some like web engineering team that just like kills the whole thing.

And I think with how often executives think they want a dashboard and then relate it to how often they actually use it. That is another strong point for using shiny because you're not wasting weeks or months developing a dashboard.

You're you're wasting hours of days. I mean, just in general, that whole iteration, the more you can do to like increase your iteration speed, like the more effective it makes you.

Because, again, like it's so hard to predict in advance, like what's the thing that's going to be valuable? There's definitely a lot to be said to just like trying out a ton of things and seeing what sticks rather than like doing a bunch of upfront planning and just hoping desperately that you've got a really good mental model of the world and your idea works.

Why use R? R vs Python

So we are going to talk about, as I already mentioned, the positive name change. We'll end up talking about Python a bit for our listeners. If there are listeners out there who don't already use R, why should they be using it for me?

I can actually give one example, which is for me, for data visualizations. I still find I can do things way more quickly, have much more fun making visualizations in R and get exactly what I want.

There had been in the past attempts to create a ggplot2 style Python library, but the one that I had been using became deprecated and harder and harder to use.

It never had all the functionality of your ggplot2 anyway. Anyway, so that's like my big example. I don't know if you have big examples of why people might want to use R still today.

On the topic of ggplot2 specifically, I think the best Python equivalent is Plot9. That's actually by a developer, Hasan Kabiraji, that we've been sponsoring at RStudio.

I think that's the best possible realization of ggplot2 you can get in Python.

But I think there's things about the design of the R language that just make certain tasks much easier and more natural to express in R code than you'll ever be able to do in Python.

I think that comes down to at the heart of it, R is more of a special purpose programming language. It's designed from the ground up to support statistics and data science.

I think that has a lot of benefits, particularly if you've never programmed before. I think you can get up and running in R using R to do data science. You can do that without learning a ton of programming. You can get up and running pretty quickly.

Then there's just things about the language that other languages look at R and they're like, oh my god, that's a terrible idea or that makes me want to throw up in my mouth.

There's just so many things that are so well placed to support interactive data science where you really want that fast and fluid cycle where you're trying things out.

That obviously lends to maybe a little bit of weaknesses. I'm like, now I've got this thing, I just wanted to do the same thing again and again and again and again.

R tends to be a little bit magical. It tries to guess a little bit more of what you want and that's great when you're working interactively and it guesses correctly. It's not so great when you're working on a server somewhere else and it guesses the wrong thing.

Everything about R I think makes it such a fluid environment for really exploring your data, digging into it, figuring out what's going on.

Posit's rebrand and the R/Python ecosystem

Speaking of differences between R and Python, I seem to remember, and you can correct me if I'm wrong about this, but I feel like you have a famous tweet from years ago where you say, somebody says something like,

and it must have been a famous poster themself that you responded to, and I can't remember, it might have been like Wes McKinney or somebody like that saying that one of the advantages of Python is that it's faster than R.

And then you have this super famous reply of, what is that, and I will make it faster. Do you know what I'm talking about?

I don't, but I know I've heard things like that in the past.

Yeah, it's kind of a misperception because Python isn't actually that fast itself. I mean, languages like Julia have come up to be faster than Python.

Yeah, I think one of the reasons, often the biggest, you have the worst arguments with your family and not with strangers. With people who are so similar to you, you tend to have more friction than with people who are really different.

I think because R and Python are actually really close together in the spectrum of programming languages. It's so easy to see all of the little things that look weird to you as opposed to looking at some programming language that's miles away, and it just looks totally different.

You can't, I just think that, I don't know, I think there's something to that. It's because we're close, you can see these little noises.

Certainly, when I see things in Python that people are like, wow, that's really cool. I'm like, challenge accepted. I will make that better in R.

Yeah, exactly. Let's dig into that a bit now. For 11 years, you've been the chief scientist at Posit, makers of open-source software for data science, scientific research, technical communication communities.

Many R users will know Posit as the makers of RStudio, a full-featured integrated development environment IDE for R, which I myself have been using for as long as I can remember.

Basically, as long as I have been typing, I have been using RStudio.

RStudio, as you actually kind of let slip earlier in this episode when you were talking about Joe Chang, I think. Oh, no, no, no. You were talking about plot nine and how you were like, RStudio is supporting. Wait, no, Posit. That's two years ago, the company name has changed to Posit.

From a distance, I mean, I don't even think it's from a distance. I think this is explicitly related to how Posit is now supporting more than just R. Is that right?

Yeah, yeah. I mean, the goal of RStudio and now Posit has always been to be this kind of like a company with a long-term vision. Like we talk internally about this sort of idea of a hundred-year company.

And when you think about a company like that, like obviously no programming language is going to be around in a hundred years' time. When we started with R, that's something that's near and dear to many of our hearts. It will always be.

But we also want it in the name of our company to like embrace that there are now other languages and there will be even more languages in the future.

And I kind of think about this as like the Burlington Coat Factory problem. I don't know if you know Burlington Coat Factory, but we have a lot of ads for them on television. But for a long time, they were like, no, it's like Burlington Coat Factory. It's not just coats.

And for us to go into customers and say like, buy RStudio. It's not just R. It's hard to make that story.

So I really, really wanted to say like, hey, for a long time now, like our products have supported not just R but Python and Julia and other tools. We don't want to lock ourselves into this. We're going to be R forever regardless of what happens with the rest of the world.

So renaming deposit, rebranding deposit was really about saying like we're in this for the long haul. And we care about data scientists regardless of what tool they're using.

Piping in R and Python

One of my favorite things that you can do really well thanks to the dplyr library that you led development of is piping. And so you can extremely easily have functions pass it. I mean, just like if people are familiar with Unix programming pipes there where you have output from one function goes the input to a next function.

And prior to me discovering dplyr, which was probably around 2010. Does that make sense? Prior to that, I would have so many variables in my workspace. It was just such a pain to keep them all straight.

And you just end up in these weird situations where like should I be investing time thinking about the name of this intermediate variable? Am I going to use this later?

Or should I just name it like intermediate variable 15 and have really ugly code?

So piping gets rid of all that where you can read the flows like a sentence. You're like, okay, this preprocessing step happens, then this next, and you can just see it so easily it makes it so elegant to read.

Do you think we'll get to a point where and I have used some kinds of piping attempts in Python, but my experience of that has never been. And I guess it's been a few years since I've tried. Yeah. But it seems like it's never been as smooth or as easy as with R. And maybe that's related to what you were talking about earlier with data visualization.

Yeah. So one, kind of like the native equivalent of piping in Python is like method chaining. You know, like if you're using pandas, you do something dot something. Yeah, pandas does it. Dot something.

But the big difference between like method chaining and the pipe is in method chaining, like all of those methods have to come from the same class. They have to live in the same library, the same package. Whereas with piping, they can come from any package.

And I think the thing that's really interesting about that is that that is meant like Python is tended to have like these fewer, bigger packages like pandas and scikit-learn, matplotlib, like kind of everything in order to work with method chaining, like everything has to be glommed into this one giant package.

Where with R, you know, because you can combine things from different packages, like the equivalent of pandas is kind of like dplyr and tidyr and readr and a bunch of other things.

It's way easier to add extensions to ggplot2 than matplotlib that work exactly the same way because you can just combine them with different pieces.

So I think this is one of these like sort of interesting, like subtle differences in language design that leads to, you know, fairly big impacts on the user experience and kind of almost even how the community has to work together and form.

Yeah, it makes perfect sense. And it's actually your explanation is so simple for how that's happened. And that had kind of escaped my attention as to why it works so well in R.

Marrying R and Python: Arrow, DuckDB, and beyond

When you were last on this podcast four years ago, you said that you wanted to marry the Python and R languages. Four years on, how do you assess the progress made in achieving this dream, especially through projects like Apache Arrow?

Yeah, I think we've come a long, a long way. And like Arrow has made a big difference in just being able to like seamlessly move data from one platform to another, you know, one programming language to another.

I think the other, and then kind of coupled with that, I think the other technologies that's really, really interesting is DuckDB. You can, you know, you can use DuckDB from R, you can use it from Python and you don't have to have like a database file. You have like a directory full of parquet files.

And it means that like, you know, people are using the same kinds of tools, just expressing them, you know, in the language that they feel most comfortable with.

Another, you know, sort of a similar thing is like Keras and a lot of the machine learning toolkits in Python. Like the reason that they are fast is not because like Python is fast. It's because you express those high level ideas in Python and then they get compiled down to some low level machine code.

And that's why packages like the Keras package for R, which is maintained by one of my colleagues at Posit, Tomasz Komarski, like it does the same thing. Like you express these ideas in R rather than Python, but then it gets compiled down to machine code using exactly the same toolkit.

So I think that like we're just going to continue to see more and more of that. Like R is not fast. Python is not particularly fast. What is fast is like people really caring about stuff and Rust and C. And then you write a more user friendly interface on top of that. So the programming language is the data scientists use every day.

And those libraries that you mentioned there are super cool. In addition to the arrow that I mentioned, speaking of Wes McKinney. And we actually have. So we have a whole episode about that.

So back in episode number 523, we had Wes McKinney on and he talks about the Apache Arrow project at length. Really cool one. And the other projects you mentioned there, DuckDB as well as Keras for R. Yeah. Super cool. Invaluable packages that people should be trying out on the show for sure.

Arrow's also been top of mind for me lately because I've been working on some enhancements for the big query package, which allows you to get data from big query.

And previously, the only way you could get or the way that the big R query package used is downloading the data as JSON.

And JSON is like a great interchange format, but it is horrendously inefficient if you're sending like data frames of data.

And so thanks to some folks who have been working on the big R query storage package, which talks to the Google API using Arrow instead of JSON.

Like you can download data like an order or sometimes two orders of magnitude faster because you're using a data format like specifically designed for the type of data that, you know, data scientists care about.

And that's just like it's really it's kind of nice to see that like in practice and that the kind of dream of Arrow of like you've got data over here and you want to get it over here. Let's make it as easy as possible.

Multilingual programming and generative AI

Speaking of multiple languages and interchanging between them, do you personally encourage teams to be multilingual or do you often write in multiple different languages, maybe even within the same project?

How do you think about that in your own?

I'm like 100. I'm 100% R and as far as I can tell, I probably always will be. Like that's my job. Like if there's something that I could do better in Python, like I will write an R package.

But, you know, but that's not the reality of like most people's lives and most people's jobs. I think most people tend to be like, you know, like 90% R or 90% Python.

But in general, it's better to be pragmatic if there is something that's way easier to do in another language like you can, you know, you can learn the basics of R, you can learn the basics of Python pretty quickly so you can use that tool.

And I think that's one area where I think generative AI is like really interesting, like just being able to like generate code in another language quickly.

And like I use it, I've been using it quite a bit for to generate JavaScript because I do the occasional web thing. And I really like it because I like I know enough about JavaScript that I can kind of look at what it produces and say like that looks right. But for me to manually like figure out would just take like so much longer.

And so I think that's really, I think it's really interesting to kind of think about how that's going to affect kind of programming languages. If it's really easy to translate between them, maybe this means the barriers between them are going to erode a little bit more.

Yeah, I think that's right. We haven't, it's amazing that we've gotten this far in the episode without talking about generative AI yet. That's kind of a nice refreshing and actually looking at all the topics we have lined up. I don't think any of it touched on gen AI, which is kind of crazy today.

But as you were talking earlier about how in a hundred years, how Posit is aiming to be a hundred year company. And we think about what programming languages will be a hundred years from now. I was kind of just, that was batting around in my head, bouncing around in my head as you were speaking. How I wonder if a hundred years from now, anybody will be programming.

Because I wonder if just natural language expression of things will be so powerful. I wonder if we'll be working at all. It's just going to be like a Mad Max hellscape.

I think the other thing that's really interesting to me, though, is like if people are really going to be using like generative AI for programming a lot. Like what does that mean for new programming languages, which are not going to have any training data available for them? Like that seems like that's going to kind of raise the barrier to new languages even further.

Yeah, it's also just like what happens to stack overflow. And I always sort of like this idea of like poisoning the well. Like people are stopping using stack overflow, which is kind of like fine in the short term. But where's like all the training data going to come from in the future?

It's exciting and scary. Yeah, it is exciting and scary.

I have this perhaps completely unfounded intuition that somehow it's going. And there's people way smarter than me who have spent a lot of time thinking about this who could easily crush what I'm about to just say. You might even do it right now.

But somehow I have this feeling, based on how quickly issues like hallucinations have been stamped out, the jump between GPT-3.5 and GPT-4 with how much less it's hallucinating, I have this completely unscientific, uneducated intuition that somehow we're going to be fine on this front, that like we're not going to end up having a complete like there's a specific term for this.

Yeah, I think the thing that makes me like less optimistic is like my Tesla and, you know, like this promise of like self-driving cars, which just doesn't seem to be getting any closer. It's like every, like, I don't know, 50% of the time that I pull into our garage, like it thinks the random collection of tools on the wall is like a semi.

That's about it. So I'm just like, meh. And that's like clearly something that's been like very much hyped and a bunch of money has been. It's just gonna be interesting to see, like we're clearly in this like explosive growth. And is it just going to like flatten off or is it going to keep going? Or is it going to get steeper? Who knows?

S7 and object-oriented programming in R

So maybe we won't be going down the Excel route. But another big innovation for our that has actually happened recently is R7.

Which I heard about for the first time doing the research for this episode or reading search our researchers research. Yeah. Do you want to tell us about R7 and the problems that it's aimed to address?

So we actually renamed it to S7 relatively recently. Oh, really? It's called S7.

So it's called S7 because there are two. This is like a lot of historical minutiae. But R was like the language that came before R was called S. And S was the kind of introduced object-oriented programming. And S version 3 and S version 4.

So in R there are two types of object-oriented programming. S3 and S4. Two chief types.

And the kind of idea of S7 or R7 was to try and like take the add those two things together and get the best of both worlds.

So S3 is like really just a very lightweight set of conventions. Like it's not like object-oriented programming in any other language. Basically, it's very, very lightweight. S4 kind of like swings too far in the other direction. It's like very formal. There's a lot of kind of boilerplate. It's quite complicated. Things can go wrong in weird ways.

So the idea of S7 was really to try and find like the sweet spot in between them. Like to take the nice features that S4 had, add them on top of S3 in a backward-compatible way so that we can hopefully switch.

Hopefully, we're not just adding another object-oriented programming style to R, but we're actually supplanting S3 and S4 over time. Because everything you can do in those two, you can do in S7 and you can do it more easily. And there's better documentation and tooling and that kind of stuff.

For our listeners who don't have a computer science background, what does it mean for a language to be object-oriented? And that it can have these kinds of grades from lightweight, like you were describing with S3.

Yeah, I don't know. I don't know. You have objects and you program with them. I don't know.

And it's especially weird in R because when you're using R, you're not really aware that you're using object-oriented programming. Unlike in Python, where I think you're much more aware that you have objects and you call methods on those objects.

Object-oriented programming is much, much less important in R as a data scientist. I think you benefit from it because packages use it.

So I think the main benefit to you as a data scientist is not that you're going to be writing S7 code. But the packages that you use are going to and they're going to be able to write code faster and more correctly from the get-go.

So hopefully more of a general uplift of developer productivity in R. Probably not going to affect data scientists day-to-day that much.

Posit's mission and the hundred-year company

You've already mentioned on the show how Posit has an ambition to build a company, a suite of tools that could last 100 years. What kinds of principles or philosophies do you think are critical to creating a legacy that lasts that long in technology?

Yeah, that's a good question. I don't think we know for sure.

But one of the things that make us different as a company is that we are a public benefit corp, a PBC or a B corp, rather than being an LLC.

And what that means is kind of fundamentally baked into our charter is that our sole goal is not to optimize shareholder revenue, which is kind of the classic LLC model.

We explicitly kind of consider other stakeholders like the community and our employees as what we're trying to do as a company.

And I think that is pretty special because we can say legitimately we don't want to make products that lock you in. We want to make products that like, you know, we want to help you and we're going to sell you products that are going to help you do your job.

And you're going to hopefully pay us money for those because they save you time. They allow you to do things that you couldn't otherwise do. But we're not just about that money. Like we really care about your kind of life as a data scientist. We want to build tools for you.

We want to build tools, open source tools for people that don't have a bunch of money. We want to improve academia. So I think that that's part of the mission is like we're not just we're not optimizing for short term profit. We can say like we're going to take a longer view.

And, you know, part of that is also like we're not we're not a VC driven company. We don't have to explode in either a good way or a bad way. In three years time where our investors like want to get money so that we can