
Hadley Wickham on R vs Python
Learn about tidyverse, ggplot2, and the secret to a tech company’s longevity as Hadley Wickham joins @JonKrohnLearns in this episode. He talks about Posit’s rebrand, why tidyverse needs to be in every data scientist’s toolkit, and why getting your hands dirty with open-source projects can be so lucrative for your career. Watch the full interview “779: The Tidyverse of Essential R Libraries and their Python Analogues — with Dr. Hadley Wickham” here: https://www.superdatascience.com/779
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Speaking of differences between R and Python, I seem to remember, and you can correct me if I'm wrong about this, but I feel like you have a famous tweet from years ago where you say, somebody says something like, and it must have been a famous poster themself that you responded to, and I can't remember, it might have been like Wes McKinney or somebody like that, saying that one of the advantages of Python is that it's faster than R, and then you have this super famous reply of, what is that, and I will make it faster. Do you know what I'm talking about?
I don't, but I know I've said things like that in the past. No, that's a, it's a, yeah, it is, it's kind of a, it's a misperception, because Python isn't actually that fast itself, I mean, whole languages like Julia have come up to be faster than Python.
Yeah, I think, I think one of the reasons, like, you know, often, like, the biggest, you know, you have the worst arguments with your family and not with strangers, like, with people who are, like, so, like, so similar to you, you tend to have more friction than the people who are, like, really different, I think, because R and Python are actually really close together in the spectrum of programming languages. It's so easy to see, like, all of the, like, the little things that look weird to you, as opposed to looking at some programming language that's miles away, and it just looks, you know, totally different, you can't, I just think that, I don't know, I think there's a, there's something to that, it's, it's because we're close, you can see these, like, little noises.
Certainly, like, when I see things in Python that people are like, wow, that's really cool, I'm like, challenge accepted, I will make that better now.
Certainly, like, when I see things in Python that people are like, wow, that's really cool, I'm like, challenge accepted, I will make that better now.
Piping in R vs method chaining in Python
One of my favorite things that you can do really well, thanks to the dplyr library that you led development of, is piping. And so you can extremely easily have functions pass into, I mean, just like if people are familiar with Unix programming, pipes there, where you have output from one function goes the input to the next function, and prior to me discovering dplyr, which was probably around 2010, if that makes sense. Prior to that, I would have so many variables in my workspace, it was just such a pain to keep them all straight, and you just end up in these weird situations where, like, should I be investing time thinking about the name of this intermediate variable, am I going to use this later, or should I just name it, like, intermediate variable 15, and have really ugly code. And so piping gets rid of all that, where you can read the flows like a sentence. You're like, okay, this preprocessing step happens, then this next, and you can just see it so easily, it makes it so elegant to read.
Do you think we'll get to a point where, and I have used some kinds of piping attempts in Python, but my experience of that has never been, and I guess it's been a few years since I've tried, but it seems like it's never been as smooth or as easy as with R, and maybe that's related to what you were talking about earlier with data visualization.
So one, kind of the native equivalent of piping in Python is like method chaining, you know, like if you're using pandas, you do something dot something, dot something, dot something. But the big difference between method chaining and the pipe is in method chaining, all of those methods have to come from the same class, they have to live in the same library, the same package. Whereas with piping, they can come from any package.
And I think the thing that's really interesting about that is that that has meant, like, Python has tended to have, like, these fewer, bigger packages, like pandas and scikit-learn, matplotlib, like kind of everything, in order to work with method chaining, like everything has to be glommed into this one giant package, where with R, you know, because you can combine things from different packages, like the equivalent of pandas is kind of like dplyr and tidyr and readr and like a bunch of other things, like with, it's way easier to add extensions to ggplot2 than matplotlib that work exactly the same way, because you don't have this, you can just combine them from different pieces. So I think that's just one of these, like, sort of interesting, like, subtle differences in language design, that leads to, you know, fairly big impacts on the user experience and kind of almost even how the community has to work together and form.
And I think the thing that's really interesting about that is that that has meant, like, Python has tended to have, like, these fewer, bigger packages, like pandas and scikit-learn, matplotlib, like kind of everything, in order to work with method chaining, like everything has to be glommed into this one giant package, where with R, you know, because you can combine things from different packages, like the equivalent of pandas is kind of like dplyr and tidyr and readr and like a bunch of other things, like with, it's way easier to add extensions to ggplot2 than matplotlib that work exactly the same way, because you don't have this, you can just combine them from different pieces.

