Hadley Wickham @ Posit | Giving benefit to people using what you build | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Happy Thursday, everybody. And welcome back to the Data Science Hangout. And let's see if we remember how to do this today. We were on a little bit of a break for July, but so, so excited to see everybody back here, hope everyone's having a great week.

If this is your first time joining us today, I know this is a pretty big group here today. Uh, so it's so nice to meet you. Thank you for spending your Thursday with all of us. If we haven't met yet, I'm Rachel Dempsey. I lead our pro community here at Posit.

And the Data Science Hangout is our open space to chat about data science leadership, questions you're facing and getting to hear about what's going on in the world of data across different industries. And so every Thursday we feature a different, uh, data leader from the community and together we're all dedicated to making this a welcoming environment for everybody. So we love hearing from everyone in this environment, no matter your level of experience or area of work.

This is a casual conversation with us all here, but it's totally okay to just listen in too. But there's always three ways that you could jump in and ask questions or provide your own perspective too. So you can jump in by raising your hand on Zoom. You can put questions in the Zoom chat and feel free to always put a little star next to it if you wanted me to read it instead, maybe you're in a coffee shop or something, or I could call on you to ask your question and introduce yourself. And then lastly, we also have a Slido link where you can ask questions anonymously and Hannah or someone from the team will share that in the chat.

Well, I am so excited to be joined by my co-host today for the data science hangout, Hadley Wickham . Hadley is chief scientist at Posit and builds tools to make data science easier, faster, and more fun. You might know Hadley's work for packages like the tidyverse , uh, including ggplot2 and dplyr, much more. Uh, but Hadley, I'd love to have you introduce yourself. If you tell us something you like to do outside of work too, and a little bit about what it means to be chief scientist here at Posit.

Sure thing. Thanks for, um, thanks for having me, Rachel, for, uh, finally inviting me. It's starting to get a little hurtful, but, uh, glad I could finally make it.

Uh, like outside of work, I'd say the, uh, here's have a dog. That's Lola is a 13 year old Sharpie. And that's pretty much what she does these days is sleep.

Uh, I write books. This is one of my books that I wrote recently, uh, to the cocktail book, cause I'm also really into cocktails. Uh, this was, um, because I'm me, all the data for the cocktails is stored in a YAML file. And then it uses Quarto to, uh, turn it into a book.

So it has like a table of contents, like the primary organization of the block is by spirit. But, uh, because I'm extra, it also has one index that lists all of the primary ingredients in the cocktail. So if you want to find like something to do with orange liqueur, and then finally it has an index by name. So if you remember the name of the cocktail, I want to find it. Um, so there's a like fixing the things that like annoying me about the cocktail books that you can never find the cocktail that you're looking for, because they're ordered in some like arbitrary thing and the indexes are so bad. So thanks to the power of Quarto and, uh, Latex and make index, it was really easy to make a super highly indexed book, if you ask us.

I have to ask you, what's your favorite from that book? Uh, the favorite cocktail. I am enjoying them cocktails. I'm enjoying the most at the moment. I'm not actually in that book cause I've only discovered them recently. And I can't even remember what it's called, but it uses, um, it's rum. It's an equal parts cocktail. There's rum, fire, Aperol, yellow chartreuse and lime juice and rum fire. It's this kind of like crazy over-proof, uh, Jamaican rum.

Again, it's like, how do you like, you've got to give some benefit to the people using this. And you've got to like either remove pain or add pleasure in some way, because if you can't do that and you're not someone's like direct supervisor with a lot of direct power over them, it's hard to get people to change.

And I think if you're doing like you're like my company's first package, like the easy pain points are like make themes for your company corporate style guide, like make a ggplot2 theme, make an R Markdown, a core theme, make a Shiny theme that people can just use to get, you know, something that's reasonably close to whatever your corporate style guide dictates. And that's just like that just feels like an easy win for people because it makes them look good inside the corporation. And because you've put in all this hard work, it's like three seconds for them to type the right function and to get the right theme.

And I think the other bit is like make it easy to get access to data. Like so set up some wrappers around DBI connections to the most important data sources, provide some conventions around authentication so that stuff just works so that they're not like you're not struggling. Like, how do I what package do I need to install? Like, what's the password? Where's the path I need? Just give them some like a list of the top ten most common data sources. And like people will love you by and large.

Developing packages and managing change

So when you are thinking about sort of developing maybe that first package for your company or something of that nature, once you've identified the things that you think would be useful for people, do you have like a philosophy or a way in which you approach kind of putting things together? Because I know one of the pain points is as you're developing, you realize, oh, what I did originally doesn't quite work and not to tear it down and kind of build it back up again. So I don't know if you have lessons learned from developing so many packages that we could learn from.

Yeah, that's I think that's a tough one, because I think. Like as a team, we have got like vastly better at dealing with kind of deprecations and breakages, and I was reminded of this recently because I had to rebuild the first edition of Alpha Data Science and it just works like five years later, all the code in that book still works, which is certainly not like true. Like five years ago, if you tried to do that with five years of code in the tidy, it wouldn't have worked.

And, you know, I think that that that's obviously great to have that, but it's also like a huge amount of extra work to do that. And like we can kind of afford to spend that work. And now we're a team of like, you know, at least my team is like six plus full time people who can spend our time thinking about this, because I think there's really like when you're in an environment of like scarcity, when you've only got like so much time that you can take out of your kind of like everyday job to invest in writing a package. Like it's really tough to balance. Like, how do I like add new stuff versus making sure the old stuff continues to work?

So I don't have any great advice there. I think, again, some of it's about coming like building up trust. So like give people some wins so that when you do inevitably break stuff like accidentally or on Christmas, you've got some like kind of cushion that people aren't going to get like really angry with you right away. They're going to be like, OK, well, you know, there's a little bit of suffering now, but this person saved me so much time that it's not so bad.

But yeah, it's really hard. And particularly as you're starting out, like you're going to make mistakes like that's inevitable and you're going to do things that when you look back a year later, you're like, why on earth did I do it that way? And now you want to like rip out the whole thing and write it from scratch. And I think that like if it feels horrible, but you have to remember, like that's great because it means you've grown immensely as a programmer. And certainly if you have like my kind of mindset, you have to like resist the temptation to like rip things out and redo them as much as possible and just focus on making the next generation better rather than breaking what what what stuff people already have. So I don't have any great answers there, but I think you've just got to think about those tensions of like, how do I keep my forward velocity up while getting better as a programmer while evolving over time?

ChatGPT and AI tools

Ianson, what is your opinion on how it will evolve the data scientist's role in an organization?

I will caveat my remarks with I I would say 99% of my ChatGPT use is entirely frivolous. I use it to make things rhyme. I use it to make things sound like pirates. I use it to make things sound like southern grandmas who use a lot of country folk sayings. I use it to make acrostics of people's names. I use it now with the finance person I work with. I submit requests exclusively in the form of office screenplays.

So I like I don't know, I think it's like incredibly fun for stuff like that when I try to use it for programming. It's been like hit or miss, like sometimes it's great and sometimes it spits out something that looks like totally plausible, but is utterly wrong. So it's and it's hard for me to tell. Like, is that just like is that ChatGPT today? And like in a year's time, it's going to be much better. Or is it going to be like self-driving cars where they're kind of always coming like a year in the future and we never quite get there?

Self-driving cars is something I think about a lot because I have a Tesla and whenever I park it in the garage, it thinks the random collection of tools hanging on the wall is a semi like about to crash into us. And I'm like, it can't even handle like this. And people say we're going to have self-driving cars. So so I don't know, it's hard for me to say. I'd say overall, I'm kind of like a skeptic. I think it is going to be useful. I think it's particularly probably hard for me to say, like in some ways, it's going to be really useful when you're learning, because you can take the things that are words in your head and spit them out and say, like, how do I rotate the X axis labels on the ggplot and get something pretty reasonable, which makes it super useful for learning. But at the same time, as you're learning, it's going to spit out stuff that looks totally plausible, but it's totally wrong. And you're in the worst possible situation to try and figure out what's going on. So I don't know. We'll see. I think it's super exciting. I have a lot of fun with it, but I do not use it all professionally.

R on cloud platforms and ggplot2's future

So my question is, what's the implementation plan of R on cloud computing platforms? So my team works in Azure. And how do we say using R there is very inconvenient at best. So previously, before we moved to Azure, the work of my team, I would say 70% in R, 30% in Python. After moving to Azure, I quickly changed to 70% in Python, 30% in R. So Python is just a very natural option for modeling analysis there. So how do you plan to change this in the future?

Yeah, I mean, I don't know if change is quite the right phrase, because I don't know if we have the power to do that. But certainly, you know, what we are trying to do is make R as easy as possible to use everywhere you can possibly use it. And, you know, I think thinking about like as more and more people move to cloud computing, how can we make it easy to get set up and get configured and make things just work is really important. I know like one example at the moment is we're working on a partnership with Databricks to make using R inside Databricks clusters much, much easier. Certainly, like this seems like there's a lot of like just challenges getting things set up there that, you know, folks have kind of solved on the Python side and no one has put the time into solving on the R side. So I'm hoping that, you know, as a company, we'll continue to invest in those opportunities, just make it easy to use for everyone.

Lisa, I see you had a question earlier on ggplot2, do you want to jump in?

Sure. I think I have two questions about ggplot2, but I think I'll go with the first one, which is sort of how do you feel about people contributing, having so many people contributing to that package and like all the add-ons that have come from it? Sort of like any any things that sort of amazed you and then on the opposite side, any sort of major frustrations?

Yeah, I mean, I think the thing that's kind of kind of blows my mind about ggplot2 is it's like coming up to it's going to come up to its 18th birthday fairly soon. So I'm sort of joking about that will be time to like emancipate it. And that's going to just have to like survive by itself from now on. And kind of realistically, like I'm not that involved in the day to day management of ggplot2 anymore. That torch is going to be passed to Thomas Peterson. It was now, you know, starting to get, you know, like, I guess four years ago, I was like sick of maintaining ggplot2. I've been long ago, sick of maintaining a hired Thomas. And now Thomas is getting sick of maintaining it. So we've got to like figure out what to do next.

But yeah, like maintaining something that is like so popular is, you know, both a blessing and a curse. It's like incredible to know that, you know, literally millions of people have used it and it's helped them, you know, make graphs and make your life easier. But it also is like immensely challenging to make any changes because like so many people rely on it now and are used to the defaults.

So there's there's something like I think both. I don't know if I can quickly pull this off, but like one thing I think about a little bit from time to time, it's just like how many people like if I find the most popular, yeah, so the most popular page on the ggplot2 website is the Geom Bar documentation, which 15,000 people have read in the last month. Like there are not many ways you can reach, like you can have more of an impact on people's lives by looking at that page about bar charts and thinking, how can we make that better?

But at the same time, those changes tend to be kind of small and incremental. So it's also like fun to like, you know, embark on completely new projects that no one uses, so you're free to do whatever you want. And in some ways, I think like the ecosystem of packages is kind of it takes a combination of those strengths and those weaknesses because like ggplot2 can like stay the same and stable, whereas you can continue to like build on top of it with these packages that fewer people use, but are also like much more easier to change and flexible.

Focus time and managing your day

At least something that's like stood out in my mind that I've learned from you from internal company calls is how you handle focus time and the way you you manage your own time. And I thought it might be helpful just to chat a little bit about that and to share how you think about that.

Yeah, I once sort of one funny anecdote, I gave a talk about this idea of focus time and a couple of years ago, probably now at the company. And I was like, I feel like the truest expression of my philosophy would be to say, no, I'm not going to give a talk about focus time because that's going to interrupt my work for the day.

But I really try to like create a lot of time for like uninterrupted work, because I think, you know, like programming is like fundamentally creative work and like finding the time to kind of like process and think things through is is really important. So what my what I do currently is I like every day, as I said, I try and like first thing I do, I get up and write for a couple of hours or an hour at least. That's like the time of the day where I am most productive. And so I want to spend on the things I think are most important, you know, not the things that are necessary, most urgent.

And then, you know, I have a team, so I have to I have to talk to them. But and I have to talk to my colleagues that, of course, I love talking to. But it means I can't do other work. So my kind of compromise is I've got two days a week off on my calendar, Tuesdays and Thursdays, where I try really, really hard not to do these things. You know, sometimes if there's like a bigger meeting, there's like 15 people and it's impossible to find the time. Otherwise, I'll do it. But I really generally pretty good about keeping those days clear so I can tackle bigger problems. And I think that's, you know, really important.

It's it's, of course, challenging because you have when you've got a big four hour block of time, you still have to like train yourself to like resist the temptations of TikTok and Mastodon or certainly that temptation of Twitter has been totally removed from me today these days. So I guess that's good. But certainly, like you have to train yourself to like, you know, hard to like stay focused on one thing and not given to all the other more fun distractions that you have around you. But giving yourself that space, I think, is really important.

Book recommendations and looking ahead

Are there any books you'd recommend to this audience here or resources you'd recommend that have been useful for you in your career?

This is one of the books that's kind of been inspiring or has inspired the tidyverse design book. It's called A Pattern Language. It's actually a book about architecture and how you design houses. It's just a really like, you know, it's a very thick book, but it's designed to be kind of skimmable. I think it's just sort of fun and interesting to read about. And this like system of describing all of these components, it's really interesting to read about. of describing all of these components, the different levels of hierarchy and how you can join them together, I think is very much connected to programming.

I don't know, I think that's the book I've been thinking about like the most lately, some of the other ones, I guess I've gotten rid of because I have limited book space and I kind of like keeping these. I'm also this collection of like, you know, fairly historical books. This is like from 1923, because there's just so much. When you kind of look back at what people were doing like 100 years ago, like I think it's really revealing that in many cases, like the quality of visualizations is not improved that much, like the quality of the best visualizations is not improved a huge amount in the last 100 years. But the ease of making them like now you can like whip out 15 variations and ggplot in 15 minutes, whereas before you were like painstakingly draw them all out with pen and ink. So I think it's just sort of interesting to reflect on that a little bit. And like what's what's changed really in visualization in the last 100 years?

I know we're at the end here, but maybe one inspirational question to ask you, what's something that you're you're most excited about as you think about the tidyverse and Posit going forward?

And I am like legitimately really excited about us embracing Python more. You know, I'm still using Python, but I went to a Python conference recently and talked to a bunch of people using Python. And I'm just excited about helping out all these Python using data scientists because it just feels like this Posit, you know, it's like, it feels like this Posit, like it just works type stuff. There's just a huge, huge scope for that. And I'm excited to kind of help spread that like the Posit way of doing things, the Posit kind of community, how we think about people and try and spread that further and further in the world.

Thank you so much, Hadley. Is that our new slogan? It just works. I think it should be. We're not like, we're certainly not there everywhere, but I just think that's like, that's what the best experience you can deliver. You just try something out and it works. And people stop remarking on that because it just works. They don't have to think about that.

And I think that's, I will say, I think some of the best packages that me and my team have created are almost invisible because they just make problems go away and you never need to think about them. You never need to know about them because that's just some pain. You never need to experience. And then you can just listen to all the old thoughts cranking on about how terrible it was back in the day and how they had to trudge barefoot to school and three feet of snow and just listen to their stories and be happy. You don't live in that environment anymore.

And I think that's, I will say, I think some of the best packages that me and my team have created are almost invisible because they just make problems go away and you never need to think about them. You never need to know about them because that's just some pain. You never need to experience.

Thank you so much again for joining us today, Hadley. And thank you all so much for for joining the hangout today and and making this space what it is. Pretty impressive to have over 200 people on a Zoom call and not have to worry about what's happening. And thank you all for monitoring the space, too.

If this was your first hangout, we would absolutely love to have you join us again. So again, they happen every Thursday, same time, same place. But next week, we'll be joined by Mike Lopez, senior director of football data and analytics at the NFL. So he'll be our feature leader next week.

Thank you all so much for joining today and have a great, great rest of the day, great rest of the week. Bye, everybody. Thanks, Rachel. Thanks, everyone. Bye.