Why RStudio is now Posit (J.J. Allaire | Posit CEO) - KNN Ep. 158

Transcript#

This transcript was generated automatically and may contain errors.

You get in a lot of these, like there's like language wars and oh, we used to use this and now we use this, or this is so much better than that. Any programming language that's popular has like a really good reason for being.

It's funny, I was listening to another interview with John Carmack, who did all the Doom and Quake and all the id stuff, and more recently worked on the Oculus. And he's legendary for getting every single ounce out of the hardware, doing crazy stuff with like video buffers and assembly and total maniac. And the interview, he was like, I kind of need to like realize that sometimes like Python is great. There's a lot of cases where it just doesn't matter. You know, the best programming language is the one that works generally that you're currently using.

I'm very agnostic in how I think about all these tools. I think any tool that gets adoption, I have respect for. And I'm kind of curious, like, well, why is this getting adopted? And why do people use this and care about it?

This episode of Ken's Nearest Neighbors is powered by Z by HP, HP's high compute, workstation grade line of products and solutions. Today, I had the pleasure of interviewing JJ Allaire . JJ is the founder of RStudio and the creator of the RStudio IDE. He's an author of several packages in the R Markdown publishing ecosystem, including R Markdown, flexdashboard , learnr, and Distil. And he also worked extensively on the R interfaces to Python, Spark, and TensorFlow . JJ is now leading the Quarto project, which is a new Jupyter-based scientific and technical publishing system, which I'll be doing a project on shortly.

In this episode, we learn about why RStudio has now repositioned itself as Posit, how it maximizes its open-source nature as a B Corp, and how JJ, as an open-source advocate, views the private nature of many LLMs. I really enjoyed this conversation, and I hope you will as well.

JJ's origin story: data and baseball

JJ, thank you so much for coming on the Ken's Nearest Neighbors podcast. You bet, happy to be here. Yeah, you've done such awesome stuff in the open-source communities, and obviously working with R, and now transitioning into Posit.

First of all, welcome. Second, I'm interested in how you first got interested in data. It might have been a series of careers ago, but to me, that is one of the kernel questions.

Absolutely, it was a long time ago, actually. I was 13 years old, and I was a huge baseball fan. Like a lot of 13-year-olds, I watched a ton of baseball on TV, read all about it in magazines, and I got exposed to a writer named Bill James, who many listeners have probably heard of, but not necessarily everybody. So he wrote this series of books called the Bill James Baseball Abstract, where he used data analysis to understand the game a lot better.

And one of the things he did was he actually exposed the fact that a lot of the conventional wisdom about baseball, how players were valued, how games are won and lost, was actually wrong. And a lot of the reasoning that went on was just basically based on people's intuitions and subjective experience, but not based on data. And that was just a huge, and it turned out Bill James, when he started, was a heretic. He had a 28-page hand Xeroxed newsletter. It turned out he ultimately revolutionized all of baseball. If people have seen the movie Moneyball, it's based on Bill James' work. And now sabermetrics is an entire discipline, and all the sports now are very quantitative. And that was really in large part due to Bill James.

Anyway, he got me to see, wow, if all these experts running around saying these things that are not true and are so easily debunked by using data, that that's a really fundamental thing that is clearly probably going on outside of sports. And my real sort of main interest was political science and politics and economics. And I went to study that in college. And I was very attracted to the empirical research side of that because it was the same thing. Wow, we're making these really consequential public policy decisions. Are we informing those decisions based on data?

And so in college, I really focused a lot on quantitative analysis, I ended up using a bunch of tools that some of which are still in wide use today, like Excel and tools like SAS and Stata and Jump and SPSS. And I was really fascinated by it. I really enjoyed that work.

And that took a detour in my career where I kind of got into computing. And I spent 10 or 15 years building other types sort of development tools, programming tools, web servers, authoring tools, et cetera. And when I kind of got done with all that, I really was very keen on working in open source software. And I found out about, I heard about R, which was an open source statistical computing system. And that was very, very compelling to me. Because that was sort of where my heart started in my professional life was like, how can we use data to understand the world better and make better decisions?

And R at the time was early. And it was designed by statisticians. So it's extremely well suited to its purpose, but it also needed a lot of tooling. And so I started working on open source tooling for R. So that's kind of how I got into this.

Why open source matters

Well, part of it is I think in science in particular, it's very important that people are able to share and reproduce results with each other and innovate on methods. And I think that open source does those things very, very well. If I do some analysis and I want you to evaluate it, you don't need to purchase a software. If I want to reproduce the analysis in 10 years, I don't need to make sure I still have a license or make sure the software still runs. If a group of people like in the sports analytics community wants to do a bunch of innovation on methods, then they don't have to lobby a proprietary software vendor and say, oh, hey, pay attention to sports analytics, it's really important.

So a lot of things, I think, specifically about science and inquiry and trying to develop pools of knowledge in communities are really, really well served by open source. So I think that's one of the reasons why it's really important in data science and science. But for me personally, I also like the idea of making contributions that are durable, contributions that can be around in several decades that aren't sort of tied to the fate of a given or the fate or the whims of like a single company. And so that's very attractive.

contributions that can be around in several decades that aren't sort of tied to the fate or the whims of like a single company.

I think a reason that I wanted to work in open source is these durable contributions. But it turns out, I think, for data science and science, it's a pretty fundamental requirement.

Yeah, something you've mentioned in some of the talks you've given is that your goal isn't just to be around this year or next year, it's to be able to produce something that's still working in a hundred years. Yeah, that's right.

If you look at the history of proprietary software, typically there are like transition points where software is like phased out, formats are discarded. Customers are forced to migrate from one thing to another. Customers are abused because they have a dependency on the software and really have no choice but to keep doing it. So I think it's really important for, again, this concept of durability and this concept of trust that we're trying to create something that you can rely on for a very long time.

That hundred-year idea is this is trustworthy and durable. But it's also, I think, the idea that if we can create a company that has a synergistic relationship between sort of building a business and being able to invest at scale and creating open-source software, that company is worth having around for a hundred years. Traditionally, software companies don't work that way. There are definitely examples of open-source software companies. But even like the classic example that people point to is like the paragon of successful open-source software companies like Red Hat. And I just saw a news item yesterday where Red Hat's gonna make it much, much harder for sort of non-Red Hat distributions like Oracle Linux or Amazon Linux, or there's various flavors that are derived from Red Hat. They're gonna make that much harder.

So I think that what Jupyter notebooks allow you to do as a publishing medium, and by extension, Quarto, is marry narrative to visualizations, to those reductions that you get in terms of numbers and things.

That's so interesting to me, because I see that so much in my own work is that every data scientist wants to qualify their work. We say this holds true when X, Y, Z. Right, right. Every business person just wants, hey, what is this number to tell us how to work? What is this number? Yeah, and there's so much storytelling, there's so much background in that that is very difficult for us to convey because data scientists, engineers, we tell the story from the beginning. And business people, they want the story from end to, basically they just want the end or the conclusion. And that's not, you can't make great decisions with just the end. You need to understand the context.

I recommend, listen to that interview if you can. It's really fascinating, him kind of reflecting on his career and kind of where things have gone. There's another worthwhile thing to read, which was Edward Tufte wrote a pamphlet, people are probably familiar with some of his books, the visual display of quantitative information, called the cognitive, it was, I think it's called The Cognitive Style of PowerPoint, called Pitching Out Corrupts Within. And he basically describes how trying to reduce decisions and analysis to like a single graph, a single three bullets, a single number leads to oftentimes like terrible consequences.

Yeah, so he actually showed, one of his examples was he showed like the ill-fated space shuttle mission. There's like a PowerPoint deck where they basically were talking about, should we do this or not? And there's some slide, it's just got like four bullets and there's a bunch of other detail that needs to be explored and it's just, it ended up on one slide. Like a really important thing was just sort of meshed over and the idea of like let's reduce it down. Yeah, and they are, you know, the whole, you know, the Wall Street value at risk, you know, it's like we have our value at risk metric and that's gonna tell us like how exposed we are. But there's qualifications on value at risk, there's like tail events that happen and it's not, that's not actually your value. Your value at risk is not $10 million, it's actually like $200 million for the tail events.

Open sourcing LLMs and the dangers of disinformation

Well, I mean, OpenAI has not open sourced their models and nor did they actually, I think Sam Altman would say, we don't think the models should be open sourced because they are, in some cases, they're kind of like weapons grade, you know? The ability of a large language model to produce disinformation at scale is a weapon. And so I think that changes the discussion about open source. I think it's, we don't have open source, like nuclear power plant designs, you know? There are things that are dangerous and need to be used with responsibility and may even need to be used with some oversight. And I think if GPT-4 isn't that, you know, future iterations of it certainly probably will be.

I think the most dangerous thing about LLMs right now is their ability to produce disinformation at scale, I think, to so disharmonious political and cultural dialogue in social media and other places is like the worst thing that they can do right now. I don't think that none of these other kind of like really more physically frightening scenarios are in play or already close to in play, but we do need to think about it, the fact that Sam Altman said, we need to make sure they can't self-replicate, right? That is looking toward a future where, huh, oh, they're gonna self-replicate, what?

This is my personal take. I'm interested in a world where we can't trust any news. I think it's an inevitability. Yeah, I think that's probably right. And I don't know if that's necessarily the worst thing because right now we're in this weird limbo state where the people or the models are good enough to trick most of us, but on the other hand, there's some news that we can vaguely trust still. If we live in a world where we can't trust any news, that changes our relationship with media. It's like maybe we go to specific people, maybe we go back to what journalism was.

My optimistic hope is that distrust online means that communities and in-person things have a resurgence in value. Yeah. Because, you know, you can, there's something inherent about being able to like reach out and talk to someone. Absolutely, yeah, no, I think you may be right. That may be where we end up and that would not be a bad thing.

So JJ, those are all the questions I had. Do you have any final thoughts? Anything you, you know, any way people can learn more about you? Anything along those lines? Happy to share those. Yeah, I mean, I think you can put the URL for the Quarto project. I think it's very much worth, if you're a data scientist, it's absolutely worth checking out. That's, I'd say the main thing, again, that I'm focused on now and I'd want people to go take a look at.

This was amazing. We touched on so many fun and exciting topics. So I'm glad to be able to share this with the audience. Absolutely. Thanks very much for having me on.

Why RStudio is now Posit (J.J. Allaire | Posit CEO) - KNN Ep. 158

Transcript#

JJ's origin story: data and baseball

Why open source matters

Posit as a B Corp and the virtuous cycle

The founding of RStudio and the road to Posit

R vs. Python: who uses what and why

LLMs, language agnosticism, and accessibility

Quarto: communicating data with narrative

Open sourcing LLMs and the dangers of disinformation

Featured software#

flexdashboard

learnr

Quarto

rmarkdown

rstudio

tensorflow