
Why RStudio is now Posit (J.J. Allaire | Posit CEO) - KNN Ep. 158
Today, I had the pleasure of interviewing J.J. Allaire. J.J. is the founder of RStudio and the creator of the RStudio IDE. He is an author of several packages in the R Markdown publishing ecosystem including rmarkdown, flexdashboard, learnr, and distill, and also worked extensively on the R interfaces to Python, Spark, and TensorFlow. J.J. is now leading the Quarto project, which is a new Jupyter-based scientific and technical publishing system. In this episode, we learn about why RStudio has now repositioned itself as Posit, how it maximizes its open-source nature as a B Corp, and how J.J. as an open-source advocate views the private nature of many LLMs. I really enjoyed this conversation, and I hope you will as well! Posit - https://posit.co/ Podcast Sponsors, Affiliates, and Partners: - Pathrise - http://pathrise.com/KenJee | Career mentorship for job applicants (Free till you land a job) - Taro - http://jointaro.com/r/kenj308 (20% discount) | Career mentorship if you already have a job - 365 Data Science (57% discount) - https://365datascience.pxf.io/P0jbBY | Learn data science today - Interview Query (10% discount) - https://www.interviewquery.com/?ref=kenjee | Interview prep questions Listen to Ken's Nearest Neighbors on all the main podcast platforms! On Apple Podcasts: https://podcasts.apple.com/us/podcast/kens-nearest-neighbors/id1538368692 (Please rate if you enjoy it!) On Spotify: https://open.spotify.com/show/7fJsuxiZl4TS1hqPUmDFbl On Google: https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5idXp6c3Byb3V0LmNvbS8xNDMwMDQxLnJzcw?sa=X&ved=0CAMQ4aUDahcKEwjQ2bGBhfbsAhUAAAAAHQAAAAAQAQ MORE DATA SCIENCE CONTENT HERE: My Twitter - https://twitter.com/KenJee_DS LinkedIn - https://www.linkedin.com/in/kenjee/ Kaggle - https://www.kaggle.com/kenjee Medium Articles - https://medium.com/@kenneth.b.jee Github - https://github.com/PlayingNumbers My Sports Blog - https://www.playingnumbers.com ️ 66DaysOfData Discord Server - https://discord.com/invite/4p37sy5muZ
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
You get in a lot of these, like there's like language wars and oh, we used to use this and now we use this, or this is so much better than that. Any programming language that's popular has like a really good reason for being.
It's funny, I was listening to another interview with John Carmack, who did all the Doom and Quake and all the id stuff, and more recently worked on the Oculus. And he's legendary for getting every single ounce out of the hardware, doing crazy stuff with like video buffers and assembly and total maniac. And the interview, he was like, I kind of need to like realize that sometimes like Python is great. There's a lot of cases where it just doesn't matter. You know, the best programming language is the one that works generally that you're currently using.
I'm very agnostic in how I think about all these tools. I think any tool that gets adoption, I have respect for. And I'm kind of curious, like, well, why is this getting adopted? And why do people use this and care about it?
This episode of Ken's Nearest Neighbors is powered by Z by HP, HP's high compute, workstation grade line of products and solutions. Today, I had the pleasure of interviewing JJ Allaire. JJ is the founder of RStudio and the creator of the RStudio IDE. He's an author of several packages in the R Markdown publishing ecosystem, including R Markdown, flexdashboard, learnr, and Distil. And he also worked extensively on the R interfaces to Python, Spark, and TensorFlow. JJ is now leading the Quarto project, which is a new Jupyter-based scientific and technical publishing system, which I'll be doing a project on shortly.
In this episode, we learn about why RStudio has now repositioned itself as Posit, how it maximizes its open-source nature as a B Corp, and how JJ, as an open-source advocate, views the private nature of many LLMs. I really enjoyed this conversation, and I hope you will as well.
JJ's origin story: data and baseball
JJ, thank you so much for coming on the Ken's Nearest Neighbors podcast. You bet, happy to be here. Yeah, you've done such awesome stuff in the open-source communities, and obviously working with R, and now transitioning into Posit.
First of all, welcome. Second, I'm interested in how you first got interested in data. It might have been a series of careers ago, but to me, that is one of the kernel questions.
Absolutely, it was a long time ago, actually. I was 13 years old, and I was a huge baseball fan. Like a lot of 13-year-olds, I watched a ton of baseball on TV, read all about it in magazines, and I got exposed to a writer named Bill James, who many listeners have probably heard of, but not necessarily everybody. So he wrote this series of books called the Bill James Baseball Abstract, where he used data analysis to understand the game a lot better.
And one of the things he did was he actually exposed the fact that a lot of the conventional wisdom about baseball, how players were valued, how games are won and lost, was actually wrong. And a lot of the reasoning that went on was just basically based on people's intuitions and subjective experience, but not based on data. And that was just a huge, and it turned out Bill James, when he started, was a heretic. He had a 28-page hand Xeroxed newsletter. It turned out he ultimately revolutionized all of baseball. If people have seen the movie Moneyball, it's based on Bill James' work. And now sabermetrics is an entire discipline, and all the sports now are very quantitative. And that was really in large part due to Bill James.
Anyway, he got me to see, wow, if all these experts running around saying these things that are not true and are so easily debunked by using data, that that's a really fundamental thing that is clearly probably going on outside of sports. And my real sort of main interest was political science and politics and economics. And I went to study that in college. And I was very attracted to the empirical research side of that because it was the same thing. Wow, we're making these really consequential public policy decisions. Are we informing those decisions based on data?
And so in college, I really focused a lot on quantitative analysis, I ended up using a bunch of tools that some of which are still in wide use today, like Excel and tools like SAS and Stata and Jump and SPSS. And I was really fascinated by it. I really enjoyed that work.
And that took a detour in my career where I kind of got into computing. And I spent 10 or 15 years building other types sort of development tools, programming tools, web servers, authoring tools, et cetera. And when I kind of got done with all that, I really was very keen on working in open source software. And I found out about, I heard about R, which was an open source statistical computing system. And that was very, very compelling to me. Because that was sort of where my heart started in my professional life was like, how can we use data to understand the world better and make better decisions?
And R at the time was early. And it was designed by statisticians. So it's extremely well suited to its purpose, but it also needed a lot of tooling. And so I started working on open source tooling for R. So that's kind of how I got into this.
Why open source matters
Well, part of it is I think in science in particular, it's very important that people are able to share and reproduce results with each other and innovate on methods. And I think that open source does those things very, very well. If I do some analysis and I want you to evaluate it, you don't need to purchase a software. If I want to reproduce the analysis in 10 years, I don't need to make sure I still have a license or make sure the software still runs. If a group of people like in the sports analytics community wants to do a bunch of innovation on methods, then they don't have to lobby a proprietary software vendor and say, oh, hey, pay attention to sports analytics, it's really important.
So a lot of things, I think, specifically about science and inquiry and trying to develop pools of knowledge in communities are really, really well served by open source. So I think that's one of the reasons why it's really important in data science and science. But for me personally, I also like the idea of making contributions that are durable, contributions that can be around in several decades that aren't sort of tied to the fate of a given or the fate or the whims of like a single company. And so that's very attractive.
contributions that can be around in several decades that aren't sort of tied to the fate or the whims of like a single company.
I think a reason that I wanted to work in open source is these durable contributions. But it turns out, I think, for data science and science, it's a pretty fundamental requirement.
Yeah, something you've mentioned in some of the talks you've given is that your goal isn't just to be around this year or next year, it's to be able to produce something that's still working in a hundred years. Yeah, that's right.
If you look at the history of proprietary software, typically there are like transition points where software is like phased out, formats are discarded. Customers are forced to migrate from one thing to another. Customers are abused because they have a dependency on the software and really have no choice but to keep doing it. So I think it's really important for, again, this concept of durability and this concept of trust that we're trying to create something that you can rely on for a very long time.
That hundred-year idea is this is trustworthy and durable. But it's also, I think, the idea that if we can create a company that has a synergistic relationship between sort of building a business and being able to invest at scale and creating open-source software, that company is worth having around for a hundred years. Traditionally, software companies don't work that way. There are definitely examples of open-source software companies. But even like the classic example that people point to is like the paragon of successful open-source software companies like Red Hat. And I just saw a news item yesterday where Red Hat's gonna make it much, much harder for sort of non-Red Hat distributions like Oracle Linux or Amazon Linux, or there's various flavors that are derived from Red Hat. They're gonna make that much harder.
Posit as a B Corp and the virtuous cycle
No, no, I think that we talk about it as sort of a virtuous cycle where, you know, our mission is to create open-source software that's usable by anyone regardless of economic means, that's extremely useful to a whole broad range of people. And so we do that, you know, that's our packages, our IDEs, our web frameworks, reporting tools, all the different things we create. And by making them open-source, they tend to get very widely adopted. And what happens is that people in large enterprises adopt them and then want to do more with them. They basically want to, like, integrate them with their authentication systems, integrate them with their compute grids, and, you know, integrate them with their sort of, like, audit requirements. And there's lots of things that come up in a larger setting when you're scaling. And our idea is, what we've done, is we've sold products to larger enterprises who are starting to take these platforms seriously. And that builds a business for us, and that internalizes us to reinvest in open-source and ultimately create more opportunities for other folks to adopt the tools. So that has worked pretty well for us.
What is, I notice that a Posit is a B Corp. Can you explain a little bit what that means, but also how that sort of ties into what we're talking about? Yeah, a B Corp is a benefit corporation, and in short, in very short, the nature of corporations in the United States is that they're really, their exclusive ultimate responsibility is to their shareholders. So they have a fairly narrow definition of what's ultimately important, and that can lead to companies, again, as I said, selling themselves, becoming abusive, and that's problematic because companies really have a bunch of stakeholders. They have their customers, the broader community, their employees, and what a benefit corp is is it sort of broadens the idea of what a corporation is for, and it broadens the accountability to say it's accountable to multiple groups of stakeholders.
Further, public benefit corporations define a public benefit, a specific mission, that they're also obliged to pursue. In our case, it's open source software for science. So it's a new type of company that has been developed really in the last 10 or 15 years, and we made the transition. It actually didn't exist in a robust form when we started our studio, but we made the transition, I think, a few years ago, three or four years ago, to being a benefit corp, partly because we wanted to have it in, that idea of durability and who we're accountable to, we wanted to be in our charter, and we wanted people to understand that we were trustworthy and we meant what we said.
The founding of RStudio and the road to Posit
Yeah, so RStudio started, I think it was like 12 or 13 years ago. And originally it wasn't started to design, intended to be a company. It was just, hey, let's make an IDE for R. You know, I had done programming tools. I'd done some IDEs. I'd done writing tools. And so I said, I think R could really benefit from having a nice IDE. And so I actually ended up working on that with Joe Chang, who was the co-founder. He's the CTO now. So we worked on RStudio. We worked on that for a couple, three years. We released it. It kind of started to get momentum. We ended up meeting Hadley Wickham, who sort of became another co-founder, who's now our chief scientist. He had been working on, you know, ggplot and dplyr and a bunch of really, really popular R packages for doing data analysis and visualization. And so we said, hey, let's all work together.
And so, and then we teamed up with Tarif Kowaf, who's someone who's now our president. So he's been here for also for over 10 years. And he kind of came on with the responsibility of figuring out, you know, how could we create this virtuous cycle? How could we create and grow a company that supports open source? So that was the genesis of RStudio. And Joe actually came up with a product called Shiny, an open source web development server, web development framework called, for R, for sort of data-centric apps called Shiny. And so, you know, we had RStudio, we had Shiny, and then we started working on, that's okay, let's see if we can create some sort of enhanced commercial version of these things to sell. And that ended up working out.
We kind of ended up, I would say about three years ago, we had sold our products to lots of customers, but we observed that every data science team that we were selling into was doing, we started off with like these sort of very R-centric groups but basically everybody was using R and Python. Or, you know, maybe there'd be some exclusively R and some exclusively Python groups, but the vast majority of groups were using both. And we were selling, you know, enterprise software to help those teams be productive and deploy things. And so it was like sort of not, it didn't make sense to sell them an R-only thing. It just didn't compute at all.
So we started, we added support for Python to sort of all of our products. And that was well-received, but it's quite difficult for customers to get their head around buying like Python solutions from a company called RStudio. It was like, just really like, wow, how am I going to explain this to the CIO? Like we're buying, you know. And so we decided a couple of years ago that we really need to change the name of the company so that people understand the kind of the bigger breadth of what we're doing. And also at the same time, kind of change some of the focus of our open source work where we had really been really primarily aimed at doing open source in the R ecosystem. And we decided that when we do new projects, we need to make them multi-language. And so we ended up doing Shiny for Python. That was announced like a year ago. Quarto, which is a project that I'm working on, was Python from the beginning. So we're essentially trying to evolve the company to one that creates open source data science software for both Python and R, sells commercial products for both Python and R. And so we renamed the company Posit. That's about a year ago to reflect that.
R vs. Python: who uses what and why
I mean, some of it is, I would say that R plays very well. Certainly, if you look at like people who are like in an undergraduate or graduate setting, if you're studying software engineering, you are gonna gravitate to Python. If you learned about machine learning in your computer science department, you're absolutely using Python. If you are in a company and you're doing data engineering, you're using Python. If you are in a statistics program, an undergraduate or graduate statistics program, you're very likely using R.
So there's a lineage of individuals and subfields where if you're coming from a heavy statistics orientation, you're likely using R. And conversely, if you're coming more from a computer science data pipeline, data engineering, you're using Python. So that's one explanation of where that comes from. The other thing though is that many people who do work with data do not principally identify as a software engineer or software developer. They identify as like, I'm a biologist, I'm an economist, I'm a sports analytics person. And so a lot of them, it gets driven by like, what's the package ecosystem look like for doing the work that I care about?
And so you will see subfields. My guess is in the NFL, some of it's historical, but there's probably packages that are extremely useful and would be really annoying to do without. And that makes it more sticky there. I mean, you can conversely see in deep learning in Python, like all the places, all the ways in which you wanna do deep learning and all the libraries are all in Python. So you're just in Python if you're doing neural nets. And so I think there's some of that that goes by field. We see a huge preponderance of R in insurance, and in pharma, those are places that are like, actuarial work or research in a clinical setting is like very statistics intensive. And so you see it more in those settings.
LLMs, language agnosticism, and accessibility
Yeah, I mean, I definitely agree that using LLMs lets you bootstrap into a new language in a way that's really pretty revolutionary, you know? Because a lot of, you would normally spend two or three days like fumbling around trying to figure out which way is up with a new language. But you know, you can say, hey, I wanna do whatever, you know, I'm gonna do a PC analysis, I wanna visualize it this way, and it'll give you roughly what you need to do, and then you're gonna probably have to tweak it and fix some problems and whatnot. But I think in terms of accessibility of different programming ecosystems, it is pretty profound.
You know, I think, I will say that I do think that there's a lot of like skill that's developed by data scientists or programmers ends up being fairly deep knowledge of the language and language ecosystem and vagaries of it and workarounds. And so I think, you know, the investment people make in a given language is still pretty sticky. You know, I don't know that people are gonna be like just flitting around between languages because you are giving up a lot when you switch languages.
So yeah, we'll see how that, I definitely feel like people are really feel empowered to try things that they don't understand that well because they can get going really quickly with LMs for sure. Maybe it'll be more like, you know, people who maybe currently use SPSS or Excel or something that's a little less code oriented will feel more empowered to say, hey, wow, I got started, I created a notebook that does kind of like what I was doing in Excel, it all works and it's really flexible and now I'm off and running.
Quarto: communicating data with narrative
Yeah, so actually, the open source project that I've been working on for the last few years, it's called Quarto. And really, it's a platform, or platform's probably the wrong word. It's a framework for publishing reports and research that have data at their center. So it's for creating reproducible documents. It's for creating artifacts from data analysis that go beyond just like a single metric, but, you know, a suite of data visualizations, narrative that accompanies visualizations, notebooks that you created that dig into a topic and make them available as a website, or make them available inside a presentation. So Quarto is really about communicating.
And I wanted to bring a little bit of the motivation for this. And, you know, it really, one of the fundamental capabilities of Quarto is that it can take Jupyter notebooks and can publish them in a huge variety of formats. So you can create entire books or, you know, a blog or a presentation, a PowerPoint presentation, a Word document, a really attractive searchable website. So you can take Jupyter notebooks and actually bring them to a lot of people.
That's one of the fundamental capabilities. And I think if you go back to like what's interesting about Jupyter, and we were talking a little bit earlier about Bill James and Sabermetrics, and there was an interview he did, I think about a month ago with Michael Lewis. And he said that he, I had not heard him say this, but he has become sort of disillusioned with the contemporary sports analytics movement because there's so folk, you mentioned that wins above replacement, like we need to get one number, you know, that represents the player's value. And Bill James's work always had context and narrative, and, but, you know, consider this. And he sort of became, he says become disillusioned because it's always just about like getting it, reducing to a single number.
And he said, there's just a lot more going on when you're thinking about players and teams and careers and all that. So I think that what Jupyter notebooks allow you to do as a publishing medium, and by extension, Quarto, is marry narrative to visualizations, to those reductions that you get in terms of numbers and things. So oftentimes there's just a lot of context assumptions, qualifications, other branches to consider in a rigorous analysis. And part of the focus of Quarto is helping you communicate those.
So I think that what Jupyter notebooks allow you to do as a publishing medium, and by extension, Quarto, is marry narrative to visualizations, to those reductions that you get in terms of numbers and things.
That's so interesting to me, because I see that so much in my own work is that every data scientist wants to qualify their work. We say this holds true when X, Y, Z. Right, right. Every business person just wants, hey, what is this number to tell us how to work? What is this number? Yeah, and there's so much storytelling, there's so much background in that that is very difficult for us to convey because data scientists, engineers, we tell the story from the beginning. And business people, they want the story from end to, basically they just want the end or the conclusion. And that's not, you can't make great decisions with just the end. You need to understand the context.
I recommend, listen to that interview if you can. It's really fascinating, him kind of reflecting on his career and kind of where things have gone. There's another worthwhile thing to read, which was Edward Tufte wrote a pamphlet, people are probably familiar with some of his books, the visual display of quantitative information, called the cognitive, it was, I think it's called The Cognitive Style of PowerPoint, called Pitching Out Corrupts Within. And he basically describes how trying to reduce decisions and analysis to like a single graph, a single three bullets, a single number leads to oftentimes like terrible consequences.
Yeah, so he actually showed, one of his examples was he showed like the ill-fated space shuttle mission. There's like a PowerPoint deck where they basically were talking about, should we do this or not? And there's some slide, it's just got like four bullets and there's a bunch of other detail that needs to be explored and it's just, it ended up on one slide. Like a really important thing was just sort of meshed over and the idea of like let's reduce it down. Yeah, and they are, you know, the whole, you know, the Wall Street value at risk, you know, it's like we have our value at risk metric and that's gonna tell us like how exposed we are. But there's qualifications on value at risk, there's like tail events that happen and it's not, that's not actually your value. Your value at risk is not $10 million, it's actually like $200 million for the tail events.
Open sourcing LLMs and the dangers of disinformation
Well, I mean, OpenAI has not open sourced their models and nor did they actually, I think Sam Altman would say, we don't think the models should be open sourced because they are, in some cases, they're kind of like weapons grade, you know? The ability of a large language model to produce disinformation at scale is a weapon. And so I think that changes the discussion about open source. I think it's, we don't have open source, like nuclear power plant designs, you know? There are things that are dangerous and need to be used with responsibility and may even need to be used with some oversight. And I think if GPT-4 isn't that, you know, future iterations of it certainly probably will be.
I think the most dangerous thing about LLMs right now is their ability to produce disinformation at scale, I think, to so disharmonious political and cultural dialogue in social media and other places is like the worst thing that they can do right now. I don't think that none of these other kind of like really more physically frightening scenarios are in play or already close to in play, but we do need to think about it, the fact that Sam Altman said, we need to make sure they can't self-replicate, right? That is looking toward a future where, huh, oh, they're gonna self-replicate, what?
This is my personal take. I'm interested in a world where we can't trust any news. I think it's an inevitability. Yeah, I think that's probably right. And I don't know if that's necessarily the worst thing because right now we're in this weird limbo state where the people or the models are good enough to trick most of us, but on the other hand, there's some news that we can vaguely trust still. If we live in a world where we can't trust any news, that changes our relationship with media. It's like maybe we go to specific people, maybe we go back to what journalism was.
My optimistic hope is that distrust online means that communities and in-person things have a resurgence in value. Yeah. Because, you know, you can, there's something inherent about being able to like reach out and talk to someone. Absolutely, yeah, no, I think you may be right. That may be where we end up and that would not be a bad thing.
So JJ, those are all the questions I had. Do you have any final thoughts? Anything you, you know, any way people can learn more about you? Anything along those lines? Happy to share those. Yeah, I mean, I think you can put the URL for the Quarto project. I think it's very much worth, if you're a data scientist, it's absolutely worth checking out. That's, I'd say the main thing, again, that I'm focused on now and I'd want people to go take a look at.
This was amazing. We touched on so many fun and exciting topics. So I'm glad to be able to share this with the audience. Absolutely. Thanks very much for having me on.
