J.J. Allaire | Open Source Software for Data Science | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

Thank you very much, Hadley. Today I'm going to talk about free and open source software for data science. The lens I'm going to use to talk about it is kind of the how and why of RStudio . A little bit about how the company got started and where our mission of creating free and open source software sort of derives from. But then I want to get into the ways that tools, both open source and proprietary, for scientific and technical computing are developed. How are they financially supported? How are they sustained? How can you come to trust them? And get into a little bit about the nature of corporations as stewards of this sort of software. And I'll ask the question, are corporations kind of inherently sketchy as stewards of this sort of software? Spoiler, they are. So then I want to talk a little bit about what we can do about that.

Origins of the company

So I'll start by a little background about how the company got started so you can get a flavor for kind of where I was coming from starting the company. The seeds of the company go all the way back to 1983. How many people here have heard of Bill James? People have heard of Bill James? Okay. So Bill James was a math teacher in Kansas City. And back in 1983, I was an avid follower of baseball. And I absorbed all of the, you know, analysis about baseball from writers and sportscasters and players and coaches and experts. And I came to believe a lot of the things I was hearing about kind of how baseball teams win games, how you assess the value of players and strategies.

So what Bill James did, interestingly, you can see back in 1977, his first baseball abstract was this little pamphlet. He went from that pamphlet over ten years to a New York Times bestselling book where he used data, empirical data analysis, to systematically debunk many of the assumptions that people had about kind of how baseball works. And for me, as a 14-year-old, that was really striking that all these people who were, you know, who had spent their lives in the game of baseball, who had developed kind of conclusions and intuitions about how it worked, that all of that could be debunked by systematically using data. It was shocking to me. And that stuck with me.

When I went to college, I didn't go to college for baseball studies. I went to college for political science. And when you talk about political science, you get into public policy. And public policy decisions are made that affect the well-being of many, many hundreds of millions of people. And so the same thing occurred to me that we've got lots of experts and we've got lots of people who've been absorbed in different fields and observing phenomena for years making these decisions and possibly, probably as it turned out, not actually using data to inform those decisions. And as I've become part of the R community, I've realized that the same phenomena repeats in fields like medicine and business, that people are making highly consequential decisions and they're not availing themselves of the tools they have to really understand how the world works.

So this then kind of, to me, through college, struck me as kind of the fundamental problem to solve. And it seemed to me that software was an important part of the answer. This is some of the software that I used in college and graduate school. And say what you might know about some of the software, I had the experience of software providing leverage. So my mind and what I could absorb was amplified by the fact that I could use the software to better understand things.

So from there, I went to graduate school at the University of Wisconsin-Madison studying political science. And this is going to be an aside that this group, I think, will find interesting. It's not really in the main thread of the talk. But I worked on a study there that was studying the effectiveness of vouchers and public school performance in Milwaukee. Milwaukee was one of the first, if not the first, cities to implement a school voucher program. And there was a study going on at Madison that was going to assess how effective school vouchers were. And I worked on that study.

So at the same time, there was a group at Harvard that was very convinced that the conclusions that would be drawn by the folks at Madison were going to be wrong. So they were very keen to reproduce and criticize the results of the study.

So they asked for the data. They said, can we please have the data so we can do our own analysis? So we had the data in Paradox databases that were machine-readable, super easy to send to them. But what we did instead, true story, was we printed out all the data and shipped them crates of paper so that they could reenter the data on their own.

So that's kind of where we were in reproducibility and academia back then. So that's a true story.

So when I got to Madison, I was really excited about data analysis and computation. And I was kind of like, could I kind of specialize in software for data analysis? And unfortunately, in political science at that time, software was definitely not something you could specialize in. And at Madison, even data analysis was kind of barely something you could specialize in.

So I was kind of in the wrong place. So while I was trying to, when I was supposed to be doing my graduate school work, I was teaching myself how to program in C++ and teaching myself how to program in the Mac. And that inevitably led shortly to me dropping out of the program and saying, I want to be a software engineer.

And what I alluded to just a little bit before, what fascinated me so much about being a software engineer is summed up well by this quote from Steve Jobs, where he says, what a computer is to me, it's the most remarkable tool that we've ever come up with. It's the equivalent of a bicycle for our minds. And the other thing he cites in that when he recounted that quote was there was a scientific American study that looked at the sort of efficiency of different modalities of motion, sort of the cost of transport, calories per gram per kilometer as it relates to body weight.

So they looked at, you know, organisms, salmon were super efficient, horses were super efficient. They looked at different modalities of human transportation and they found that a person on a bicycle was by far the most efficient. And that's, I think, what computers are and what software is.

So that became my fascination and I decided I want to become a software engineer and I wanted to build software tools. So I did that for quite a number of years. I built programming tools, I built research tools, I built some writing tools. And the software I worked on was, had a couple of characteristics. One was it was proprietary software. Two was I was working in startup companies. And proprietary software worked on in startup companies is almost, it has sort of the seeds of its demise built in from the beginning. Because startup companies are built to be sold and usually when they're sold they're in some form or fashion kind of destroyed or warped. And the proprietary software is often very bound up in the fortunes and fate of the companies that sponsor its development.

So I really enjoyed working on software tools but I found that proprietary software in startups was not something I wanted to do anymore after that. So I was sort of searching for what I'd like to do next. And I came to the conclusion that one of the things I wanted was I wanted to build tools that were durable that could outlast a given company. And I wanted to build tools that were accessible to everyone. That anyone could use irrespective of cost. And that led me to the idea of working on open source software. So I knew I wanted to work on open source software. I knew that I didn't want to do software startups. And at that time I found out about R, which kind of took me back to the work that I had done in data analysis as an undergrad in graduate school. And it took me not very, I don't know how long it took me, it was definitely less than 24 hours to conclude this is what I want to work on. At the time I felt like maybe for the next 10 years, now I feel like for the rest of my career.

So I was very lucky to discover R. And it also felt to me like I could offer something to the community. Because I had worked on kind of programming tools and tools to make people more productive with complex software. So I also didn't want to do a startup. And I thought, well, you know, this is fine. Because I think one or two people could actually make a significant contribution. We wouldn't need to have a startup. We could just make a contribution. And so that's when I started working on the RStudio IDE. And I worked on it initially by myself. And then Joe Chang, who I had worked with a couple of previous companies, joined me a few months later. And together we set out to build the RStudio IDE. And the general mission was to try to make a contribution to open source software for statistical computing.

Why free and open source software matters for science

So I want to now say, you know, why is free and open source software for science and data science so important? At the time I was focused on this idea of durability and accessibility. But as I've joined the R community, I've come to realize there's lots of other good reasons to prefer open source software for data science. So I want to get into a little bit of that. And I first want to make a distinction between different senses of the word free. There's for free, gratis. And there's with little or no restriction, libre. And both are relevant, obviously, with open source software. Famously Richard Stallman summarized the difference and the nature of libre as think free as in free speech, not free beer. We sometimes with free software focus too much on the fact that the software has no cost. But the most important thing is that it comes without restrictions.

We sometimes with free software focus too much on the fact that the software has no cost. But the most important thing is that it comes without restrictions.

And they actually the new summarizes kind of the four essential freedoms of free and open source software. And they mostly have to do with being able to do with the program what you wish, including inspect it, modify it, create your own derivative works from it. And the fact that you are not dependent upon the original purveyor of the software to continue using or evolving the software. So that's actually for science especially more important. The fact that it comes without cost is important. That's actually even more important.

So what are some of the reasons why we want to prefer free and open source software? Well, this I don't need to speak at any length about this with this group. But it's worth reflecting on the fact that if I use proprietary software to do science or data science and someone else wants to reproduce my results, at the best that person needs to buy a license for the software that I used. But at worst, and this often happens, I can't even reproduce my own work in the future because maybe the vendor who created the software has gone out of business or they've made older versions of their products inaccessible. So long-term reproducibility is really only assured by using free and open source software.

I want to point out briefly just how important the R community has been in this movement toward reproducibility. There was an article written in Nature in 2012 sort of making this case, and at the time they cited there were two systems known to enable the packaging of code, data, and text at the time. One of those two was Sweve, which the R community actually came up with in 2002. So we've been at this for 18 years, certainly longer than any other programming language community.

There's another consideration, which is resiliency. As I said before, software products and companies come and go. We don't want our research, our ability to reproduce the research, tied to the fate of a specific product or vendor. Now, a variation on this theme is that software doesn't come and go. It actually stays and becomes really, really important to customers, and then the vendor decides, oh, this is an opportunity for us to dramatically raise prices and extract more value from our customers. So notably, the four essential freedoms that I talked about ensure that this cannot happen with free software.

There's a great example of this from the database world, where MySQL, which is a GPL open source database, was acquired by Sun, and then subsequently Sun was acquired by Oracle. So Oracle, a proprietary database vendor, now owned all the copyrights to MySQL, and they were gearing up to try to do all kinds of things to try to exploit that position. And even though Oracle actually owned all the copyrights for MySQL, the community took the code from MySQL, forked it, and created another product called MariaDB and continued on with development. So only the fact that that software was free and open source kind of ultimately protected the community from a vendor that was going to be abusive.

And we can think in the R community, RStudio and many other vendors have provided offerings around R, and the commitments of vendors can vary over time. Companies can get acquired, they can shift strategies. If, as a user, your primary investment is an open source R code that will run irrespective of any vendor's products, then you're protected from that. You have that resiliency.

There's another piece, which is participation. I think maybe 10 or 15 years ago, many people might have believed, well, a proprietary software vendor can kind of enumerate and account for all the methods that are important in a field and then provide that. Or a lot of people believed that. I think now the explosion of innovation in statistical methodology, nobody believes that, that a single vendor could be the filter who decides what methods are supported and available and easy to use. So the fact that we have an open source ecosystem around R enables what you've seen with CRAN, where there's this huge long tail of innovation. There can be many, many different approaches to analysis, there can be innovation in methods, and it's all supported by the software. So participation is another fundamentally important thing.

And finally, what I talked about at the beginning, accessibility. And data literacy is becoming it's fundamentally important for organizations, it's also fundamentally important for individuals. And open source software allows everyone to participate and use these tools, again, without regard for cost.

we are going to dedicate a substantial portion of our profits to philanthropic causes that relate to our mission of open source software and open science.

So thank you all very much for helping us to build this company and build this community. It's been an incredible experience, far exceeding anything I could have ever hoped for, and I'm excited for what the future holds. Thank you.

J.J. Allaire | Open Source Software for Data Science | RStudio (2020)

Transcript#

Origins of the company

Why free and open source software matters for science

How scientific computing tools get built and funded

RStudio's model: open source sustained by commercial products

Corporations as stewards: the problem of shareholder primacy

The benefit corporation model

RStudio becomes a public benefit corporation

Featured software#

rstudio