Resources

Tareef Kawaf | Welcome and the Posit Vision | Posit (2019)

From Posit President, TAREEF KAWAF About Tareef: Tareef Kawaf is a software startup executive and current president of RStudio, Inc., a Massachusetts-based company that develops both open-source and commercial software for the R statistical programming language. Prior to joining RStudio, Mr. Kawaf served as senior vice president of engineering and operations at Brightcove, Inc.. Over 8 years he helped Brightcove build and operate the second-largest online video platform, helping it grow from 0 to 92M in revenue and complete its initial public offering (IPO). Mr. Kawaf jointly holds a patent for the “Method and System for Dynamic Pricing,” issued in 2001 which is a core component of Oracle’s ATG Commerce solutions and helps retailers define sophisticated rules for couponing, discounting, and personalized commerce. Mr. Kawaf received his B.S. degree in Computer Science with a minor in Mathematics from the University of Massachusetts Amherst in 1994. He and his family currently reside outside of Boston, MA

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm usually the shiny-headed guy you see in some people's photos, like when they take photos with Hadley and you see there's a bald, shiny-headed guy next to him, that's usually me, and they're like, why does this guy keep showing up in these photos?

If you attended my talk last year at RStudio Conf, you know that I am regrettably not a data scientist, contrary to what Hadley said, yet. One of these days I will get there. I do enjoy playing with R. I find it, my background is in software engineering and mathematics, but I love to play with R and I love to explore how you can use it to understand the world more deeply, and hopefully make better data-driven decisions.

It is precisely the question of understanding the world and making better data-driven decisions. This is going to be so easy for me, I can tell.

You know, it's funny, I was talking to my wife and my seven-year-old said, you know, would dad get fired if he screws up on this presentation? I said, I don't think I will be fired for this, but all right.

So six years ago I got into R and eventually I met JJ, and at the time I didn't fully understand or appreciate just how incredible the R community was, and continues to be, and the positive impact you all have on this world. I also hadn't gotten a full picture of what was possible with this remarkable tool chain.

You'll hear a lot from speakers this week about the power of the open-source packages that are created in R, and I don't need to tell you guys about all the incredible things that are sort of coming down the pipe. My goal in this talk is to give you a deeper look at our studio story, what we believe in, and how organizations have adopted R and the R ecosystem to solve real-world problems.

So given the rate of innovation and change, it is often difficult to stay on top of what is possible and how the puzzle pieces fit together. I happen to know this because last year I was lucky enough to visit 30 different customers all over the world.

And my goal was to sort of talk to them about the new products we're building, where we're going, and to see whether the problems that we're solving are the same problems that they are trying to solve for themselves. And it became really obvious that most people don't have the full story, and it's really hard. And it occurred to me that the reason for that is because we've never told that story publicly.

So let's make sure that you guys walk out of here with three things today. What is the way of RStudio? What do we believe in? Why do we do the work that we do? How can you adopt R in production? And finally, how does R sustain itself? You'd be surprised how many people still ask me that question.

The printing press and the prime directive

To answer that, I'm going to take you back to 1439. Does anybody know what was created in 1439, what was invented? The printing press. I knew somebody in the audience would know the answer to that question, so it wasn't so hard.

At about that time, there was a goldsmith by the name of Johannes Gutenberg, and he invented the movable type-based printing press. Prior to its arrival, the process for creating content was one where people would write manuscripts, usually monks, because they were the only ones that could read and write, and the process was really, really slow and very expensive, which meant that knowledge got centralized into the hands of the few, and being able to validate the authenticity and correctness of the work was really, really difficult.

Developing a press that could accelerate the creation of perfect fidelity copies of a book, most notably the Good Book at the time, radically changed who could have access to knowledge and increased the confidence that everybody was seeing the same information. Many believe, and including Wikipedia, that the arrival of the printing press helped start and fuel the scientific revolution in the 16th century.

I'm particularly attracted to the few concepts in here about ushering in the era of mass communication, permanently altering the structure of society.

Now 500 years later, we have the prime directive. John Chambers wrote about this in his book, and I think it's so important and it captures who we are and what we do so well that it bears reading out loud.

Science, business, and many other areas of society continually rely on understanding data, and that understanding frequently involves large and complicated data processes. Those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis and by extension the software that produced it. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the prime directive.

Why RStudio believes in code

So for us, code is reproducibility. Reproducibility is good science. If you believe in code and reproducibility, you can get reuse, you can get automation, you can get scheduling, you can get parameterization. Good science for us is good business.

So for us, code is reproducibility. Reproducibility is good science. Good science for us is good business.

So I don't want to rush over this too quickly. If you can reproduce your analysis, you can recreate it, you can repeat it, and in the same context or a new one, and you can always reproduce your analysis if you record your analysis in code.

So last November, I was at a conference in Barcelona, the Gartner conference, and there was this big movement that was starting up called the no code movement. I don't know how many of you guys are familiar with it. But I sat there and I listened to what folks were saying and I saw the CIOs sort of like oh, this is really exciting, this is going to be the new world, I don't have to hire data scientists, I don't have to code, you know, and it occurred to me that we are basically the opposite. I'm like, why did I show up to this show?

So I figured it would be good to just remind people of what we believe. There are four things, four reasons that we love code, right? The repeatability I suspect everybody here would agree with, right? It's important for your analysis to be repeatable and reproducible down the road. It's one of the key tenants, if you will, of the scientific process.

But there are other elements that are also important. Inspectable analysis. So inspectable analysis speaks to the ability to review the work, point out flaws in the assumptions or suggest improvements. It ultimately helps one understand how the results were achieved in a transparent manner.

Diffable analysis addresses the ability to leverage that same work again for future work. Or to build on or extend it. If you want to solve something new, you might Google for it and understand that someone has done what someone has done to solve the same or similar problem. It serves as a foundation for sharing.

And then finally diffable analysis. I tried this out a few for a couple of days and most people are like, I don't know what diffable analysis is. This may seem like a variant of inspectability. But it's important in its own right. What we're talking about with diffable analysis is you being able to compare the changes made to your analysis over time or somebody else's analysis over time. It makes it easier for you to understand why decisions were made and if there were mistakes, where the errors were introduced along the way.

So for us, ultimately we believe complex problems will require code. Code is communication. Communication is critical. What gives you then leverage? Single data scientist creates something that a whole bunch of other people in the organization can use to solve real problems. With code, you can inspect how a problem is solved and either adopt it or figure out how to improve it yourself. And finally, with code, the answer is always yes.

Open source packages and commercial products

174 packages that we maintain and support for the R community. And we're excited that over 50% of our engineering team works on free and open source software. And our company is more than 50% engineering.

So just because you have really fantastic open source tools doesn't mean that you can get them into your production environments or you can get your company to adopt them. It turns out that there are multiple constituencies that you have to deal with. So data scientist is one of them. Business users are another. And IT admins or your DevOps or your data engineers, there's a whole class of folks that have vested interest in how you operate and how you use your data in production.

So everybody has a set of they have their own set of activities and they have their own set of concerns that they need to worry about. So what we've been trying to do is figure out how to make that easier, how to get that adoption to happen into the organization. How do you get people to move from a world where they have a lot of point and click, drag and drop solutions to one that's sort of grounded in code. And to that, what we've done is added our commercial products.

And if you take a look, our commercial products are meant to address all of those concerns on the right-hand side.

RStudio Server Pro and the launcher

So I mentioned that I'm going to show you some of the things that we talked about with customers over the last year. And I'm going to introduce, I'm going to go back to sort of the IDE. And if you guys can go back in time and think six, seven years ago, most people were using the IDE on a desktop. Some organizations started using RStudio server to help centralize the workloads and better manage capacity and access to the data.

About six years ago, we started hearing feedback from the IT teams within organizations, and they were being asked to help support the growing use of R in their organization. They asked for features around, so they wanted to be able to better control access to resources, they wanted to be able to audit the usage, they wanted high availability, they wanted multiple versions of R, they wanted the ability to start multiple sessions, collaborative editing, et cetera. So we added all these capabilities into our premium product line, and that actually has helped us get over 1,000 enterprise customers today.

In the past two years, we've seen IT organizations experiment with the use of Docker, and some have even started playing with distributed computing platforms. The pattern they use is to bake RStudio server and Shiny server directly within a Docker container so that each session would run a full server process. We've been working on a refactoring of professional products to change the execution model, which we are calling the launcher. In this new model, we have separated the execution of the R process from the server that is responding to the user. This will allow customers to start an R process on a remote machine.

So let's actually take a look and see what this looks like.

The heart of all of our work is our R packages, right? So again, for us, everything starts from the R, the open source, and expands up. So let's start with the IDE. I'm going to start a new session, and in this session, it happens to be using Kubernetes. I'll name the session. I'll be able to control the CPU, resources that are used, the memory that is used, and the base image that I'm leveraging.

And once I start the session, I should get an IDE that looks identical to what you guys see on your desktops today. Not all that impressive, but if you run the plot, I should be able to see the work, and now I can go to the Kubernetes dashboard and take a look and see the jobs that are running. This is an R process that's running inside of a Docker container on Kubernetes, and you can see that it has the same metadata that I entered in the previous report.

Now going back to the IDE, I can go ahead and take a look, and oh, sorry, this session has started with that user's context in the right, with the right level of permissions. So if I go back to the packages, I can go ahead and take a look and see this particular base image does not have all of the, it has a very limited set of packages, and if I search for Shiny, you'll notice that Shiny's not in there.

If I want to go ahead and start a new session, this time maybe a validated environment, so the organization says, okay, there's a set of packages that I'm okay with you guys using. They're preconfigured. They can bake that into another base image that has those particular packages in them. I go ahead and start that project, and here if I search for Shiny, you'll see that the Shiny package is there.

RStudio Package Manager

Now this comes from a package manager, the new product that we introduced in October called the RStudio Package Manager, and the RStudio Package Manager was created to help the IT team be able to manage what versions of packages you have in production, right? So you can create different repos. Each of these repos can have a subset of packages that have been approved by somebody if you want them to, and then any R process that is pointing to that particular repo will be able to get exactly the same versions of the R package, so this is speaking again to getting a reproducible version of the R code that you're using.

So the RStudio Package Manager was started in these IT organizations because one of the key problems that we've seen is that there's a big spectrum of usage of R in these organizations, so some people have a Wild West model. You can use any package that you want anywhere, and obviously there's a concern here about reproducibility, right, unless you're using something like PackRat. Packers lock down their R packages so that you can only use what has been pre-approved and blessed, and that's sort of a very draconian process you have to get special approvals to get through.

What we've done with RStudio Package Manager is made it easier for you to be able to cater to either end of the spectrum, so on one side of the spectrum, you may say I want my R&D team to be able to have access to the latest and greatest version of anything, you know, on Crayon or any of our internal packages, but on the other side, you could say, you know what, when we get closer to production environments or if we want to go to a validated environment, this is a pre-selected set of work that, set of packages that you can use and define, and so that hopefully will make it radically easier for organizations to be able to manage this.

Shiny, R Markdown, and RStudio Connect

So enough about package management. Let's switch to something that everybody here is more comfortable with, Shiny. Here's a Shiny application that happens to show you a stock portfolio over time, and as you can see, this stock portfolio is indicating that the market's not doing so well. Hopefully most of you guys have noticed that.

But what's powerful about Shiny is that you can give people the ability to interact and explore questions and get the answers for themselves. So for the sake of the example, let's say I'm showing up in a meeting and I have a bunch of stakeholders. I pull up this application and we see something that is interesting to us, right? And people say, oh, that looks fantastic. How do I get this particular report, how do I get a report of this on a weekly basis, right?

And again, what's nice here is that this Shiny application has all the code that it took to run it, and so in theory, it's completely reproducible. I can hand it to somebody else, and when they run it, they are running exactly the same code that you created.

But so they say, okay, I want this on a weekly basis, or somebody else, some other stakeholder says, you know, that's nice. I want a slightly different set of variables, but I want that on a monthly basis or a quarterly basis. So what do you do when that happens? And so what we're proposing is that you take a look at R Markdown.

And R Markdown, so here's an R Markdown doc that has the same logic and visualizations that you saw in the Shiny application. But an R Markdown, if you haven't seen it, is a text-based literate programming content format that allows you to mix prose, code, and output into a single document. You can even use LaTeX equations. So similar to notebooks, you can have the outputs inline, but one of the key powers is that you can reproducibly knit the document into a variety of output formats.

For those of you who heard my talk last year, you know that I use R Markdown with IOSlides to create our quarterly board deck. In this example, we'll use PDF. So if you take a look at this, this is exactly the same set of visualizations and data that we saw in the Shiny application.

Once I'm happy with this, I can decide to go ahead and save it, save my work. And how do I save it? I'm going to go ahead and check it in. When we go to check it in, this is when we talk about the diffable story. This is what we're talking about, right? I can actually go ahead and see the exact changes that were introduced. And this gives me the ultimate transparency in how much...

The transparency is making it much easier for me to review the spreadsheet than any spreadsheet or any point and click solution. Once I have what I want, once I've done with it, I go ahead and publish this and share it with other people. And what I'm proposing here is that you publish it to Connect, because Connect is actually the easiest way for you to publish your data products to your organization. Connect will actually go ahead and detect all the dependencies and push up the full code up to the server.

And the server now can go ahead and recreate that analysis in a secure sandboxed environment. Same PDF report that we were looking at earlier. Now what's nice about Connect is because all the data is up there, I can go ahead and rerun that analysis at any moment in time. I can also control who has access to this analysis, right?

I can also email the report out. And this particular report allows you, through code, to customize the subject line, the body of the email, including the various graphs that we had, and the attachments that are included. So the attachments here are not only the PDF doc that we looked at earlier, but actually any content that you created programmatically in the R Markdown doc. So in this case, we have both a CSV file and we have the PowerPoint presentation that you may decide, if you look at it and say, okay, this is really great, I'm gonna go ahead and send that to my boss, right?

The other elements that I did not get a chance to show you here is with R Markdown, with the fact that we just did this with R Markdown, you have complete reproducibility of the work that you've done. One of the other elements we have not shown you is sort of the parameterization of R Markdown So you can take that same report that I showed you earlier and expose the same kind of input parameters that you saw in Shiny, but then allow the users to go ahead and interact with them. And for each of these parameterized versions, you can go ahead and send that out to your constituency.

So your constituency can look at it and say, okay, I really like the aggressive version of the portfolio, and so I wanna be able to receive that on a weekly basis, because that really matters to me. For other members of the organization to say, I wanna get this on a quarterly basis, I don't care, I have a longer term view in terms of how I think about portfolios, and so I will go ahead and pull that in.

So this is one of the key tenants that we see. If you thought about the printing press, this is a way for you to sort of recreate the printing, but all through code.

R as a glue language

Next thing that we talk a lot about is that Connect, RStudio Server Pro, and Package Manager make it easier for you to put R into production. Now most organizations, they don't have just R users, right? They have a variety of tools and services and connections that they need to interoperate with. And so the next thing I wanna spend time on is showing you the next superpower of R, R as a glue language, right? Many of you guys may have experienced that firsthand.

So we'll start by showing how you can connect R to Spark, since Spark is still the new hotness, I think, and in the IDE, this is available in the open source version, you can connect to a Spark cluster, and for those of you who have never seen this, I can create a notebook from it, pull it back into the R Markdown doc, and now I can start writing dplyr code that I'm familiar with, and what's really powerful about this is that even though I'm writing what looks like regular dplyr code that runs in R, when this code looks to run, it's actually running in Spark, and so let's take a look and see what that looks like.

Here's the Spark SQL that this would actually run at that moment in time, and if I go ahead and change that code, I can go ahead and run it so that I can get the data back, I can get the summary data back. So this could be working on a dataset that is very, very large, and similarly, I can leverage any of the machine learning algorithms that are in Spark. Again, this is leveraging the Sparkly R open source package, you guys can play with it if you're interested.

The next thing that we've sort of invested more time in is on connecting to databases, so some of you guys may have seen this, you can connect through the ODBC driver in the connections pane, you can explore the schema of a SQL server database in this particular case. What's new in 1.2 is that you can take that work and pull those, you can sort of edit the SQL and manage it and run it locally, and then once you're happy with it, you can extract that and pull that into R Markdown.

Now most people haven't seen what, don't know that R Markdown can include different kinds of code chunks, and so this is, we've shown you sort of R, you're now seeing SQL, but now we can also, we've added more support for Python in the IDE, so you've always been able to add, have Python code, but now you can sort of have the output of the plots come in in line just as it would with R, and we've improved the cross chunk data pass through.

So another language we added support for is D3, through the R2D3 package, so now if you're really into D3, it's much easier for you to create these types of JavaScript visualizations and include them.

Finally, those of you, some of you may know about Plumber, which is an open source package that you can use to expose REST-based API end points from your code, and in this particular case we have a sentiment analysis tool that is exposed through this annotation for the predict function, and in the 1.2 release of the IDE, it's easier for you to run those APIs locally, test them out, make sure that they're working the way you expect them to, and then once you're happy with that, you can go ahead and publish it to connect, just as you published the R markdown doc that I showed you earlier, or you can publish the Shiny applications.

And again, we detect all the dependencies, and those of you who may not have caught that, but this code is actually using Python, so not only are we detecting the R dependencies, we're also detecting the Python dependencies, and we're going to send those up to connect, and on connect, you'll be able to sort of now, it's recreating that in that sandbox environment, and I can go ahead and try and play with it if I want.

So again, this is a real model, so let's try out some things. If we type in debugging, we'll see what it comes back with as a result, and not entirely surprisingly, it came in a 0.149, which is not a particularly high score. Now if we go up here, you know, we're in Austin, some people will get a chance to do this, you type in great beer, we'll come back and see a 0.88 score, which is much, much better.

Finally, let's just for yucks and giggles, type in R users. Amazingly enough, it comes in at 0.90. I swear to you, we did not play with the model. This is really what it comes back with for the Python-based sentiment analysis model.

The RStudio virtuous cycle

All right, so just to recap, our studio's goal is to create an organization that can have a durable contribution in this world by creating very high-quality open-source software that is accessible to anyone, anywhere, regardless of their economic means. In order to make sure that the business would be sustainable longer-term, we decided to build a software-only business that sells commercial versions of our open-source server products.

These products help enterprises adopt R and open science tools in production environments. The differentiation between our open-source products and our commercial products lie in capabilities around security, authentication, monitoring, tuning, scaling, metrics, collaboration, sharing. You get a commercial license and premium support.

Our hope is that we can sustain this virtuous cycle, which allows us to then invest in the areas of the greatest needs for the R community. So far, things have been going really well. The open-source packages make R more and more accessible, which in turn drives the adoption of R in enterprises, which hopefully creates more demand for our commercial products, which then allows us to fund even more open-source work.

Put it another way, our open-source products create value for everyone, and our commercial products help businesses leverage that value. We believe that this gives the open-source community benefits because you have talented, paid open-source developers creating a growing number of free, high-quality R packages that can be consistently maintained.

Our open-source products create value for everyone, and our commercial products help businesses leverage that value.

Commercial customers benefit from a financially viable and sustainable ecosystem of these open-source tools and packages, along with products that meet their enterprise IT standards for deployment and production. A thriving ecosystem of education, training, and services can bring energy, new ideas, and talent to create and share data products at a fraction of the cost and hopefully improve our societies and our world.

Thank you so much for your attention and for taking the time to attend the conference. We are very excited to have the opportunity to spend a few days together. I picked out a few of the talks that will dive into greater detail on the topics that I could not talk very deeply about. If you are interested in hearing more, feel free to take a picture of the slide. If you want to see any live demos or ask any deeper technical questions, please stop by our professionals lounge if you go out the back and towards the left there. You can ask our engineers, and I will also hang out there and answer any questions. Thank you, everyone, and have a wonderful, fun-filled day.