RStudio Team Demo | Build & Share Data Products Like The World’s Leading Companies

Transcript#

This transcript was generated automatically and may contain errors.

Hey, everybody. Thanks for joining the RStudio YouTube Live for today. We're going to get started here in just a few minutes. I'm going to be going through quite a few different slides and we'll be having a lot of fun today. So we'll give some folks a couple of minutes.

Thank you so much for joining me today. My name is Tom Mock. I'm going to be your host for today. I'll be sharing some links and some comments in the chat. So if you do have any questions, feel free to post them there. Let me know how things are going and if you have any issues with the live stream.

So for today, we're going to be going over RStudio Team , which is one of the professional products and kind of the overall software suite that RStudio provides. You probably know RStudio from the self-named RStudio IDE or integrated development environment. And that's one of the core things we produce in addition to all the free open source software that we provide. Now we have to make money to kind of give away a lot of the things we create. So we also sell professional products that enhance some of the open source work that we produce and give away.

You may be familiar with what is the RStudio IDE. So this is, you know, just looks like any old RStudio IDE session. The benefit being that kind of for what I'm doing today, I'm actually running an RStudio workbench. So you'll notice that I'm working through a browser as opposed to my desktop.

So let's knit some slides so we can create a quick Schrodingen presentation that I can actually talk about for today. So I'll take this file and let's confirm that I'm working on the latest one. Okay. So this was created about 20 seconds ago. Great. Let's take that file and we're just going to deploy it to RStudio Connect.

So in about five seconds, I went from working in my environment to having this hosted on RStudio Connect. And now I have a URL that anyone can access anywhere in the world. So these slides are now up at Colorado.RStudio.com.

The hard truth of data science

Part of what we're covering today is the hard truth that data science is really hard and it can be challenging to kind of get value from some of the work you're doing or get, you know, executive buy-in or business value from what you're doing. So in many cases, many data science teams actually fail to live up to what they promise or what they want to deliver because there's these challenges or these kind of momentum that they have to fight against.

So first off, data science teams find it difficult to create insights or impact decision-making or find it difficult to create, maintain, and improve their data products, their models, their APIs, their applications over time. There's a lot of different ways that this shows up in terms of it could be difficult to recruit data scientists because you're using a very specific proprietary tool or you don't have all your data connected. For decision-making, maybe your stakeholders just don't understand the insights or there's a long, you know, duration between what if this happens or what if this happens and the actual answer to that question.

It's also often difficult for teams to build on previous work for new use cases. They basically have to reinvent the wheel every time rather than kind of building off previous work. Also, they might have analysis constraints or delays imposed by some of those proprietary tools. This can lead to slow iteration, which hinders alignment between the business unit and the data science team, or even irrelevant or outdated insights because things have been slowed down.

Additionally, this can lead to insights become obsolete or difficult to reproduce. Again, the reproducibility kind of tenet of RStudio is true for both academia and for business settings. And then lastly, siloed teams often lead to redundant work and there's lack of collaboration with too much time just spent like maintaining the tools or the difficulty of maintaining data science environments. Lastly, you know, teams can't often self-deploy. So they have to like go ask IT, can you take this and run with it or hand it off to a new team? And there's this disconnect between what you're actually creating and getting it into decision-making or getting it into production.

Additionally, this can lead to insights become obsolete or difficult to reproduce. Again, the reproducibility kind of tenet of RStudio is true for both academia and for business settings.

So it provides the ability to scale up the, you know, scale up the actual server or the actual compute environment as more users come to visit it. So this Shiny application just out of the gate can serve 60 people. Now I could change these parameters and make it serve 100 or 1,000, or in some cases we have examples of like tens of thousands of users that could be potentially supported.

That's going to be based around how big your server is, but you as a data scientist don't have to worry about how does this actually scale in terms of figuring out all the different components. You have the ability to just publish your Shiny application and then change in these parameters if you need to.

Let's go to a quick Dash application and let's do this Dash application for my colleague David. So this is just one of the examples from the Dash team in terms of like their hello world examples, but a rich one that they have. And again, I have the ability to change these parameters as I see fit. So as a data scientist, I can publish my application. I can change different things and it makes the graphic change on its own. But I don't have to worry about how does Linux work or how does authentication work or how does scaling of an environment work. I just get to create the data science content that I want to put out there.

Here's a Streamlit example. Streamlit's very similar to like a Shiny-based R Markdown in terms of they're kind of like a notebook style setup or a linear flow, but the same idea in that you change a parameter and then the code is re-executed to regenerate something. In this case, showing an output or a graphic, but this is running Python on the back end as well.

The other benefit in terms of reproducibility is let's see if, let's go back to one of mine. So I have a presentation on production databases. We can look that I actually have multiple versions of this presentation. So I publish an older one at the very end of September and actually go back to a previous version. So I can maintain or I can overwrite. So let's say I have a presentation that I'm building or R Markdown report or a Jupyter report I'm building. Connect will handle the history of that report in terms of showing you like, oh, I can look at older versions of the same thing. So I can always look at the latest if I wanted to, but I could roll back to an older version if I wanted to show that to someone else.

This is an R Markdown report hosted on RStudio Connect that aggregates multiple data science things within an entire project. So if we scroll down, this is that end-to-end workflow I showed in the slides from earlier, which is bring some data in, creating some datasets that are saved to RStudio Connect as a pin, taking a model and saving it onto RStudio Connect and then serving that model up as an API that's interacting with Shiny applications or a website.

So this aggregates all those different components. So as a data scientist, sure, I'm probably going to have some of this on version control and that's where I'm going to be like actually writing code. But if I want to just go interact with the things in production, this shows me all the different components of this very complex data science task.

RStudio Connect has support for email servers. So if a parameter is met or is not met, it can actually send you an email and inline in that email embed ggplot or a table or attach a CSV or attach a presentation or attach a report. Whatever you want to do in R, you can kind of build that into a presentation and send it along.

Let's dive real quick though in terms of one of the things that resonates for me and what I use a lot is scheduling of content. I often have tasks where, you know, it's something I have to do all the time and I want it to rerun on a schedule. I don't want to have to manually pull from the database, click run, and then publish to connect. Like that sounds like a lot of me clicking a lot of buttons.

I can actually schedule this to run every single day. So built into Connect, I can say schedule this output, rerun this exact code at this time every day, and publish the output. So it's going to go through, and if we look at the history, there's many, many different versions of this report because it's been running every single day. So this report runs every day, does some type of cleaning or automate some type of model training step that I don't have to do manually. So now I can go spend my time on other things that are more useful of my time rather than clicking run every day or rerunning something manually every day.

And importantly, because it's on RStudio Connect, the package environment is stable. So while my dev environment, I can install new packages and I can change my version of R, once I publish something to RStudio Connect, it's stable in terms of it's living there. It's got its own little environment that's controlled with a very specific version of R, very specific version of all these different packages that is static and it's controlled. So this report has been running for, I think, over a year. And I have other reports that have been running on RStudio Connect since I've joined RStudio over three years ago. So it's nice to be able to put something in production and just let it run, even though my dev environment where I'm writing new code has changed or the actual outside world has changed a bit.