Katie Masiello | Professional Case Studies | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

J.J. articulated our mission so clearly and it's really meaningful for me to have this codified now into the fiber of our organization and to be recognized as a B Corp. But what I really want to emphasize is that our contributions to the open source are funded by professional customers and their commitment to our professional tooling. So that's why the successes of our professional customers are so important. Their wins are all of our wins within the entire community.

So today I'm going to talk about ways that I've seen our professional customers overcome common pain points that can trip up a team and limit their ability to become efficient data-driven organizations.

So a year ago I sat in this audience, like many of you maybe, as an engineer and a data scientist working in industry. And I was trying to find out for my own how to make my analysis more powerful, more efficient, more impactful. And it's just funny how things turn out. Not too long after Conf, I had the opportunity to move to RStudio and now I'm working with professional customers who are asking those same questions.

So in the time that I've been at RStudio, I've had a lot of customer conversations. So far, well over 200. Right now I'm averaging about 8 to 10 meetings a week. And our customers span across every industry that you can imagine. In my role, I get a unique insight into the various stages of a data science team's growth and their maturations and the challenges that they face. Sometimes that team I'm talking to is literally a team of one. And some of these teams are spanning the globe. They have enterprise-level architectures and data products in place. But I do see common elements among these teams and across all industries. I've gotten a firsthand view into some of the successes that take a team from good to great.

In production, you're thinking ahead and expecting that every change you make has the potential to disrupt. So you're designing, you're testing, and you're architecting to prevent that.

Now my phase two team, they were already working with RStudio Connect, and they saw how this could provide that stable, secure, scalable wear for production. We've got built-in authentication, access management, app-by-app performance tuning. They're also able to stand up a staging environment on QA, which was great. But taking advantage of these infrastructure controls was one thing. But the transformation happened as the team cultivated their production readiness mindset and incorporated best practices with version control all in line in an automated DevTest prod workflow.

And that was a mouthful. So I want to illustrate what this workflow looks like, because it's very versatile and it's useful for many teams. So what this means, team two has linked their master repo in GitHub to the production instance of Connect. And Connect is automatically watching that repo and updating it if changes occur. However, all development feature additions and whatnot are worked off a branch to that master. So the branch repository is deployed to a staging environment. And this way, the data scientists can still see and touch and feel and play with their deployed content without disruption to the production instance. When this content in the branch is approved and ready to be moved to production, a pull request is made to merge the master. And Connect automatically redeploys the updated version.

So this automated DevTest prod workflow met the data scientists' needs for an efficient workflow. And it met the requirements of IT for a managed strategy that allowed QA in a production mindset. This philosophy change and automated workflow has been a tremendous value add for team two.

Just to show what it looks like in Connect. There was a lot of clicking, so I didn't feel like I could show it live today. But you can see there's multiple versions of an app, of a version. And so in GitHub, we switch to the development branch for the asset and initiate a pull request. When the merge request is approved, then we move to Connect. And I know that this content is Git-backed, because in the info panel on Connect, it will show me the branch or the repository that it's watching. And it will periodically check for updates. Or in this case, I can force an update, because I know that there's changes coming. So Git sees those changes. And Connect sees those changes and redeploys. And so now we can see that the changes that we made in production that were approved are now available and ready for us in production.

Closing thoughts

So I want to close with a few thoughts. I do not want imposter syndrome to get you. There are plenty of sexy and intimidating words out there when we talk data science in the enterprise, Hadoop and Docker, Kubernetes, Spark and Slurm. It's okay. You don't have to know or understand it all.

Because successful enterprise data analytics teams are just that. They're a team. So you do what you do best. But build up, count on, communicate, and learn with your team. You need an R admin, you need IT and DevOps, and you need to be listening to your stakeholders.

So to recap, what I want you to see today is you can have your own success story, and I want you to benefit from seeing some solutions and common pain points that might confront you along the way. It's important that you find an efficient way to share your data products. Work smart. Use automation and reproducibility to make efficient workflows. And know where you're going on this journey to production and bring the whole team along. And above all, I want you to think forward and ask questions and come visit us at the lounge for a more detailed conversation.

Q&A

So I think we just have time for one question before the next speaker. So there's a couple questions that came in, but one of them was, so what do you think are some ways that data science teams can accelerate their learning and especially adopt newer open source packages they may not have used before, like Plumber or TensorFlow ?

I really enjoy, it sounds so fundamental, but I find that the best learning happens when folks go to the main landing pages for each of these packages. So people may not realize, but the homepage for Shiny, for R Markdown, there's a getting started section there with tremendous tutorials. I'll say my first Shiny app I built by copying and pasting code from Stack Overflow, probably like maybe your first Shiny app. And after watching the tutorials on just one tutorial from the Shiny page, I started a Shiny app from blank page, and I knew every element and where it needed to go. So I think the resources are out there. And fundamentally, a lot of the building blocks of knowledge that people can benefit from are right on those main pages for the packages.

Katie Masiello | Professional Case Studies | RStudio (2020)

Transcript#

Phases of data science team maturity

RStudio professional products overview

Phase one team: from Excel to Shiny

Phase two team: production readiness and DevOps

What does "production" really mean?

Closing thoughts

Q&A

Featured software#

rstudio