
Katie Masiello | Professional Case Studies | RStudio (2020)
The path to becoming a world-class, data-driven organization is daunting. The challenges you will likely face along the way can be thorny, and in some cases, seem outright impossible to overcome. How do you get teams that traditionally butt heads, such as IT and data science, to complement each other and work in unison? How can you efficiently scale the scope and reach of your data products as requirements change? Your time should be spent doing truly valuable work instead of updating charts and reports. How do you prevent the support structure behind your platform from toppling like a house of cards? Despite these challenges, we think that the end result is worth it: an organization that is equipped to make important decisions, with confidence, using data analysis that comes from a sustainable environment. We see this outcome every day
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
J.J. articulated our mission so clearly and it's really meaningful for me to have this codified now into the fiber of our organization and to be recognized as a B Corp. But what I really want to emphasize is that our contributions to the open source are funded by professional customers and their commitment to our professional tooling. So that's why the successes of our professional customers are so important. Their wins are all of our wins within the entire community.
So today I'm going to talk about ways that I've seen our professional customers overcome common pain points that can trip up a team and limit their ability to become efficient data-driven organizations.
So a year ago I sat in this audience, like many of you maybe, as an engineer and a data scientist working in industry. And I was trying to find out for my own how to make my analysis more powerful, more efficient, more impactful. And it's just funny how things turn out. Not too long after Conf, I had the opportunity to move to RStudio and now I'm working with professional customers who are asking those same questions.
So in the time that I've been at RStudio, I've had a lot of customer conversations. So far, well over 200. Right now I'm averaging about 8 to 10 meetings a week. And our customers span across every industry that you can imagine. In my role, I get a unique insight into the various stages of a data science team's growth and their maturations and the challenges that they face. Sometimes that team I'm talking to is literally a team of one. And some of these teams are spanning the globe. They have enterprise-level architectures and data products in place. But I do see common elements among these teams and across all industries. I've gotten a firsthand view into some of the successes that take a team from good to great.
Phases of data science team maturity
So I'm going to share two stories from teams that are at different maturity levels, but they've succeeded in overcoming significant pain points.
So we can take broad strokes and talk about phases of maturity in a data science team. Now our first phase, right, these are our fun, excited newbie teams, right? They're establishing buy-in for their analytics. They're establishing credibility for what they're doing. Maybe they're coming off of Excel and spreadsheet-based things, but there's a lot of big and easy wins for them as they start to explore what they can do with all the capabilities of working in a codified analytics means. So this group I love. They're just dazzled by all the amazing things that they can do with code.
Our phase two teams are a little further along in the path, right? At this point, they've established buy-in, they've got credibility. Often they're growing in size. They've got more people on the team. Maybe they're adding more data products to their portfolio. But definitely no matter what's going on, their data products have been recognized as being more business critical, right? They've got cred behind what they're doing. And that's great, but along with that comes greater expectations and more demand. And so now these teams are in this state of figuring out, what the hell have we done? And how are we going to keep this up? We're not sure what we've gotten ourselves into. And there's a lot of growing pains associated with the phase two teams.
And then lastly, there's that upper echelon phase three team, right? They are hitting on all cylinders. They've got it all figured out. And I think there's definitely an asthmatotic path that phase two teams take towards becoming a phase three. But really, no one actually gets it all figured out in the end.
So the two teams I'm going to talk about today, team one is a phase one team. They're moving from an Excel world. They've got really inefficient reporting, emailing things around. But they've discovered Shiny. And so they're enthusiastic about this, but they're really struggling to figure out how to get their data products into their consumers' hands from here. Team two is going to be a phase two team. And they're trying to maintain an efficient workflow as they grow. But they're finding some issues because they're becoming a little too cavalier in their approach.
So when teams are tangled up looking for answers to these types of questions, it can take weeks or months or sometimes years to overcome these issues. You get the benefit of 20-20 hindsight and reflection. And you can see these struggles and have visibility of their success strategies. I want you to walk away from this talk with enough information and vision that you can see that there's a forward path for your own team. And I invite you to come to the lounge at some point this week and have a deeper and more specific conversation about how to move your own team to the next level.
RStudio professional products overview
Since we're talking professional customers, we need to have a quick orientation on our RStudio professional products. So RStudio offers three professional products. RStudio Server Pro, Connect, and Package Manager. These products provide a modular platform, which provide the shortest and the most efficient path for lasting value in a data science team for all entities. So RStudio Server Pro is your place for R and Python data analysis, now supporting Jupyter sessions as well as the traditional RStudio session. Connect is a publication platform. And this is where all the data products that are created go to live, where stakeholders can consume and interact with those data products. And RStudio Package Manager provides a managed and controlled environment for your packages. But at the core of this are the open source packages that we're dedicated to.
Phase one team: from Excel to Shiny
Now let me introduce you to our phase one team. So they had a handful of analyses all tied up in Excel files. They used to email them around. It bogged things down tremendously. But they've discovered Shiny, and they have jumped wholeheartedly on board.
But the biggest issue facing this team is now needing a way to get their Shiny app into the hands of their stakeholders so it could be used. So for a little while, they were simply publishing to local hosts, and it quickly became apparent that that was not a stable or scalable solution. But they also weren't sure if they were going to need to be reliant on an outside group to be able to help get their apps published.
So for them, a win is being able to have their work visible and discoverable by their stakeholders. And really, success in this realm would open up the door for this group to be able to reach their broader goals of securing buy-in for what they're doing so they can expand their scope and their impact in the organization. They can get off of this repetitive just updating the same quarterly report every quarter and get into more forward thinking and proactive analysis.
So my phase one team found that RStudio Connect provided the best home for their data products. What was really appealing to them was the ease of deployment. So Connect permits push button deployment from the IDE, right from this little blue button that's really small on your screen here. But what you do, you push this button, and all of your packages and your R version information is bundled up behind the scenes, shipped off to the Connect server, and unpacked. And so deployment is easy as pushing a button, and two minutes later, you're off and running.
Now once this Team Shiny app was available on Connect, stakeholders could interact with all this information in a self-service manner, and it didn't require these iterations, going back to the data scientists time and time again whenever they wanted to see a different view or a different variable.
But what was really great to see with this team was that they really began to leverage the multiple data types supported by Connect. So for them, Connect opened the door to new gains for them. So in addition to this original Shiny app, they discovered they could tackle all sorts of other inefficiencies in their processes. They're using scheduled and ad hoc reporting in R Markdown and using Jupyter Notebooks, sending customized emails, using pins as part of an ETL process.
So I want to show what this would look like as an example. So in Connect for report scheduling, when you upload and you're working with an R Markdown document or a Jupyter Notebook, you have the ability to specify both the report frequency and the audience for which you want to send that report to. And it can be stored just on the Connect server, or it can be emailed out automatically.
Now this team's also integrated in condition-based triggers. So on some reports, it'll only send an email out if a certain criteria is met, say if it's a low inventory condition or if there's performance threshold criteria that are being exceeded.
Another workflow they've fully taken advantage of is using scheduled R Markdowns to write ETL data to a pen. And so this data lives right in Connect, and it's accessed by their scripts. They can control who has access to this data on the pen. And because it's ephemeral data, it didn't really make a lot of sense to take this little chunk of data and put it into a database. But they didn't want to be saving and importing CSV files. And so for them, a pen was a perfect solution. This data is then accessed by their Shiny app, and now they have a regular update cycle behind their main data product, so the information is always fresh and relevant.
So my team 1 channeled their enthusiasm of moving to a codified solution, and they found tremendous efficiency gains in value add by getting their workflow streamlined and their products right into the hands of their customers.
Phase two team: production readiness and DevOps
Now it's time to meet our phase 2 team. So this is a fun bunch. They've grown in their data offerings. They've established credibility in the organization for what they do. But they're in an awkward spot, because as their data products are becoming more business critical, they're finding that their phase 1 days of doing things is not sustainable. They've been burned a few times by pushing things straight into production, and things have gone awry. They know that they need to start adopting more process rigor, being more diligent about version control. But right now it's just hard to let go of those freewheeling days of phase 1, and it just seems like a jumble of pieces that don't have a clear order where they need to fall.
This team's biggest challenges include tool and process-based issues, but a big part of it is also philosophy. Because of the high visibility of the past fumbles that they had in pushing straight to production, they're now butting heads with IT over deployment strategies. So IT was really uneasy with the push button deployment feature of Connect. It was a little too easy to send things out to production. That was what made Connect such a great sell in the phase 1 days, but now it's the source of friction.
So in this case, IT put out a nine-page list of instructions to follow in order to deploy to production, and they figured, hey, if anyone's diligent enough to go through all nine pages, they'd pass their test and they could be trusted to publish. Clearly this did not go well with the data scientist team, and they just felt this overall turmoil of what have we done and how are we going to keep this thing going?
Questions between the data scientists and IT felt like data scientists are from Mars and IT is from Venus. Words like Jenkins and DevOps were swirling around in their heads, and it was just overwhelming for them.
So a win for this team is to establish a managed deployment workflow that was still efficient, but it was respectful and cognizant of a more formalized understanding of what production means. And if they can be successful in this realm, this helps them achieve higher goals, building a foundation of best practices and a collaborative relationship with their IT team.
What does "production" really mean?
But I want to tease out a little bit on this philosophy issue that really feeds into phase 2's team points. So what is production? It can kind of feel like that summer family road trip of, like, are we there yet? Where are we going? Frankly, how do you know when you're actually there?
So right now you might be having a hard time formalizing what it means to be in production. And if this idea of production sounds kind of nebulous, you're not alone. Look around, and if you know clearly in your head what production means, raise your hand. It's a quiet audience right now. Guaranteed, if you ask 10 different groups what production and production readiness means to them, you're going to get 10 different answers.
So let's talk about what production might mean. An incomplete concept of production only considers that the data product is accurate and it's ready to be used for informing business decisions. So this is where my phase 2 team got bit. They made the mistake of saying, it works fine on my desktop, it's ready to go. And if that doesn't get your spidey senses all tingly, what if I tell you it was a Friday afternoon when they said that? And it was right before an organizational deadline.
So in my customer conversations, I've seen a number of in-house solutions, yet no one really knows what it means to be in production. And then as demand for products grow, it becomes increasingly important and painful to address scaling and security, stability, availability. These in-house solutions tend to crumble as higher demand is placed on them.
So a more formally defined state of production not only ensures that the data product is correct, but it's in a stable environment, it's safe, it's secure, and it scales and responds appropriately. These are the infrastructure things. But there's also a state of mind associated with production, and this is important. In production, you're thinking ahead and expecting that every change you make has the potential to disrupt. So you're designing, you're testing, and you're architecting to prevent that.
In production, you're thinking ahead and expecting that every change you make has the potential to disrupt. So you're designing, you're testing, and you're architecting to prevent that.
Now my phase two team, they were already working with RStudio Connect, and they saw how this could provide that stable, secure, scalable wear for production. We've got built-in authentication, access management, app-by-app performance tuning. They're also able to stand up a staging environment on QA, which was great. But taking advantage of these infrastructure controls was one thing. But the transformation happened as the team cultivated their production readiness mindset and incorporated best practices with version control all in line in an automated DevTest prod workflow.
And that was a mouthful. So I want to illustrate what this workflow looks like, because it's very versatile and it's useful for many teams. So what this means, team two has linked their master repo in GitHub to the production instance of Connect. And Connect is automatically watching that repo and updating it if changes occur. However, all development feature additions and whatnot are worked off a branch to that master. So the branch repository is deployed to a staging environment. And this way, the data scientists can still see and touch and feel and play with their deployed content without disruption to the production instance. When this content in the branch is approved and ready to be moved to production, a pull request is made to merge the master. And Connect automatically redeploys the updated version.
So this automated DevTest prod workflow met the data scientists' needs for an efficient workflow. And it met the requirements of IT for a managed strategy that allowed QA in a production mindset. This philosophy change and automated workflow has been a tremendous value add for team two.
Just to show what it looks like in Connect. There was a lot of clicking, so I didn't feel like I could show it live today. But you can see there's multiple versions of an app, of a version. And so in GitHub, we switch to the development branch for the asset and initiate a pull request. When the merge request is approved, then we move to Connect. And I know that this content is Git-backed, because in the info panel on Connect, it will show me the branch or the repository that it's watching. And it will periodically check for updates. Or in this case, I can force an update, because I know that there's changes coming. So Git sees those changes. And Connect sees those changes and redeploys. And so now we can see that the changes that we made in production that were approved are now available and ready for us in production.
Closing thoughts
So I want to close with a few thoughts. I do not want imposter syndrome to get you. There are plenty of sexy and intimidating words out there when we talk data science in the enterprise, Hadoop and Docker, Kubernetes, Spark and Slurm. It's okay. You don't have to know or understand it all.
Because successful enterprise data analytics teams are just that. They're a team. So you do what you do best. But build up, count on, communicate, and learn with your team. You need an R admin, you need IT and DevOps, and you need to be listening to your stakeholders.
So to recap, what I want you to see today is you can have your own success story, and I want you to benefit from seeing some solutions and common pain points that might confront you along the way. It's important that you find an efficient way to share your data products. Work smart. Use automation and reproducibility to make efficient workflows. And know where you're going on this journey to production and bring the whole team along. And above all, I want you to think forward and ask questions and come visit us at the lounge for a more detailed conversation.
Q&A
So I think we just have time for one question before the next speaker. So there's a couple questions that came in, but one of them was, so what do you think are some ways that data science teams can accelerate their learning and especially adopt newer open source packages they may not have used before, like Plumber or TensorFlow?
I really enjoy, it sounds so fundamental, but I find that the best learning happens when folks go to the main landing pages for each of these packages. So people may not realize, but the homepage for Shiny, for R Markdown, there's a getting started section there with tremendous tutorials. I'll say my first Shiny app I built by copying and pasting code from Stack Overflow, probably like maybe your first Shiny app. And after watching the tutorials on just one tutorial from the Shiny page, I started a Shiny app from blank page, and I knew every element and where it needed to go. So I think the resources are out there. And fundamentally, a lot of the building blocks of knowledge that people can benefit from are right on those main pages for the packages.
