Resources

Mark Sellors | R in production | RStudio (2019)

With the increase in people using R for data science comes an associated increase in the number of people and organisations wanting to put models or other analytic code into "production". We often hear it said that R isn't suitable for production workloads, but is that true? In this talk, Mark will look at some of the misinformation around the idea of what "putting something into production" actually means, as well as provide tips on overcoming the obstacles put in your path. VIEW MATERIALS https://rinprod.com/ About the Author Mark Sellors Mark is the Head of Data Engineering at Mango Solutions as well as the author of the 'Field Guide to the R Ecosystem'. He has more than a decade’s experience working with analytical computing environments, DevOps and Unix/Linux. He uses his experience to help Mango’s customers transform their analytic capabilities to ensure they can make the most of their data. Mark and his team are at the forefront of the data engineering field, deploying high performance analytical environments using a wide range of tools, such as R, Python, Spark, and cloud computing. He is experienced in the complete product life-cycle from initial ideas and proofs of concept through to development, test, release and production

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

R is a one-word trick that we use to get R into production. I'm also not going to talk specifically about any code that might help you run R in production. That's not really the area that I work in.

So where I work, a lot of the time we're helping organisations adopt the R language and trying to help businesses of varying sizes, from very small organisations up to massive government agencies to adopt the language in their businesses, which isn't always an easy conversation to have.

So this is kind of my starting point. I genuinely think that all of the technical barriers to running R in production are easy to overcome. I say easy, I mean comparatively easy, right? Those are always challenges. But it's the cultural barriers that slow us down.

But it's the cultural barriers that slow us down.

So there's some overlap here with what Joe Cheng was talking about this morning. I like to present this as a kind of inspirational quote, but actually it was just something I said on Twitter one time. But I do genuinely believe it.

Defining production

So much like Joe, who kind of, you know, stealing my thunder this morning, we need to start out with a working definition of what production actually means. And mine's kind of similar to Joe, there's a lot of overlap.

But basically, when people say production, they tend to sort of have this idea in their head of what production means. And production is the same as massive, right? That's kind of the basis for all these production kind of conversations. And when people historically have said, oh, well, you know, you can't run R in production, quite often what they mean is you can't run R at scale. And even that's not true anymore.

So these are, I'm not 100% sure because I stole the image from Google, but I think these are Google racks, actually, in their data center. And I think they're Google's tensor processing units. But this is the kind of thing that people think of when we talk about production. But actually, production and scale, not the same thing at all.

So for me, production is anything that's run repeatedly, and that the business relies on. So hopefully, I'm kind of lining up with Joe Chang on that one. And the important bits of that are that it's repeated or continuous, kind of an extension of the same idea, and that it's relied upon.

Now, in most of the organizations that I work with, when you're implementing R in production, the people, the kind of target audience for those R environments are anywhere from 10, 20 people up to a few hundred people. So those people will rely on the information that's being presented to them, whether it's information in a Shiny application, or whether they're running R on a HPC cluster, or something like that. They're actually relying on that information that they're getting back to kind of inform decision-making within the business, or to bring a new drug to market, or whatever it may be.

Why R is great in production

So actually, R is great in production. There are a number of reasons for this. R doesn't have a lot of the baggage that other programming languages do. And I'm thinking specifically things like the Python 2.3 schism. There's still an awful lot of stuff in Python 2, which is slowing adoption of Python 3. And that's because it's used so heavily on Linux systems where vendors are perhaps a little slower to upgrade things.

Obviously, R doesn't have that baggage, so you can install whatever version or versions of R you want to use. And I've found overall that performance is exceptional. On top of that, you've obviously got a fantastic package ecosystem and things like that that we can draw on to build great things like we saw from the previous talk.

Getting R into production

So what I really want to talk about is actually how we get there. How do we get R into production? So maybe you're working at an organization now. You've perhaps got R on your desktop, something like that. But you're not running anything R-based in production. There are traditionally a number of reasons for this.

And they mostly revolve around the way that businesses have evolved in terms of their kind of technology adoption. And so what we've found when we look at the market and when we talk to our customers and things like that is that there is interest in adopting new technologies and bringing new things into the business, but they're perhaps not sure how to go about it.

So this tension, I guess, a lot of people will look at their business and say, well, IT won't let me run R in production. But actually, often it's not IT, it's the business as a whole, which is obviously what informs IT and kind of provides all the kind of basis for their security infrastructure and the network infrastructure and all that sort of thing.

Two paths to adoption

So there are two kind of distinct approaches that have emerged in the work that I've done and the businesses that we've talked to over the last, so I've been doing this sort of work for about six years. There are two paths that I've seen people be really successful with for getting R into their organizations. And I want to talk a little bit about those now.

So the left-hand path is the path of magic, essentially. So if you think about the great talk that we saw a moment ago, part of what happened was that a fantastic shiny app was presented to somebody important, and then that drove the kind of decision-making over whether to use R or something like that downwards in the organization.

So the path of magic is a great path. It can be a very short path, but it can also sort of fraught with difficulty, I guess, because what can happen is it can backfire a little bit. So this path, I think, is brilliant and, like I say, can be a really good shortcut, and especially if you've got something really powerful that you can demonstrate. But you can get kicked back later on when we come to talk to IT teams about actually getting that stuff into the production ecosystem within the business.

So that's one option, and hopefully that's kind of fairly easy to understand. The right-hand path, though, is... So the right-hand path has, I guess, I don't want to say guarantees, because there are never any guarantees, but what you're looking for with the right-hand path is for the business to have confidence in the work that you're doing. And historically, that's the area where getting R into production kind of falls down.

So the IT team might look at the work that a data scientist is doing and go, well, I'm not confident in what you're doing because where's the test team? Where's your requirements document and things like that? All these kind of historical software engineering things that have been sort of adopted by businesses over the last 50, 60 years, whatever it may be.

Data science vs. software engineering

And so we need to kind of address that elephant in the room, which is essentially to come to the understanding and to sort of agree that data science isn't software engineering. We know it isn't software engineering, and that's absolutely fine.

In terms of the kind of rigor that goes into these different disciplines, the rigor goes into a different place in data science. And so what I usually see from data science teams is that they're very good at ensuring their code does what they think it does. That's obviously extremely important. But also that they're kind of methodologically sound, I guess. And then data science will often stop there.

Software engineering, on the other hand, obviously has a different set of drivers and a different set of goals. And so software engineering can very quickly mushroom into a much bigger thing where you have things like QA departments and requirements documents and all those sorts of things, which you hopefully try and avoid being too onerous in the data science world.

So there's this friction between our starting point, which is the data science, and then software engineering, which is kind of the IT and the business.

So in terms of tackling things head on, which is obviously what we're kind of aiming to do, and this was just my really quick thoughts, and there might be different things in your environments, but these are some of the things that, as a data scientist, you might be pushed into understanding in order to get stuff into production.

I want to kind of just have a little sidebar on that, though, in that I don't think every data scientist wants to learn this stuff, and I don't think every data scientist should have to learn this stuff. There are a group of data scientists who already know this stuff or who are excited by these things, but also it's about building bridges with the IT team and getting them to kind of meet you halfway in these kind of areas.

So you might find that you've got a particularly restrictive IT team who don't want you to run new things in production. People are always concerned about new things. They might break the old things. They might cause problems. There are QA teams that have to be satisfied. Who's going to write the test scripts and things like that that you might need to take an application from your desktop to a production environment?

There's automated testing. Joe did a great talk this morning, obviously. It included Shiny load test and Shiny test, which are great ways of testing Shiny applications, but I can pretty much guarantee that your internal IT team, if you have one, won't be familiar with those tools. So we need to build those kind of bridges. Staging servers, integration environments, continuous integration, continuous delivery, these are all concepts that come from that software engineering world rather than from data science, but they're all the sorts of things that people need to take on board and not necessarily do themselves.

Like I say, I genuinely don't think that all data scientists should also be software engineers in terms of getting things to production, but these are the sorts of things that you'll encounter.

A real-world example

I worked on a project a little while ago for a business that wanted to adopt RStudio Connect so that they could deploy Shiny applications and things like that, and I talked to them initially about how they might go about doing that. They didn't have any Linux knowledge in-house. Connect only runs on Linux. They didn't have any kind of R administration knowledge in-house, so they didn't know how they were going to work things like CRAN and stuff like that, and I had a meeting with their security board.

They had a board that met every two weeks to discuss any new projects that were coming through from a security perspective, and I said to them, oh, yeah, it's a programming language like any other. There's not really anything to get excited about, and by the way, CRAN's got 13,000 packages on it, and they were like, no, it's not going to happen.

So unfortunately for this customer, this was slightly before RStudio's package manager came out, so I ended up building them an internal CRAN to get around this very strict requirement that they've got of not bringing 13,000 packages of, I don't want to say unchecked code, but you know what I mean. It's kind of community-contributed code into their business.

So there are a lot of these kind of barriers that will need to be often overcome in these sorts of organizations. My background is actually more on that kind of engineering side. I was an infrastructure support engineer for many years, and my role now is basically to try and bridge that gap.

So I get into conversations about doing things like, well, we'd like an RStudio Connect deployment in our business, and that's great. I can do that in half an hour, maybe an hour, if it's a complicated installation. But then people want all the help around that. I can install the tools really, really quickly, but most of what I do is actually hearts and minds. I write documentation for the Linux team so that the Linux team know how to support R, how to compile it from source and where it should be put on the systems and things like that.

So it can very quickly mushroom into a thing that's got almost nothing to do with data science at all, even though it's got everything to do with R. Production is very much a team sport, which is why I talk a lot about building bridges between these different disparate groups within the organization.

Production is very much a team sport, which is why I talk a lot about building bridges between these different disparate groups within the organization.

As I said before, I don't want every data scientist to have to learn all of the kind of software engineering best practice, and I don't think that there are necessarily many data scientists who are interested in that sort of thing, which is why it's important to build bridges with these different teams within your organization. But obviously in those circumstances, it's about building a kind of common language and sort of trying to understand a little bit about what they do.

Production readiness checklist

So I've put together, and again, this list is by no means exhaustive, but I've put together a kind of production list or a production readiness checklist for some of this stuff. So when R goes into production, you need to think about the target environment. What is the target environment? Is it Windows? Is it Linux? Which versions and things like that?

You also need to think about the release process. Can you just take code off your desktop and shove it onto a production server? Probably not. So a lot of organizations will have a formal release process that you'll need to either get on board with or somehow persuade somebody to adapt for your specific needs.

You'll probably need a testing strategy, so it's worth writing down what that is so that when whoever it is from the QA team says, well, you haven't done any testing, you can say, well, actually, we've done a lot of testing, and our testing strategy is really good and clear. Some organizations have a change management process. So a really good example for this one is retail businesses who often don't want to change anything in their infrastructure over the Christmas period because they're extremely concerned that any changes might damage their infrastructure in some way, and so there'll be a gap, two, three months, whatever it is, where literally no changes can be pushed to production at all.

You might need a security review. Infrastructure is kind of similar to the target environment, but also covers things like, do you need access to databases and things like that? What is your deployment process? Will there be an automated process? Is it a manual process? Will you push straight from your desktop? Is it going to come out of Git or wherever? Who's going to support your application? If you're taking an R application into production, do you want the 3AM support core when it falls over? Probably not. Who's going to do the support? Are the support team prepared to support this application? Do they have the information they need?

With a lot of data science projects, you do the project and then you move on to the next one, but in production, we need to think about maintenance. Who's maintaining this thing when the initial drop is done? Where's version two going to come from? Who's going to do the bug fixes? Who signs off on releases? Which version of R are you going to use? Which packages, which versions? There's a lot of things to think about. This checklist is kind of a starting point. Yours will be different. Feel free to copy it or take a picture or whatever and add your own things onto that.

Once you've done that, you can be happy up a mountain enjoying having R in production. The last thing I want to encourage you all to do is if you do have R in production currently or if you do manage to successfully get R into production in your organisation, I really want you to share what you've done somehow, anyway, because it helps lift the rest of the R in production community up. It's really good to get these stories out there to encourage others to do the same and so that we can learn from each other.

Last thing, that's me, that's the company's website. The last thing is the field guide to the R ecosystem. It has been quite useful for people who don't know R, so people with ops and management backgrounds and things like that. It's a very short, high-level overview of the R ecosystem. With that, I will stop talking.

Q&A

Thank you, Mark. We have time for one question. If anybody has any question that they would like to pose to Mark, we've got microphones. It looks like we have one right over here.

You mentioned about building an internal CRAN for a company and installing and running unvetted code is kind of a big concern in my organisation. Were the packages somehow vetted when you were including them in the repository, or how did that work?

They were vetted, yes. Basically, all I did was take... They gave me a list of these are the 80 packages that we use. I got those packages from CRAN. I got all the dependencies from CRAN. I ran their kind of corporate antivirus, their corporate anti-malware stuff over those packages, and then I built a small repository to host them. That was enough to get their security team on board effectively.