Resources

Kelly O'Briant | Configuration management tools for the R admin | RStudio (2019)

This talk will feature an introduction to configuration management tools for the Analytic Administrator. An analytic admin is someone who is invested in continually improving analytic infrastructure, advocates for best practices in data product deployments, and acts to adopt DataOps philosophies in their organization. One of the biggest challenges for an analytic admin can be figuring out how to help IT groups develop core competencies around the management of R tooling. When IT groups are unfamiliar with R, they might lean heavily on the analytic admin for guidance or resist adoption entirely. Data science teams that rely on delivering results through integrated R based solutions can get blocked when they lack the full support of IT. I’ll present a roadmap for how analytic admins can create custom teaching tools for introducing the R toolchain. Using these strategies, a dash of creativity, and a little bit of empathy, I hope you can get the IT buy-in you’ll need to make R a fully legitimate part of your organization. About the Author Kelly O’Briant Kelly is Solutions Engineer for RStudio and also an organizer of the Washington DC chapter of R-Ladies Global. It’s an R users group for lady-folk and friends

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, welcome. My name is Kelly O'Briant. I work at RStudio as a solutions engineer, and my talk today is on showing you some configuration management tools if you happen to be an analytic administrator, or as we kind of like to use the term, an R admin.

So I'm really happy that R admin as a concept, as a role, is becoming more recognized in different organizations, and people are starting to identify as this personality in different organizations. But in case you don't know, an R admin is usually, as we define it, a data scientist who has crawled into doing more IT-esque type work. So they take on the jobs of onboarding new tools, deploying solutions, and supporting existing standards in their organizations. They work closely with their IT groups to maintain, upgrade, and scale their analytic environments. They influence others. They train. They try to make things more effective in their organizations. And in general, overall, they are really passionate about making R a legitimate analytic standard in their enterprise.

So if you're not familiar with the R admin, you think you might be one, you want to read more about it, our Director of Solutions Engineering, Nathan Stephens, put out this blog a while ago on our R Views blog called Analytics Administration for R, so definitely check that out if you are interested in reading more.

The R admin's role in legitimizing R

As part of this overarching set of goals that we have for R admins, a big part of doing the work of an R admin is making R legitimate in your organizations. The R admin looks at their enterprise analytic infrastructure on a wider scale than maybe a traditional data scientist would. They are aware of all of the different integration points and all of the places where R can be assisting in helping manage that infrastructure.

So we kind of think about these things in three different parts. Your first step is to get R recognized as legitimate. If you are in the back room and IT is giving you the secret server and you've installed R on it and you're kind of doing things in a shady way and giving people access to different things, like, we understand that that's going on, but we're trying to give you a framework to move out of the shadows and bring R into the light. So in order to do that, we say it's important for you, the R admin, to develop competencies in using our professional stack. So in order to do that, you need to get hands-on experience with understanding and managing R tooling. And finally, you need to take the step to communicate with your IT department and get R adopted and finally, hopefully, help others in your organization so that they all get to the point where various teams are all relying on integrated R-based solutions.

So that's the ideal. But we understand that this is kind of an open-ended framework, and it can be kind of frustrating and difficult to navigate because we don't have a lot of actionable steps listed here. One of my friends, Adam, recently, in the past couple of months, started this thread on Twitter that was really, really interesting where he reached out to the Rstats community and said, hey, does anyone have any resources for starting conversations with IT? You know, I'm struggling to make R an accepted part of our software kit. And I know Adam, and he is a very talented, capable R admin, but he's feeling stuck.

Building sandboxes

And so a lot of people had some really interesting ideas for him. You can go and check those out. But our standard advice as a place to start is that, number one, think about how, if at all, you can get yourself a sandbox. What is a sandbox? A sandbox is any kind of server or VM that you, the R admin, has control over, completely control, and you can practice getting yourself those competencies.

So we have this resource available on our GitHub, which is the Solenge GitHub. And this one is called the Data Science Lab. And it very basically takes you through assuming it assumes you have a clean EC2 AWS instance to work with, but that's not a hard requirement. These commands will work on other servers as well. And it takes you through all of the parts of installing and configuring a very basic pro stack. So those include system dependencies and building R from source and installing packages and installing our studio products and all of the various integrations that we consider basic ones that are nice to have.

Once you've done that manually, we would then suggest that you might be ready to explore the fire hose, which is my favorite place, docs.rstudio.com. And on docs.rstudio.com, we have all of the quick links to get to all of our admin documentation. So if you're interested in focusing on one or many of these products, this is the place to go. Our admin docs are wonderful and they're full of many different side quests to explore in your sandboxes. So I highly recommend this site for getting started with the art of the possible.

So I hope at this point I've sort of explained why I'm excited about sandboxes, why I think other people should be excited about sandboxes. Unfortunately, even though we give this advice quite a lot, we don't see a lot of people actually taking us up on it. Actually building their own sandboxes and showing us what they've done and getting hands on experience. So for my talk today, I wanted to think about, like, what kind of talk could I give that would have people leaving this room being like, yeah, sandboxes! Like super excited about them.

Configuration management with Ansible

So that's what I've done. Instead of talking about the basic sandbox, which I've sort of already done, I want to talk about ultimate sandboxes. And this is kind of how I develop sandboxes for my daily work as a solutions engineer. I have the privilege to be embedded on the RStudio Connect team and help them out with testing new features and figuring out how things work so that we can deliver the correct documentation to our users.

So I do a lot of creating quick sandboxes that live for very short periods of time. But I like to save all of my work. So I like to create the ability to have reproducible, you know, custom environments that I can go back to and visit through my GitHub if I ever needed them again. And the way I do that is through configuration management tools. In particular, I mean, this is a bunch of the top players in that space, but there are very many. And it's easy to get overwhelmed by the sheer number of tools available, but my favorite of the configuration management tools is Ansible.

Take this with a grain of salt, because I'm going to spend the next several minutes talking a bunch about Ansible, but I do not believe that you should use Ansible. If Ansible speaks to you, more power to you. I think that's wonderful. I think it's a wonderful tool. Go out and learn about it. But I would suggest first checking with your IT team to see if they use a configuration management tool. And instead of investing your time in learning the one that I like, learn the one that they use, especially if you intend on staying in your organization for any period of the next several months, because that is going to really grease the wheels for helping you communicate with your IT. If you speak the same language as they do, if you can show that you know how to administer the Rstack using the set of tools that they're comfortable with, that is a hugely powerful thing.

If you speak the same language as they do, if you can show that you know how to administer the Rstack using the set of tools that they're comfortable with, that is a hugely powerful thing.

So some things I like about Ansible are, of course, that I can stand up these servers very quickly in custom ways on demand. But I also really love the rich module ecosystem that it comes with, especially the modules that they have around your basic big players in cloud compute space. So they have great modules for AWS, Azure, GCP, and they have a lot of other modules, some of which I'll talk about today, that are also nice to have and helpful. The other great thing about Ansible is that you write playbooks in YAML, and that's awesome because it is super human readable and super machine readable, and I love that. It's also super easy to install. This is how you install it on a Mac, and then you're up and running.

So I'm going to go through kind of what creating an Ansible project looks like, but as a high level overview before I move on to the next slide, I'll say that what I'm going to talk about is kind of a nested directory of YAML. So the very container-ish level, you have your playbooks, and then within playbooks, you're going to have roles, and then within roles, you're going to have tasks. So I'm going to go through all of that. First with the playbook role structure.

Ansible playbooks and roles

So this is a project that I actually did a couple weeks ago to help check out the new feature that just got announced a couple minutes ago in RStudio Connect 1.7, which is support for these new content management APIs that allow for programmatic deployment of content to Connect. So this is what my playbook role structure looks like, and I'm showing right here the create sandbox playbook, and this playbook contains two roles, one for provisioning my cloud infrastructure and a second one for installing RStudio Connect on it. So I have this playbook, create me a sandbox, and then within that playbook, different things happen and I connect to it in different ways, but I have two roles, one provision, one install and configure.

This again is about the content management APIs for programmatic deployment. It's kind of a two-for-one in this talk, because I was very excited about this feature that's coming into Connect, and so if you are working with your IT department and you've ever heard these two questions, hey, we can't allow push button publishing or how do we implement a dev test prod setup, I believe that our new stuff around programmatic deployment will be really helpful in having those types of conversations. There's the key resources, the RStudio user guide has just been updated with a cookbook for various server API recipes, and we also have a GitHub repo that has scripts that have examples of how to do programmatic deployment, so those, the scripts that are in this GitHub repo are going to be really useful, and those are the ones that I used in this Ansible project.

So again, back to roles, as I mentioned before, these roles have tasks, and the basic anatomy of a task is that it's a good idea to name it, and that name should in general be relevant to what the task is doing, and then you also provide the module that you want to use. And then finally, if it's applicable, you'll then provide parameters for that module and plug in your variables.

There are a couple of really cool things that are happening in this particular group of tasks, and this is the entire role that this Ansible project uses to create and upload and deploy the content. So this is programmatic deployment as defined in an Ansible task listing. The cool things that are happening here is that I am leveraging the script module, because if you thought, like, this can't possibly be all that programmatic deployment is, that's correct. I'm leveraging some shell scripts, because I didn't want to take the time to transfer all of the commands in those scripts into Ansible code. So the scripts that are provided in that GitHub repo, I grabbed those out, I put them in a scripts directory, I edited them slightly to make them do what I wanted them to do, and then I'm just calling those scripts, and you can it's a really cool way to take baby steps into moving any of the configuration scripts that you currently have into a more reproducible configuration management type workflow.

So that's the number one cool thing I'm doing. Number two is that you can see after I've run that first task, I'm registering an output object of what came out of that first script, and I'm plugging it in to the second and the third task. The final cool thing is that I remembered to use a debug statement, which is also great.

I'll start moving faster. So I talked about roles have tasks. This is the task for the deploy content role, and finally when you're ready to get up and running with your playbooks, this is kind of how you run playbooks one by one. So I talked about I usually have two or maybe three playbooks that I'll run to in any given Ansible project, sometimes more, but usually two or three, and one will be the playbook that installs and sets up my infrastructure, installs whatever I need on it, and the last one will be the playbook that tears everything down.

This is my favorite part of Ansible sandboxing. Once you have created this thing, you now have the power to write a good read me, check it all in to version control, and then burn it all to the ground, because you have the ability to reproduce this environment at any time within minutes, which is awesome.

Once you have created this thing, you now have the power to write a good read me, check it all in to version control, and then burn it all to the ground, because you have the ability to reproduce this environment at any time within minutes, which is awesome.

Interoperability and legitimizing R

So that's a little bit about how I do daily work as a solutions engineer who works with engineering teams and how I stand up these environments very quickly whenever I want, but obviously you would do things differently. You have different needs for sandboxes, and so this slide kind of shows what is available at a high level through docs.rstudio.com and all of the various configurations and integrations that you could possibly use.

This is a little resource that I have available on our Solange GitHub, and it covers if you aren't ready for the cloud yet, but you want to start using Ansible and creating sandboxes on demand, how you might do that with VirtualBox and Vagrant. It also shows kind of if you're really interested in seeing just the task structure of Ansible and how I have done the data lab sandbox inside Ansible task structure, that's available to look through here.

So finally, I want to kind of end on thinking about, like, why did my talk get put in the interoperability section, which kind of seems like an odd fit, but I also really like that it's in this section as well, because going back to the idea of how we get R legitimized in an organization, I often hear the stock answer, like, show the value of R, like, build a bunch of cool Shiny apps, and I love Shiny, I do. I'm writing a book about Shiny in production, apparently, but it isn't actionable if you don't have, like, awesome killer Shiny app ideas.

So my idea for you, a better way to frame this sort of advice is to think about linking the tools that you love, R, to the tools that you know other people love in your organization. Powerful sandboxes really leverage interoperability, and we've done a lot of work at our studio to make cool integrations that are turning into true interoperability opportunities for R. Throughout my career, I have had, like, a lot of success helping get R legitimized by putting it in terms of other people's favorite tools, and so I'll leave you with that. Thank you.