Emily Riederer | oRganization | RStudio

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Emily Riederer, and today I'm here to talk about how to design packages for your organization. Think about the last time that you joined a new organization. There was so much you needed to learn before you could start contributing. Remember the frustration when you couldn't figure out how to access data, the lost hours trying to answer the wrong question before you built up your intuition, and the awkwardness of figuring out team norms through trial and error.

When we join a new organization, we must learn these things only once before we can hit the ground running. However, the off-the-shelves tools that we use can't preserve this context. Every day is like their first day at work. Unlike open-source packages, internal packages can embrace institutional knowledge more like a colleague in two important ways. First and foremost, they can aim to solve far more domain-specific and concrete problems than open-source packages. Second, although narrower is the scope of the defined problem, their insight into our organization allows them to be more holistic in the solutions they propose by internalizing more steps of the necessary workflow to answer that question, covering areas like pulling data and communicating with colleagues.

Because of these two factors, internal packages can add a lot of value by interfacing internal utilities, streamlining the analysis process, and empowering data scientists to create high-quality results. But, more than the tasks they can do, today I want to talk about how to design such internal packages to be as rich in context as possible.

To do this, I like to think about the popular jobs-to-be-done framework for product development. This asserts that we hire a product to do a job that helps us make progress towards the goal. And, to me, it's that notion of progress, and truly knowing the ins and outs of what sort of progress our users need to make that sets internal packages apart. Additionally, these jobs can have functional, social, and emotional aspects.

For the rest of this discussion, I'll tweak this framework just slightly. Let's build a team of packages to do the jobs that help our org answer impactful questions with efficient workflows. So what sort of teammates should our packages be? Let's meet a few.

The IT guy: abstracting infrastructure

First, I'd like to introduce you to the IT guy. Think of a helpful or friendly IT or DevOps colleague that you may know. They're fantastic at handling the quirks of infrastructure and abstracting away all the sorts of things that data scientists don't want to spend their time thinking about, like proxies, connections, servers, etc. In that abstraction process, they also take on additional responsibilities to promote good practices like data security and credential management. Ideally, they can save us a lot of time and frustration in navigating organization-specific roadblocks that no amount of searching on Stack Overflow can help.

So how can we encode these characteristics in a package? Writing utility functions can help achieve that same type of abstraction. And in these functions, we can take an opinionated stance and craft helpful error messages. Let's take a look at an example. Suppose we want to write a function to connect ourselves to a database. First, we might start out with a rather boilerplate piece of code using the DBI package. We take in the username and password, hardcode the driver name, server location, port, and return the connection object.

Now let's suppose our organization has strict rules against putting secure credentials in plain text, as well they should. In an open-source package, I wouldn't presume to force users' hands to use one specific system setup. However, in this case, we can make strong assumptions based on our knowledge of an organization's rules and norms. And this sort of function can be great leverage to incentivize users to do the right thing, like storing their credentials in environment variables, because it's the only way they can get the function to work.

Of course, that's only helpful if we provide descriptive error messages when the user does not have their credentials set up this way. Otherwise, they'll get an error that dbPaths is missing, find nothing useful online to help them troubleshoot, since this is an internal choice that we made. So we can enhance this function with an extremely custom and prescriptive error message explaining what went wrong, and either how to fix it, or where we can get more information.

Of course, that's only helpful if we provide descriptive error messages when the user does not have their credentials set up this way. Otherwise, they'll get an error that dbPaths is missing, find nothing useful online to help them troubleshoot, since this is an internal choice that we made.

Of course, even better than explaining errors is preventing them from occurring at all. We might also know at our specific organization that non-alphanumeric characters are required in passwords, and that dbConnect doesn't natively encode these correctly when passing them to the database. Instead of troubling the user and telling them how to pick their password, we can instead handle this silently inside of the function.

Instead, we prefer that both A and B be conscientious of the promises they've made and stay committed to working towards their shared goal. We can formalize that shared vision with integration tests.

In summary, we all know the joy of working with a great team. And given that you're here right now, I suspect you know the pleasure of cracking open a new R package. By taking advantage of the unique opportunities of designing internal packages, we can achieve the best of both worlds. We can share the fun of working with good tools with the teammates we care about. And we can elevate those tools to full-fledged teammates by giving them the skills they need to succeed. Thank you.

Emily Riederer | oRganization | RStudio

Transcript#

The IT guy: abstracting infrastructure

The junior analyst: proactive yet responsive functions

The tech lead: vignettes and templates

The project manager: modularizing workflows

Building a coherent package ecosystem

Featured software#

rstudio

Shiny