Resources

Emily Riederer | oRganization | RStudio

Many case studies demonstrate the benefits of organizations developing internal R packages. But how do you move your organization from individual internal packages to a coherent internal ecosystem? This talk applies the jobs-to-be-done framework to consider the different roles that internal tools can play, from unblocking IT challenges to democratizing tribal knowledge. Beyond technical functionality, we will explore design principles and practice that make internal packages good teammates and consider how these deviate from open-source standards. Finally, we will consider how to exploit the unique challenges and opportunities of developing within an organization to make packages that collaborate well -- both with other packages and their human teammates. About Emily: Emily Riederer is an Analytics Manager at Capital One. Emily leads a team that is focused on building internal analytical tools and data products, including a suite of R packages and Shiny apps, and cultivating an innersource community of practice for analysts. Emily is an active member of the R community. In 2019, she co-organized satRday Chicago and the Chicago R unconference. You can find her {projmgr} R package on CRAN and her blog at emilyriederer.netlify.com. Previously, Emily earned degrees in Mathematics and Statistics at UNC Chapel Hill and worked as a research assistant in emergency department simulation and optimization

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Emily Riederer, and today I'm here to talk about how to design packages for your organization. Think about the last time that you joined a new organization. There was so much you needed to learn before you could start contributing. Remember the frustration when you couldn't figure out how to access data, the lost hours trying to answer the wrong question before you built up your intuition, and the awkwardness of figuring out team norms through trial and error.

When we join a new organization, we must learn these things only once before we can hit the ground running. However, the off-the-shelves tools that we use can't preserve this context. Every day is like their first day at work. Unlike open-source packages, internal packages can embrace institutional knowledge more like a colleague in two important ways. First and foremost, they can aim to solve far more domain-specific and concrete problems than open-source packages. Second, although narrower is the scope of the defined problem, their insight into our organization allows them to be more holistic in the solutions they propose by internalizing more steps of the necessary workflow to answer that question, covering areas like pulling data and communicating with colleagues.

Because of these two factors, internal packages can add a lot of value by interfacing internal utilities, streamlining the analysis process, and empowering data scientists to create high-quality results. But, more than the tasks they can do, today I want to talk about how to design such internal packages to be as rich in context as possible.

To do this, I like to think about the popular jobs-to-be-done framework for product development. This asserts that we hire a product to do a job that helps us make progress towards the goal. And, to me, it's that notion of progress, and truly knowing the ins and outs of what sort of progress our users need to make that sets internal packages apart. Additionally, these jobs can have functional, social, and emotional aspects.

For the rest of this discussion, I'll tweak this framework just slightly. Let's build a team of packages to do the jobs that help our org answer impactful questions with efficient workflows. So what sort of teammates should our packages be? Let's meet a few.

The IT guy: abstracting infrastructure

First, I'd like to introduce you to the IT guy. Think of a helpful or friendly IT or DevOps colleague that you may know. They're fantastic at handling the quirks of infrastructure and abstracting away all the sorts of things that data scientists don't want to spend their time thinking about, like proxies, connections, servers, etc. In that abstraction process, they also take on additional responsibilities to promote good practices like data security and credential management. Ideally, they can save us a lot of time and frustration in navigating organization-specific roadblocks that no amount of searching on Stack Overflow can help.

So how can we encode these characteristics in a package? Writing utility functions can help achieve that same type of abstraction. And in these functions, we can take an opinionated stance and craft helpful error messages. Let's take a look at an example. Suppose we want to write a function to connect ourselves to a database. First, we might start out with a rather boilerplate piece of code using the DBI package. We take in the username and password, hardcode the driver name, server location, port, and return the connection object.

Now let's suppose our organization has strict rules against putting secure credentials in plain text, as well they should. In an open-source package, I wouldn't presume to force users' hands to use one specific system setup. However, in this case, we can make strong assumptions based on our knowledge of an organization's rules and norms. And this sort of function can be great leverage to incentivize users to do the right thing, like storing their credentials in environment variables, because it's the only way they can get the function to work.

Of course, that's only helpful if we provide descriptive error messages when the user does not have their credentials set up this way. Otherwise, they'll get an error that dbPaths is missing, find nothing useful online to help them troubleshoot, since this is an internal choice that we made. So we can enhance this function with an extremely custom and prescriptive error message explaining what went wrong, and either how to fix it, or where we can get more information.

Of course, that's only helpful if we provide descriptive error messages when the user does not have their credentials set up this way. Otherwise, they'll get an error that dbPaths is missing, find nothing useful online to help them troubleshoot, since this is an internal choice that we made.

Of course, even better than explaining errors is preventing them from occurring at all. We might also know at our specific organization that non-alphanumeric characters are required in passwords, and that dbConnect doesn't natively encode these correctly when passing them to the database. Instead of troubling the user and telling them how to pick their password, we can instead handle this silently inside of the function.

The junior analyst: proactive yet responsive functions

However, strong opinions and complete independence don't always make a great colleague. Other times, we might want someone more like a junior analyst. They know a good bit about the organization, and we can trust them to execute calculations correctly and make reasonable assumptions. At the same time, we want them to be responsive to our feedback and willing to try things out in more ways than one.

So how can we capture these jobs in an internal package? We can build in proactive yet responsive functions with default arguments, reserved keywords, and the ellipsis. To illustrate, imagine a basic visualization function that wraps ggplot2 code but allows users to input their preferred x, y, and grouping variables to draw cohort groups. This function is fine, but we can probably draw on institutional knowledge to push our junior analyst to be a little bit more proactive.

If we relied on the same opinionated design as the IT guy, we might consider hardcoding some of the variables inside of the function. Here though, that isn't a great approach. It isn't the junior analyst's job to give edicts. We might know what the desired x-axis will be 80% of the time, but hardcoding here is too strong of an assumption and decreases the usefulness of the function for that other 20%. Instead, we can put our best guess 80% right names as the default arguments in the function header, ordered by decreasing likelihood of the need to override. This means when users do not provide their own value, a default is used. That's the junior analyst's best guess, but users retain complete control to change it as they see fit.

This approach becomes even more powerful if we can abstract out a small set of incredibly common, frequently occurring, assumed variable names or other values. We can define and document this set of keywords or special variable names that span all of our internal packages. If these are well-known and well-documented, users will get into the habit of shaping their data so it plays nice with the package ecosystem and saves a lot of manual typing. This type of incentive to standardize file names can have other convenient consequences in making data extracts more shareable and code more readable.

Finally, one other nice trick in making our functions responsive to feedback is the ellipsis, or passing the dots. This allows users to provide any number of arbitrary additional arguments beyond what was specified by the developer, and plugs them in at a designated place in the function body. This way, users can extend functions based on needs that the developer could not have anticipated, like customizing the color, size, and line type.

The tech lead: vignettes and templates

So far, we've mostly focused on that first dimension, making our package teammates targeted to solving specific internal problems. But there's just as much value in that second dimension. Using internal packages as a way to ship not just calculations, but workflows, and to share an understanding of how the broader organization operates. To illustrate this, consider our intrepid tech lead. We value this type of teammate because they can draw from a breadth of past experience and institutional knowledge to help you weigh tradeoffs, learn from collected wisdom, and inspire you to do your best work.

So that's a pretty tall order to put in a package, but conscientious use of vignettes and templates can help us towards this goal. Vignettes often help introduce basic functionality of a package with a toy example, as found in the declier vignette. Or less commonly, they may discuss a statistical method that's implemented, as done in the survival package. Vignettes of an internal R package can do more diverse and numerous jobs.

These vignettes can accumulate the hard-won experience and domain knowledge like an experienced tech lead's photographic memory, and they hand-deliver these insights to anyone currently working on a related analysis. Just as a few examples, you can consider having completely code-free vignettes that conceptually introduce you to a common problem that your package solves, explain the workflow and key questions you should be asking yourself along the way, and even potentially get into procedural weeds like what you need to do to add features or deploy a model.

Then after aligning with your users on a conceptual framework, you may introduce the package's functionality and explain how those two overlap. When your package contains functions for many different ways to do a task, you can also compare pros and cons and explain different situations where different ones have proven more or less effective. Finally, vignettes can also include lessons learned, reflections from past users, and references to past examples to help analysts learn about similar projects.

In fact, all of that context may be so helpful even to people that are not direct users of your package, they may want to seek this mentorship. In this case, you can use the Packagedown package to automatically create a package website to share these vignettes with anyone who needs to learn more about a specific problem space. With a single function call, your package can share its wisdom more broadly. And, unlike their human counterparts, the package tech lead can always find time for another meeting.

Similar to vignettes, embedded templates take on a more important and distinctive role for internal packages. In open-source packages, our Markdown templates provide a pre-populated file instead of the default. This is most commonly used to demonstrate proper formatting syntax. For example, the Flux Dashboard package uses a template to show users how to set up the YAML metadata and section headers.

Instead, internal packages can use templates to coach users through workflows because they understand the problems users are facing and the progress they hope to achieve. Internal packages can mentor users and structure their work in two different ways. Package walkthroughs can serve as interactive notebooks that coach users through common analyses. As an example, if a type of analysis requires manual data cleaning and curation, a template notebook can guide users to ask the right questions of their data and generate common views they need to analyze.

We can also include full end-to-end analysis outlines which include placeholder text, commentary, and code if the type of analysis that a package supports usually results in a specific type of report. Similarly, our package can include a project template, not just in our Markdown template. These templates can pre-define a standard file structure and a boilerplate set of files for a new project to give users a helping hand and drive the kind of consistency across projects that any tech lead dreams of when doing a code review.

The project manager: modularizing workflows

Speaking of collaboration, brings us to the last teammate whose traits we want our package to evoke, the project manager. One of the biggest differences between task-doing versus problem-solving packages is understanding the whole workflow and helping coordinate the project across different components. When writing open-source packages, we rightly assume that our intended audience is our users, but on a true cross-functional internal team, not everyone will be, so we can intentionally modularize the workflow and think about how to augment RStudio's IDE to make sure our tools work well with all of our colleagues.

One way to do this is with modularizing the parts of the workflow that really do or do not require R code. For example, in the templates we just discussed, we could actually make separate template components for the parts that require R code and for the parts that are just English text and commentary. The commentary files can be plain vanilla markdown files that any collaborator could edit without even having to have R installed, and the main R markdown can pull in this plain text output using child documents.

This approach is now made even easier with advances in the RStudio IDE. The visual markdown editor provides a great graphical user interface to support word processing in markdown for those plain text narrative documents that we just discussed. We can also use RStudio add-ins to extend the RStudio interface and ship interactive widgets in our internal packages. To illustrate, I've shown the point-and-click plot coding assistant from the open-source Esquisse package. Add-ins require more investment up front, but they're much easier to maintain than a full application, and they can help convert teammates to R users. Besides, a good project manager is willing to go that extra mile to support their team.

Building a coherent package ecosystem

Now speaking of collaboration, we've talked about what can make an individual package a good teammate. Another major opportunity when building a suite of internal tools is that we can think about how multiple packages on our team can best work together. We want teammates that are clear communicators, have defined responsibilities, and keep their promises. We can help packages be these sorts of good teammates with naming conventions, clearly defined scopes, and careful attention to dependencies in testing.

Clear function naming conventions and consistent method signatures help packages effectively communicate with both package and human collaborators. Internally, we can give our suite of internal packages a richer language by defining a consistent set of prefixes for the namesteps that indicate how each function is to be used. One approach I like is that each function prefix can denote the type of object that that function will return. For example, using viz prefixes for functions that return ggplot2 objects. This way, past experience working with one internal package seamlessly translates to the next.

Another aspect of good team communication is having clearly defined roles and responsibilities. Again, since we own our whole internal stack of packages, we have more freedom in how we can choose to divide up functionality and responsibilities across packages. Take for example the data science workflow as described in the R for Data Science book. Open-source packages inevitably have overlapping functionality, which forces users to compare alternatives and decide which is best. Internally, we can use some amount of central planning to ensure each package teammate has a clearly defined role, whether that be to provide a horizontal utility or to enable progress on a specific workstream. And just like one team ought to work well with another, that central planning can include the curation and promotion of open-source tools along with our internal ones. After all, no one team can do it all alone.

When assigning these roles to our team of packages, we should consider how to manage the dependencies between them when different functionality needs to be shared across packages. Packages often have direct dependencies where a function in one package calls a function in another. This is not necessarily bad, but especially with internal packages, which might sometimes have a shorter shelf life and fewer developers, this can sometimes create a domino effect. If one package is deprecated or decides to retire or take a vacation, we don't want the rest of our ecosystem affected.

Alternatively, we can use the fact that both packages A and B are under our control to see if we can eliminate the explicit dependency by promoting a clean handoff. We can see if a function in A can produce an output that B can consume instead of calling A's function directly. Additionally, because we own the full stack, we may also consider if there are shared needs in A and B that should be extracted into a common building block package C. For instance, a set of common visualization function primitives. This way, we at least have a clear hierarchy of dependencies instead of a spider web and can identify a small number of truly essential ones.

Regardless of the type of dependencies we end up with, we can use tests to make sure our packages are reliable teammates who do what they promised. Typically, if I write package B that depends on package A, I can only control package B's tests, so I could write tests to see if A continues to perform as B is expecting. This is a good safeguard, but it means that we will only detect problems after they've already been introduced in A. There's nothing in place to actually stop package A from getting distracted in the first place.

Instead, we prefer that both A and B be conscientious of the promises they've made and stay committed to working towards their shared goal. We can formalize that shared vision with integration tests. That is, we can add tests to both the upstream and the downstream packages to ensure they continue to check in with each other and inquire if any of the changes they're planning could be disruptive. Now, just imagine having such rigorous and ongoing communication and prioritization with your own teammates.

Instead, we prefer that both A and B be conscientious of the promises they've made and stay committed to working towards their shared goal. We can formalize that shared vision with integration tests.

In summary, we all know the joy of working with a great team. And given that you're here right now, I suspect you know the pleasure of cracking open a new R package. By taking advantage of the unique opportunities of designing internal packages, we can achieve the best of both worlds. We can share the fun of working with good tools with the teammates we care about. And we can elevate those tools to full-fledged teammates by giving them the skills they need to succeed. Thank you.