Resources

Data-as-a-product: A framework for collaborative data wrangling (Clara Amorosi, BMS) | posit::conf

Data-as-a-product: A framework for collaborative data wrangling Speaker(s): Clara Amorosi Abstract: Data preparation requires substantial time and subject matter expertise but is often tailored to a single-use deadline rather than encouraging reusable workflows across a team. We developed a framework that acknowledges the time and expertise invested in data preparation and maximizes its value. Our data-as-a-product suite of R packages promotes joint code and data version control, standardizes metadata capture, tracks R package versioning, and encourages best practices such as adherence to functional programming. I'm excited to share my experience onboarding collaborators to this reproducible research framework, highlighting key challenges and lessons learned from advocating for good development practices in a dynamic research environment. GitHub Repo - https://github.com/amashadihossein/daapr posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, thanks everyone. So I'm a data scientist at Bristol-Myers Squibb, and I'm here to talk about a framework for reproducible and collaborative data wrangling.

But first, I'd like to ask if you've ever heard of type one versus type two fun. If you haven't, type one fun is an activity that's enjoyable while it's happening. So this could be sitting on a sunny beach on vacation. Type two fun, on the other hand, is an activity that's not as enjoyable while it's happening. You might even be miserable. But then, when the activity is over, you look back on it fondly.

And so, classic examples of type two fun include exercise, or my personal favorite, jumping into an icy alpine lake. And so, today I'll talk to you about how, just like type two fun, taking the plunge into reproducible research is worth it.

So I often think about reproducible research in the context of ad hoc analysis requests. I work with a lot of clinical trial data in a large organization, and I might get a request from an investigator such as this one. So maybe they ask me to look at baseline albumin levels across two different clinical trials. And albumin is a relatively common lab measurement, so I'll say, okay, I'll take a look.

But then, when I actually start looking into the data, I realize that there's a whole bunch of questions that I'll need to answer in order to actually go back to the investigator with the solution. So maybe I need to figure out which specimen type I'm interested in, or maybe there's different units used, so I need to perform a unit conversion step. And inevitably, most of the work goes into finding, cleaning, and structuring the data.

And this is really the classic 80-20 rule of data science, where most of the work goes into this data wrangling step, and then you only have the last 20% where you do the actual analysis and essentially get the answer.

And oftentimes, or sometimes, this can be disposable data work. You maybe just need to get to the answer as quickly as possible, and so you essentially just turn in your homework and go on with your day. But I feel like fairly often, maybe you're asked to return to this analysis six months later, or maybe you want to share your analysis with somebody else.

And in this case, having disposable data work isn't a really good solution here. So one way to make this data wrangling more reusable is to package it into a data product.

What is a data product?

And so what do I mean by a data product? In this case, a data product is an analysis-ready data set, or a group of data sets that's versioned, reproducible, and shareable. And this is how you can essentially package your data wrangling work and collaborate with others on creating this data product object.

The only small wrinkle here is that this is more work. This is definitely the type two track here, but I'd like to talk through a framework for creating these data products and hope to convince you why it can be worth it to create these data products.

Introducing DAPR

So at BMS, we've developed a suite of packages called DAPR, or data as a product in R, that provides a framework for creating these reproducible data products. These were developed by a group of us at BMS and are used across different types of scientific analysis questions, mostly in the exploratory space.

The idea behind DAPR is that it allows multiple users to pool their data wrangling efforts and then create this data product object that they can collaborate on. The overall goal is to be able to build these analysis-ready data products that follow sound data engineering principles. And so I'll talk through what I think is the core principle behind DAPR, and then I'll go through how you would actually use the DAPR framework to build a data product.

So the key principle behind DAPR is pretty simple. It's just to version all the things. So this comes in three different flavors. You version your code, you version your environment, and you version your data. I've arbitrarily ranked these by how hard I think they are, but DAPR attempts to make it easier for the user to do all these things and also forces them to version all of these things.

It's just to version all the things.

So in terms of code, DAPR uses Git and GitHub, so users make sure to version control and commit and push their data wrangling code. In terms of versioning your environment, it uses the renv package, so users track and install all their required package dependencies. This is also very useful when you're sharing code with somebody else and you need them to start from the same point as you.

And then in terms of versioning your data, DAPR tracks the metadata of all of the input files that you're using. And under the hood, it's using the pins package to do this. So this is a package that's, a wonderful package that's used for publishing and sharing datasets. And one of the really nice things it does is it creates versioned datasets. And so DAPR uses this to track which versions of all of your input datasets you're using.

The DAPR workflow

Okay, so that's the key principle behind DAPR, but how does it actually work? How do you build a data product?

This is the DAPR workflow, and I'll build out these steps as I walk through them. But the general idea is that you're going to start, initialize your product, create an input data snapshot, do the actual building step where you add derivation code, then you do the version everything step, and then finally you can deploy and share your data product.

So I'll jump into the first step here, which is how do you actually initialize a data product? So you'll need a few things in order to start. The first is you need to specify the remote board that will be used to store your data. This could be cloud storage, or a shared server, or a networked device. And you'll basically just need to gather the credentials or configuration details needed to connect to this remote board.

You can then use one DAPR command to initialize your local build environment. So this will initialize Git locally, it will start RN for you and track all your package dependencies, and it will save all the configuration details you need to connect to your remote.

At this step, you can create an input data snapshot. So you would gather the input files that you need that you're going to process. In my world, this is a lot of clinical data sets, but this could be any type of input file as long as it's in rectangular format.

DAPR then pins and syncs these input files to your remote board and saves the versions of all of the files that you're working with.

Okay, we're now on the build step. So this is where you actually build the data product by adding derivation code. And there's quite a lot of flexibility here in how you approach this. DAPR provides essentially scaffolding code for how you might go about this, but often the way we approach this is to create derivation functions where each function creates a data set that you might use for analysis. I'm showing an example here where you have a deriveAlbumen function.

And within the function, it will take a link to the input data that's snapshotted on your remote board and read that in and do all of the, essentially, data processing work. And each of these derivation functions might reference one or more input files, and then you might have one or more derivation functions that all build on top of each other.

And I should mention here that we often use the targets package to help with this. This is a really cool data pipeline package. It's kind of make-like, and it allows you to kind of set up all of your dependencies in terms of the derivation functions.

So as I said before, there's quite a lot of flexibility in how you actually build your data product. I'm showing here an example data product structure. So there's a lot going on here. The main thing I want you to focus on is that the data product object itself is structured into three main pieces. So you have your inputs, your outputs, and your metadata.

The inputs are all of your input files that are already synced to your remote board, and then the outputs are all of the data sets that you've created from derivation functions. And then you can optionally have metadata, such as a data dictionary, to define what your column names are, for example.

The other thing I'll note here is that we attempt to make these data products a little bit more lightweight by having the inputs return links to the data themselves, not actually the data. This is helpful because, at least in the clinical trial world, I would say we're working with medium-sized data, not truly big data, but it is still helpful to not have to read in the whole data sets at the same time.

Okay, so at this point, you've structured your data product however you want. You've done a bunch of derivation code and added that. Now you get back to the version everything step. And so, Dapr will help you commit and push, essentially, all of your code and other versionable things to your GitHub repo. So, this is the derivation code and all of the data product structuring logic. This is your RN files to track your package dependencies. And this is also all of the metadata around your input and output files.

Okay, and then one final step, which is that now you're ready to deploy it. So, at this point, you can push your data product to the remote board, and it's ready to share, to use for your own analysis, to share with somebody else to use for their own analysis, and you're ready to go here.

There's actually a sixth step here, which is to iterate over this. So, at this point, you can go back to step two, add new input data, deploy a new version of the data product, or you could add more derivation code, like maybe you have a new function you want to add, and then you can, for each of these, you'd go back and go through the cycle and deploy a new version of the data product.

The other helpful thing here is that all of this is version, so if you have an analysis script that references the data product, you can refer to it by version so that it doesn't break when a new version is published. This is also where you can bring in collaborators to work on the data product with you. So, it doesn't have to be just you going through this iterative process. You can bring in somebody else to do this as well.

Collaboration in practice

Just to show an example of how this might work, I'm showing a plot of data product collaboration over time in the form of commits to selected data product repositories. So, here, each line is an individual data product repo, and then each dot is a commit to that repo, and then the color of each dot indicates a different collaborator identity.

So, it's hard to tell exactly what's going on, but I hope you can see that each of the data products has a different group of collaborators, and each of them have different bursts of activity over time.

The other really cool thing is that some of these data products go back as far as 2022, which is quite exciting for me. Some of these represent kind of scientific analysis questions that have been persistent and interesting for many years, and the data product has been useful for those analyses for that whole time.

Getting people onboarded

So, I hope this shows, really, how Dapr can be used as a collaborative framework. I do want to switch tracks a little bit now and perhaps address the elephant in the room, which is, how do you actually get people to try this? So, I walked through, admittedly, a number of different steps to create a data product, and for people trying this out, they might need to learn to use new tools that they're not as comfortable with. So, how do you actually get people onboarded to a framework like this?

So, I have a few tips here. I will note that most of them are educational or focused on reducing barriers to entry. We're not forcing anyone to use this framework.

The first tip I have is to really provide comprehensive training. So, this might be hands-on demos with real data products. I would say it's been very effective to have one-on-one demos, so you can kind of look over somebody's shoulder and see when they run into issues. We also have asynchronous learning materials, so this is things like video tutorials, vignettes, and for all of these, just being really clear and providing dedicated debugging support, that's really been helpful.

The second tip I have is to let people ease into the process. So, you know, we do this by providing multiple DAPR roles. So, the first role is a reader role. So, at this point, people could essentially interact with the data product in read-only mode, and they could use it for their analysis, but they don't have to fully learn the DAPR framework. They could then graduate to a writer role where they would then learn the full suite of DAPR packages and tools and would contribute to derivation code.

And then the third role is an owner. So, at this point, this is more of an admin role. They would maybe manage access to the remote and do some provisioning activity, and we don't have as many people getting fully to the owner role, which is fine.

The third tip I have is really to show off the results, like show off what you can get by working within this framework. It's not very convincing to show, for example, just the number of lines of derivation code you've added to a repo. That's not going to convince anyone. Instead, sharing things like maybe you have a Quarto report that's linked to the data product that has key figures that get shown in stakeholder meetings, or maybe you have a Shiny app that lets you explore the data product and which derivations are involved in it.

And then finally, showing off the results of the collaboration. So, if you find somebody who's not already a reproducibility nerd, but has tried out the framework and has had success with it, and then they can kind of champion the framework to other people, that's really key as well.

Common roadblocks

So, despite all of these tips, we do run into a lot of common roadblocks getting people onboarded to this framework. And I think a lot of these will be familiar to many folks here as well. They're not specific to Dapr. And so, I just wanted to walk through how we approach some of these.

So, the first common roadblock is that managing credentials is annoying. So, you know, maybe you have like a Git token, maybe you have like S3 keys, maybe you have another data platform credential you need to provide. And all of these probably expire at a different time, which, you know, you can never really control that. We've had good success using the keyring package to manage all types of credentials, but there's a lot of different credential management packages out there.

The second common roadblock is focused on how do you really impart Git best practices? I will also say there's a lot of resources out here for getting started with Git. I just wanted to shout out Jenny Bryan's Happy Git with R, which is an excellent resource.

But I think the thing I want to focus on here is that we've all been at the learning stage for, you know, learning how to use Git and GitHub. We've all accidentally created a merge conflict or committed a file that we shouldn't. And so, I think for all of these, you know, leading with solutions when you're teaching people how to work through these issues, that's really key because we were all learning this once.

And then the third common roadblock kind of relates to the corporate enterprise environment. So, maybe you have a corporate firewall or maybe you have to configure your internal package manager to install packages. And for these, I would say it's been most helpful just to provide example setup code, especially across different operating systems. And I would say the key point for walking through these common roadblocks is that they will happen, and the goal is really to try to empower people to be able to work through them when they encounter them the next time. So, that's really what we try to focus on when we have people come to us with these issues.

Resources and closing thoughts

So, I'd just like to end with some Dapr resources. So, if you're interested in trying this out, Dapr is available on GitHub. I will note that despite having some data products that are over three years old, Dapr is still maturing. It also only supports RDS and QS file types, but we are thinking about some parquet development. It also only supports three different types of remote boards.

And then the final thought that I'll leave you with is what I said about trying to make your analysis pipelines more reproducible resonated with you. If you're trying to get away from some of the disposable data work, but maybe you're not fully ready to take the plunge into the type two world of Dapr, I would encourage you to try out some of the tools that Dapr uses. So, try out renv, pins, targets. These are all wonderful packages for making your pipelines more reproducible.

With that, I just need to acknowledge the other members of the Dapr dev team. So, big thanks to Afshin Mashadi-Hussain, who was the original developer and creator of the Dapr suite of packages. And then thanks to Leslie, Barack, and Mandeep, who are the BMS Dapr dev team. And I couldn't resist putting in a photo of my cats who have listened to many versions of this talk. Thank you.

Q&A

Apologies if people did submit questions. I think we've had some technical difficulties with them coming through Slido. So, we're just gonna have to wing some questions. So, what's next for Dapr? Is it gonna be made public? Do you plan to do more than Parquet support? What else?

Yeah, that's a great question. So, it is on GitHub. We would love to get it on CRAN. I would say we're thinking about some refactoring work. So, yeah, definitely Parquet development, I think, is one of our key first next steps.

And then, I'm also curious, what was your thought process about deciding to build? It does a lot of things, but one of the things it does is wrap a bunch of other tools. So, what's the trade-off on building another tool versus training people on the existing tools?

Yeah, that's a great question as well. It's definitely a trade-off. I will say some of these tools have a lot of common pain points. So, I think the wrapper approach forces people to buy into all of the tools at once, not just one of them. We already had people using the Dapr framework. So, as people were interested, we could say, hey, learn these reproducible tools as you're also learning the framework. But it's not always the right, Dapr is not always the right solution for every analysis question. So, it's definitely a trade-off there.

Okay, thank you, Claire. Just one more round of applause.