Hitting the Target(s) of Data Orchestration - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

I'm a huge huge fan of Lord of the Rings. And every time I see this image, part of my brain thinks that this is the Eye of Sauron.

But actually, this is the first image of a black hole. It represents an extreme case of a data pipeline where petabytes were shipped across four continents to a supercomputer where they were processed and synthesized to this image of a few kilobytes.

My name is Alexandros Kouretsis. My background is from astrophysics and cosmology, from where I entered the fascinating world of data science and now I'm working as an R Shiny developer at Epsilon.

In this presentation, I'm going to walk you through some real examples and I'm trying to pass to you some hands-on experience on how to build effective, efficient, and secure data science workflows using the vast and mature ecosystem of R and Posit Connect, augmented by the game-changing package targets.

So, here, we are trapped in a Groundhog Day, right? Where you have to live the same day over and over again.

And this goes even worse for the development process because even if I do the slightest change in the pipeline, I should recompute everything. And this turns like the Joyful process of building a data science workflow through a Sisyphean process where I have to recompute everything over and over again.

Directed acyclic graphs and the desired behavior

So, let's see what went wrong here and see what options we have to get out of this trap. This is a concept from ETL Pipeline. It's a very easy concept. It's called Directed Acyclic Graph. So, consider two inputs, two tables, called Data1 and Data2. I read this data, I process this data, and we join this data. And you can see the flow of the logic from the left of the right and this is a Directed Acyclic Graph.

I can go now even further and generalize this notion where A, B, C, D can be any kind of steps. And here, in our case, B can be I load some big genomic files to the S3 bucket. And I compute some correlation and regression in the database for genes that can take hours, et cetera.

So, what I have created here is an all or nothing scenario. And this means that whatever, whenever I trigger this pipeline, everything is invalidated. Everything is processed again. And what is the desired behavior? So, consider here this G node that it is like a kind of a logic that I just read a small file from the file system that can take seconds. And when this file changes, this node is invalidated. So, what I want to do is just to recompute the nodes that are related to this part in my workflow.

And omit the other steps that can be like big computations that can take hours.

So, going back to my previous experience from research and data science, what usually we tend to do is like we introduce some kind of an abstraction that we call the job or a chapter or something like that. And what you tend to do is to break this down to some kind of steps that you load some data, you process some data, and then you save some data. And then you continue on if you want to extend your project.

You introduce a new job, you load some data again, process data, save data. And this goes on and on, and you start creating something like a higher order graph where you do things manually. But still here you have somehow breaked out from this Sisyphean cycle for development process, but still you have to do some custom scripting to declare the sequence of jobs. You should do some custom scripting to skip steps, and you should also do some custom scripting, for example, to shake workflow data.

So, you should be careful. All the saved files, for example, if you are in a cloud worker, to push them to an external file storage. That it is persistent. And this is just part of a real pipeline in the wild. This is a small pipeline, and these are the components of the pipeline, how they are connected together. I think you will see in the next presentation a bigger view of this.

And do I really want to take up all this complexity? I'm prone to errors, and it's much more difficult to collaborate with other team members. This is the chaos I was talking about.

Targets to the rescue

So, here is where Targets comes into play. Targets is an opinionated framework about how to structure your data science workflows and your data pipelines. And if you follow its conventions and the way that it suggests you to build your workflows, then you get out of the box many benefits.

Targets will automatically infer the directed acyclic graph for you. It will strategically skip steps from the pipeline for you. It will also keep strong reproducible evidence for you. And it has great facilities to integrate with cloud storage. And of course, it has great facilities for distributed computing.

So, how Targets achieve this? Targets introduces a simple abstraction that it is called a Target. And the Target is just a function that outputs an object. And by convention, this object is pushed to your persistent file storage.

So, let's see a Target script in action. These first three lines, we just set the scene. So, you load Targets, you set your options, and with TarSource, you can load your custom functions. And this list is where all the action takes place. So, here is where I define my Targets.

And in this line of code, TarTarget, like, declares that this line of code is a Target, and then you go to GetData, the function that outputs an object, the data object. So, here we have this duality. GetData function outputs data object. And then GetData takes as input the file object that it is generated by a previous Target, and this goes on and on. And Targets will directly infer the directed acyclic graph for you, and it will put in the correct order all these objects. So, file now is before data, and there will be computed correctly.

Cloud storage integration and distributed computing

Also, Targets has, how we call it, Metaflow integration. And by Metaflow integration, we mean that Targets will send all this metadata and intermediate files that have been computed to a persistent file storage. Here we use an S3 bucket, so you can just set the options and declare your S3 bucket.

And now you have many benefits. Some of them are that you can now scale up to petabytes for your storage. You can just switch on data version control. And, of course, you have portability. And by portability, we mean that this pipeline, you can run this pipeline in different machines, and developers can share cast results between them. So, if I have run the pipeline and have pre-computed something, my colleague can share the cast and save hours from his development process.

Targets has also great facilities for distributed computing. It uses GRU. And it's very easy. You just go to the options. You set the number of workers. Here we set two workers, for example. And Targets will start these two workers and check if a worker is available. As soon as it is available, it will send the next target there.

So here, let's consider a very simple case. I have data, some data, and I feed two different models with this data. And this can take, again, each model feed can take some time. And Targets will check if a worker is available and it will send these computations there so they are computed in parallel. And it has also nice facilities. You can also go even further to more classic distributed computing, like using SAM systems, et cetera.

Deployment with Posit Connect

For deployment now, it is also very easy. You just set in the Quarto document one line of code, Target star make, and this will trigger the pipeline for you and will do all these nice things. So you can automate it easily, and pause it, connect. And you can also easily read any previous object that has been computed and displayed in your report.

For example, here we summarize the data object that we computed in the pipeline. And now everything is inspectable. You can create your code books. You can create your validation reports. You can extend your report that is generated during the pipeline.

So going back to our case where we have deployed this pipeline and pause it, connect. By, let's say, moving and refactoring and reframing our processes into a Targets project, brings this, completes the picture. And with minimum effort now, we have a case of maximum efficiency and scalability.

And with minimum effort now, we have a case of maximum efficiency and scalability.

And yeah, I couldn't resist to not share this meme from my social media. It is now that we have a nice pipeline. It is open source also, written in R. This means that it is extensible. Also, it is resilient. You can integrate with any system that you want. And you can move on on building your Barbieland perfect dashboards and share this with your organization so that everyone benefits from this process.

So the next time that you have a project that it is some logical steps, a workflow, a data pipeline, you can just use Targets, reframe it into the Targets project. Even if it is like a medium-sized or small, you will get all these benefits. And it will bring you out of the mindset of creating like ad hoc scripts, manual scripts for your workflows and linking them together.

So if you want to, yeah, please reach out to me on LinkedIn. This is my LinkedIn profile here. And yeah, use Targets. Thank you.

Q&A

So a quick question. How was the experience of integrating Targets with Connect for the team?

It was quite nice. The thing is that the biggest bottleneck, one of the things that you have to solve is that you want to store this data somehow. So you will need a persistent file storage or if you have like a friendly IT team, you can request for, let's say, write access in Posit Connect in an absolute directory. So this is the main thing.

Thank you again, Alexandros. Thank you. Appreciate it.

Hitting the Target(s) of Data Orchestration - posit::conf(2023)

Transcript#

The data orchestration challenge

The project: a pharma data pipeline

Requirements: transparency, security, automation, efficiency

The efficiency problem: Groundhog Day

Directed acyclic graphs and the desired behavior

Targets to the rescue

Cloud storage integration and distributed computing

Deployment with Posit Connect

Q&A