
Data Science in Production Has Never Been So Easy | Feat: Posit Connect (Adam Wang)
Data Science in Production Has Never Been So Easy | Feat: Posit Connect Speaker: Adam Wang from NMDP Abstract: Data science is most impactful when it's in production. However, there's often a disconnect between local development and a production system. I'll show how to leverage **Posit Connect** to reduce the friction between development and production, automate and reproduce your data science at scale, and empower decision makers. We’ll uncover the production architecture that powers data science at NMDP
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you everyone for coming. I'm really excited to follow up on the spotlight video we saw yesterday morning where we got to see Pierce's story and his battle against blood cancer.
And so every year there are tens of thousands of patients like Pierce where their best hope for a cure is often through an unrelated donor transplant who at that point in time is basically a total stranger. And so it is our job as data scientists to use our data superpowers to help identify these donors to better serve patients like Pierce. But one of the challenges that I find data science typically get themselves into is how do we get these insights into production into the hands of decision makers, in our case physicians who are selecting donors, and how do we get the insights to them when they need them.
And so over a couple of years we've had a journey of trying to really make this process much easier and reduce the friction between local development and production environments. And so it all starts around five years ago we were a young data team where we were doing lots of ad hoc analyses in R. And it was kind of a cautionary tale of be careful what you wish for because we did a lot of great analyses. They were so great that stakeholders wanted them refreshed every morning. So I would set an 8 o'clock alarm, run a series of 20 reports, make sure everything's working, and rinse and repeat.
And so that got very, very tedious and so as one did back in the day we were feeling lucky and we searched the web for how can we automate the running of these R scripts. And the first result was Windows task scheduler. And it actually made a lot of sense conceptually, so it replaced me and my alarm clock with a Windows server managed by our IT department running on a task scheduler. But it had quite a few pain points, one being we didn't always know when reports failed, which is a big problem to have. And as we started having more and more infrastructure and reports on this server, syncing files became quite difficult. So pushing from our laptops to our Git repository is straightforward, but then we had to build out a whole process on the Windows task scheduler service to sync with Git and inevitably there are fire drills where people are hot fixing on the server itself. You have to go the other direction, which is quite a pain to manage.
And so we were wondering that there's got to be a better way to do this, right? And so that's when we realized we reinvented the wheel, only our wheel was a little bit more squarish than we would have liked, whereas software like Posit Connect just works plug and play and you get high-quality enterprise-grade software. And so part of the reason why I'm giving this talk is because even in 2025 with AI, the very first result for our automation is still Windows task scheduler, and so this is my contribution to fight back against their monopoly on our automation.
our wheel was a little bit more squarish than we would have liked, whereas software like Posit Connect just works plug and play and you get high-quality enterprise-grade software.
Why the team loves Posit Connect
So why does my team love Posit Connect? The first thing that we like is that scheduling and monitoring just comes out of the box. We don't have to worry about that. So no more waking up early and running reports by hand, and more importantly, we can have proactive alerting. So we built on top of the Posit Connect API to have this automation status report, which summarizes every single day our entire catalog of reports and finds out how many ones failed and there's even conditional logic to alert certain people if certain reports failed.
Secondly, and I think this is my favorite aspect, is that we don't have to change how we work. So we can develop in our favorite ID of choice, whether that's RStudio, VS Code, or Positron, as long as we all agree to push back to our Git repository at least once in a while. And we get to use our favorite code-first tools like R and Python, Shiny applications, Jupyter Notebooks, Quarto, just to name a few.
And to really appreciate why this is such a good benefit to have, I think it's insightful to compare to alternative approaches to how you might deploy something in production. So a very common framework is to use Docker containers, and they are a way to package all your code and application into this one container, which you can ship anywhere and run anywhere in theory. But in practice, one, there's a big learning curve. So if you understand everything in this Docker file, props to you. But again, we're trying to make the friction as low as possible from local development to insight and production. And Docker in particular, it's hard to inspect inside the container unless you really know how to probe around in there, and it can be quite a learning curve.
A second approach is maybe you want to try some cloud infrastructure, like serverless AWS Lambda functions. And again, they have their time and place, but when you're talking about reducing friction from local code to production code, Lambda functions have to be structured in a particular way. They all have this handler function, which takes in two arguments, event and context, and those are runtime-dependent. So wherever it's running in production, those are events that are coming in, and so it's a little bit hard to mock up your local environment to have exactly those events in context. Of course, you can do it, but then you're increasing the differences between your development code locally and your production code deployed into AWS.
And so a third benefit of Posit Connect is actually something we didn't even ask for or didn't realize we really enjoyed until we saw that it actually gives you an entire user interface to show your reports. So we've been thinking entirely so far based on scheduling, automating compute somewhere else off my laptop. But it turns out it's really nice when you can automate that compute and also show at the same time a really high-quality report. So in this case, we have a screenshot of our data science storefront, which is a portal document that is able to, it's kind of like a meta report that highlights all the other content that we want to highlight, and we can build a whole bunch of dashboards, shiny applications for people to use right within the same tool.
Architecture and the data science lifecycle
So those are some reasons that we really enjoy Posit Connect, but how does it fit into our architecture or more generally the data science lifecycle? And so this is probably not a surprise, but it almost always starts with an idea or problem that you're trying to solve. And so for us, that was trying to predict the likelihood that a donor is willing and able to donate. And we call that the donor readiness score. And so once we have that idea, then we do what we do best as data scientists. We do some exploration, exploratory data analysis locally, we train a model, we run some validation, make sure that the model is performing as we would expect, and we do it with our favorite IDs of choice.
But once we start getting to the step where we have code that looks like, take our model, take some new data, calculate some new predictions, this is when you want to start thinking about deploying to production so you don't have to run this four times a day every single day for a long time. And that's where Posit Connect fits in for our architecture. So it's hooked up to our GitLab repository through a Git integration. And this is a one-time setup that basically gives the Posit Connect server read access to everything that we push to Git. And so things like our data science storefront, it's a portal document in our GitLab repository, and Posit Connect has access to that along with the usual open-source packages.
And the one change you have to make to your workflow using Posit Connect, unlike Docker or Lambdas, is you just have to create this one manifest.json file, and you can do that automatically through one of these two commands, depending on if you're using Python or R, and the manifest is basically an instruction manual for the server to know what version of Python or R you're using, what packages you're using, and their versions, so it can help set up that environment. And once you have that manifest.json file, you go to the Posit Connect UI, you find the file or the folder where the manifest is in, you click the blue button to deploy your content, and then you have your awesome report that you can link to and share right away.
And it's really helpful for maintenance, too, because Posit Connect automatically checks all your content with a manifest every 15 minutes to see if there's updates, so we don't have to implement any extra CICD on our end to make sure that the changes we pushed gets reflected to the latest version of the report. That's all done under the hood, all that complexity is abstracted away.
Connecting to the database
So the next critical piece to have in our architecture is introducing our database, because you can work with CSVs locally, you can work with pins for a time, but you always want to connect to your database because you want to have a live connection in production. So let's zoom in there a little bit. So the database set up to Posit Connect is, again, a one-time connection, and there are Posit professional drivers that are included that support most common databases, around 15 the last time I checked. And once you have this two-way connection where you can read data from Snowflake and also write back to data to Snowflake, you can do things like score millions of records per day, and schedule this to run as much as you need.
And one kind of interesting use case that we would recommend for your database of choice is to be able to leverage that database compute when you need to for the occasional job that doesn't fit into memory in your Posit Connect server. And databases are a great place to do that because they're used to dealing with big data all the time and your data already lives there, so your security department's going to be really, really happy that there's one less transfer of data moving around. And for us, we use Snowflake and there's Snowpark functionality, so we can actually write Python user-defined functions that run natively in the warehouse. So when we have to occasionally do a huge compute job, we just scale up the warehouse, use some Python, and there you go. You can get more creative for other databases that don't support Python or Python-like languages by, you know, converting your model objects into SQL. If you're interested in that, you can look into Orbital, but there's creative ways to leverage your warehouse compute when your server isn't doing the job for you.
Internal packages and connecting data holistically
Okay, so I think these are kind of the key components to these production workflows that are not too different from your local development, and I think this is a great place to start. But once you have maybe 10 or a lot of content onto Posit Connect, you might start realizing that you're actually copy-pasting a lot of code, like when you connect to your database, it's the same 20 lines of boilerplate code. And so here's a good opportunity to think about using internal packages. And we personally use Posit Package Manager for two main reasons. One, it automatically builds new package versions as long as you push to Git. So again, we don't have to do any CICD to update the package versions. That complexity is all abstracted away from us. And from a user perspective, it's also super nice to just use the same commands that you would to install any other package, even though they are completely internal, completely custom-made.
Okay. And there's one more piece of the diagram that is missing, and it's going to be a little bit dependent to each of your specific use cases and organizations. But basically the question of, you know, when we talk about data and machine learning, our models are only as good as the data that comes in. And also, if no one has access to our models, then they're really not useful at all. So how do you connect data in a holistic way? And the general advice that I would have is to try to keep everything within one central database as much as possible.
And so, for example, we have a lot of input data, for example, from our donor registry. We have a lot of other sources of data, but we try to make sure that everything eventually flows into Snowflake, our central database, somehow. And we also work closely with our engineering and integration teams to make sure that there's a pathway for data that we send back to Snowflake to connect to downstream applications. So for us, our physicians use an application called MatchSource that we built in-house, which is what they use to select donors for their patients. And so, having a process where all the data that we're producing ends up in Snowflake, and that has a clear path to application.
And so, with that, we're able to put all the pieces together to create this, you know, architecture diagram that we use for data science that really focuses on development experience for the data scientist, so that we can bring all our awesome insights that we develop locally and have a really clean process to push them into production, but also without compromising or missing features. And we're able to, you know, generate useful insights that people can use, and that's running not on our laptops, and it's running in an automated fashion. And so, with that, I'm happy to take a couple questions. You can read more about the slides or more about what we do in these links, and I'm always happy to connect and chat more. So, thank you.
we're able to put all the pieces together to create this, you know, architecture diagram that we use for data science that really focuses on development experience for the data scientist, so that we can bring all our awesome insights that we develop locally and have a really clean process to push them into production, but also without compromising or missing features.
Q&A
All right. Thank you, Adam. We have time for a few questions. So, the first question for you. Could you elaborate more on Git-backed pushing to Posit Connect? Does the report or app update on commit? Yes. So, the short answer is that yes. It basically updates on commit to the branch within 15 minutes. And so, you can think about Posit Connect is running every 15 minutes, checking to see if there's been changes to any piece of content you have on the server. And if it picks up any new changes, it's going to fold those in and automatically deploy a new version without you having to go in and click buttons yourself. If you want it to be immediate, you can go in and click a button to do it faster, but otherwise, we generally have a workflow of push to Git, merge, check our changes, and once they're merged into the main branch, then we know it'll update in Posit Connect.
Okay. Another question. What features are missing from Posit Connect? So, I have one nitpicking one that comes to mind. So, when you schedule reports on Posit Connect, there's the common use cases of every day or every month or every 15th of the month. But sometimes we get weird requests from people that are like, can you run this on the 19th business day of this month or at least four times of the day that aren't nicely every hour or some periodic interval. And so, there's ways to work around doing that by making a parameterized report, but it would be nice if you could just point and click and be like, this is the schedule that I want. But that's kind of a nitpicking. Broader, I'll have to think about that. Connect with me and we'll talk more. I'll try to think of something. Okay. That's all the questions. Thank you so much. Adam, another round of applause, please.
