Resources

Advancing Epigenetic Predictors with Scalable Machine Learning (Varun Dwaraka) | posit::conf(2025)

Advancing Epigenetic Predictors with Scalable Machine Learning: A Biologist’s Perspective on Efficient Model Development Speaker(s): Varun Dwaraka Abstract: TruDiagnostic develops precision health tools using DNA methylation-based diagnostics. We integrate bioinformatics (R) and machine learning (Python) with Posit’s ecosystem in AWS to enable high-throughput model development. Posit enhances workflows by streamlining preprocessing, feature selection, and deep learning with PyTorch. Leveraging Posit with AWS parallelization and sharding accelerates model training, reducing computation from weeks to hours. This talk highlights how Posit is vital for advancing research in health predictors, driving innovation in precision medicine and revolutionizing healthcare. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Varun, and this is going to be very cathartic, because I think when I was creating these slides, I was thinking about where we were as a company to now, and so it might be a little bit of a vent session, if you will, but I also just want to start off by saying I love Posit for everything that you guys do, and use this as a use case for a sheep in the wolf's den of how we can actually be able to bring a lot of this scalable machine learning, especially as a biologist.

So True Diagnostic is the company that I work for. I am here as a representative. I used to be the former director of bioinformatics, now I decided that research made more sense to me, and so I'm the director of research and principal investigator.

What TruDiagnostic does

Okay, you might be wondering what we do. So True Diagnostic, what we try to do, and have been doing for the last couple of years, is we leverage epigenetics to assess health. A lot of what we're essentially quantifying is just essentially a blood test, which you can put onto a blood spot card, and then we can develop, essentially run it through these Illumina methylation arrays. What we're trying to do is quantify the methylation signatures of the human body.

Through that, we do bioinformatics, which obviously to this crew, I don't think I need to explain a little bit more about that, but this is where the machine learning aspect comes into play. We're taking a lot of the data, which is essentially in a matrix format, with samples across the columns, and then the 850,000 to about a million CPG sites as rows. So you can just think of it as like a, essentially as a CSV file with a beta matrix.

From that, we have two different wings. Commercially, we can send this out to people. If you want to buy our test, you can actually just buy it. It's very much like a 23andMe style, a direct-to-consumer perspective, but where I've kind of fallen in love with, and I think what really sets us apart, is the research. We are actively publishing in this area. We're trying to see how much these epigenetic tests can quantify and what they cannot quantify.

So if you're kind of wondering how this kind of works, you can think of it from the 23andMe model, where you're essentially doing a genetic test. You're looking at the different single-nucleotide polymorphisms, or SNPs, and seeing how that makes up for you.

It's a great model, and roses to them, even though RIP, but patients, for people that do the 23andMe, they really need one test per lifetime for those genetic information. The reason why is because your genetics aren't really changing, and nor should they. And so there's few interventions that really change those genetic risks. It's great to understand a baseline, but what do you do with that, right? If you're screwed, you're screwed, kind of, from there.

And you're inheriting a lot of this from your parents. And so when we thought about this, it's like, okay, is this an actual business model that goes forward? 23andMe did a great job with it, but there's an entire area of epigenetics that really hasn't been commercialized in this area. So for that reason, we've actually went into epigenetic testing.

The reason why this works is because of the science that goes into it. The methylation markers are always changing. And in this next slide that I'll go into is going to explain why. When you think about epigenetic testing, the epigenome, or what the specific flags on the DNA, that's what we're looking at right now, DNA methylation, they're actually changing based on all different types of environmental cues that are caused by changes in your nutrition, stress, toxicants, pathogens, and other different factors. When all of these things are changing, the epigenome changes, and then that regulates genes, phenotypic changes, and then also is causing a lot of the fluctuations in the expression of genes, and so on and so forth.

And so for us, from True Diagnostic, we've decided to go that route and complement what 23andMe and many other genetic testing companies have really looked into. And so this leads to a variety of predictors that you can estimate from a single blood test. So again, when you're thinking about epigenetic testing, we are not looking at just the genome, per se. We're looking at the flags on the genome, these DNA methylation marks that are physically on specific cytosine sites of the genome.

And so given the environment and exposure effects on DNA methylation, you can actually calculate a wide variety of health measures that are directly based on biology you can control because of that environmental aspect. And for example, on the right side, as you're looking at this, that's actually a test that we can provide for you. Not only can we give you something like your biological age, but we could break it down into what is the imputed, let's say, or supposed biological age for different kidneys, brain, heart, et cetera.

And so this is really useful because this is something that you really can't calculate. I know Whoop came out with biological age. Every other company has come out with a biological age. But a lot of that is based on static information versus something like an epigenetic test, which will give you a little bit more of those malleable changes. And so it's really changes that you can actually impact versus something that just says, hey, you're probably going to die in the next two years.

And so it's really changes that you can actually impact versus something that just says, hey, you're probably going to die in the next two years.

Thank you. And so with this in mind, we've brought a lot of products to commercial, health care, and academic markets. And the reason why bioinformatics is really helpful is because a lot of this is just based on machine learning. We're capturing a lot of more than 50,000 individuals. We have entire epigenomes for these individuals. And that really comes back to developing a lot of these markets. So this includes drug development, companion diagnostics. We're actually working with Eli Lilly on that one. Disease diagnostics, omic research, and so on and so forth.

Media appearances and research

And you don't have to just hear from me. I want to introduce somebody that has a wealth of knowledge. This is the best test I've ever taken in my fucking life. This was so cool. We were on the Kardashians, by the way. But I love this. This is the best test I've ever taken in my fucking life. This was so cool. Thanks, Chloe.

So I don't really need to say much else. So this has been featured on the Kardashians. We've been able to work with HBO on a documentary for Natalia Grace, actually figuring out what her chronological age was. And then on Netflix, we were pretty much looking at what diet impact has on different interventions and stuff. So if you ever wondered how I look with a goatee, check that out. It's not great. I'll be very honest. Even my mom was like, what are you thinking?

But where I think, for me, my heart is really set on is actually the research. With a lot of the data that we've been able to gather, we've gathered a lot of data. We've been able to utilize this and collaborate with a lot of academic partners to really publish a lot of these studies, specifically 30 or more in the last four years. We have plenty of university collaborators. And then that three plus is actually supposed to be patents. But the reason why I'm saying all of this is because it really cannot be possible without Posit. Posit Workbench has been such a godsend for us.

How Posit Workbench fits into the workflow

So when we started as a small startup, we had one data scientist, me. And this was kind of the everyday life for us. I was using R. Shout out to the next speaker. R was one of those things where, as a biologist, everything is just much more packaged in Bioconductor. And for us, it made sense for one person.

How this kind of worked, though, when we were implementing a lot of this was I came from the land of HPCs. And so I was using a lot of the AWS systems as an HPC, which I wouldn't recommend it. And so it would be very serial. We're using schedulers, fixed compute nodes. Of course, you can use EC2 in a way to increase or decrease the type of RAM that you need or compute. But it was just a hassle.

And then I started getting more people into the department. A lot of very smart technical people, probably smarter than I am. But they all brought with them a wealth of their own knowledge. And so as the team grew, we needed more scaling in different languages as well. Majority of our team codes in R. But then especially with ML and the next generation of all of our ML products, we needed to use something like PyTorch. And so Safe, I just want to give a shout out to Safe. SageMaker, he was the one that brought that in for ML Ops.

And so we turned to our engineering team. There was no real way to expand this. And so this conversation actually happened. I'm not going to go through it, but hopefully you find it funny because I thought it was hilarious. But the problem really was the customizable solutions, we had ways to utilize EC2 or S3 in a manner that was able to scale up. But it needed somebody from the engineering team to kind of manage. And so with that said, we were trying to figure out what is that solution? How can we kind of take not that much time from the engineering team, but then have the bioinformatics team be able to scale up and scale down as needed? That's where Posit Workbench came into play.

From the R side, we're still in R. We have to use a lot of those bioinformatics workflows because for us as a company, we need to be able to utilize published and validated workflows that are out there. So this includes things like, for example, for DNA methylation, Sesame, Minfi, all of these different packages that are already there, tuned, and kept well-organized. This includes the quantification of the beta values, essentially the percentages of methylation at those 1 million sites, the epigenetic clocks, which actually go to our customers and also to our research individuals, the academic partners.

But then we needed to figure out a way to include Python so that we could use the ML ops that are there because parallel processing was huge for us. We're not just training one or two models. We were literally training 5,000 different models serially the first time. And that took months, if not. And I'll explain that use case in a bit. So we needed efficient, scalable tooling, and Posit really was the answer, especially with Posit Workbench.

And so how it fits within our current process is essentially we have an Amazon EFS server. So we have our own dedicated lab at Lexington, Kentucky. Shout out to any Kentuckians that are here. We have all of that data being generated from the lab and automatically goes up into a S3 bucket. And so for us, we essentially just use our credentials for AWS. And then we essentially create an EC2 within the Posit system. And then we could just essentially have different RStudio IDE or Positron. Well, RStudio for now, Positron in the future. Get those sessions up.

And this is really great for us because as the team grew, we have Docker, GitHub, Jupyter, especially for safe. But Natalia, Laura, Kirsten, me, we're all in RStudio. So it really has everything unified.

I think you guys already know, but for the biologists or people that are not using Posit Workbench, it's super easy to start. This is how we're using it. So we're using it within SageMaker. It's super easy. I'm not even going to go through this too much, but let me know if you want more details here.

But it's also a really good method about just scaling those EC2s. Before, we had to work with our engineers to just have, let's say, a large EC2 instance started. Now we could just, literally for us, we could just start it ourselves. So something like this, I already have a few of them at the bottom, like four of them. So just for me to get started, the test, that was super easy. And it really provided familiar tooling on the cloud with the flexibility between R, Python, and Linux.

The reason why that was huge is because for us, even though we do most of our tooling in R, many of the bioinformatics workflows that we have actually exist as packages that you need Linux to start. And so just like any other IDE, having this available was super helpful for us. And it's on the cloud. So for example, now I can go around with my MacBook Air, which doesn't really have a lot of memory, but then be able to work everything there as well.

Three use cases

So I'm going to go through three use cases. The first use case, how this has really impacted us, is expediting clinical trials and publications. Having close to 30-plus or so publications, it's a lot. But the reason why we were able to get there is because of this workflow. So we essentially are able to put all of our pipelines onto RStudio, be able to run it pretty much as just a general run. And then essentially, we use R Markdown, where it's kind of shown there how we can deliver a lot of this to our academic partners. From those HTMLs that they have, they can literally choose, OK, these are interesting that we could publish on, create the figures, and then essentially it goes back into just publishing. So those are actually three papers that we got, I think, the last few months or so.

Use case number two, it saved us almost a year of product development. And the reason why is because a lot of our methodologies, I talked only about biological age, but we can actually use DNA methylation to look at essentially your lab values, CRP, HbA1c, a whole slew of metabolites, a whole slew of proteins that are estimated from DNA methylation. For those of you that are interested, please talk to me after. I'd be happy to explain how that's working.

But essentially, if you think about it from an ML perspective, you are training an individual model for each and every one of those metabolites, proteins, and clinical values. And so for that, we actually, this was part of a preprint that we published earlier, and now we're submitting to Nature Medicine. But essentially, it's about 1,700 models that we were able to whittle down from a total of 5,000 after validation, clinical validation. And now we're able to sell that to the consumer with almost a $90, for $90, essentially. So you're able to get all of this.

Now, we're not Theranos. I swear to God. The reason why, again, we're not Theranos is because that academic publication is where we really want to make sure that what can we say and what cannot we say. And that's where it really allows us for, especially for use case number one, where that's been very helpful there. And so we can offer these as reports, essentially.

And use case number three is, within the organization, we have a lot of, let's just say, non-engineers and non-bioinformaticians who are part of the company. But we need to empower them to be able to essentially show and sell these products. Because again, because of the state of biomedical research, it's really hard to educate the masses on this. And so for them, they need to have a quick way of showing the associations to disease, how you could use this, what is the information that you could really leverage, especially with this new science. So that's been really helpful. So a shout out to Laura on our team, who's actually developed a shiny tool for data visualization. Essentially taking all of our data from the over 50,000 individuals who have taken our tests and be able to show that to the rest of the, when explaining our science.

What we're excited about

So what are we excited about? I got a really shout out to Nick, Posit Marketing, and Tom, Posit Engineering, and the two engineers I talked to. DataBot and Posit Workbench, I'm super excited about. The reason why this is really important is because on the left side, these are fixed HTMLs that need a lot of human interaction to create, a.k.a., they don't really let me out much. I'm usually in the basement coding away. But instead of doing that, why not use something like Anthropic Cloud Coding and also just try to use some AI tools to kind of glean a little bit more from the data. So that's where we're really excited about.

That's already sent a couple of messages to our bioinformatics team and engineering team for that. And so, yeah, I think with that said, thank you, especially to everyone, especially the Posit team, for everything you guys do. You've really made our lives a bunch easier. And for anyone else who's interested in the science aspect, we'd love to collaborate. Especially right now, we're looking for ways to create R packages that we can give out to the rest of the community, mainly because this is the Wild West, especially for epigenetics. So if you have any interest in that or helping us out, my LinkedIn and email, I use my email like I'm texting. So yeah, I'm always on there. So thank you, and happy to take questions.

Q&A

All right, thank you, Varun. Great presentation. Yeah, thanks. So we have time for maybe one or two questions. The first question, I don't understand. It is, DNA methanolination is highly tissue dependent. How does the blood test correlate with the tissue test for assessing tissue age?

Perfect. Yeah, that's a great question. So yeah, it is very much tissue dependent. The way that we're looking at it is a blood test, obviously. And blood is heterogeneous in the cell mixtures. The way that we've been able to leverage kidney health or, let's say, brain health and all of that is we're looking at, if you think about blood as a systemic organ, it's capturing a lot of the signals that is part of many of your organs within your body. And so what we're using is we're using machine learning, essentially, as a way to pick out those signals. And it's very much a signal to noise ratio.

So there's a paper that came out, actually, in Nature Aging about two days ago called Systems Age from our collaborators at Yale. And it actually goes through exactly how we're able to deconvolute from the blood down to individual organs. Amazing stuff. Cool, yeah. All right. Thank you. That's all the time we have. Yeah, thank you. Round of applause for Varun.