
Precision Medicine for All: Using Tidymodels to Validate PRS in Brazil (Flávia Rius) | posit::conf
Precision Medicine for All: Using Tidymodels to Validate Breast Cancer PRS in Brazil Speaker(s): Flávia E. Rius Abstract: Polygenic risk scores (PRS) are a powerful way to measure someone’s risk for common diseases, such as diabetes, cardiovascular disease, and cancer. However, most PRS are developed using data from European populations, making it challenging to generalize results to other ancestries. In this talk, I’ll show how I used tidymodels tools—like yardstick, recipes, and workflows—to calculate metrics and validate a breast cancer PRS in the highly admixed Brazilian population. You will learn how to leverage tidymodels to make precision medicine more inclusive. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Can you hear me well? Okay, my name is Flávia. I'm a data scientist in Mendelix, which is a company that does diagnosis of genetics diseases based in Sao Paulo, Brazil and I'm here today to talk to you about how I have used tidymodels to validate a PRS of breast cancer in Brazil.
So breast cancer is the most common type of cancer for women in 85% of all countries as you can see in this map represented in pink. You can see that both Brazil and the US are among them.
The main risk factors for breast cancer are sex, age, lifestyle factors such as alcohol consumption and sedentarism, hormonal factors such as contraceptive pill usage, family history and genetics.
The genetics of breast cancer became widely known after Angelina Jolie revealed that she carried a BRCA1 mutation and decided to undergo a preventive double mastectomy.
BRCA1 is an example of a gene in which mutations can dramatically increase the risk of breast cancer up to about 80% over a woman's lifetime. That's roughly six times the general population risk.
Besides high-risk mutations such as in the BRCA1 gene, there are also moderate risk mutations, which increase the risk by about two to threefold and they both occur in single genes.
Apart from them there are also other mutations which we'll call genetic variants that have very small effects individually but that can be added together in what composes polygenic risk score or PRS. These variants are spread all over our genome.
Understanding polygenic risk scores
So the polygenic risk score follows a normal distribution in which the majority of us will have an intermediate score, which is the same risk as the general population, but some of us will have a high score due to having a lot of this small effect variants and we'll have an increased risk for breast cancer due to the polygenic risk score.
This variants with small effects are discovered through very large studies with hundreds of thousands to millions of people. These studies are called genome-wide association studies or GWAS. They are generally conducted with individuals of a single genetic ancestry because it facilitates the discovery of this variants.
The majority of this study so far as you can see in this plot 90% of them have been conducted with individuals of European ancestry and the big problem about that is that not all PRS is discovered in European populations can be generalized to other ancestries because especially there are differences in frequency and effect of variants among ancestries.
not all PRS is discovered in European populations can be generalized to other ancestries because especially there are differences in frequency and effect of variants among ancestries.
Brazil's diverse population
Brazil has one of the most diverse populations in the world. 75% of our individuals have at least two continental ancestries and these are mainly European, African, Native American and East Asian and since building our own PRS from scratch was not in the radar, we decided to test if some European based PRS European developed PRS could be applied to our population.
Once I gathered all my data all my cases and controls data it was already I decided to I started to have a lot of questions and uncertainties and the feeling reminded me of a very specific situation that I've been through in the spring of 2017 when I moved to New York, let me tell you a story about it.
So during my PhD I got a scholarship to spend one year in the Weill Cornell Medicine Lab to do a part of my project there. I was very excited it was my first time in New York and once I got there and settled in the into the apartment with my roommates, I noticed that I needed to buy some things like laundry detergent, shampoo and yogurt.
So I made a list with 20 items and went to the grocery store. When I got there and went and got into the fridge part to buy my yogurt I simply got stuck. There were so many options that I could not decide which one to pick.
The cheapest maybe? Oh the list of ingredients is too extense. I don't know. What about the most expensive one? It must be good. Too expensive. I don't know if it's worth it. Which brand is this? Should I buy Greek or regular yogurt? What is skr?
So, oh my god, I need some help. So that day I understood the meaning of the word overwhelmed. I was so overwhelmed that I left the grocery store with only half of my list checked off and a headache.
So as many of you here are from the U.S. I believe you might have never experienced the feeling of overwhelm by going to the grocery store, but I'm sure you have experienced this from doing a data science project, right? You have to pick so many things and it was not different for me in my PRS project.
So I had to pick first where should I get the PRS is to test, should I get them from papers, from PRS database, from a GWAS database? What about the method to analyze, should I trust the nature or science paper? Should I use metric X or Y, statistic K or Z?
Operationally, what about the ID? Should I stick with the one I know or try that newest fastest available? Should I try the new tool they've shared in posit::conf or stick with the ones I don't know? So as this and many more questions were going through my head, I was overwhelmed in my project.
But once I've passed it all and got into the final part of my project what I needed to answer the question can European based PRS be used in Brazil to identify a high breast cancer risk? I knew that tidymodels was my the solution. It was not be overwhelming at all.
Why tidymodels
So tidymodels is a set of tools developed from posit engineers to do all operations related to machine learning from pre-processing to modeling and results gathering. It has integratable tools meaning that each to talk with the other with each other each of their tools, the multi package tools talk with each other in a pipe of a way.
It is compatible with the tidyverse so you can use your favorite packages such as dplyr, ggplot2, tidyR and so on. And it has what I call a concept of reusable building blocks where each tool is a part of the process and it works like a musical orchestra where each tool is an instrument and if you want to change your symphony or the result of your models you can change like a chord of an instrument.
Step-by-step workflow
Let me show you better in a practical way of all of my step-by-step. So first I want to show you a glance of my data. I've read it with readR and here you can see I have ID, status, sex, the PRS is and the principal components 1 through 10 all row.
Then I started with my first tidymodels tool, which is rsample to split my data easily into training and testing. Here I've used status as the strata because I wanted to balance cases and controls.
Then prior to the next step I normalize the PRS values by mean and standard deviation of controls only because I didn't want it to bias by the selected patients.
Then with recipes I could easily set a formula for my model, here I'm setting status as the as the variable. Then with step_rm, I could remove all of the predictors I did not want to use and I could easily center and scale my principal components with step_normalize.
Then I've set my model with parsnip. It's a simple logistic regression. I started with logistic_reg function, then I set my engine as the GLM function and the mode classification because I wanted to classify individuals into having the disease or not.
Then workflows served as the orchestra conductor where it joined the recipe and the engine and it could be easily piped to a fit function and fit my data to get the results.
I could extract the statistics with the versatile tidy function from the broom package. Here I said exponentiate to true to get the odds ratio and I use some dplyr functions to get only the PRS results. That's what I was interested in.
Then I did a new recipe without the principal components to get the AUC metric for only PRS and I could easily update the recipe with update_recipe from workflows.
Then I predicted the data and extracted the AUC metric with a yardstick roc_auc function. So, that's it this was my workflow very easy all tools integrated, no need to shape and reshape the data, all simple no overwhelm at all.
Results
So to get my final results, I just repeated that with all the other PRS's. I split my data into PRS deciles in order to evaluate if higher PRS values would have would identify a higher risk for the disease and due to the heterogeneity of our ancestry sample, I've split it into different groups to evaluate PRS for all particularities of ancestries I'm going to show you better.
So this is the first main result we had you can see that there is a positive correlation of the odds ratio on the y-axis and the PRS deciles on the x-axis and this is an indication that a higher values of PRS can predict better the disease and this shows us that the three PRS's here generalized well to our population.
Then I did the split by ancestry composition. The first group was an East Asian majority with over 50% of East Asian ancestry, the second group non-european majority with less than 50% of European ancestry and the third group European majority with 50% or more of European ancestry.
The overall results were good, very similar to the overall population with odds ratios — these are odds ratios per standard deviation — but particularly for the East Asian group one of the PRS's, the broad PRS exhibited a wide confidence interval which included a less than one value of odds ratio and a p-value higher than our threshold so we could not follow with it as a valid PRS for our population. So we validated both 380 and 313 PRS's, by the way, their names are the number of variants they have and with 380 exhibiting the best results.
Remember the moderate risk variants? So in this plot we compare some genes that are very important for a breast cancer risk and we could show that the top PRS decile has an odds ratio that's comparable to ATM and CHEK2 which are genes in which moderate risk variants reside. So this result is very important because it not only shows that it is generalizable to our population, but also that it has clinical use.
So this result is very important because it not only shows that it is generalizable to our population, but also that it has clinical use.
So earlier this year, we published a paper with those results and we also developed a genetic test in Mendelix which includes the top 100 most important genes for hereditary cancer in general and the breast cancer PRS in which if a woman has a PRS value that is on the top decile the report shows that they have a moderate risk for breast cancer.
Key takeaways
So the key takeaways of my talk are that projects and data science are inherently overwhelming. We have so many things to pick so many data formats. It's like going to the grocery store in a foreign country.
But if you have a machine learning part in your project, I recommend you to use tidymodels because it is very well organized, all steps are covered, and of course the tools I've shown here it's not everything they have much more. So it worked fluidly, it was not an overwhelming part of my project and it enabled the PRS validation in Brazil, which is a step forward to make precision medicine more available to more people and more ancestries and I'm sure other ancestries also do not have this access and they can now do the same analysis I did.
Thank you very much for your attention if you want to find more find out more about the breast cancer PRS validation project you can point yourself on to this QR code which will lead you to a github repository with the code slides and the link to the paper. Thank you very much.
Excellent let's log any questions on Slido. While we wait I should tell you that the original name of the workflow package was orchestra really so I texted Davis that and he thinks it's a great analogy.
One question, do you have any like pain points in the analysis process? Was there anything that was like exceedingly difficult? Well, actually the whole bioinformatics part, the principal components analysis I had to do everything kind of from scratch because I was using exome data and there was no process of doing this, I had to use an external database. It was the most complicated part so not so data science related more bioinformatics related.
