Precision Medicine for All: Using Tidymodels to Validate PRS in Brazil (Flávia Rius)

Transcript#

This transcript was generated automatically and may contain errors.

Can you hear me well? Okay, my name is Flávia. I'm a data scientist in Mendelix, which is a company that does diagnosis of genetics diseases based in Sao Paulo, Brazil and I'm here today to talk to you about how I have used tidymodels to validate a PRS of breast cancer in Brazil.

So breast cancer is the most common type of cancer for women in 85% of all countries as you can see in this map represented in pink. You can see that both Brazil and the US are among them.

The main risk factors for breast cancer are sex, age, lifestyle factors such as alcohol consumption and sedentarism, hormonal factors such as contraceptive pill usage, family history and genetics.

The genetics of breast cancer became widely known after Angelina Jolie revealed that she carried a BRCA1 mutation and decided to undergo a preventive double mastectomy.

BRCA1 is an example of a gene in which mutations can dramatically increase the risk of breast cancer up to about 80% over a woman's lifetime. That's roughly six times the general population risk.

Besides high-risk mutations such as in the BRCA1 gene, there are also moderate risk mutations, which increase the risk by about two to threefold and they both occur in single genes.

Apart from them there are also other mutations which we'll call genetic variants that have very small effects individually but that can be added together in what composes polygenic risk score or PRS. These variants are spread all over our genome.

Understanding polygenic risk scores

So the polygenic risk score follows a normal distribution in which the majority of us will have an intermediate score, which is the same risk as the general population, but some of us will have a high score due to having a lot of this small effect variants and we'll have an increased risk for breast cancer due to the polygenic risk score.

This variants with small effects are discovered through very large studies with hundreds of thousands to millions of people. These studies are called genome-wide association studies or GWAS. They are generally conducted with individuals of a single genetic ancestry because it facilitates the discovery of this variants.

The majority of this study so far as you can see in this plot 90% of them have been conducted with individuals of European ancestry and the big problem about that is that not all PRS is discovered in European populations can be generalized to other ancestries because especially there are differences in frequency and effect of variants among ancestries.

not all PRS is discovered in European populations can be generalized to other ancestries because especially there are differences in frequency and effect of variants among ancestries.

So this result is very important because it not only shows that it is generalizable to our population, but also that it has clinical use.

So earlier this year, we published a paper with those results and we also developed a genetic test in Mendelix which includes the top 100 most important genes for hereditary cancer in general and the breast cancer PRS in which if a woman has a PRS value that is on the top decile the report shows that they have a moderate risk for breast cancer.

Key takeaways

So the key takeaways of my talk are that projects and data science are inherently overwhelming. We have so many things to pick so many data formats. It's like going to the grocery store in a foreign country.

But if you have a machine learning part in your project, I recommend you to use tidymodels because it is very well organized, all steps are covered, and of course the tools I've shown here it's not everything they have much more. So it worked fluidly, it was not an overwhelming part of my project and it enabled the PRS validation in Brazil, which is a step forward to make precision medicine more available to more people and more ancestries and I'm sure other ancestries also do not have this access and they can now do the same analysis I did.

Thank you very much for your attention if you want to find more find out more about the breast cancer PRS validation project you can point yourself on to this QR code which will lead you to a github repository with the code slides and the link to the paper. Thank you very much.

Excellent let's log any questions on Slido. While we wait I should tell you that the original name of the workflow package was orchestra really so I texted Davis that and he thinks it's a great analogy.

One question, do you have any like pain points in the analysis process? Was there anything that was like exceedingly difficult? Well, actually the whole bioinformatics part, the principal components analysis I had to do everything kind of from scratch because I was using exome data and there was no process of doing this, I had to use an external database. It was the most complicated part so not so data science related more bioinformatics related.

Okay. All right. Thank you very much.

Precision Medicine for All: Using Tidymodels to Validate PRS in Brazil (Flávia Rius) | posit::conf

Transcript#

Understanding polygenic risk scores

Brazil's diverse population

Why tidymodels

Step-by-step workflow

Results

Key takeaways

Featured software#

tidymodels

yardstick