Migrating to Open Source & the Future of Biostatistics | Beth Atkinson | Data Science Hangout
ADD THE DATA SCIENCE HANGOUT TO YOUR CALENDAR HERE: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Beth Atkinson, principal biostatistician at Mayo Clinic, to chat about the challenges of migrating from SAS to R, working with diverse and noisy data types (including wearable data and omics projects), foundational tooling like RMarkdown and Quarto, and maintaining statistical fundamentals amidst the hype cycle of new tools like AI. In this Hangout, we explore the challenges of working with complex, high-volume data, like the data derived from wearable devices and medical charts. A challenge with wearable device data is that it can be super noisy, with issues like computers not syncing up, people forgetting to wear the device, or someone else wearing it. Medical chart data can be inconsistent; some things are recorded, and some are not. She also talks about the R/Medicine conference, the future of modern biostatistics, and the journey of compassionately helping an organization move from proprietary tools like SAS to open source tools like R. Beth also works on omics projects, including genomics (looking at DNA), metabolomics, exposomics (chemical exposures), and multiomics, which involves looking at all of this information together in a holistic way. We hope you'll come along with us if you're interested in learning about the biomedical world of data! Resources mentioned in the video and zoom chat: R/Medicine Conference website → https://rconsortium.github.io/RMedicine_website/ arsenal R package (MayoVerse) → https://mayoverse.github.io/arsenal/ (The arsenal package was created to help encourage transition from SAS to R by providing equivalent functionality for summary reporting macros that people relied on.) 2025 Posit Table and Plotnine Contests → https://posit.co/blog/announcing-the-2025-table-and-plotnine-contests/ If you didn’t join live, one great discussion you missed from the zoom chat was about the general dislike of regular expressions (regex) and the tendency to rely on tools like ChatGPT to write complex regex syntax. Many in the chat agreed that regex is difficult to commit to memory but acknowledged the power of the tool. So... do you use an LLM to help with regex? ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us! Timestamps 00:00 Introduction 03:25 "You do a lot of things that end in -omics. What are those things?" 05:02 "What are the types of data that you work with and some of the challenges that you face with those data?" 09:37 "What was your favorite new feature that made your work easier?" 11:42 "What is your favorite data science tool or R package that you find helpful in health research as a biostatistician?" 14:04 "I wanted to see if that's consistent with your experience [that 80% of workflow is data prep]" 17:07 "Does it scare you to hand off data to be cleaned by someone else?" 18:11 "What have you noticed that we still need to adhere to [regarding statistics fundamentals]?" 22:48 "Do you also produce reporting products as part of your role, and is your audience primarily internal and narrow, or do you communicate with a broader external audience as well?" 26:35 "Can you talk about a little bit of your personal SAS experience as well as the bigger organizational change maybe that Mayo is is doing?" 30:05 "What are some of the roadblocks that are faced in a SAS-to-R journey and and how can we find compassion for the people that we are helping to transition?" 33:55 "What is the community aspect internally at Mayo Clinic around R?" 35:43 "How do you store and manage all of that [data]?" 40:41 "What tools and skill sets should we focus on if we want to get into biostats today? Do you think it's important for people to still learn SAS if they're coming in fresh? And how about the future of biostatistics as a role separate from data science?" 45:48 "Is it possible for someone with a nontraditional background to make these transitions [into computational epidemiology]?" 48:10 "What's the source of most of these innovations?" 50:05 "Could you talk a little bit about R/Medicine conference?"
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.
All right, I am so excited to introduce our featured leader today, Beth Atkinson, Principal Biostatistician at Mayo Clinic. Beth, welcome. I'm so glad to have you here. Could you please introduce yourself for us? Tell us a little bit about what you do. And I will also ask you about something you like to do for fun.
Sure. I work at the Mayo Clinic. I'm a biostatistician there. I've been there for 35 years. I start when I started off, almost everybody worked, used SAS on the mainframe computer. So you wrote a job and you submitted it and you waited for things to run in the queue and took a break together because it was all queue-based. And pretty soon after, we got a Unix system and people started transitioning and we had S Plus and I got involved with S Plus and working with Terry Thurnau on the survival package and our part. And I kept asking him questions of, well, can we do this with our part? And he finally just said, you do it. And that's how I got involved.
And so I have been involved with the infrastructure of S Plus and then migrating to R almost from the beginning. So I work on a variety of different types of projects. I work on clinical trials. I work on genetics, omics projects. I've done things with epidemiology. I try to look at a number of different things. A lot of the projects I work on are related to aging and understanding what is aging and how do you measure it and what are the diseases that change with aging?
Working with omics and complex data types
So like genomics is looking at the DNA. So then you've got things looking at the RNA. You've got metabolomics. So there are measures of the metabolome. There are exposomics that they're looking at in the blood and other samples, trying to understand that the chemical exposures that you may have had. Everybody's looking at different aspects of the body, looking at your gut biome and understanding what it is that you've eaten and how that might influence your health.
And multi-omics is hot now. So people started off by looking at just the genomics or just the proteomics or just the microbiome. And now people want to see how do we look at all of that information together, which is a huge challenge.
So right now I'm starting to look at data from wearable devices. So wearable devices, I've got my Fitbit on my watch, got something looking at the aura ring. So you get measurements every day for each person. And sometimes computers don't sync up or people forget to wear their device or somebody else wears it. Or it's very noisy and how do you collapse it into something that's explainable? And that can be a challenge.
So a lot of the data also has is what's available in the medical charts. So it's some things are recorded, some things aren't recorded. So you want to understand BMI. You want to capture the BMI from the medical records and some people have lots of measurements and some people don't come in as often. They don't get it measured when they come in. And you have to estimate or sometimes it's written in the wrong units, especially historically, or people who are pregnant. And so that's not really capturing who they are and what that might do. So you have to pay attention to that.
RMarkdown as a game changer
RMarkdown was a game changer, hands down. It totally changed the way that I do my job because I'm writing a report partly just for my own sanity of recording what it is that I'm doing and I can document kind of here's my thought process. So sometimes I'll just write a version of it for myself and then a smaller version that I'm going to share with the investigative team because I have a lot of projects that I do something and then I don't hear back from them for maybe five years and trying to keep track of what in the world did I do and what was I thinking about five years ago.
RMarkdown was a game changer, hands down. It totally changed the way that I do my job.
I feel like Quarto is equally magical as a transition. It took me a little bit to hop over to Quarto from RMarkdown, but I made that jump and I love it.
Favorite tools for health research
Yeah, there is definitely a lot of overlap because I do a lot of data mining, using a lot of data mining tools and other types of things. So, yeah, we have discussions about what does it mean to be a data scientist and what does it mean to be a biostatistician? And it is not always clear.
Some of the tools for working with larger datasets, I think is kind of one of the challenges that I have is, you know, if I've got, I had one project where I looked at all of the medical records, or all of the diagnostic codes for everybody in Olmstead County, you know, over, you know, a 10-year period, which is just lots of records. And so, using more efficient ways to merge data and subset and understand that was important. So, general tools for working with large data, I think, is...
Data cleaning and the hype cycle
Cleaning data, organizing data, pulling it together, understanding it definitely takes up a huge amount of time. Then communicating results sometimes in working with the investigators. For some of it, I am fortunate that I can delegate to other people and say, can you create a clean version of the data for me? And, you know, some projects I do it, but for others I can get other people involved, and that gives me kind of the bandwidth to think about all the other aspects.
Because I think also just the being suspicious of things that are too good, results that are too good. We face a lot of hype of, oh, this is the answer. You know, when I first started doing genetics, GWAS was going to solve everything. We were going to know all of the answers by doing GWAS, and it was, if you had 100,000 variables, that was great, and then it was 500,000, then it was a million, then it was, oh, exome sequencing, that's going to solve everything, and now it's the artificial intelligence, that's going to be the answer.
AI and statistical fundamentals
I'm seeing a lot of presentations where I was at a conference yesterday and people had 20 versus 30 in a group. They ran all of these models and did all of these fancy things and optimized all their settings and said, oh, our model's great. I'm a little suspicious about how well that's going to do in the practice. So it's using the tools that are available, exploring things, but questioning, thinking back through, are the results too good? Why is it causing that? Being suspicious.
I have to be careful. It's a fine line between being the naysayer because everybody wants to use these tools and thinks they're magical and can do everything. So I don't want to be too negative, but at the same time, it's just another tool. You have to think in terms of, you've got a million variables or attributes that you're looking at, but you have very small sample size. Is it really representative? Is it going to really be helpful? I guess that's the balance that I'm struggling with, especially coming from the statistics.
Yeah. I mean, chat GPT, when I have to do regular expressions, is great because I can never remember all the specific syntax for doing a specific task. But there are other things that some of these tools aren't as helpful for, and sometimes just a penalized model, simple penalized model, is more than sufficient.
Reporting and tailoring to different audiences
I'm on a number of different projects. Some of them are big grants, multi-year grants, and some of them are just short-term. They have something, analysis, that they're looking at. Most of the teams I'm working with are internal. Sometimes there are also external team members on the projects. I find it helpful to write a general report using the RMarkdown or Quarto of, here's what was done. It makes sure everybody's on the same page in terms of, here are the results, here's where we started with, here are the methods that were used. I've got some teams where the people know and so then I turn on being able to share the code if that's something that's helpful for them.
Well, and I will do parameterized RMarkdown reports. So I've got regular DSMB meetings. So clinical trials, you have to do reports on, here's the recruitment and here's the adverse events that are seen. And so for the big team, you have to do it so that they don't see the randomization. They don't see the differences between the groups because they need to be blinded. But then for the advisory committee that aren't part of the regular team, they need to see that detail. So then I've got a version that's for them and a version that's anonymous.
SAS to R migration
So Mayo is still using SAS, especially in the clinical trials area. But a lot more people have migrated to R and that because coming out of school, that's what people know. They don't get trained in SAS, but there's all this legacy code. This is how, when you're doing so many studies, you build up a lot of legacy code and that makes it really challenging to migrate. We've got one particular area of data that we need to pull from and that's still only available in SAS. So we have to use SAS for some of those things. So we're trying to kind of use the best tools for the situation.
Well, and it's hard because we're all busy and we don't have time to learn something new and taking time out of your day to struggle with something when you know how to do it in the other language. I know how to do this in SAS, it's going to take me 10 minutes, but if I have to do it in R, it's going to take a lot more time. And where do I charge my time to do that? Because we have to account for the time that we're spending.
So trying to find, here's example documentation to say, okay, if you have a project like this, here's kind of some of the workflow that you might want to have, and here's some code that you can steal. Having places to start with is always helpful as opposed to, here's a blank page and you have to start from scratch.
And, you know, chat GPT and tools like that are also helpful now in that you can at least stick something in saying, here's my SAS code. Can I come up with something? You know, what would, what do you do in R? And it may not be the best R code. That usually is kind of weird sometimes, but it gets you started.
And when I'm working with team members, I will give them, here is some example code of how I would do this in R. You can use either SAS or R, but I'll give it, give you a start in R.
Building an R community at Mayo
We've got R stewards. So I'm one of the R stewards, Jason Sinwell's on here. He's another one of the R stewards. So we try and, um, so we've got a small team and we try and organize training, regular seminars. We're now trying to kind of break into small groups and say, you know, areas where we need documentation that's specific to Mayo or where people are struggling and can we improve on that?
You know, we had a group that was looking at REDCap. So REDCap is a tool for capturing data, for entering data, that's used a lot in medical institutions. And so getting that to work well with R and at Mayo was something to, we created some documentation around that. So we're trying to add value because when you just Google, it can be overwhelming because there's so much out there and what do I trust and what, whatnot.
Data storage and management
We have three different systems for studies to enter their data. And so those get stored in different locations. We've got the electronic health record and we can use, it's now in the cloud so we can use SQL code to pull from that. And we get things from surveys, we get wearable devices, we get genomic data from different data types. So part of what I and others do is putting that all together for a particular research project. And then we store that in our Linux system and, or in the Google cloud environment. And then we have an archival system. So we, you know, every project has a number, a tracking number, and we can find it in metadata surrounding those numbers.
Exploring new data types
I'm Googling a lot to see what's out there. Lately I've been asking, with the aura ring, I was asking chat GPT, can you show me some example data and how I might look at it and publications that people have used? Because that's kind of my latest data type that I'm trying to look with. Or if there are other people in our area that know something about it, you know, trying to learn from them. Because, yeah, you're right. Every data has its own challenge and you can't be an expert in everything.
Also, I try and kind of have a central place where I keep tips on this is what I had to do in order to figure this out. Because I figured this out for one project that's stored somewhere, and I can't remember where I stored that project. That's often my challenge.
Career advice and the future of biostatistics
I guess I've had to do, I've been successful in part because I've been flexible in learning new tools. So, whatever I learned in school was a good starting point. But being open to learning new tools in general is important. R is great for some things, but I've got a couple of projects where I need to use Python now. So, I'm teaching myself some Python. And that it's constantly been that. When the GWAS world, there were all sorts of separate packages and I had to learn how to use those. So, it's never going to be just one tool, but I think knowing R is a great starting point to get the basics and the structure and how you do things. And then you change the syntax, you change the programming language, but some of the concepts are still the same.
If you're taking over a project where it's all in SAS, then it would be. But for most areas, it's probably not going to be as important.
Well, and it also comes into pay, and data scientists tend to have higher salaries that are offered, at least for right now. I still think of data scientists as working with really big data and databases and focusing more on does this algorithm work versus how do I interpret the individual variables and what are the implications of that. So, it still seems like there's a difference, but yeah, the labels are going to be what they are at different institutions.
I still think of data scientists as working with really big data and databases and focusing more on does this algorithm work versus how do I interpret the individual variables and what are the implications of that.
Sources of innovation in medicine
And, you know, for me, a lot of it is, you know, the doctors that are working with the patients, and they're making observations, and then they're, you know, seeing sometimes they're collaborating with industry to say, you know, gee, I really noticed that these people, you know, were having voice issues before they develop Parkinson's officially. And, you know, can we capture that somehow? And partnering with somebody in industry saying, oh, we can measure this or these wearable devices or whatever. So it's that collaboration between different groups. We're really trying to do patient driven, you know, what, what are the needs of the patients and how do we improve that?
R/Medicine conference
Yeah. So R/Medicine, we started off as an in-person conference for a few years focusing on aspects of how R is being used within the medical field. But we've really wanted to bring in the doctors that are using R and, you know, R that's being used in labs and get everybody to talk to each other and think again, that it's not just a, here's a bunch of data scientists talking about R in isolation. It's really looking at the issues with R and, you know, how it's helping within the general medical area. And now it's a virtual conference and we're really wanting to make sure that things are accessible to people around the world. You know, so people who can't attend regular conferences, how can they get the support that they need? You know, what are people doing to teach R to people who don't have access to somebody with training, official training?
All right, everybody. This has been a fantastic get together. Beth, thank you so much for joining us and sharing all of your wisdom with us. That's great. Thanks for inviting me. Oh, absolutely. This was really fun.