Migrating to Open Source & the Future of Biostatistics | Beth Atkinson | Data Science Hangout

ADD THE DATA SCIENCE HANGOUT TO YOUR CALENDAR HERE: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Beth Atkinson, principal biostatistician at Mayo Clinic, to chat about the challenges of migrating from SAS to R, working with diverse and noisy data types (including wearable data and omics projects), foundational tooling like RMarkdown and Quarto, and maintaining statistical fundamentals amidst the hype cycle of new tools like AI. In this Hangout, we explore the challenges of working with complex, high-volume data, like the data derived from wearable devices and medical charts. A challenge with wearable device data is that it can be super noisy, with issues like computers not syncing up, people forgetting to wear the device, or someone else wearing it. Medical chart data can be inconsistent; some things are recorded, and some are not. She also talks about the R/Medicine conference, the future of modern biostatistics, and the journey of compassionately helping an organization move from proprietary tools like SAS to open source tools like R. Beth also works on omics projects, including genomics (looking at DNA), metabolomics, exposomics (chemical exposures), and multiomics, which involves looking at all of this information together in a holistic way. We hope you'll come along with us if you're interested in learning about the biomedical world of data! Resources mentioned in the video and zoom chat: R/Medicine Conference website → https://rconsortium.github.io/RMedicine_website/ arsenal R package (MayoVerse) → https://mayoverse.github.io/arsenal/ (The arsenal package was created to help encourage transition from SAS to R by providing equivalent functionality for summary reporting macros that people relied on.) 2025 Posit Table and Plotnine Contests → https://posit.co/blog/announcing-the-2025-table-and-plotnine-contests/ If you didn’t join live, one great discussion you missed from the zoom chat was about the general dislike of regular expressions (regex) and the tendency to rely on tools like ChatGPT to write complex regex syntax. Many in the chat agreed that regex is difficult to commit to memory but acknowledged the power of the tool. So... do you use an LLM to help with regex? ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us! Timestamps 00:00 Introduction 03:25 "You do a lot of things that end in -omics. What are those things?" 05:02 "What are the types of data that you work with and some of the challenges that you face with those data?" 09:37 "What was your favorite new feature that made your work easier?" 11:42 "What is your favorite data science tool or R package that you find helpful in health research as a biostatistician?" 14:04 "I wanted to see if that's consistent with your experience [that 80% of workflow is data prep]" 17:07 "Does it scare you to hand off data to be cleaned by someone else?" 18:11 "What have you noticed that we still need to adhere to [regarding statistics fundamentals]?" 22:48 "Do you also produce reporting products as part of your role, and is your audience primarily internal and narrow, or do you communicate with a broader external audience as well?" 26:35 "Can you talk about a little bit of your personal SAS experience as well as the bigger organizational change maybe that Mayo is is doing?" 30:05 "What are some of the roadblocks that are faced in a SAS-to-R journey and and how can we find compassion for the people that we are helping to transition?" 33:55 "What is the community aspect internally at Mayo Clinic around R?" 35:43 "How do you store and manage all of that [data]?" 40:41 "What tools and skill sets should we focus on if we want to get into biostats today? Do you think it's important for people to still learn SAS if they're coming in fresh? And how about the future of biostatistics as a role separate from data science?" 45:48 "Is it possible for someone with a nontraditional background to make these transitions [into computational epidemiology]?" 48:10 "What's the source of most of these innovations?" 50:05 "Could you talk a little bit about R/Medicine conference?"

Oct 9, 2025

54 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12pm US Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.

All right, I am so excited to introduce our featured leader today, Beth Atkinson, Principal Biostatistician at Mayo Clinic. Beth, welcome. I'm so glad to have you here. Could you please introduce yourself for us? Tell us a little bit about what you do. And I will also ask you about something you like to do for fun.

Sure. I work at the Mayo Clinic. I'm a biostatistician there. I've been there for 35 years. I start when I started off, almost everybody worked, used SAS on the mainframe computer. So you wrote a job and you submitted it and you waited for things to run in the queue and took a break together because it was all queue-based. And pretty soon after, we got a Unix system and people started transitioning and we had S Plus and I got involved with S Plus and working with Terry Thurnau on the survival package and our part. And I kept asking him questions of, well, can we do this with our part? And he finally just said, you do it. And that's how I got involved.

And so I have been involved with the infrastructure of S Plus and then migrating to R almost from the beginning. So I work on a variety of different types of projects. I work on clinical trials. I work on genetics, omics projects. I've done things with epidemiology. I try to look at a number of different things. A lot of the projects I work on are related to aging and understanding what is aging and how do you measure it and what are the diseases that change with aging?

Working with omics and complex data types

So like genomics is looking at the DNA. So then you've got things looking at the RNA. You've got metabolomics. So there are measures of the metabolome. There are exposomics that they're looking at in the blood and other samples, trying to understand that the chemical exposures that you may have had. Everybody's looking at different aspects of the body, looking at your gut biome and understanding what it is that you've eaten and how that might influence your health.

And multi-omics is hot now. So people started off by looking at just the genomics or just the proteomics or just the microbiome. And now people want to see how do we look at all of that information together, which is a huge challenge.

So right now I'm starting to look at data from wearable devices. So wearable devices, I've got my Fitbit on my watch, got something looking at the aura ring. So you get measurements every day for each person. And sometimes computers don't sync up or people forget to wear their device or somebody else wears it. Or it's very noisy and how do you collapse it into something that's explainable? And that can be a challenge.

So a lot of the data also has is what's available in the medical charts. So it's some things are recorded, some things aren't recorded. So you want to understand BMI. You want to capture the BMI from the medical records and some people have lots of measurements and some people don't come in as often. They don't get it measured when they come in. And you have to estimate or sometimes it's written in the wrong units, especially historically, or people who are pregnant. And so that's not really capturing who they are and what that might do. So you have to pay attention to that.

RMarkdown as a game changer

RMarkdown was a game changer, hands down. It totally changed the way that I do my job because I'm writing a report partly just for my own sanity of recording what it is that I'm doing and I can document kind of here's my thought process. So sometimes I'll just write a version of it for myself and then a smaller version that I'm going to share with the investigative team because I have a lot of projects that I do something and then I don't hear back from them for maybe five years and trying to keep track of what in the world did I do and what was I thinking about five years ago.

RMarkdown was a game changer, hands down. It totally changed the way that I do my job.

I feel like Quarto is equally magical as a transition. It took me a little bit to hop over to Quarto from RMarkdown, but I made that jump and I love it.

I still think of data scientists as working with really big data and databases and focusing more on does this algorithm work versus how do I interpret the individual variables and what are the implications of that.