SAS to R, data harmonization, & a career in pharma | Dony Unardi | Data Science Hangout
ADD THE DATA SCIENCE HANGOUT TO YOUR CALENDAR HERE: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Dony Unardi, Principal Data Scientist at Genentech, to chat about the Teal framework for Shiny, data cleaning and harmonization in the pharmaceutical industry, and career transitions from SAS (and web development!) to R. In this Hangout, we explore Teal, an open-source framework built around Shiny to accelerate the creation of clinical trial analysis apps. Donnie explains that Teal provides out-of-the-box features like advanced filtering and, most importantly, code reproducibility. This allows users to generate the exact R code needed to reproduce a visualization or analysis from the app in their local environment, a critical feature for validation and submission in the pharma industry. Resources mentioned in the video and zoom chat: Teal Gallery → https://insightsengineering.github.io/teal.gallery/ Pharmaverse → https://pharmaverse.org/ SAS to R: A Data Snack from the Fred Hutch Data Science Lab → https://hutchdatascience.org/data_snacks/r_snacks/sas2r.html A SAS to R Success Story (Posit Video) → https://posit.co/resources/videos/a-sas-to-r-success-story/ R Consortium Submissions Working Group → https://rconsortium.github.io/submissions-wg/ If you didn’t join live, one great discussion you missed from the zoom chat was about making the transition from SAS to R. Many community members shared personal experiences, valuable resources, and advice, emphasizing the importance of learning R's idioms rather than attempting a direct, one-to-one translation of SAS code. Let us know below if you’d like to hear more about this topic! ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us! Timestamps 00:00 Introduction 06:34 "What is the teal shiny app framework?" 12:10 "Does being in the Bay Area make a difference for data careers?" 16:44 "What kind of sass are you talking about? (SAS!)" 18:07 "How do you approach technical interviews?" 20:22 "Do you have any tips for a SAS programmer being told to use R?" 26:10 "What are some of the data quality challenges that you find in pharma data?" 29:43 "How is reproducibility defined in the pharma industry?" 32:37 "What does Teal offer for reproducibility that a Git repo does not?" 36:48 "What are the pharma job titles for roles focused on data cleaning and QAQC?" 39:11 "What do you mean by harmonizing data?" 42:09 "How do you prioritize projects and collaborate with your teammates?" 44:36 "Should personal data science projects be for learning or for a resume?" 47:58 "How did you get leadership support to contribute to the open-source community?"
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hey there, welcome to the Paws at Data Science Hangout. I'm Libby Herron, and this is a recording of our weekly community call that happens every Thursday at 12 p.m. U.S. Eastern Time. If you are not joining us live, you miss out on the amazing chat that's going on. So find the link in the description where you can add our call to your calendar and come hang out with the most supportive, friendly, and funny data community you'll ever experience.
Can't wait to see you there. With that, I would love to introduce our featured leader today, Dony Unardi, Principal Data Scientist at Genentech. Dony, it is so nice to have you here today. Can you tell us a little bit about yourself, what you do, and something you like to do for fun?
Yeah, for sure. First of all, thank you so much for having me here and talking to you and everybody, wonderful people here. We have such great participants here, great numbers. So thank you so much for having me. So my name is Dony. I've been in Genentech for eight years, no, nine years, going to a decade now. I've been leading this TEAL framework for about three years now. And that's why Libby was saying that I'm a morning person by choice, is because our standup, because I have a global team, my standup starts at 7 a.m. every day. So I think it used to be a lot earlier, but I'm like, nah, let's just make it 7. I can't do it. Otherwise, I'll be like a zombie in the meeting.
My background is in computer prior to joining Genentech. Well, maybe prior to TEAL, I worked closely with study data. So I was a data manager. And also I switched to a role where I do a lot of R development, trying to harmonize data for a molecule. So this is depending on whether it's a recent study data or an older study data, my job is to harmonize and make it ready for analysis. And then I also have the background in, as a web programmer, I used to do a freelance web programming and a SAS programmer as well. So it's been such an interesting career. I was sharing with Libby at some point where I thought I want to do web programming. And I was about to make a career switch from that to the web to the front end. I would say I'm so glad I did it because that data turns out to be such an amazing career path and a lot of things to learn. And I think with a product like R, we can even make it so sophisticated and until I just apply so many different object-oriented techniques into the product. So it's such a versatile and flexible tool to make great product.
Yeah. The funniest thing that Donnie said to me was, yeah, I was doing web development and yeah, I was considering making a switch, but I didn't think data was going to go anywhere. I didn't think anything was going to come of this data thing.
Yeah. I was really like almost on what I felt like a ceiling, career ceiling for me. And then I was really deep into web too. I was doing Angular. I was learning Vue.js. I was doing a LAMP stack, all this different stack. I was trying to learn it. I did a little freelancing, helping my wife. Some of this, she used to have a logistic business. So I built the front and back end of that, of her business. And it was fun, but I'm glad where I am right now. Definitely our data science is such an interesting and fun topics.
Introducing the Teal framework
It is. And I, speaking of topics, like I would love to give everybody a sort of rundown on things that I think it would be fun to talk with Donnie about that can help you inform some of your questions. So one is a little bit more background, which I will ask from Donnie in just a second around the Teal framework. What is that? How does it work? Why is it important? Another one is just the sort of switch that happened from like SAS programmer and web developer to data. What that sort of skill set and background gave to him and his ability to do stuff. Also being a technical leader. We've had a lot of conversations lately in the community about being a technical leader without being a people leader as well.
I know that Donnie also does technical interviews. So he has people who are on this team with him. And while he does not do the behavioral part of the interview process, he does do technical interviews. So if you have questions about that, that might be interesting. And then also I loved the topics, the technical topics on one, just, you know, working on Teal, working in frameworks that I've never worked at in R, like R6, S3, S4. All of that feels so above my head as far as R programming goes, but if you have questions about that, I know Donnie loves talking about those frameworks.
He's also an R champion coach. He onboards study teams and really helps with R adoption inside of Genentech. Those are great topics. I also loved something that Donnie said was, hey, not everybody in AI and ML is doing the actual model building, because he has always been more on the data prep, data cleaning, the QC side, which is quality control, validation, things like that. So if you have questions about that, or if you're like, hey, I'd love to have a career in data, but maybe I don't want to be a modeler, Donnie can talk about those topics too.
Hello. Sorry, I was in front of the stove because I am cooking my post. Not run. Emphasize not run meal, because I do Strava, but I go at a leisurely pace. But regarding the question, what is Teal? I'm unfortunately somewhat ignorant about it, and why use it? What does it do? What's so great about it? Is it just a really good color?
Fantastic question, Noor. Thank you. All right, Donnie, take it away. Tell us about Teal.
It does have a good color. But anyway, hopefully it's more than that. So Teal is essentially a Shiny app, but we built a framework around Shiny app to add a lot of feature out of the box. So it's using Shiny components, it's using Shiny techniques to make a Shiny app, but with what we would like to call Teal flavors. So what are these flavors? What we're trying to achieve is accelerate on how people can create Shiny apps. So no longer, I mean, while you can make a traditional Shiny app, we want to provide more functional programming way where with a small number of code, you can make your Shiny app.
Another thing that we're trying to achieve with Teal is that we want to add components that kind of important in clinical trial setting, but we also know it's important in other settings as well. One of our key features that we always try to market is that it has code reproducibility. So what that means, if you create a Teal app, you're using the modules, the prebuilt modules that we have, and you run your analysis with these modules and you see something that you like. Now, Teal was positioned an exploratory tool. So in order for you to run it through some more validated environment, what Teal can do, it will provide you the code that can reproduce what you see on your Teal app or visualization. So that's very important for us to achieve because in a pharma, this is something that we always want to do. We want to be able to reproduce our analysis.
One of our key features that we always try to market is that it has code reproducibility.
Another thing that we want to promote out of the box was our exploratory features. That means subsetting, filtering, and so on. So we, by just running a very simple Teal code to make a Teal app, the filtering, it comes out of the box. So we encapsulate this. And so this is where the object-oriented comes in. We provide some API for you to control how the filter behave of if you want to define a predefined filters, but the intricacy of it, the details about that, we just encapsulate them. So by default, you make an app, you will have filtering ability. And then we kept improving on this. We have predefined modules. We have line plot, bar plot, graph that we know quite standard. And so we provide about maybe 50 plus more modules that people can look around. And the idea is that you don't need to know on how to build a custom Teal module. We just provide, we already know a standard module that could be beneficial for your analysis. So why don't you just take that in and plug it in into your code and you can run it.
There's a lot more thing that I can think of. I think we have a website about this. We also have a Shiny Life session in our website, all of documentation. We try to show this, the capability of Teal in the website. We also have what we call Teal Gallery. You can Google it, teal.gallery. Maybe it's a website to show you a couple of finished product of a Teal app using synthetic data. So these are all fake data. And this is what I like about our open source Teal. So right now, like I said, it's very flexible. Everything is just Shiny module. And we're looking to several other categories that can be beneficial to other department of our team. So right now, Teal, I know it's being used very heavily in our study trial, but we also want to be used on other purposes, like medical data review and so on. So maybe this is a long answer to your question, but Teal is a product to make you easily make a Shiny app with analysis that you need.
Bay Area, biotech, and networking
Since you've been in the Bay Area for a while, does being there make a difference for data careers, like either job opportunities or other kinds of networking or ambient knowledge you just pick up from being around that community? Do you feel like being in the Bay Area has given you a boost in data?
I think, well, I guess, let me start with opportunities first. In the Bay Area, South San Francisco is the biotech area. And so South San Francisco is a little bit south of San Francisco. There are a couple of biotech companies here. And, you know, in the biotech, I think data is quite crucial. Like I said before, it is the asset for this company. So I guess based on from opportunity perspective, being these type of companies are in the Bay Area, I would say, yes, it does have more opportunities, especially if you're local. But I know since the pandemic, things could become remote. But I would say, yes, being in data in the Bay Area could increase the opportunities.
I think from the data perspective, though, I've seen job posting regardless on the Bay Area. I think it's more toward what industry that you want to be in. For any categories of companies, they should have data aspect of it. But do you want to work with clinical trial data? Do you want to work with finance data? I think that's what makes a difference, the category of the data that you want to work with. If you want to be in biotech, I think Bay Area does help.
SAS to R: tips for the transition
Do you have any tips for a SAS programmer that wants to or is being told to use R?
So I think the most important thing is, if you are, you have to embrace it. You have to acknowledge that it is a different language. As a SAS programmer, like I said, I used to be a SAS programmer. I think having this understanding about functional programming, about the syntax around it, it's very important. I think you just have to learn and train yourself. I remember that at the beginning of my career, switching to R, I just practiced a lot. And the easiest way to practice is the data step stuff, the cleaning stuff up and making the data ready. That, to me, was an easy entry point.
So if you ask me what is my favorite R package, it has to be dplyr, because that was my first package that I used a long time ago, and I still know how to do it, even though I do more of Shiny stuff now. But then you will level up. I think after dplyr, you learn about, okay, then how do I do this in base? Because now if you're making R package, you want to make sure, does it really make sense to bring the whole dplyr in, or can I get this away with some base stuff? And you slowly will get increased. So practice a lot. Pick some use case.
I mean, as much as I want to advise taking some online training, having a hands-on project, the one that really makes a difference. So make a small project for yourself, create something out of it, enhance it, level it up. You make the data, great. Now, can you do Shiny app? Can you level that up? Can you make it a huge different way? If you don't want to do Shiny, that's fine. Can you do ggplot2? Can you make a static plot out of this? Just keep on increasing that level of learning, because then you'll get familiar with the different, how SAS do the syntax versus how R do the syntax.
The plus point is if you know how to do R, and you know, kind of familiar the way it is, then if you want to switch to another language, like Python, for example, or Rust, you kind of see the similarities. SAS is just a different, I felt in my, personally, it's just, like I said, it's a proprietary language. Great software. I think they have a great way of warning messages, error messages, where R sometimes can be very ambiguous. You just have to understand it and Google it. Now, I'm glad we have AI now, that we can ask AI what this error means, and I do that a lot as well.
So hopefully that helps. I think the most key is to pick a project, work on it, level it up, and just be familiar. Then you get familiar.
Technical interviews
Libby, in her introduction had mentioned that you're in charge of looking after technical interviews, and as someone at Pfizer who's trying or occasionally has to do the same, I kind of have antibodies to the prove it kind of technical interview, and I'm much more interested in how do you think around problems? Can you demonstrate that you can think around a scenario? I'm just interested in how you view that.
Yes. So we do ask this kind of questions too. I'm not worried. I mean, these candidates that comes in, they're smart individuals. I'm not worried of them if they don't know R6, for example. They don't know S4, for example. I'm not worried. They will learn this quickly. So definitely not worried from that perspective, but what you just said is something that we're trying to conduct as well. There is a couple of steps that I usually do. I do want to know, I'm more concerned about their base R techniques, and then it increases. If this is a problem, how would you do it? Have you done Shiny module programming? Do you understand certain techniques in Shiny that we're doing? So yeah, not specific about the syntax has to be correct, but it's more toward just talking about it, and then actually review the code together. So what I did was I just make a Quarto website, and just for myself, and I'll just present some examples, and then we just talk about it. Sometimes we give hints. We don't grade them from accuracy. It's more about communication and train of thought.
Data quality and harmonization in clinical trials
Yeah, it's a pretty straightforward question. I too have never worked in that industry. So I mean, and I haven't done my homework either. So I mean, my first question before the one I wrote there was, where do you get your data? Where does it come from? I'm guessing from randomized control trials. But once you get the data, what are some of the data quality challenges that you find? And what are some potential remedies for that?
Sure. So yes, you answered your own question for the question one. It does come from some study trial that run by a CRO. So we provide the molecule. These are companies that help us conduct the trial. At this stage, it has to be human trial. So then it varies between different phases. So we got the data from the vendor that we use to collect the data. So the CRO, the one that finds the hospitals and run the trials, and the data comes in into some software. So we got it from them. And you're right, it comes in many variation. It kind of depends on the requirement as well. So sometimes throughout the trial, they may want to see more biomarker or may want to get more data points, which means we have to alter the software so that we can get that data points. So the data could be dirty.
And this is the perfect, I mean, yes, these are the type that I would say clean at this point. And it could be very, some that could be very trivial. It's just about converting some characters into something else if it's from a different language. You know, like, you know, if it's in Germany, you have like an umlaut or stuff like that. We usually, you know, depending on which one we want to keep, which one we don't want to keep this non-ASCII characters. So yeah, so from there, we try to clean it up. We try to make it ready for the next step, which is then we come to a specific data model. And in clinical trial, we use the data model called CDISC. And CDISC provide this guidance on how to build a data model from our raw data all the way to the data model that we can use for submission. So what submission means is that we give the data to FDA and they can analyze it. If they think the trial is successful, then they give our, their approval.
Reproducibility in Teal and pharma
Fantastic. Awesome. Thank you so much for the great question, Nick. And next I wanted to call in Cecilia.
Yeah. Hi. Thank you for your time today. I think I also am not, like I'm in academia, so I think I have an idea of what reproducibility means for me, but I'm not sure it necessarily is the same of what it means in the industry, mostly like in pharma where reproducibility is mandatory. Like you have to be able to know what the data, where it comes from and how to reproduce it. So I think my question was more general. Like, does it mean like have code, clean code and pipelines that are repeatables, but also like making sure the analysis can be independently validated? What, yeah. What does the reproducibility mean for you?
Okay. Maybe not validated. So the code reproducibility, the context here is that when using a tool and you see, after you add a filter, let's say you add a filter to get some visualization, you subset some data, you adding some encoding, you, by encoding, I mean the parameter tuning to show the aesthetic of the visualization, this custom modification that you've done for your visualization while using the app. So the reproducibility means is that we have a way in Teal that if you click this button, it will show you the code to reproduce what you see in Teal, but in your local environment. So if you take this code and you run it into your own R session, considering that you have installed all the dependencies needed, you will get the same result. So that's what the context that it means here. So is it validated? I don't think so. Unless you run that code into your validation pipeline.
So these are, when I say validation pipeline, usually this means in an environment that is very strict, usually they control what's in the environment, right? Either approved packages or things that have been sanctified by the validation team. And that means that everything that they put in into this environment has some validation documentation. So that's what I meant by code reproducibility.
Data harmonization explained
And it's what exactly do you mean by harmonizing data? I had the same question when I was talking with Donnie, he mentioned putting data together from over a large time span. Is that what we're talking about?
That's correct. So if you recall, I said earlier about CDISC data model, right? I'm just going to use this as an example. So this data model, something that we are following in order to prep our data, they have a different publication every couple of months, if not every couple of weeks, right? There's always new publication comes in telling us what is, because data changes, technology changes, so there's always new guidance on how to do these changes. Sometimes there's a new stuff, sometimes it's just an improvement of an old way to do it, now we have a better way to do it. So imagine you have a data that spent 20 years, just imagine that. And then depending when the data comes out and when the data got modeled into specific guidance, it will differ in this 20 years, right? Then you may end up having a data that has five different, using five different guidance. Now, it's hard to combine, it's hard to just stack data when they have five different models, you can't, sometimes even the variable names is not the same. So then that's what I meant by harmonizing. What we need to do in this process is to come up with some kind of an agreement of how do we harmonize this, which guide that we're going to follow in order to fit all the data into the same model so that we can easily stack them together and make them easy to be analyzed.
Prioritizing work and collaborating globally
How do you prioritize and choose what projects you work on? How do you collaborate with your teammates?
Yes. So for me, it always comes back to our stakeholders. So we do have a target stakeholders that we always reach out and listen to because I make the product, but the product is being used at this study trials. So if it's not working, if something needs improvement, then I need to know and this is our priority. So from there, we bring it to the team and we use agile methodology. So we have to define definition of done. After we define this, then we just start working on it. So this is how we prioritize. It always comes back to the stakeholder. Our increment usually run eight weeks. So we need to deliver something in eight weeks. And every two weeks is our sprint where we always look back again in our board. We use GitHub actually. GitHub has been really helpful for us to plan, estimate how much work to see where we are and to kind of estimate whether or not we'll reach the goal. So I guess the short answer is we prioritize by listening to our stakeholders.
Standing out with personal projects
I see that a lot of people are doing personal data science and coding practice. How do you think people can stand out from the crowd? I'm guessing that this is with like applications. Do you think it's more ideal to treat these projects as a learning exercise or something that you should put on your resume?
I would say both. Do you look at projects and stuff when you look at applications?
Yes, I do. I do actually. I usually ask people like what is their either recent project or their most proud project that they work on? Because again, it could be small, it could be big. What I do is that if I have a project, like my side project, I just want to do this one thing. And yes, it's a good learning experience, right? So, I think I recall I have a package that I made where I want to streamline my working when I have a lot of different dependencies. So, if you work in a product that has a lot of dependencies, then sometimes things can be quite hard to switch different branches, different feature branches, and so on. So, I make a package from there. But then I just thought it was a good product, so I just push it into GitHub and see if anybody else is interested. And then, yes, that would be a good adding to your resume because now it's in there. And if you want to take it to the next level, push it to CRAN, and then it can really become public. But I would say both. If you can use it for learning, but then as you use it for learning, hopefully the ideas will come in. All right, this looks like an interesting technique to do this certain problem that may be closer to your field. And if you see that opportunity, pursue it. Just try. Make something.
That's the great thing about R too. There's a lot of documentation about how to make R package. If you're not into R package, you're more like a web guy, deploy Shiny apps. Shinyapps.io, I think Posit also, you guys are the owner, right? Shinyapps.io, if I'm not mistaken. Yeah, it's free. It's free for, like, I think they have a free tier. And then if it's become bigger and you want to scale it, I think there's some fees. But for the starter, it's free. And I do this too. I just deploy stuff to Shinyapps.io.
Community, open source, and career advice
Sure. I was going to ask you a career question because we always ask about career advice. But I was thinking of maybe pointing it a certain way. Because, Dany, I've seen you're such a big part of the community. And you've given a bunch of different workshops. And I was wondering if that comes natural to you? Did you have to convince your leadership to be able to let you share what you're doing with others in the community, too?
Yes. I mean, I think with this product, fortunately for me, there is a support from leadership for me to be able to do this. So there's definitely that aspect of it. If you don't know, we have some pilots. Actually, no longer a pilot. We do have a successful R submission already with the FDA. But we also want to explore what Shiny would mean in an FDA context. So Teal could be one of that gateway. So that's why I'm getting the support to see if anybody else can see this could be a potential solution that we can all agree on and all can be on the same principle. Because what all FDA cares is that it has to be something that agree upon the industry. So FDA really doesn't, whether it's SAS or whatever it is, it's really up to the industry to decide what is it that they want to, what they can see.
So yeah, so luckily, fortunately for me, it does come quite natural because of the support. And I'm very fortunate to be in here. I met so many incredible people and the project really does open up a lot of avenue, just connecting with the different people from different companies. I would encourage for our participants today that if you want to be in this space, look into some open source code out there. I think Pharmaverse is a great one. And these are all data scientists and we are all very friendly compared to, I mean, I know tech could be competitive, but I would say data scientists, one of the warmest, kindest people that I've met so far. We always like to help each other. And then sometimes if you go to FUSE, if you go to other open forum and you say, ah, you're the guy that, you know, we always chat on GitHub so far. And then it's just so nice to be connected in that space. So yeah, check out Pharmaverse. If you don't know where to start, look at Pharmaverse. I mean, Teal is not in Pharmaverse, but we are part of Pharmaverse. So if you look for in Pharmaverse, you probably see some Teal there. I put all of my workshop in Pharmaverse organization in GitHub. So if you want to check out some of the older workshop, you can, but we have a big announcement soon. So look out for that.
I think Pharmaverse is a great one. And these are all data scientists and we are all very friendly compared to, I mean, I know tech could be competitive, but I would say data scientists, one of the warmest, kindest people that I've met so far.
That sounds amazing. There were so many wonderful questions that we couldn't get to. There were questions about, you know, career limitations for growth. We had questions about clinical trials. We had questions about the influence of AI in the research space. Everybody asked such amazing things, but we only have this one hour with Donny. So Donny, thank you so much for hanging out with us. This was so much fun.
Yeah. Thanks for having me, Libby. And thanks for everybody for your question. I really enjoyed it. I hope you had a great time. I mean, yeah, feel free to connect with me if you want, if you have any questions.
Thank you. Go find Donny on LinkedIn, Donny Unardi. Also, I realized as we're talking about announcements, I was trying not to spill the beans because I didn't know if it was out yet, but Isabella has informed me that it is out. So if you are registered for the Posit Conference, in-person, virtual, doesn't matter, the Discord server is open. Yay! So there's a link there. There's a little button. You can go join the Discord server. You better find me, tag me, say hello. I will right now go make a Posit Data Science Hangout channel for us. We can all pile in there. And yeah, really get talking. My recommendation to you, make your name your real name. If you have a weird Discord server name, I'm not going to know who you are. Make it your real name so I know who you are, okay?
If you want to save the chat, because it's full of amazing resources, go click the three dots in the upper right-hand corner of the chat. If you are not able to save the chat and that distresses you, fill out the Zoom survey that comes after this. Put your email on it. Rachel and I can help you get the chat. Find us on LinkedIn. Find me on Blue Sky, where I spend most of my time. And we will see you next week, where we have Lisa Elkin. She's a Senior Principal Computational Toxicologist at Pfizer. Lisa is a delight. Please come hang out with Rachel and me and Lisa. All right. Have fun on the Discord server. I will see you there, and I'll see you next week, everybody. Bye. Bye, Donny.