Shannon Pileggi @ The Prostate Cancer Clinical Trials Consortium | Data Science Hangout
We were recently joined by Shannon Pileggi, Lead Data Scientist at The Prostate Cancer Clinical Trials Consortium to chat about building reporting workflows for multi-year projects, creating analysis code review processes, and fostering global communities. Speaker bio: Shannon Pileggi (she/her) is a Lead Data Scientist at The Prostate Cancer Clinical Trials Consortium, an occasional blogger, a frequent workshop instructor, and a member of the R-Ladies Leadership team. She enjoys automating data wrangling and data outputs, and making both data insights and learning new material digestible. ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co LinkedIn: https://www.linkedin.com/company/posit-software The Hangout is a gathering place for the whole data science community to chat about data science leadership and questions you're all facing that happens every Thursday at 12 ET. To join future data science hangouts, add to your calendar here: https://pos.it/dsh We'd love to have you join us in the conversation live! Thanks for hanging out with us!
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, everybody. Welcome back to the Data Science Hangout. I'm Rachel Dempsey. I lead Customer Marketing at Posit. Posit, if you are joining a Hangout without knowing who we are, we are the open source data science company building tools for the individual team and enterprise. Thanks so much for taking the time out of your day to hang out with us. The Hangout is our open space to hear what's going on in the world of data across different industries and to connect with others who are facing similar things as you. And we get together here every Thursday at the same time, same place.
We do have one week we're taking off August 15th for the week of Posit Conf. And so I know I've asked this many times before, but if anyone from the Hangout will be there in person, please let me know in the chat. I'm trying to get some people together one of the days of the conference as well. But if you want to join us virtually, you can as well. And let me, bad at multitasking here, let me put this into the chat. If you want to register to join the conference virtually too. And I always like to add, if you're watching this Hangout as a recording and you want to join us live in the future, we'd love to have you join us. Just make sure it adds it for 12 Eastern time. I know people really enjoy connecting with other attendees here in the Hangout in the chat. So if you want to use this time now to introduce yourself, say hello, maybe share where you're based or something you like to do for fun, feel free.
I love to see people connecting there and making friends from the Hangout. We're all dedicated to keeping this the friendly and welcoming space that you all have made it and love hearing from you no matter your years of experience, titles, the languages you work in, or the industry that you work in. Another quick note, if you're hiring, you can also share those roles in the chat too. But today, if you want to jump in and ask questions or provide your own perspective, you can raise your hand on Zoom. I'll call on you to jump in. You could put questions in the Zoom chat and feel free to put a little asterisk or star next to it if you want me to read it. And then lastly, we do have a Slido link where you could ask a question anonymously too.
But thank you all again for joining us. I'm so excited to be joined by my co-host today, Shannon Pileggi, Lead Data Scientist at the Prostate Cancer Clinical Trials Consortium. And Shannon, I'd love to have you kick us off with introducing yourself, share a little bit about your role today, but also something you do outside of work too.
Sure. So my name is Shannon Pileggi. My background is in biostatistics. I was a college professor for a few years, and I worked in market research for a few years. And more recently, I've been in clinical trials for the past three years, and it's my happy place now. Outside of data science, I recently got into a gym in my local town. It's a community-owned gym. I go six days a week. I'm really into weightlifting. And I was thinking about it this morning, and I feel like it's not entirely unlike what we do in data science, and that you're constantly pushing yourself to failure and iterating and figuring out what works and moving on and getting stronger.
Journey into data science
Yeah, sure. I mean, being a college professor, I knew about statistics, and I knew about programming, and it eventually got a little frustrating that I was sending people out in the world to tackle cool programming problems, but not having any experience doing so myself. And so I was ready to make a change. And so my first industry position was more statistics-oriented. And as I started doing that more, I realized I had more fun learning about the programming side. And so in my current role, I am definitely more programming-focused and less statistics-focused. But I would say I've been using R since about 2005.
Collaborating with Excel users
Something I was curious to ask you about, because I saw one of your recent blog posts around working with Excel and R, and I wanted to learn a little more from you on the collaboration between data science teams and business stakeholders who prefer Excel. And I was wondering if you had any tips or lessons learned for us.
Yeah. I mean, a lot of what we do is we have to work with other people to put together reports. And in an ideal world, we could seam together our reproducible tables and their narrative. But we're not going to onboard our partners to GitHub and teach them how to use Markdown. I mean, obviously, they could learn that, but that's just not realistic for where we're at. So we often do produce tables in Excel for them to paste into the reports as they need to. And it's just it's what works right now.
Yeah, I think the post I saw was about exporting a table.
Yeah. I mean, Excel is like a pretty like gnarly beast to tackle, especially like you get used to like so much like really pretty formatting from GT tables and like you're like doing all these beautiful things in HTML and then you try to dump it into Excel and like none of the beautiful things come out right. And so that recent post was just about some tips you can use, like intermediate workflows and steps to like make the formatting come out as close as possible to what you specified. But it's still not perfect.
Clinical trials work
Yeah. And so I work on clinical trials and we deal with a large amount of data. And so part of the work that we do is to ensure that we're getting data in a timely manner and that we're getting good quality data. So we might be making monitoring reports, looking like holistically at all of the sites that are enrolling in clinical trials, like are they doing a good job and entering certain fields? And does the data look like of good quality? And obviously we have different ways to assess that. Other types of like standard reports we do are safety reports for different monitoring agencies where we're looking at adverse event reporting and stuff like that.
Yeah, 2005, that was back in graduate school. So we definitely had graduate school courses in R and that was my first exposure. I also learned SAS when I was a college professor. I taught both R and SAS programming. And then when I moved into my first industry position, our team was mixed. So we were R programmers and SPSS programmers and also people who are very good at writing like Excel macros and like doing crazy things in Excel that I've never done before. And so I would say, you know, that job more and more started to encourage and support the use of R, especially as they saw like the powerful things we could do in Shiny and Connect.
Community involvement and R-Ladies
Yeah, so getting involved in the community was definitely baby steps. You have to start somewhere. I think it was a little easier when Twitter existed or people were active on Twitter. And so some of the first things I did was start blogging that helped me to get to know people as they gave me feedback on the blog posts. Another really cool program that I enjoyed was when we were on Twitter, I served as R-Ladies curator for Twitter, where they had a rotating person each week showcase like things that they did in R. And that helped me get to know more people in the community. And through that role, I eventually joined the R-Ladies global team to support the Twitter curation program. And unfortunately, we had to retire that as we decided to leave Twitter.
And then fairly soon after that, there was a call for volunteers to step up to join the leadership team. And so I said, yes, and so now I've been on the leadership team for about a year and a half now. And what that means is that for the R-Ladies organization, we are a nonprofit and we have a board of directors. So I am one of five individuals on the board of directors.
Blogging and staying motivated
Yeah, you said the word as I was typing the question, but good to speak with you. So I'm just curious about just kind of a question was how you got into blogging, which I think you partially answered, but also how do you come up with the ideas for your blog posts and kind of like what keeps you going? Because I know that a blog is something that I've thought about doing. I'm just like, I don't even know, like, how am I going to get into it? Am I going to have enough steam to keep going?
Yeah, I mean, part of like I explained that that market research job had so many mixed skills in terms of programming that like not everyone appreciated the cool things I was doing in R. And I thought they were cool. And I kind of felt like they were just kind of sitting on a computer, like going to waste. And I was like, this is the thought. So I wanted to get it out into the world. And so I just had to think carefully about what I could do that was shareable or frame it in a different context that was shareable. And so that's what got me started.
As far as like what keeps it going, I think having an infrastructure that just works keeps it going. You know, you see a lot of people talk about like, oh, I went to go like make a blog post and all of a sudden I figured out I had to like update things on my website and everything broke and I couldn't get my post out there. And that's like not a happy place. I think a lot of like the technology has changed so that you don't have to live in that unhappy place anymore.
I'm like super pleased because my blog is built in Distil, which was like predated Quarto. And I always think about updating my blog to Quarto. But, you know, like every time I go back to blog, like it still works. So like why? Like the infrastructure is there and it's easy. And now I can put my energy into like writing a new blog post when I have the time instead of updating my actual blog infrastructure.
As far as like the motivation to keep blogging goes, like I just kind of like writing. And if I don't, if I go too long without writing, I get a little itch to start writing again. I was putting like pressure on myself when I started to have goals to like blog once a month. And now with assuming other responsibilities, especially with our leadership team, those goals are like not sustainable and not realistic. And so now I'm just happy when I get one out.
Advocating for data science and R
Yeah, I think you got to get someone in the door. I don't know how you — I think when you advocate for things in your company, like sometimes you can start small and you can start grassroots and you're like you get all the people in your level and you agree that this is like the best way to move forward. And like it just never happens. And then at some time, some point you just have to take it straight to the top. And if that leadership doesn't agree, then like it's just not going to happen. But it's sometimes that's like the best way to get it moving instead of grassroots.
Yeah, I mean, I think sometimes you have to ask like depending on where you are and do this like very carefully, but like sometimes you have to ask for forgiveness instead of permission, like no one's going to — people might not be like, yeah, go use R and do this different thing. But if you can show them something really cool that you did in R and some capabilities that you don't have in SAS, then that's what is going to get people excited about it.
Keeping up with the R community
So R weekly still runs, they used to actually have a mailing service where you would get the email with the weekly emails and that mailing service is down and has been down. But the curators are actually still working every week to curate content and like aggregate all of those posts. So it's still a really good resource if you remember to go check it out yourself. And then the other thing I do is I listen to the R Weekly podcast and Michael, because they talk about the top three highlights from the R weekly posts, and they add like a really nice layer of context and depth and conversation around those posts about their own personal experiences. And I think that's a really nice way to stay on top of something.
Career advice
I can't think necessarily about received, but we do have an intern right now at our group, and she's very lovely to work with, and I believe she is on this call. She is going to be graduating from college in the next year, and she's a little frazzled about what she's going to do for her first job. And I keep telling her to relax. It is your first job, and it's not going to be your forever job, right? Like, your career is about iteration. You're going to take what you get in your first job, and hopefully it's a good one. But you're going to learn what you like to do, and you're going to learn what you don't like to do, and you're going to iterate and move on from there. And just because it's your first job, it doesn't necessarily lock you into that career path forever.
It is your first job, and it's not going to be your forever job, right? Like, your career is about iteration. You're going to take what you get in your first job, and hopefully it's a good one. But you're going to learn what you like to do, and you're going to learn what you don't like to do, and you're going to iterate and move on from there.
Building inclusive communities
Bringing people together and getting a diverse range of views is really about creating a very safe place for people to talk, like a place like this. So, you need a safe environment, and that comes in a lot of different ways. There's quite a bit of infrastructure and thought that goes into creating that safe environment. So, for example, with R-Ladies, some things that we have are our code of conduct that explains, like, what sorts of behaviors, like, will not be tolerated in our events. And other things that you can do to provide that infrastructure or provide suggestions or ways that people can communicate in that environment, which Rachel does so good at here, right? She tells you, you can drop questions in the chat, you can post questions anonymously, you can star them if you want to read them yourself, right? It's really explicit how to interact in this environment. And the person on the other end of the call, like, doesn't have to guess about, like, what should I do? Am I going to go on camera? Like, what's the process here, right? There's no guesswork for that person. So, that, like, tension and trepidation that a person might feel about interacting in that environment is removed because you know exactly what to do. So, I think when you are trying to build diverse communities, you need to set up that infrastructure to make people safe, but also to guide people on how to interact.
Switching sectors and open source in pharma
Hi, Shannon, I'm also a fellow co-organizer of R-Ladies, so just the Amsterdam chapter. It's a bit later in the day for me. But I was actually interested in asking you a bit about something you said much earlier on about how you're now on clinical trials and that's your happy place and you were a professor for a while and then you were in market research. And I'm just kind of wondering how did you, one, decide to switch between those very different sectors and then also how you navigated it?
Yeah, I mean, when I left teaching and it was like two factor decision, one that I had to relocate to be closer to my family, and I actually like really loved the department. I was teaching and I feel like the best place in the world is like an undergraduate statistics program that truly cares deeply about teaching and teaching well. And I had so much fun there. But I also explained earlier that I was a little frustrated that I wasn't doing my own data science problems. I wasn't advancing my own data science knowledge. I felt like I was further behind than my students in programming and R.
And so I decided when it was time to move, if I couldn't teach in the best place that I thought was the best place in the world, I was not going to teach anymore. And so and I needed to advance my own skill set. And so the first position I took in market research was like a really nice fit for me to learn more about industry and how things operate outside of the academic world. It was different, you know, but it wasn't quite my favorite flavor of science. And I realized I kind of missed that more rigorous environment. And that's so that's why I'm so glad I ended up in clinical trials because it blends what like and pushing the edge of what you can do on data science, but also has that like nice, rigorous environment to it as well.
So then I have also recently I saw that at the Posit Conf that there's going to be this summit about open source and pharmaceutical companies. And I always thought that those two may not go hand in hand. So I am curious about sort of how you use open source tools like R and R packages and things like that in a environment that needs to be very sort of closed and secure and sort of how that those two are balanced.
Yeah. I mean, I'm super lucky that the team I came into like it's a brand new team and didn't exist before, which means that there was like no legacy code existing like to protect. So we were a brand new team. We were writing code from scratch, which meant like we could choose our own tool set and our tool set of choice was R and that was clear, you know, that that's the direction that we were going. As far as like industry standards, like it is accepted now that you can do R programming in clinical trials. But that's not the pushback or the barrier as much — the pushback or the barrier is that like a large portion of the industry is trained in SAS and a large portion of the pharmaceutical companies have a ton of existing framework built upon SAS macros. So that is like really hard inertia to change both the training and the technical infrastructure. But you'll see that a lot of larger pharmaceutical companies have their own entire software engineering teams now where their jobs are to build R packages for internal processes to make that technical transition easier. And so a lot of the pharmaceutical companies are moving in that direction towards R or at least trying to move the needle.
Moving into a managerial role and code review
So, I was more at first like an individual contributor, just writing my own code, and then our team grew, and then I had one direct report, and it grew again, and I had two direct reports, and it grew again, and I had three direct reports. And then all of a sudden, I was in a place where I found that I might have been giving the same feedback individually to the three direct reports instead of having a cohesive place where I documented my expectations and made those clear upfront. So, that was something that I had to move towards to make sure, and it probably wasn't a something I could have done in the beginning because I didn't know what my expectations were. And specifically, this is around code review and code submission.
So, part of my managerial role is that the new contributors will submit code and submit pull requests, and I'm looking for two different things in that code. I'm first looking like does the code actually do what it needs to do, which is one question. But sometimes part of the harder question is did the code think about the data correctly? And in that sense, what I mean is that when you work with clinical trials data, like our data comes to us in a relational database type format with like 90 different data sets. And so, there's a lot of nuance in understanding how the data capture works, and that takes experience to get used to. So, it's not just like did your code work right, but like did you bring in the appropriate data from the appropriate data sets to make this conclusion?
And so, I'm reviewing a lot of code, and I was giving feedback individually, and then I realized I'm giving the same feedback over and over again, and I've got to document this to make it clear. And I think once I did that, people were like oh my gosh, you told me, and now I understand. Like now that I've seen it all written out, like this makes sense. Thank you.
Yeah, I think code review is like such an interesting thing, because there's like this very traditional context of code review in software engineering, where you're like writing functions for packages, you're doing unit tests, you're making sure the code does exactly what it says it's going to do, but that's like not the type of code review that I do. Like we're doing code review for data analysis, and I think that's a much different flavor. I actually made my thoughts on it public on GitHub. If one of my friends on the call right now wants to go find that link and drop it in, that would be great. And you're welcome to read it, look through my thoughts, contribute anything back. And I made it public because I think it's — some of the things are kind of like still kind of internal to our team, but some of the things I think can more broadly apply to other teams. And I put it in a GitHub repo and not a blog post because I feel like it's going to be changing and evolving as I continue to think about it, and I also would love like other people's feedback on it.
But in terms of like tips for code review, maybe I can work on that more. I mean, one of my goals over the next year is going to be for me to do less code review and for my direct reports to be doing more code review of each other and me so that they flip the role, and I think as we do that conversation, we'll learn more about like processes about what works for everyone in terms of it. But it's a pretty painstaking process, to be honest, to go through everything line by line and validate the results as well to make sure that the tables are coming out right, that the numbers are lining up, but it has to be done.
Community participation and career growth
Libby, I see you asked a question a bit earlier around the R community. Do you want to jump in?
Sure. I lost it, but it was definitely like how has being active in the community changed your trajectory, do you think, or opened up opportunities for you? I'm often talking to people who are very, very hesitant to start interacting in the community or start posting things or just start talking about what they know, and I wish that I could give them an example of an experience like, look, this is how expansive it could be for you.
Previously, before I started blogging, or maybe even initially when I started blogging, I felt like I had to figure everything out by myself, and I would just spend hours googling things and being kind of unhappy and chasing dead ends and not getting the answer to my questions and being frustrated. And then when you're part of a community, you can just ask for help and skip that eight hours, and you can get help pretty fast. So there's huge time savings there, and it's fun, and you get to learn what other people's areas of expertise are.
And then when you're part of a community, you can just ask for help and skip that eight hours, and you can get help pretty fast. So there's huge time savings there, and it's fun, and you get to learn what other people's areas of expertise are.
Data scientist vs. statistician vs. data analyst
Yes. The data analyst one's kind of hard, right? Data scientist, statistician, I can parse out a little bit more easily that I used to be a statistician, and now I'm a data scientist. And part of that is I no longer want to compute p-values as part of my work, so I don't do that anymore. I think it was great training for me in that I understand how to think about data, and I know how data should be organized, but I'm not actually interested in doing the analysis or the heavy lifting of the analysis. So if you really like doing the analysis and finding the results, then statistics is a great way to go.
I think data science is obviously a very broad field and can mean a lot of things, but maybe to further delineate, you can do software engineering flavors of data science, and you can do more data analyst flavors of data science, and those are two different skill sets. So the people that I know that are software engineers, they have less opportunity to work with data and real-world problems. They're writing functions and tools to do things, and I do do some of that. I do do some package writing, but I would really miss working with the data. I actually like working with data. I like working with real data, and I like finding the messes and helping find the solutions to the problems. So I think that's more of a data analyst role, and it might not have as fancy of a title or connotation, but I think it's still really fun and valuable work.
Making the case for code review
So, you've just spoken about how code review is really beneficial, because, one, it, you know, validates that things are doing what they're supposed to be doing, and, two, people learn from each other, but how do you make the argument to invest the time to do code review? So, if your manager thinks, no, it's too much time, we have too much work, I don't have time to have two people looking at the same body of code, what's the argument to convince them?
I mean, I think it's about the quality of the product that you deliver, because there's just so many places that even, like, I can go wrong, right? Like, I feel like I'm very well-versed in the data, I'm very well-versed in the, like, what we need to do, and I'll send it to my direct report and ask her to code review me, make sure I didn't miss anything, and she found something. Like, she found, like, oh, my gosh, like, there was this one data frame you actually forgot to account for when you drew this conclusion, right? And maybe that logic would have been missed when I just presented a table of numbers, but it would have been incorrect. And so I think we're just, it's about increasing our quality and also increasing our quality of our output and increasing our internal knowledge base at the same time.
And does that ever become problematic if you're faced with really tight deadlines?
It's challenging, but it's, like, not an excuse not to do it. It's, like, it's not something that you — it's non-negotiable for me. Like, this deliverable can't go out without a code review. I think there are different levels of review you can also employ for quality checks, right? Like, if you don't have time, there are, like, especially if you're, like, tabling numbers, there are some shortcuts you can take in terms of, like, I don't have time to, like, look through the code, but I'm going to actually, like, look through some raw data listings. I'm going to spot check some numbers. I'm going to make sure that that five matches this five lines right here. And, like, if you're in a super pinch, but it's, yeah, for me, it's still non-negotiable. It doesn't go out without a review.
Online training tips
Yeah, I mean, I've certainly been a participant in, like, async online courses before. I've never really run one myself. I think those are harder to make. I mean, they're good. Definitely, like, a lot of great, good Coursera courses and stuff, like, out there, but I think they're harder to, like, make meaningful, make stick, make, like, long-term impressions in your brain, at least for me.
But what I do do more of is online training in terms of interactive workshops, especially for R-Ladies. I do really enjoy that because I still, like, really love teaching. It's super fun, and I'm glad I have the opportunity to do that every now and then. In terms of tips for making those run well, like, you have to, like, let there be, like, awkward pauses. You have to stop and let people's brains, like, catch up so that you give them the opportunity to ask questions. You don't just say, does anyone have questions, and you move on. Like, you say, does anyone have questions, and you stop, and you awkwardly stare at the screen and let it be quiet, and just when you think you're about ready to move on, someone's going to pipe in with a question. So you really have to give the people the time, and it's even harder when it's online, and you, like, now, like, people's, you got black squares, right? And, like, so when you're in a room, you can read people's faces. Like, you can see, like, I, and especially if, like, you've been working with them for a while. Like, you're, like, you have your confused face on. We need to stop right now.
Working with clinical trials data
Yeah, I just have a question regarding clinical trials data. So earlier you mentioned that clinical trials data tends to come to you in a lot of different data frames. And I've used real-world data in the form of public health data sets, but I'm really trying to learn how to use electronic medical records. Do you have any advice for someone who's starting this field and interested in working with clinical trials data, especially things that are really awkward and messy like nested variables?
Honestly, I don't deal with electronic medical records directly, so we design our own data capture systems. And so the way it comes out is prescribed by that data capture system. So I'd be curious to learn more specifically about the types of data you're dealing with and that nesting structure before I could really offer you some sound advice.
But I think one thing that kind of helps is if you can find a sweet spot with things that mimic that structure but aren't quite as Wild West. And so, for example, I don't know if you're talking about something like API access or whatever, but kind of taking baby steps to learn about those workflows maybe in a different context that's like lower stakes or maybe where the content matter is more familiar to you so you understand how to navigate that process. So, for example, like the first time I ever used API tokens was like on a personal project to create a library of books that I checked out during COVID for my kids when I wanted to create this like catalog of books and learned how to like use APIs that way. And it was a cool learning experience that like transferred over knowledge to like API work that I need to tackle now.
Survival analysis packages and Posit Conf
Sure, the link is good. There's a lot of in the Pharmaverse more broadly, for those who are interested in this space. The Pharmaverse is a collective, I guess I'll call it, that has a pretty broad selection of packages that are useful for regulatory submissions. And this is one of them that I posted from this Insights Engineering GitHub org that just has a very cool list of tables, listings, and graphs or figures. And yeah, it comes with R code and then the rendered version of that R code and the HTML document. It's a really beautifully rendered page. So it's nice, not only for programmers, but for stakeholders to look through and see what it is that they would like to see with their TLGs.
Connecting and upcoming talks
Sure. I'm on LinkedIn. I'm on the R-Ladies community Slack. I'm on Mastodon. And so you can find me in a lot of different places. And if you message me and I don't reply, you're welcome to message me again, because sometimes I forget where all those messages are, like which platform they were actually on.
Sure. And so at Posit Conf, I will be co-teaching What They Forgot to Teach You About R with David Aja. We've taught this together for three years. Super fun material. This year, we are teaching a half day on debugging and a half day on personal R administration. So if you want to get into the weeds of how your computer works with your R installation, that's David's section. If you want to understand how to more efficiently debug functions, like if you've never used the debugger, if you've never used a browser statement, any of that, like I have a million videos up on YouTube because I give it at so many different workshops. You can look for that content or you can come join us live in the workshop. It's always more fun in person.
And I will also be giving a talk. My talk is called Context is King. I'll be in the Making Great Tables session. And we're going to be talking about variable labels in R, which I think is a not very well known feature. And I think there's a lot of advantages to using this feature that I'm going to talk about.
Well, luckily for Posit Conf, I have well-tested teaching material. I didn't actually try to give a talk the first year I was teaching a workshop. That would have been crazy. I know people that do that, but that's not for me. So at least the workshop's pretty well set.
Awesome. Well, I'm so excited to see you there. I did just want to let everybody know that, so you've probably heard, when people get accepted for talks at Posit Conf, we have them go through speaker training, if they want, with this organization called Articulation. And Articulation is going to actually join the Data Science Hangout next week. So if you want to learn from them without yet being a Posit Conf speaker, I encourage you to join next week's Data Science Hangout too. Blythe and Acacia from Articulation will be here.
Awesome. Well, thank you so much, Shannon. I really appreciate you taking the time to join us and sharing your experience. And thank you to everybody here in the chat with all these amazing links and everything that was shared here too. Thank you all. Have a great rest of the day, everybody.