Data Science Hangout | Javier Orraca-Deatcu, Centene | Excel to data science to lead ML engineer

Transcript#

This transcript was generated automatically and may contain errors.

Hi, everybody, welcome back to the Data Science Hangout. Hope everyone's having a great week. For anybody joining us for their very first time today, welcome. This is actually our second to last Data Science Hangout for 2022. Thank you to everybody who's been with us this whole year as well.

This is an open space for the whole data science community to connect and chat about data science leadership, questions you're facing, and getting to learn about what's going on in the world of data science and getting a view into different industries and use cases. We share the recordings of these sessions to our Posit YouTube, so you can always go back and rewatch or find helpful resources. And sorry, I am three weeks behind, but I will do that this week and update them.

Together, we're all dedicated to creating a welcoming environment for everybody. No matter what industry or background or experience you have, we want to hear from everybody. So there's always three ways that you can ask questions. And also to provide your own perspective, it doesn't have to be just a question. You can jump in by raising your hand on Zoom, and I'll be on the lookout there. You can put questions into the Zoom chat. And feel free to put a little star next to it if you want me to read it out loud instead. Maybe you're in a coffee shop or something. And then third, we also have a Slido link where you can ask questions anonymously.

I will mention too, if you want to connect with people after the fact, we do have our LinkedIn group for the Hangout. I know right now, not too much conversation goes on in there, but we'd love to create that space where you can easily find each other. You do have to manually turn on notifications in that group, so that might be part of it.

But I am so happy to be joined by my co-host for today and our friend from the Data Science Hangout, Javier Orraca-Deatcu, Lead Machine Learning Engineer at Centene now. Congrats on the new role. Javier, I'd love to have you reintroduce yourself, because I know you've met a lot of us from past weeks as well, and share a little bit about your role and the company now, and maybe something you like to do in your free time too.

Thanks, Rachel. Yeah, and it's great to be here. This has been, I've told many people this, I really do mean it, this has been like my favorite standing meeting for the last year and a half or so. I feel like you've been doing an awesome job, Rachel, and I love this kind of platform for knowledge sharing and whatnot.

But yeah, my name is Javier Orraca-Deatcu. I've been around the finance, corporate finance, and sort of data science world for like 15 plus years. I've, I spent a lot of my career in consulting, doing different types of financial modeling around evaluation work, different types of like economic studies, like, you know, economic obsolescence studies, functional obsolescence studies, some tax optimization type work. And in those days, I was mostly using Excel.

We would do some, you know, SQL or SQL work, but it was pretty minimal. Whenever we were doing that, I was actually using Microsoft Access. I feel like that's not really a tool used nowadays, but at least at the time when we were working with, you know, hundreds of thousands or millions of records, it was much easier to do the same types of group buys and sort of summations that we were doing in Excel, you know, and for bigger data using Access.

So I went back to grad school. I really wanted to get into data science. It was kind of a buzzword. I was reading all about predictive analytics without really knowing what it was or what it meant, but it excited me. I wanted to really get my hands on that and understand how I can take sort of the forecasting or, you know, modeling skills that I had, sort of taking it to the next level so that I could start trying to automate some of my work or some of the reporting that I'm doing.

So I went back to grad school, and yeah, when I got out of grad school, I joined Centene. I was with them for about two years. I quit to go work for an e-commerce startup for about a year, and I just recently returned to Centene. I'm on week two in a new org, so bear with me if I can't answer all the questions about my work, but yeah, here at Centene, I'm a lead machine learning engineer. I'm part of a scrum team that is a joint effort with different data scientists and data engineers, and we're partnering with our business stakeholders and really taking these, like, high ROI, you know, predictive modeling concepts and putting them into production.

Yeah, something I like to do for fun, I love playing board games. I feel like I spend way too much of my free time reading about developments in the R world or, you know, Python developments. I mean, I enjoy it. A lot of times I'm just reading about the stuff, not really even, you know, applying it through code or anything, just trying to keep up with all the trends that are happening. And yeah, that's a little bit about me.

I have to ask, what's your favorite board game?

In terms of being able to explain it to friends or play it quickly, I love Splendor. It's a card game. It's two to four players. Especially if the people you're playing, you know, playing it with have had a few repetitions, it's a really fun kind of quick game.

So I would say if you don't already have some type of like, you know, intake process or something like this, where your business stakeholders that might have ideas can come in and request, um, some type of like company wide, you know, project that as part of that submission, um, or as part of a conversation that your team has with them is to really try and like quantitatively determine what the benefit would be.

Um, I mean, I think that is going to go a really far away. I'm still learning about our whole intake process, but a lot of my health net peers, um, you know, now that they know that I'm back or like, Hey, can we partner with this? Can we partner with this? And I'm like, there is sort of a, like, we are trying to prioritize the highest ROI, um, you know, projects and there is an official intake program. So I'm pretty much just sending them information on, Hey, here's, here's the forums, you know, please populate them. And, um, yeah, that, that seems to have helped a lot or help this team scale out their models a lot and, and prioritize which models to actually put into production.

And there's actually a follow-up anonymous question to that too. The question was, what does an ROI timeline look like for most projects? Are you working on things that come to life years down the line?

I actually do not know yet, but I would say a lot of these probably come to life within the same year. I don't think they're like, I don't think the planning is so far ahead. There are larger IT goals and things in motion that could be like a multi-year type of transition. I just don't work on those types of projects. The focus of my work is more data science specific.

Starting the R user group at Centene

So I hope I'm not remembering incorrectly, but I'm thinking pretty sure you had a hand in starting our user group at Centene when you were there previously. Um, and so I was wondering if you could talk a little bit about that because that's a difficult task and you have to like recruit people and getting people to meet up and help each other and stuff like that. Um, it's really challenging. So I was hoping you could give us a little bit of a rundown on how that happened.

Yeah. Thanks Libby. And great to see you. So, um, there was, I sort of piggybacked from a general data science community chat that we had at the company. Um, and you know, there were several hundred people on this of varying backgrounds and expert expertise levels. And so there was a lot of conversation happening. There was already a Python group that was meeting, I think every other month.

So me coming in like three weeks after I started, I got really excited about the possibility of potentially creating something similar for our, uh, our users. And so I think it started by just reaching, trying to figure out who owns that already existing data science chat and see if they could help support the idea of creating a Python or sorry, an R user group, you know, something to meet. Once a month or once every two months. Um, because I think at the larger companies getting that type of like top level level kind of executive, uh, you know, stamp of approval and support can go a long way, especially if that individual is part of the already existing IT or data science function.

Um, and so, yeah, at the time I created a blogdown site for those of you who are familiar with R Markdown, blogdown is just kind of a, you know, it's a package that allows you to create static websites, uh, static blogs with R Markdown. Um, you know, distil is a very similar concept. Um, Quarto now with Quarto, you can do the same thing, creating websites. Um, and I, yeah, I love the syntax of Quarto, but anyway, uh, so we started, I think what started is like 13 users the first month.

One of them is at RStudio now. I don't know if he's on this chat. Uh, Dave Grunwald, he might be here.

Hey, I'm here.

Hey, how are you? So David, um, Dave taught me Shiny. I tell him this often. I owe like my current career to Dave because it made Shiny itself made me such a better overall programmer and the way I think about functions and recycling code, like seriously, thank you, Dave.

But, um, yeah, what started with about 13 users, uh, within a few months jumped to about a hundred, 125, you know, monthly users that were on this, our specific, uh, monthly meetup. Um, so we had a really great time, you know, the partnerships we had with Posit. We were able to get, uh, some people to come in and do workshops as well. Um, it was, yeah, it was, it was just a great way. Everyone was so collaborative and it was such a great way to see the excitement around what you could do with like all things are, um, and even Python, like how can we tap into this, these robust, like, you know, Python libraries or Python and like we had reticulate sessions where it was, you know, a co-branded Python R kind of workshop where we're looking at ways in which we can actually, um, you know, communicate between teams of different languages a lot easier. And so anyway, I, I had a great experience with it.

Navigating stakeholder uncertainty and project valuation

And one was, how do you navigate stakeholder uncertainty and incongruent project valuations for data science projects versus more traditional engineering projects?

I, I don't know about traditional engineering projects, to be honest. Um, I don't feel like I've been surrounded by that world enough to, to like speak, you know, yeah, I really don't know when it comes to the data science projects, um, at least at the point I feel like where we are now. Um, you know, just really careful, objective kind of unbiased review of what the ask is, um, has been very helpful.

Um, being able to like, in however way you want to do this, like it doesn't need to be just a dollar value. Like what savings are we going to have, or what benefits are we going to have? You can kind of score the different potential projects coming in. Um, where at least project to project, you can prioritize which one is going to have the most impact for the business, or, you know, an impact can even be measured differently, like which one's going to have the most short-term impact. What's the, what's the best project for long-term impact? I mean, maybe all these things could be different weights for your scoring of these different projects to understand what your team should focus on. Um, but yeah, I would say trying to develop some system or framework where you can actually, um, like help prioritize or rate the importance of these new projects is really helpful.

Shiny in the workplace and the branded interview app

Did you create a branded test Shiny app in applying for your return to Centene?

No, I didn't. Luckily I had several apps in production. We were doing some really neat things, you know, when I was at Health Net. So we had several apps that were in production. It got to the point where like managing some of the pipelines for the different apps was becoming a bit of a recurring, more time-consuming process. The more apps we had, all of a sudden we found ourselves needing to continue to streamline like data intake or transformation for some of these different apps. And so we created a web app updater that was, it was itself a Shiny app, but also capable of updating other Shiny apps, or at least the code from other Shiny apps.

That kind of stuff, you know, it's not typical to see that kind of stuff with Shiny apps, at least not the Shiny apps I've seen. So anyway, where I was going with this, the production team I am now part of, or like the corporate data science team, at least they had some samples to kind of chew off of, of the type of, you know, our work that, or Shiny work that I could do. But that, I would say for, for organizations not familiar with, you know, Plotly dash apps or Streamlit apps, or, you know, Shiny apps, being able to show them the speed of a web application like Shiny and how kind of like clean it can look with an advanced UI, that definitely goes a long way, in my opinion, to, you know, impressing the people that you were interviewing with.

For anybody who didn't see it before, didn't know what that question was referencing to, Javier has shared before that in an interview with Bloomreach, he created a interactive Shiny app that used their branding and color scheme. And so I just put that into the chat too, so everybody can see the blog posts that he made about that.

Yeah. Travis Gerke, I don't know if he's on here today, but he had asked if he could reference this as a cover letter accessory in one of his RConf talks. And I was like, of course, yeah, please do this. But at the time my GitHub repository for it was just like, you know, a super high level, read me, it was, you know, Oh, this is a Shiny app that I styled with Bloomreach theme. So I wrote this blog post, just in an effort for people that are less familiar with Shiny, or maybe with R, I tried to write it in a way where, you know, people with basic like GitHub understanding could go in there and clone the repo and, you know, try and tweak the app to their liking, to their company.

I was just going to say AppSalon as well. They're a consulting firm, data science consulting firm. They make some incredible Shiny apps. They have some beautiful, you know, Shiny app examples on their gallery, on their public facing website. The Shiny website itself, I don't know what the new posit link for it is, but that has a bunch of galleries, a bunch of gallery examples on their gallery. With both, like you can go in there and launch the Shiny app. You can also see the source code behind each and every one of those. So that's a good way to learn too.

Grad school, function writing, and the journey to ML engineer

So short and sweet. Just wanted to get your opinion on if grad school was the way to go to getting you feeling like you were able to open up the doors for more data science and machine learning type roles, or if kind of having that one-off background. I mean, ideally for me, it's like I come from kind of a hybrid. I've got my background is more in public health epi, but then I do more programming and you know, clinical type work, but having that heavy medical background has always been an asset. So I didn't know from your background, if you felt that was the way to get your feet in the door, or if you were more like, was it a benefit or do you feel like you could have got there without, I guess is the short answer.

Grad school allowed me the time to be able to get to where I wanted with like the basics of data science and programming. If I was in a full-time job throughout that time, you know what, even if I didn't have a full-time job, I think grad school just gave me sort of a set routine for being able to learn this stuff. I don't actually think you need a graduate degree to, you know, get into this type of work or I lack the discipline to learn all these topics and concepts just on my own.

And for me, I found that grad school really did help just push me to, you know, learn not Python and R more specifically, like to learn about the math that underpins a lot of the data science that we're doing now and, you know, how to apply these different algorithms to business problems. That, the way in which grad school sort of force feeds that information to you, I think the alternative learning it on your own would just be really hard. I don't know that there's any one resource that, and I'm sure there are some great ones out there. I just don't know of any sole kind of resource that would give you such like a robust or like holistic understanding of data science.

Yeah, for me, I kind of knew the kinds of data science work I wanted to get into. I mean, I was doing, I wasn't calling it this, but I was doing time series forecasting, that was sort of my bread and butter in financial modeling. And so I knew of all these different techniques for time series that were possible, I just didn't know how to, you know, I didn't know where to start. And so grad school for me really helped sort of open the doors of what's possible with code for different types of time series problems, you know, like, yeah.

So it is, you went from Excel to lead machine learning engineer. Can you tell us about the journey? Anything you found surprisingly hard or easy?

I had a really good, I mean, because of SQL, I had a really good understanding of at least how tabular data could be joined and the different transformations that could be done to these data objects. I think I would have really struggled without that kind of like basic understanding. But having said that, I think the parts where I really struggled at first was like function writing. Function writing was not intuitive to me. Basic function writing was, but in general, I found it to be very complicated. And it took a solid, I don't know, three to six months of practice to feel actually comfortable function writing.

Um, even when I started building Shiny apps, you know, basic Shiny is quite easy, but large functions underpin the entirety of a Shiny app. Everything you do within Shiny is effectively writing functions. So the process of learning Shiny and becoming more comfortable with Shiny was very difficult and something that just took a lot of repetitions. But it all sort of played together because while Shiny is, you know, more like people think of it more as like a front-end type of system, it did make me a much better programmer in the way I thought about actual functions and function writing.

Other things that I found hard? I mean, looking back, I'm sort of embarrassed to say this, but reproducibility of machine learning was not something super intuitive either. Like, you know, being able to reproduce a code set and get the exact same predictions every time, I wasn't quite sure why this wasn't working. Or how to like create these fixed views, you know, setting a seed or what have you. You need to do to ensure that someone else downstream could replicate your study or analysis and get the exact same findings themselves.

IDEs, what's next, and bridging Excel and data science

And somebody had asked anonymously on Slido, what IDE, if any, are you and your colleagues using for development work? RStudio, Emacs, Vim, VS Code, Notepad?

I think all of the above, honestly. Yeah. I use RStudio every day. So I'm in the RStudio IDE. I seem to be in Terminal a lot these days. Just, you know, like the shell, writing bash commands and whatnot. Anyway, but yeah, the RStudio IDE, definitely where I spend most of my time.

I know that you are two weeks into this new role, but is there, like, if you think of the next year ahead in this role, what are you most excited about? Or what made you most excited about this role?

For me, it was more, I'm really excited about the challenge that's going to come with becoming a better just overall software engineer or, you know, become better at programming at large, not just R specific. I'm, like, constantly humbled at everyone I work with. Just their breadth of knowledge with all these different systems. I've sort of touched a lot of these different systems for, you know, ML ops or ML engineering. But being able to really dive deeper into some of these platforms to get production jobs out of any language. I'm really excited about the challenge there and, you know, the growth and learning opportunity.

I just happened to see a question. It said anyone using VS Code for R Shiny? I've tried this. I still feel like RStudio itself is just, like, the gem for coding in R. But I do really like VS Code for Python.

Thanks, Javier. This is Daniel. I asked that question. I asked it because I write a lot of R in my position as well, and I write a lot of Shiny applications, and I've started to think about VS Code because I write my Python and Postgres in VS Code, and I would just kind of like all that to be together. But as you mentioned, VS Code is not ideal for R, and RStudio really is the best place for writing R code these days, and just kind of wondering where others are in that.

And in the minute, few minutes or so we have left here, I know there were a lot of questions in the beginning and comments on, like, Excel and data science, and I was just curious, circling back to that conversation, like, what do you think has been most effective for you at, like, bridging the gap between those two teams or two sets of users? Because I know there are some people who are probably always going to stay in Excel, but you might need to work with them as well, and what have you found helpful?

Like, data extract, if you're working at a company that's large enough where, it doesn't need to be a large company, but I feel like at larger companies, tapping into databases, or you actually have databases, you're not just operating in a world where it's Excel files or CSV files. Then, showing how you could kind of stay within the same, you know, R notebook or framework from data collection, just pulling it straight into your environment, you know, manipulating it there, writing it out or summarizing it as, you know, an HTML file, like, knitting an RMD to an HTML file, or a Flex dashboard, or something that's still static but interactive, that has gone a long way, too. And the speed of everything, like, the speed of the data manipulation and data handling for millions or tens or hundreds of millions of rows, that always shocks people, because if you've got a lot of columns in Excel, after about, I don't know, three, four hundred thousand rows, your Excel is crawling, and you can look at your RAM, and it's eating up your entire available RAM, whereas, you know, with something like Python or R, it's definitely not the case.

And the speed of everything, like, the speed of the data manipulation and data handling for millions or tens or hundreds of millions of rows, that always shocks people, because if you've got a lot of columns in Excel, after about, I don't know, three, four hundred thousand rows, your Excel is crawling, and you can look at your RAM, and it's eating up your entire available RAM, whereas, you know, with something like Python or R, it's definitely not the case.