Resources

Emily Riederer @ Capital One | Explicit design at the start of a project | Data Science Hangout

We were recently joined by Emily Riederer, Senior Manager - Customer Management Data Science & Analytics at Capital One. We discussed how a strong foundation in high quality data infrastructure and reproducible tools sets the stage for innovation in modeling, causal inference, and analytics, and so much more. Diving into a question asked at (42:38): What is your thought process for solving a problem that you don’t know how to solve immediately? One thing that I think is a really undervalued part of that process is thinking about how you will know a good solution when you find one? Also, how would you know if there was a good solution staring you in the face and you already had it? I think the more unstructured and complicated a problem can be, it can almost be a little deceptive of what's good-- which can have one or two bad outcomes. You find a good solution, but you don't realize it's good so you keep going You spend a lot of time chasing after an outcome, and only then do you realize, I solved the problem I was trying to solve but it wasn’t the problem I wanted to solve. Something I've really been experimenting with in my own work is having a lot more of an explicit design stage at the beginning of a project and thinking, how can you do a pilot? If I'm trying to predict some target, can I take those two values of that target and plug them into a downstream problem I actually thought that I was going to solve, and make sure that's actually what I want to solve? Almost like frontloading model evaluation with even a fake solution is the first step versus last step. Then, I'll check on one other point. I think the other aspect of that – going back to that level of abstraction – is figuring out how to take the context out of my problem to make it something more Googleable. So I mean thinking, not being like, “oh, this experiment, the random seeds were wrong, so I don't have a control population - what do I do?” Backing that into more of a general question - “how do you sample a synthetic control through observational data?” which is something you can Google and then find a ton of resources about. I think pushing myself on what I want, and then finding the right framing at which to ask for help. ► Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co LinkedIn: https://www.linkedin.com/company/posit-software Twitter: https://twitter.com/posit_pbc To join future data science hangouts, add to your calendar here: pos.it/dsh (All are welcome! We'd love to see you!)

Apr 28, 2023
59 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Hangout. Hope you're all having a great week. I'm Rachel Dempsey. If we haven't had a chance to meet yet, I lead our pro community at Posit. And so the Data Science Hangout, if you've never been before, is our open space to chat about data science leadership, questions you're facing, and getting to hear about what's going on in the world of data across different industries. And so it happens every Thursday at the same exact time, same place. So if you're watching this recording sometime in the future on YouTube later, the link to add it to your calendar will be in the details below.

Together, we're all dedicated to making this Hangout a welcoming environment for everybody. And so we love hearing from everyone, no matter your level of experience or area of work. And there's always three ways you can jump in and ask questions or provide your own perspective too. So you can jump in by raising your hand on Zoom. You can put questions in the Zoom chat. And feel free to put like a little star next to it if you want me to read it instead.

I will say, Emily, I have wanted you to join us here for so long. And we've also had a number of requests to have you on. Emily is Senior Manager of Customer Management, Data Science and Analytics at Capital One. Now, Emily, let's maybe get started by just having you introduce a little bit about your role and what you do and maybe something you like to do outside of work too.

Emily's background and career arc

Yeah, definitely. And thank you so much, Rachel. I've been at Capital One for about the past seven years. Over that time, I've held a number of different roles. But fortunately, both the data science toolbox and specifically R has really been like a through line for me throughout whatever I've been doing. Spent some time on in the more strategic analytics and modeling spaces, solving kind of very specific business problems related to our card business.

I'd say overall, I think my career has pulled me kind of like progressively further and further up the data stack. No matter what problem downstream, I'm really motivated by solving, I always realize there's some roadblock just one level higher. And I get really excited about like, well, how do I just solve that one pain point and that one? So after starting out more in that first line data science space, moved a little bit further upstream, when I got really passionate about inner source tooling, and bringing the best of what I saw from the outside, our community into my company.

And building internal packages to help us solve common business problems like customer lifetime value modeling, setting and forecasting KPIs, and even connecting to just different enterprise systems. And really enjoyed kind of like building out both those packages, but also really figuring out that you needed to also build out an inner source community around them. Then kind of continuing that trajectory upstream, moved more and more progressively into also adding in elements of analytics engineering and data engineering to tackle kind of standardizing, harmonizing, breaking silos, and that raw data input that would then flow through the tools and the ultimate data products.

Awesome. And what's something you like to do outside of work too? Big runner and big reader.

Inner source at Capital One

I love that you just started off with talking about inner source there, and that's something that actually came up on a Hangout, maybe that was last year, was Zach Garland at MasterCard. And I'm curious for all of us here too, like what is inner source really mean? And can you explain a bit about what that community looks like at Capital One?

Yeah, definitely. So in some ways, it's almost a funny thing to explain because I think to the open source community, it sounds so natural. It's almost like, why does this concept require its own name? But I think sometimes like in large corporations or just in different environments, you don't just have that nice synergy where someone sitting halfway in the world as you has happened to already solve your problem and put the answer up on GitHub. And so there can be a lot of, I think, redundant work, both in one person's project, like moving from project to project, or different teams solving like very similar problems, maybe like sometimes reinventing the wheel on even the framework or solving that problems.

So inner source is just really the idea of kind of both building kind of tools and an internal code base that multiple teams can contribute to over time. And as I mentioned, like that only also works if you like try to build up that same sort of community and enthusiasm and inclusivity and knowledge sharing around it to support it.

Yeah, definitely. So yeah, a lot of, I think you can like think about it, the definition pretty expansively. But I think the place I've definitely focused on it most is in the R package space. So hopefully, like one package for one thing, and then everybody around across Capital One is using it for that.

Yeah, that that's always been, like, been the goal. And I think the other really cool thing about that is really digging into like, the levels of abstraction. Not like, you obviously never want to constrict different teams from like solving problems in different ways, because different parts of the business inherently should be customizing and tailoring. But it becomes a really fun problem thinking, you know, what should be standardized, like, not everyone wants to write the glue code to like connect to database and get through a proxy. But where are the right degrees of freedom of like, one optimization method might be truly better for one type of problem than another.

And then it also kind of, I think, helps you aggregate those business use cases, and create that kind of body of knowledge of like, these are how different teams like, have solved this different problems, why they needed to like layer in different tools. And like, it becomes kind of a knowledge store, I think, for some of the complexity of any given problem.

Yeah, no, I mean, I think definitely, as best you can, I think, like, a lot of companies do use like a version control system, like GitLab or GitHub. So like, obviously, first step minimum bar is just like getting it out there. But I think there's a great little O'Reilly book on InnerSource, that one of the chapter headlines is literally like, just because you use GitHub doesn't mean you're doing InnerSource. I think in open source, we are sometimes spoiled by the extent to which it is, if you build it, they will come. They will not always come internally.

I think in open source, we are sometimes spoiled by the extent to which it is, if you build it, they will come. They will not always come internally.

I think like corporate incentives are different. Sometimes the culture of people just have no expectation that if they look, they might find something. So I think definitely, there's a little bit more of shoe leather beat the pavement in terms of talking to other teams, trying to go presented internal forums. But also, again, building up that community with things like, I leaned very heavily on like Slack channels to kind of like bootstrap and internal version of like Twitter, to try to get those just like, serendipitous say sync conversations going.

Current projects and data infrastructure

Yeah, I think pretty wide range of things. Something that's really been top of mind for me lately, is just like, how you can really shape that kind of like raw data layer, and make it like the most usable for kind of like the down last mile analysis. So, one broad area of interest that I'll say has been really on my mind, is thinking about how to structure raw data into like really meaningful, you could almost call it like a metric smarter feature mark, although it's not quite the same thing, but just like how you can embed kind of like more like context about a business problem in individual data elements, to have like more of those off the shelf for use kind of across metrics tracking across like, modeling and kind of like feature engineering.

Communicating inner source packages across a large org

Yeah, I mean, I think definitely the like, in an enterprise setting, I think I have also loomed more also on those, like, personal relationships to get the word out than you would necessarily in the open source context. So both in terms of, I think, kind of like leaning on leadership to like, help identify people that may be working on similar projects or patterns, kind of throughout the company. Or ideally, I think having a forum, where you're intended to kind of like, share updates, not on an individual project, because it's like, how do you get people to come to that forum initially. But if you can get people interested in the idea of like, technical knowledge sharing at a higher level, aggregating kind of like, many types of information in that one space, I mean, again, to like, make an open source analogy back, like kind of like, internal, our open sci analog, or internal, like our weekly type newsletter, I think it can be easier to like, sell and socialize the broader concept than maybe an individual like, artifact thereof.

Unexpected parts of the job

Oh, my, like, I, if anything, I feel like that's truly the story of my career arc. Went from like, math to statistics, because I wanted to like, math for the real world, but then found like, the shift from like, deductive to inductive reasoning, like, not at all like math. Then went from stats to data science, because wanted to, like, use the statistical toolbox, to, you know, like, solve these, like, beautiful probabilistic type problems, but then found, like, all of my challenges for more into like, data and tooling layer.

And so, I mean, I think it's a, this may be a kind of generic answer, but I think, like, being obsessed with, like, not only like data quality, but even upstream of that, like, how is data getting collected? How do the source systems work that are processing the customer records that are then spitting out the data? I think, like, I definitely never expected myself to, like, keep going, like, further upstream and further back into more of a like, technical stack, as opposed to, like, proving a nice little theorem, which is where I started out.

I think definitely something where I think a lot of my mindshare goes, but that I find is, like, super helpful for the last mile. Because even if you know some, like, really weird perk in your system of, like, oh, we were never able to, like, email these customers for x reasons, it's like, suddenly, then you also know, like, population that, like, something happened to you for a systemic reason. It could feel some, like, really interesting analysis.

So, I think the, both going upstream, but then being able to, like, round trip back downstream are probably two things that I never would have anticipated.

Yeah, cool. It's funny you share that. I recently was told that data science were snobs, like, we're only happy with the tail end of the data stream. I quickly had to kind of, like, snap my head around and say, well, I think we're, if anything, more holistically, we're interested in the holistic health of data, like, from start to finish, literally nurturing it from its inception all the way to the end.

Not at all. And yeah, I love that. I mean, to me, I think that's also part of what really puts the, like, science and data science is seeing the whole system. I forget, someone in the more ecology side of our stats Twitter once commented on something I wrote about, like, oh, like, the data generating process, like, I have all these same concerns as you do. It's just, for me, it's when I fell out of a boat and I dropped my notepad. Very different, like, cause of missing data, but potentially similar outcome.

Contributing to the open source community

Oh, my goodness, so, like, so very many. I think, first and foremost, obviously, it's just, like, all of the amazing, like, relationships you form, people you get to meet, and being a fly on the wall of, like, conversations that, like, I'd never get to hear. I'd never get to hear, like, epi professors talk, or biostats professors talk, or econ professors talk in my day-to-day life, but just being able to continue to, like, pull, and I think this is especially unique to R, from all these, like, really rich interdisciplinary communities and steal the best ideas, and then take them back to my day job, and then I look like I, like, came up with a new idea, and I didn't. I'm just stealing like an artist.

In terms of, like, contributions, I think something that I found a very surprising benefit is the way that it's made me have to think a lot more critically and a lot more formally about what I was doing anyway and otherwise. I think a lot of the things that I found that I can, like, really lean in to talk about, it forces me to, like, think about the problems I'm facing at that higher level of abstraction and think about, like, what have I learned from this that's not just true to my last project but is true to my next project and is true to other people's projects? So I think it's definitely a fun intellectual challenge, almost like taking a final exam in school to kind of, like, force you to formalize your own knowledge.

dbt and integrating with R

Yeah, no, I'm still, and, like, maybe I'll just take a step back and, like, explain for a minute, like, kind of what dbt is for those that aren't familiar. dbt is a framework for building data pipelines in SQL, and what it kind of nets you is a lot of the same benefits that I think we all value with, like, reproducibility in R, which tends to be harder with databases. And with dbt, so you can do things like DevPad deployments, you can do better testing, it helps you modularize your code better, so you can, you know, have easier version control, easier code reviews. And with some ginger templating, it makes SQL more like a higher level kind of programming language with legitimate control flow. Definitely, if you work in, on databases, I'd recommend checking it out, because I think there's, like, a lot of spiritual similarity with just things this community, like, tends to value and get excited by.

I continue to be, like, kind of big fan of using both tools. I'd have to say I've never really found that sweet spot of exactly learn how to integrate them. I think, still tend to have, like, dbt pipelines running more in the database layer, but kind of switching more into R mode for real, obviously, like, kind of analysis, reporting, modeling, etc. I've always felt like there had to be some, like, linkage, especially with something like dbFlyer, but haven't, like, have not gotten there yet, and would be, like, fascinated if anyone else has, like, played with both and has any takes there.

What has kept you at Capital One for seven years?

That's a great question. I know, especially in this day and age, I think it's, like, the less common route to stay at a, like, single company for a long time. I think definitely, like, I think pros and cons to it. I think definitely the thing that's kept me motivated is being able to, like, continue to, like, expand my scope, expand my skill set, both in terms of, in such a large company, having different areas of business to move around to, which is, so I'm still kind of, like, thinking about different, like, strategic issues throughout my career, somewhat similar to, like, changing companies, as well as kind of growing out my technical skills in different dimensions.

So, I think I'm definitely somebody that, like, gets very edgy if I don't feel like I'm continuing to learn, but I think I've found ways that, like, I really feel like I continue to be able to do that in my career. And secondly, this is a very, like, silly reason, definitely not a career-defining one, but going back to InnerSource, I will warn you, like, packages are kind of like children, and there have been times in my career, even where I've been offered a different role internally or something, and I've, like, been like, oh, but then I've had to, like, I don't know how to say goodbye.

AI and the future of data science

Yeah, no, that is such an interesting issue right now, and I feel like definitely, I know a lot of very strongly opinionated, like, strong opinions on both sides of the spectrum of whether it's going to change everything or nothing. I think through the type of work I do, and the type of, like, kind of values, per se, that I have about data science, I think there are, like, large parts of it that I don't expect it to change a ton, or I almost hope it doesn't change a ton.

I know, like, I think even there are a lot of companies and tools right now that are working on, like, automated NLP query generation, which, on one hand, like, seems really cool. Yeah, I'd love to be able to just answer anyone a question, ask a question and get an answer, but then there's, like, I think the data quality, like, part of my head, like, the fire alarm goes off. Like, there may be an AI that's, like, smart enough to, like, write a sonnet far sooner than there is one that's smart enough to think, based on the kind of conditional probability this person wanted, should I cast the nulls in this table to a zero, or should I leave them as zeros, and what should be in the denominator?

And, you know, I mean, I think, if anything, like, data quality and data bias will save us, because I think those are so sufficiently hard for humans to ever be precise enough in language about what they mean. And I feel pretty confident that's going to remain a, like, solidly, like, human problem for a much longer time. And then similarly, on the last mile, like I mentioned before, I think I'm very passionate about kind of, like, thinking about feature engineering, curating, like, both automated solutions, but also, like, kind of understanding and bringing, like, context into data science, whether you even think about cost-sensitive loss functions, like, model evaluation beyond traditional metrics.

I don't even know if, like, data science will always be a career, so much as, like, data science is, like, kind of an expectation of, like, something professionals should be able to apply to the problems they have at hand, like manipulating an Excel spreadsheet. And similarly with AI, I think there are probably some things, like, in a legal setting, like, parsing a ton of documents, generating some hypotheses, like, helping maybe structure a preliminary screen for, like, legal discovery or something. Like, I think there's some kind of assistive things AI can help with, but I, like, I personally, I think, like, don't love the idea of a world where it's every, like, making the final decisions, so I think plenty of space at the table for both.

Solving problems you don't know how to solve

That's a really good question and a really, like, prudent question lately, I think, for a couple of different reasons. One thing that I think is a really undervalued part of that process is thinking about what is, like, how will you know a good solution when you find one, and how would you know if it was a good solution if it was, like, staring you in the face and you already had it? I think the more unstructured and complicated a problem can be, it can almost be a little deceptive of what's good, which can have, like, one of two kind of, like, bad outcomes of either, you find a good solution, but you don't realize it's good, so you keep going, or you spend a lot of time, like, chasing after an outcome, and then you get there, and only then do you realize, like, I solved the problem I was trying to solve, it was not the problem I wanted to solve.

So, something I've really been experimenting with in my own work is having a lot more of an explicit, like, design stage at the beginning of a project, and thinking, like, how can you do a pilot? How can you, like, you know, if I'm trying to predict some target, can I take those, like, true values of that target and plug them into a downstream problem I, like, actually thought that was going to solve, and make sure that's actually what I want to solve? And almost kind of, like, front-loading model evaluation with a, even, like, fake solution is, like, the first step versus the last step.

And then I'll attack on one other point, but I think the other aspect of that, just, again, going back to that levels of abstraction, is figuring out how to take the context out of my problem to make it something more Googleable. So, you know, I mean, thinking, like, not being, like, oh, this experiment, the random seeds were wrong, so I don't have a control population. What do I do? But then, like, backing that into more of a general, like, how do you sample a synthetic control from observational data, which is something you can Google and then find a ton of resources about. So, I think pushing myself on what I want, and then finding the right brain at which task for help.

Something I've really been experimenting with in my own work is having a lot more of an explicit, like, design stage at the beginning of a project. And almost kind of, like, front-loading model evaluation with a, even, like, fake solution is, like, the first step versus the last step.

Communicating with non-technical stakeholders

Yeah, no, that's always a hot, like, enduring hot topic, and for very, very good reasons, I think. I think, like, a couple of different, I think, approaches that I tend to use there is, A, like, leading with the solution versus kind of, like, leading with impact versus leading with that, like, kind of process. And I think as many people know, I'm a, like, huge R Markdown fan, and maybe now I should say Quarto, but I've actually, like, come to the realization that I kind of need to even change my workflow for working with a tool like that because, like, inherently, like, an R Markdown workflow is kind of very linear of, like, I cleaned my data. I did the analysis. I got the outcome. But then from a storytelling perspective, I think you can really, like, get the hook and get the buy-in to, like, keep that conversation going if you lead more with the outcome of, like, this is why what I'm about to tell you is important.

And then some people want to really dig deeper and understand the mechanics. Some people won't, and I think it's also as, like, data person learning to accept that that's okay that they don't. I think all of us, it's very, like, it can really be about the journey versus the outcome, but accepting for, like, leaders, reasonably, it is often about the outcome. But then when I want to go into those details, I think I both either try to, like, lean heavily on both and or either of, like, metaphors and also, like, diagrams or one really compelling plot, just anything to make it more tangible versus, like, purely conceptual.

Yeah, no, I think that's a great point of tools versus analyses because tools are so inherently abstract, it can't make it harder. But I do think probably how I've translated it, leaders inherently, I think, care a lot about the people on their team, hopefully from the genuinely caring perspective, but at least from the managing capacity perspective, bare minimum. And I think kind of leading with the case, I think most tools, you can also bring a human dimension into that story, whether it's such a pain point that people have to, they're doing this manual work every single month, then that's inhibiting both them doing more interesting projects in their career development, but there's a way to automate this thing.

And there's a broader framework that I really like for thinking about this called the jobs to be done framework that kind of comes from the product management world, and essentially what that says is you can think about a tool as someone you're hiring to do a job. And I've always liked that, kind of framework for, like, thinking about the interview, thinking about, like, why should I, like, bother to, like, hire and onboard you, and is this tool more, like, almost going back to the AI discussion, is this, like, an intern that I only want to use with a very high level of supervision, or if you're thinking about CICD and process automation, it's, like, am I hiring an executive or a contractor to just go run this process for me in an abstracted way.

So I'm, like, definitely feel really lucky to be in an environment that's, like, pretty much always been very, very friendly to open source, and I think that is, like, I know a trend I've seen across the industry. I've had the opportunity also to do some consulting projects with other people, more into biostats and pharma fields, taking that leap to other proprietary options to the R world. And especially, I mean, I can only imagine, I think, the intersection of data tooling with the current, like, economic conditions, I have to imagine can only, like, kind of accelerate that trend when it's becoming an increasingly, like, easy to hire skill set, much more attractive option through the bottom line.

Yeah, no, I mean, I think there's three aspects I'd say to that. I think first, definitely, like, can be really helpful to understand, like, a company's tech stack when you're interviewing and the amount of data they have available that you don't want to be, like, hired to be a data scientist and then show up on day one and have them be, like, oh yeah, here's a Google Drive with some CSVs in it, like, go wild. You want to be sure they have, like, enough data to support analysis and that either you'll be empowered to have the tools to build out the data you need yourself, or maybe they also have, like, data engineering or other job functions.

Secondly, I think it's helpful to be crisp about what you want to learn and grow on in a role. Like, data science can be such a nebulous job title these days. You know, I mean, I think someplace, like, it can mean anything from BI, experimentation, modeling, machine learning, and I think really just clarifying your interests and then just being able to, like, articulate them. Like, companies, like, do not have any incentive to, like, hire you for a role you don't want, so I think, like, kind of being able to, like, share what you're looking for really just helps the matchmaking.

But finally, I think it's also really good to recognize, like, you can learn a ton in pretty much any role. The spending a lot of time on data processing, taking that just as an example, that is sometimes, like, the hard part, the complex part, the part that, like, requires actually still a lot of the data science skill set of understanding how do I shape this data in a way that, like, because I do understand the stats, because I do understand the algorithm, how do I, like, structure this problem in a way that the algorithm can understand the problem? And, like, at the end of the day, I think it's a funny conceit that in school we spend most of our time learning, like, hyperparameter tuning and typing that, like, model.fit, model.predict. That often isn't the hard part or sometimes even, like, the most interesting part, so I think being open with that, like, kind of growth mindset of, like, whatever job you end up in, there's going to be, like, a ton to learn and a ton of really interesting work to be done.

I think it's a funny conceit that in school we spend most of our time learning, like, hyperparameter tuning and typing that, like, model.fit, model.predict. That often isn't the hard part or sometimes even, like, the most interesting part.

I know we just got to the end of the hour here, and I'm sorry if we didn't get to answer everybody's questions. Emily, what is the best way for people to stay in touch with you? Is it through your website or LinkedIn or GitHub? Honestly, I think wherever people are these days, I'm still, like, for now, I'm still on Twitter. I'm still spending far more time there than I should. LinkedIn, GitHub, my website has my contact information, same with my email, like, my handle on pretty much. And, like, I have the fortune of having, like, an unusual enough last name that my handle on, like, LinkedIn, GitHub, Twitter, my Gmail, everything's Emily Riederer at wherever. So, yeah, please, like, don't hesitate to get in touch. Like, I just love getting to meet more of the community.

But thank you so much, Emily, for joining us today and for sharing your insights and experience with all of us. This was awesome. Oh, thank you. This was a lot of fun.