Emily Riederer: Column selectors, data quality, and learning in public

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we're joined by Emily Riederer, who I think has the distinction of living in that sweet, sweet overlap between Python, SQL, and R harder than anyone else I've ever seen, and is a data science manager at Capital One, and apparently listens to the test set, but is not planning to listen to this episode. Welcome to the test set. I'm Michael Chow , and I'm joined by my co-hosts, Wes McKinney, who's a principal architect at Posit, and Hadley Wickham , who's a chief scientist at Posit. And I'm so excited to be here with Emily Riederer, who is a data science manager at Capital One, and I think a sort of like icon in the R, Python, SQL community for just putting out so much interesting work in this intersection of Python, SQL, and R.

So like, recently the article Python Ergonomics, and some talks around that, of how to have a workflow that R users love in Python with things like polars, and dbtplyr, which was a plugin for the SQL framework dbt, which was a really interesting sort of cross-section of ideas. So Emily, thanks for coming on. So happy to have you.

Oh yeah, thank you so much for having me. I've loved the pod so much so far. I have not missed a single episode. Yo, I'm so glad. Honestly, the biggest downside of being on a podcast is I was thinking about it, and I'm like, oh, the next one that comes out, I'm not going to want to listen to. Yeah, are you going to listen to your own? I feel like, oh my goodness, you can't break the streak, you know? Never, never, never, never.

Emily's journey into SQL

So I think people will be really interested to hear sort of your journey through languages like R, Python, SQL, and the things you've put out. But one interesting question I thought might be good to open up with is maybe just on the role of SQL kind of over the years. Since I know you started doing a lot of R work, and folks here have done like a lot of work in SQL, whether it's like dbtplyr, which translates like dplyr and R to SQL, or ibis, which translates Python to SQL. Emily, I'm really curious to hear like, what was your journey into SQL like?

Very abrupt in some ways, because I think if I think about my educational background, I was in a relatively like theoretical stats program, so could prove like asymptotic convergence of a lot of things, had very little experience with real world data, until I get one of my first internships, I'm asked to make a customer profile, I take the average of customer incomes, and I'm like, we're all rich, but we were not rich, I was just averaging in all the 9999 encoded default values. So that was my first introduction, both to databases, to SQL, and just like the vagaries of real world data. And I think that had been like, was also kind of jarring, because I was under this whole illusion of like, oh, data is this like, ground truth, where we can like figure things out about the world. And then I entered into this world where like, data is this like, kind of source of chaos that you need to control.

And then I entered into this world where like, data is this like, kind of source of chaos that you need to control.

But then kind of like, kind of coming into industry, I think SQL was one of like, the main tools for the trade, the only way to kind of like, access your data before you could get in there with something like R or Python. And pulling them from just like kind of a single brush I'd had in college with kind of database design and data modeling and verbal forms, I think just something about that, combined with the tidyverse , just so kind of like, clicked in my mind of like, there's actually a real art to how you like, set this thing up, how you like, get the data out, that can really just like, set you up for success.

Yeah, it feels so real that like, moving into a business, like going to work, and hitting like SQL, and realizing that there's like a process to get the data out. I mean, I will say I've been in an active like, I think calling it a war against SQL is maybe putting a little bit strongly, but I've definitely engaged heavily, and I think as has Hadley, in building tools so that humans write a lot less SQL. And I've often felt that SQL is as a language, it's clearly not going anywhere, it's the lingua franca of databases. But it's also a little bit like assembly code or Fortran code, it doesn't have a lot of the modern niceties of real, like a real programming language, you know, like functions, like reusable code, like code that you can refactor and reuse.

And so, like early on in my career, I was a bit scarred by hundreds and thousands of lines of copy and pasted SQL queries that, you know, there's no tests, and so, you know, you're dealing with these like highly brittle, very complex, you know, hundreds of lines, many pages long, 10 page long, SQL queries. And, you know, the reality is like SQL is really alluring in the sense that like, it's declarative, you know, writing simple SQL queries is easy, but complex business logic, especially in financial setting ends up being rather subtle and with a lot of complexities. And so, I found myself like making the same kinds of mistakes over and over again and seeing other people make mistakes. And so, I felt like if only, you know, if we could essentially abstract away the unpleasantness of SQL and make it easier for humans to write, essentially author SQL indirectly and to avoid many of those common errors, then that would be, we would be doing humanity a great service.

I guess I actually started with databases, like I did SQL before R. So, my dad, a lot of his work involved databases. So, we had like, you know, we had like dinner table conversations about like relational data and COD's third normal form. And when I was in high school, I guess starting from like age, I don't know, like 15, I like made Access databases as like my part-time job. I did some like database documentation as a part-time job, which is kind of like crazy looking back at it now. Like I was, I mean, that's really how I learned to program was in like Visual Basic for applications. Like that was my first real exposure to programming, real exposure to SQL.

And kind of interestingly, like I've been working on dbplyr lately, which translates dplyr code into SQL. And people, like dbplyr has an access backend and people file issues that it doesn't work. And the fact that like, that is mind-blowing to me that even in like 2025, people are writing R code to connect to a Microsoft Access database and like work with that data. That, I don't know, that's just kind of like mind-blowing, mind-blowing to me.

I definitely write python code with an R accent but at the same time like as best like it feels so much like you're going to get so much further and have a lot less pain if you are like leaning into the conventions of the language you're in.

So I've really it's been fun from the outside looking and I feel like to see like where the scientists I'd same and where they do kind of like for goth I think I think R's had a lot of impact on on on python as well and and if you look at like the API of polars API of ibis and it's it's a much more like pipe centric kind of fluent you know fluent API and and I think um that's you know people looked I think a lot of people myself included looked enviously from the python world into the R world and then we were like oh wouldn't it be nice to not to just be able to refer to column names like inside of expressions and not need to like you know sub basically you know index into a data frame to be able to get a reference to that column to be able to do things.

And stuff like that like the non-standard evaluation that you have an R that comes from its its lisp uh lisp heritage and so I think that the people in python have tried to replicate like some of the some of the flavor or like the the the niceties of that like the underscore like the underscore operator and and stuff like that but still it's it's hard to shoehorn that type of API into into python which is like in many ways the opposite of non-standard evaluation like it's the you know it's everything must be as explicit and obvious as possible and so to try to do things like magically is is considered unpythonic.

Yeah and that that feels like the kind of like it's almost like the spectrum of like do we give these things the same name is like the very first question and then like that that reminds me of like the deeper questions of like how do we even architect it so it feels pythonic versus like uh yeah it has like an R sort of smell to it how do we actually lean into kind of python as a language it seems yeah it seems so tricky to kind of strike that balance kind of on all those fronts.

I will say for great tables we when we started working on great tables which is a port of the library gt in R to python it wasn't available on PyPI so that kicked off right away like the need for a new name but what I appreciate about Rich is he was open to like what if we just nobody few people knew that gt stood for grammar of tables so I have to hand it to him for being willing to just retroactively like rewrite history and pretend the acronym stood for great tables. I think I think actually few developers would be willing to retroactively change and I'd actually forgotten that like in my head I'd already translated yeah like gt that stands for great tables I'd forgotten it was like grammar of tables inspired.

Learning in public and imposter syndrome

I'm curious Emily how do you choose like the things you get into because I think that like digging into column names as contracts and like addressing the dbt communities so interesting um how do you kind of choose what what you're gonna like which threads you want to pull. I think I've always like first just like really gravitated towards things that might otherwise frustrate me but just like had so much curiosity of like couldn't they be better like don't we maybe need more internal packages or like surely there must be a better way to like write the SQL code. And I think so I think kind of like really being kind of like drawn to kind of like the little just like more paper cut everyday problems um I think it's just like left me with a lot of like curiosity and energy to explore things that otherwise it just kind of like train you.

And then I think that's just coupled with like I love just like learning about kind of new new tools new algorithms you know just like kind of squirreling away information that like it doesn't feel like I probably ever need to have just like save it as a nut for winter. And um I think and I think that was so easy especially early in my career was kind of like the renaissance of like RStats Twitter where you just have like people just like you could be on the fly on the wall for just like non-stop conversations people so much smarter than you doing just fascinating things and then just like very luckily just being able to do a lot of that pattern matching of like oh I have this problem oh and then I heard about dbt or like I'd really love to never copy paste this again but I think there's a thing called RMarkdown that exists.

Um that I think is just like it's kind of like I'm curious about the things that frustrate me and then I had to have all these like tools I want to try and then sometimes I just like kind of get lucky pulling the right tool from the toolbox. Yeah it's so cool and I I do think it's so interesting where so much of this really shines in your blog where you're often like stitching together tools like multiple tools in really creative ways.

And I know I think something that that like came up when when preparing for this is like imposter syndrome because I know that like with blogs or putting when people put themselves out there like that's a big can be a really big challenge. I'm really curious like your take on like what what that looks like for you and maybe advice you'd have for people that are like blogging or putting tools out that might feel a little imposter syndrome.

Absolutely I mean like even preparing to join this podcast is like what have I done I could just be having a nice afternoon and I'm talking to like three luminaries this field. But um I mean I think in some ways I think it's something that in some ways you really have to get comfortable with because if you like if you aren't in rooms with people that you feel like you have a ton to learn from that's kind of sad you're probably also in the wrong rooms you know. I mean going back to like I think the last question of like the way I learned was just getting to like absorb all this amazing content from other people out there and like if you weren't part of the conversation you weren't going to get to kind of like learn and encounter those things.

So I think like I've always like somehow managed like get myself like in a lot of places that I really had no business being you know even switching between like the analyst engineering more mo hats and like just getting kind of comfortable jumping in and like learning as you go. And same with like obviously not PhD political scientists or economists but the more that like I could like peer over into those spaces and see the different ways they're talking about causal inference you see you know I mean like I think to a large extent I think it's very good to feel like you aren't smart because that's like tells you you're like in the right place where you're learning.

Sharing on the internet I think it's always like hard and scary I think it is so much to the great credit to the R community that like it never felt that way. I've thought about that if I graduated like if I'd like come into the field either like five years sooner or five years later both of which rudely was when the internet seems like was most more hostile place and like oh what I have felt is like confident just like putting stuff out there. But um I think Emily Robinson really maybe said it best at some point in a um conference talk of like just like right for the person that like is struggling with the things you were struggling like 6 or 12 months ago and just like try to help them know what you need to know because I think that's just like a good way to remember there are other people out there just working on the same stuff then that you do like genuinely have something something to contribute.

Just like right for the person that like is struggling with the things you were struggling like 6 or 12 months ago and just like try to help them know what you need to know.

Yeah it's such a neat um way to frame it I feel like of just like thinking back as both a way to like really emphasize like your own growth but also like it does it does feel nice to be able to just take that difference and like write it down. Yeah and and I just like I think it's the best way to like crystallize your own thoughts that I mean nothing I write will ever be like hey guys this is exactly what I did at work today because I'm pretty sure my employer would um they kind of prefer I not go like sharing IP all over the internet. But um I think kind of also that just thinking about like how would I help somebody else do this thing forces you to actually think at the right level of abstraction that I mean to just take the trivial example of column names it's not like how it actually works better for the state is that if I do this but kind of like almost like it forces you to kind of like generate like a theory behind things that is in itself a little bit more like reusable.