Resources

Emily Riederer: Column selectors, data quality, and learning in public

Emily Riederer writes Python with an R accent, and we’re all comfortable with it. In this episode, Emily reflects on her journey through R, Python, and SQL — from lessons learned in averaging default values (oops, we're not all rich!) to discovering that column selectors are way cooler than they sound. She weighs in on the delicate art of learning in public, why frustration often makes the best teacher, and how to find your niche by solving the boring problems. Oh, Oh, and the crew casually drops that she's keynoting posit::conf(2026)! Emily’s had a wild ride through modeling, data engineering, machine learning, and back again, and she knows a thing or three about the evolution of SQL tooling (from nightmare multi-page scripts to the dbt renaissance). She reveals how building internal packages became her gateway to making work enjoyable. Plus: the surprising Stata origins of column selectors, the eternal struggle of naming packages across R and Python, and why watching people code teaches you more than any tutorial ever could. The conversation gets real about imposter syndrome and the magic of tacit knowledge. IN THIS EPISODE • Why real-world data is chaos, not truth • The path from modeling to data engineering (and back) • What a data pipeline really is (extract, load, transform) and why organization matters • How dbt changed the SQL game • Learning by watching: Tacit knowledge and coding over the shoulder • Imposter syndrome and learning in public • Building internal tools to escape busywork • posit::conf(2026) keynote preview

Jan 27, 2026
1h 1min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the test set. Here we talk with some of the brightest thinkers and tinkerers in statistical analysis, scientific computing, and machine learning. Digging into what makes them tick, plus the insights, experiments, and OMG moments that shape the field.

On this episode, we're joined by Emily Riederer, who I think has the distinction of living in that sweet, sweet overlap between Python, SQL, and R harder than anyone else I've ever seen, and is a data science manager at Capital One, and apparently listens to the test set, but is not planning to listen to this episode. Welcome to the test set. I'm Michael Chow, and I'm joined by my co-hosts, Wes McKinney, who's a principal architect at Posit, and Hadley Wickham, who's a chief scientist at Posit. And I'm so excited to be here with Emily Riederer, who is a data science manager at Capital One, and I think a sort of like icon in the R, Python, SQL community for just putting out so much interesting work in this intersection of Python, SQL, and R.

So like, recently the article Python Ergonomics, and some talks around that, of how to have a workflow that R users love in Python with things like polars, and dbtplyr, which was a plugin for the SQL framework dbt, which was a really interesting sort of cross-section of ideas. So Emily, thanks for coming on. So happy to have you.

Oh yeah, thank you so much for having me. I've loved the pod so much so far. I have not missed a single episode. Yo, I'm so glad. Honestly, the biggest downside of being on a podcast is I was thinking about it, and I'm like, oh, the next one that comes out, I'm not going to want to listen to. Yeah, are you going to listen to your own? I feel like, oh my goodness, you can't break the streak, you know? Never, never, never, never.

Emily's journey into SQL

So I think people will be really interested to hear sort of your journey through languages like R, Python, SQL, and the things you've put out. But one interesting question I thought might be good to open up with is maybe just on the role of SQL kind of over the years. Since I know you started doing a lot of R work, and folks here have done like a lot of work in SQL, whether it's like dbtplyr, which translates like dplyr and R to SQL, or ibis, which translates Python to SQL. Emily, I'm really curious to hear like, what was your journey into SQL like?

Very abrupt in some ways, because I think if I think about my educational background, I was in a relatively like theoretical stats program, so could prove like asymptotic convergence of a lot of things, had very little experience with real world data, until I get one of my first internships, I'm asked to make a customer profile, I take the average of customer incomes, and I'm like, we're all rich, but we were not rich, I was just averaging in all the 9999 encoded default values. So that was my first introduction, both to databases, to SQL, and just like the vagaries of real world data. And I think that had been like, was also kind of jarring, because I was under this whole illusion of like, oh, data is this like, ground truth, where we can like figure things out about the world. And then I entered into this world where like, data is this like, kind of source of chaos that you need to control.

And then I entered into this world where like, data is this like, kind of source of chaos that you need to control.

But then kind of like, kind of coming into industry, I think SQL was one of like, the main tools for the trade, the only way to kind of like, access your data before you could get in there with something like R or Python. And pulling them from just like kind of a single brush I'd had in college with kind of database design and data modeling and verbal forms, I think just something about that, combined with the tidyverse, just so kind of like, clicked in my mind of like, there's actually a real art to how you like, set this thing up, how you like, get the data out, that can really just like, set you up for success.

Yeah, it feels so real that like, moving into a business, like going to work, and hitting like SQL, and realizing that there's like a process to get the data out. I mean, I will say I've been in an active like, I think calling it a war against SQL is maybe putting a little bit strongly, but I've definitely engaged heavily, and I think as has Hadley, in building tools so that humans write a lot less SQL. And I've often felt that SQL is as a language, it's clearly not going anywhere, it's the lingua franca of databases. But it's also a little bit like assembly code or Fortran code, it doesn't have a lot of the modern niceties of real, like a real programming language, you know, like functions, like reusable code, like code that you can refactor and reuse.

And so, like early on in my career, I was a bit scarred by hundreds and thousands of lines of copy and pasted SQL queries that, you know, there's no tests, and so, you know, you're dealing with these like highly brittle, very complex, you know, hundreds of lines, many pages long, 10 page long, SQL queries. And, you know, the reality is like SQL is really alluring in the sense that like, it's declarative, you know, writing simple SQL queries is easy, but complex business logic, especially in financial setting ends up being rather subtle and with a lot of complexities. And so, I found myself like making the same kinds of mistakes over and over again and seeing other people make mistakes. And so, I felt like if only, you know, if we could essentially abstract away the unpleasantness of SQL and make it easier for humans to write, essentially author SQL indirectly and to avoid many of those common errors, then that would be, we would be doing humanity a great service.

I guess I actually started with databases, like I did SQL before R. So, my dad, a lot of his work involved databases. So, we had like, you know, we had like dinner table conversations about like relational data and COD's third normal form. And when I was in high school, I guess starting from like age, I don't know, like 15, I like made Access databases as like my part-time job. I did some like database documentation as a part-time job, which is kind of like crazy looking back at it now. Like I was, I mean, that's really how I learned to program was in like Visual Basic for applications. Like that was my first real exposure to programming, real exposure to SQL.

And kind of interestingly, like I've been working on dbplyr lately, which translates dplyr code into SQL. And people, like dbplyr has an access backend and people file issues that it doesn't work. And the fact that like, that is mind-blowing to me that even in like 2025, people are writing R code to connect to a Microsoft Access database and like work with that data. That, I don't know, that's just kind of like mind-blowing, mind-blowing to me.

Becoming a data scientist

I'm so curious, like, so your, Emily, like your journey into data science and how you encountered a lot of these things. Could you, do you mind explaining a little bit just how you became a data scientist? Like what did that journey look like for you? I mean, in some ways, like I think I have like a very boring or traditional data science background in some ways, but in like, I think there's a different read where it's like kind of funny because at every step of the way, I knew what I wanted to do. I probably didn't know the right reason why I wanted to do it. So kind of started out like in college or in high school, took my first stats class, just like had been a math, like kind of math kid figuring out what can I do with math? And just had this like idea of this being like this amazing truth-seeking like applied way to like do math in the real world, I think.

I mean, definitely going in, I didn't actually know probably like so much of the things that were true about it. The data science is like so much more of an art than a science, that it requires so much more engineering skills and you'll like never once again in your life feel the certainty of math, which is probably the thing I liked about math in the first place.

In my time at Capital One, my current employer, I've really worn like three very different hats, which also like maybe kind of mirrors some of the different tools in the data space. Started out working a lot more on problems of like measurement, causal inference, understanding, you know, the values of different like levers you could pull and customer lifetime values. And that's like a lot more exploratory type work, you know, a lot of more like visualization and modeling and just like being a lot more intimate with kind of like both the data and the business.

Then kind of like took probably a sharp right turn to move upstream and spent like a number of years in more of the tools and data stack. So I'm thinking about building out data pipelines or in Python tooling for kind of a broader community to use. And really just like spent a lot more time thinking about how does good coding practices, how does automation, how does good engineering really enable better analytics before moving back into the like core traditional like machine learning modeling type space.

Moving into data engineering

I feel like the switch from like the modeling to the data engineering, it's so intriguing. Like what was that like? Like what were the sort of tools you switched into and things you used? In some ways I like fell into that one pretty like organically even though the like actual roles seem on paper very different. I think in part for what we were talking about of like you joined a company and suddenly like you find out like in school the data sets you were working with was penguins or I'm old enough I'll date myself and say iris. It just like embedded in some nice little toy data set. But you know, I mean so much of your work even if you want to do that really like exploratory deep analytical work is around getting the data.

And I think I both had kind of Wes's reaction of like I'm working on this like huge long multi-page SQL script and they're all these subqueries and nested tables and I'm trying to draw it out on a board like the Always Sunny in Philadelphia map. But at the same time like there was something to me that was like so interesting about both how like and like I think I was feeling all the pain points so closely that like I can't focus on the thing I want until we get data right. So starting to get like obsessed with kind of like column names, data quality checks, how can you actually do like testing and macros and all the things Wes called out that aren't native to kind of SQL. So it's like spending a lot of my time outside of work trying to like understand build out that part as well as realizing like truly understand data.

We talk about from a statistics perspective like understanding the data generating process but there's like a separate data generating process that's like the data pipeline process that is like the number one thing that predicts kind of what the data areas will be, what the failure cases are. So I think just kind of like fell in love in service of like understanding the problem in the last mile with thinking about the data.

The thing I was working on without going into too much detail I found myself kind of like repeating for many different use cases in a way that just like felt so anathema to me to just like how I saw people in like the RStats Twitter world outside of work like building packages, sharing code, like easier to collaborate with someone in Australia than collaborate with someone in the same company sometimes. And it was like that seems strong which is how I started like getting into like internal package development, tool building. So just like kind of thought I was doing it all in service of the analytical space but more and more I found my like time and energy like gravitated towards these upstream problems.

And if you had to like break down for yourself like at that time like what a data pipeline is and like what are kind of the the key pieces involved like how would you kind of break that down for someone? I mean the most kind of I guess classic go-to paradigm that you can think of with that is something like extract, load, transform. And even probably before extract there's the step we don't talk about a lot that's like logging or encoding of like somehow something that happens in the real world has to be turned into some digital signal. Then extract being someone has to go be like capturing that signal and getting all the different data sets into one sort of centralized place in some sort of format.

Depending on your field maybe that's hitting up APIs, maybe that's working with vendors, maybe that's even being a field scientist and doing manual like work in a notebook and then function it into a computer. But then once that's kind of like loaded into a data lake or a data warehouse you still need to impose some level of like organization which at a high level you can think about as maybe being more like organizing the files in your file system. But there are actually different conventions around that if you're using blob storage like S3 which is more like literally organizing files on your hard drive or doing it in a database where in some ways there are a lot more rules and constraints but that helps you get a lot more stuff for free like discoverability, some level of kind of like constraints and checks on internal integrity.

Discovering dbt

Down the road I think a long time but like I think dbt was like still really in its infancy at the point I started doing a lot of this. But this is definitely the point where I think started feeling very very acutely on a daily basis the exact sort of pain points that dbt solves for. That like I could build kind of my own workflows for like oh how do I like if I want to test some SQL code maybe not harder to set up test cases and unit tests not have a need to do that. I had some like crazy workflows like pulling data down into an RMarkdown notebook from my database and you know kind of like doing a lot more round trips than necessary.

Figuring out a big part of data pipelines is orchestration which like I don't think you run into so much in analysis but if you have like a lot of long-running things especially that depend on systems that aren't in your control thinking about how do you make sure all of those everything like happens in the right order. Which is not a hard problem there are a lot of great open source tools for doing that but maybe not always ones analysts have at their disposal and fingertips to spin up the right infrastructure for.

So I think like I just like had my elevator pitch for like the seven or ten things that were the biggest pain points of SQL and then one day I was like just at the gym listening to like a data engineering podcast and heard one of the first like interviews or I don't say first to me interviews about dbt and I was just like oh wow they like these people happen to like be interested in solving like some of the exact same problems I've been thinking of. Which I think like for a certain time in the data space was like a weirdly frequent feeling of just that sense of serendipity where like you'd be working about something or thinking about something and then within a week somebody like put out a dbt put out an R package ask the exact same question on stack overflow and it was definitely just like kind of serendipitously fell into it that I was having the same problem as a lot of other people.

Yeah I found like when I fit like it took me a while to understand like what dbt was like what was the like why were people so excited about it until I kind of realized it's like it's like an IDE for SQL plus the ability to write functions plus the ability to use like git and I was like oh wow that's like you'd like you kind of you didn't have that already for SQL that's like crazy and obviously people are going to like love it now because I love this tool because how do you do like software engineering without like functions and version control.

Like yeah I've realized it's like for that reason a very hard thing to make kind of the elevator pitch for because it's you know I mean if you turn somebody that like just has a couple of queries they're running and they're like hey what if you broke this into like 27 queries and added some yaml files in there and they're like oh my goodness that sounds like terrible but once it's like once you've like I think it's like super easy to sell to like people that have like lived in the old world but then people coming into it now like I know there's a lot of talk of like dbt versus sqlglot and like anything that you take is like the base case it's like you can see like oh wow there's still a lot of room for improvement here.

SQL as a language and the spec problem

And I think that's the number one weirdest thing of SQL as a language is like it's not one thing but there's like a standard that everyone adheres to until um they don't. And of course introducing things like column selectors which I'm a very big fan of largely thanks to dplyr but at the same time like that is it's a tough trade-off too because I know everything like every time you deviate from the grain it's like yes but it's also like slightly less interoperable.

SQL is really interesting to me because it's sort of like this pre-internet technology in many ways like you know SQL is like a spec it's like this like an anti-SQL spec but you cannot actually get the spec without like paying for it so I actually have like one of the few like books I still like I have SQL 99 like complete really like this this is the only resource I've been able to find that actually like explains how like SQL is supposed to work not how like some specific database implements it.

And I think that's I don't know I find that kind of bizarre I mean luckily like I think recently like Claude is like filling a lot of that gap for me like Claude seems to have this knowledge of SQL that I could never like through googling I could never find the right website but it must be on the internet because like Claude seems to understand it now. But I don't know it's just really interesting to me as someone who's been like writing this translation layer from R to SQL like to try and figure out like like what like what is SQL like what's the official way you're supposed to do this and like what the databases actually support.

That is fascinating about the spec I didn't realize that was so not accessible like there's so many like fascinating topics in open source governance but I've never heard of that like truth rules but you can't see them. I mean you can get them but it costs like three thousand dollars or something. And I mean maybe it's easier maybe I just never had the right searches but that just yeah I found that like mystifying like and kind of even just this idea that like to me like if I want people to like do something to like follow something I've written like to me it's like obvious you want to give that away for free.

Column selectors

I think just to like circle back and flesh out something you brought up Emily is you mentioned selectors and I wonder if it almost be useful to almost explain what a selector is because it's kind of like it has been added recently to a lot of SQL implementations but maybe you could talk us through like just what is a selector.

Absolutely and I think that tees me up for a question I've always wanted to ask Hadley so um appreciate that but um so if you think about I mean I'll pander to the R crowd here to start out but as selectors I mean you have a data frame you have a bunch of columns um if you have in fact named your columns well in a standardized way or if you think about data types there are a lot of different like kind of identifying information you could use to grab out and act on a set of variables. Um so if you want to do some sort of mass data wrangling process like say in SQL maybe you want to take the average of every boolean variable you have in your table if you're doing that in SQL you're going to be there a while because you're going to be typing like mean or average variable a average variable two average variable three.

But in some of the more modern um kind of programming languages with more flexible APIs you can have these like really nice kind of selectors where you might say like for all boolean values apply the same transformation or um if you get a little clever with like naming columns in a standardized way for all of my like variables related to this entity you're representing an indicator do this operation. Um so for kind of like large scalar wrangling it can just be a way to like write a lot like cleaner code avoid a lot of like kind of typos or copy paste errors by consolidating your business logic and just applying or mapping it over many different variables.

And I know the first place I saw this was in dplyr I think now I know um it's in Spark has it polars has it a number of different like SQL or SQL variants have it pandas has it but um Hadley like I've always wondered like to the best of your knowledge was dplyr the first tool that had implemented that sort of those selectors?

I don't I don't know it seems like my recollection of this is vague like I kind of remember around this time like kind of learning that oh like one of the problems people have is that they have a data frame with like 800 columns and just selecting the correct columns is a pain. My vague recollection and I did a little googling that kind of supports this is I think like Stata had some tools for variable selection uh that I that I have I and I think like some I think that's where I learned about it from from like Stata users. Um like looking at the documentation for Stata like it looks like it you can say like here's a here's a start of the range it has their end of the range here's like all the very you could select all the variables of the prefix so I I think that's probably I have a very vague recollection that maybe SAS has something similar as well.

But yeah I think I think that's or I think it came from like statistical software which is kind of surprising because it does feel like this is something you feel the need for and like SQL all the time because you're just typing out like I was just looking at looking at pandas to see if it has it had it and I don't I don't think it does I mean I haven't been super active in pandas in a long time but um it doesn't it doesn't look like it has selectors quite in the same way that that dplyr does or that you know DuckDB does now for example. I think the ibis team implemented it um and I guess like polars and polars and ibis have selectors similar to dplyr now.

I knew I'd done something similar in polars or in python with pandas of just grabbing out the columns and doing some list comprehensions and throwing that back in but I didn't want to say that was the best way to go. But I'm fascinated if it came from Stata like my main recollection of my major collection of Stata from college is like you can't have two data frames in memory at the same time so I did not think of these bits as like bastions of user experience.

So interestingly like I'm looking at the docs for this on the Stata web page and it one of the things that mentions like there's specifically like my kind of favorite selector in dplyr that like literally no one uses is called num range and so you give it like a prefix you give it like a starting number and an ending number and it will generate so you can say like select me from x1 to x50 really easily and like the Stata docs specifically say that there's no way to do that easily in Stata so I I don't know I think that's kind of evidence for like okay I was like well Stata can't do this like I think that's something useful and I'm gonna do that.

You mean there's like that's like spite like a response evidence of a response you're like yeah but it's also one of those functions I'm like I think this is cool I would have thought like people would use this and like basically no one does. Yeah it's so fun it's such a small in a way like such a small behavior but it does seem like so like it's it's so ubiquitous that that it does make a big difference that that SQL users and when I've had to type SQL manually this idea that you just type over and over again the same calculation say on and just change out column names um is kind of is kind of mind-blowing.

So it is funny like selection does seem like a really simple topic but it's kind of crazy how much it shapes the quality of life and I think even like for me personally I always wish like I feel like if that existed in databases natively people would think about more like as they build their database is like names is something that can do a thing and then think about those more cautious carefully like you would think about designing an API. I think you know when you get into like kind of industry or production databases it's like you get into all these things where there are like 10 different ways to like abbreviate account id so you'll end up with like 10 different versions of that be mentally accounting which table but I think for me it was just like a big aha moment of like oh like my column names can actually like do something if I'd like actually think of them as like part of the software.

dbtplyr and cross-community pollination

And this I think this is very much your um column names as contracts post is that right? Yeah yeah indeed indeed yeah nice and you I feel like dbtplyr so like your dbt plugin kind of like built on that concept a bit do you try to explain a little bit about that?

Yeah exactly so um dbt um plugins packages whatever you want to call them are the answer to what kind of was called out about um SQL not really having a native function interface. Again I think that's something that's wildly data specific some do and some don't but even the ones that do it's not really great to use them because then you've just like loaded code to a database it's not really version controlled you can't really see what it does can't access it that easily. But um dbt just I guess taking a step back is essentially like a collection of SQL scripts macros and other files organized in a very specific way so it can kind of infer the dependency graph and execute.

Um similar to how in R packages maybe just a lot of R files organized in a smart way so the computer knows what to do with it so they abstracted that then like step forward more where um you can have dbt packages which are kind of like data agnostic chunks of SQL code that you can then like kind of import in calls macros they can be at the function level they can be at the table level they can exist in a lot of kind of different levels of granularity. Um so dbtplyr was kind of like an early-ish um dbt package that I put out there that was essentially stealing um a number of things I really liked about the tidyverse but specifically around column selectors and trying to like port that API um into SQL to solve this exact kind of problem of how do I grab out a set of columns and how can I then like apply transformations on those in bulk.

What was the reception like because when I I went to dbt's conference coalesce in 2022 to give a talk and they were like they were kind of surprised that a person from RStudio came through they were excited but they the one thing I remember is they kept saying like emily is our like representative from the R community and they kept like they they pointed to dbtplyr a lot as an example of kind of like a really interesting cross-pollination of ideas like what was the what was the reception like what was it like kind of going into that community.

I think it was interesting because I think there were people kind of going back to our the point about it being like a simple idea but that has legs of like there are people that like got it probably that had seen it before. I know there is one um dbt labs engineer at the time that um really was like knew a lot about R had been part of that community really kind of like like latched on to it. You know I mean I think people that had like seen how it worked somewhere else um kind of really got it. I think the thing I realized in retrospect was probably calling it like dbt selectors um would have probably been like far more useful and informative for like discoverability purposes um since like dbtplyr like kind of a shibboleth of like pre-limiting yourself to only having like the R part of the community having like any earthly idea of what you're talking about or what to expect from it.

Naming packages across languages

That makes me think of a question we've been struggling with that you you might have some insight on and that is like as we're creating more uh packages where we have like an R version and a python version like what do we name them like do we give them like I think with like orbital and pins we're like okay we're going to call the R package and the python package exactly the same thing or do we do it like um great tables we've got gt on one hand and great tables on the other and I think there's another I don't know if you remember Michael but I think there's another case where we're like these are totally different names like do you do you have any sense of like like what do you think we should do when we're doing like one package like the same idea but implemented in in two places in two languages.

That's a really interesting question because like I'm sure too you're also bound by different namespace availability in both languages. Yes on one hand like like with orbital like I find it very satisfying of like if I've heard about in one language then it's trivial easy for me to be like oh yeah I know they have that thing in the other language but I do think I mean even with like gt versus great tables like in a weird way there's like a nice like mental name spacing to the fact that like I know these do not like aspire to be at parity the APIs within them aren't 100 the same and like I feel like it kind of fits my expectations right that like these are aiming at the same goal but they may not 100 get there in the exact same way.

Yeah that's interesting because like orbital is the case we probably that like the API of orbital is simple enough you can base it's basically the same for R and python but obviously the more complicated the package the more it has to kind of diverge like we like you want a package that feels R-like and you want a package that feels python-like you don't want to be like oh I'm writing R code in python or I'm writing python code in R. That's that's something I've so loved about um Michael and Rich's work on both like kind of great tables and um pointblank and I mean honestly even like plotnine I feel like manages to get like that distinctly more like pythonic feel in the python version.

And I do think that like something I think a lot a lot about is I like kind of switch between languages is like I'm all very like I definitely write python code with an R accent but at the same time like as best like it feels so much like you're going to get so much further and have a lot less pain if you are like leaning into the conventions of the language you're in.

I definitely write python code with an R accent but at the same time like as best like it feels so much like you're going to get so much further and have a lot less pain if you are like leaning into the conventions of the language you're in.

So I've really it's been fun from the outside looking and I feel like to see like where the scientists I'd same and where they do kind of like for goth I think I think R's had a lot of impact on on on python as well and and if you look at like the API of polars API of ibis and it's it's a much more like pipe centric kind of fluent you know fluent API and and I think um that's you know people looked I think a lot of people myself included looked enviously from the python world into the R world and then we were like oh wouldn't it be nice to not to just be able to refer to column names like inside of expressions and not need to like you know sub basically you know index into a data frame to be able to get a reference to that column to be able to do things.

And stuff like that like the non-standard evaluation that you have an R that comes from its its lisp uh lisp heritage and so I think that the people in python have tried to replicate like some of the some of the flavor or like the the the niceties of that like the underscore like the underscore operator and and stuff like that but still it's it's hard to shoehorn that type of API into into python which is like in many ways the opposite of non-standard evaluation like it's the you know it's everything must be as explicit and obvious as possible and so to try to do things like magically is is considered unpythonic.

Yeah and that that feels like the kind of like it's almost like the spectrum of like do we give these things the same name is like the very first question and then like that that reminds me of like the deeper questions of like how do we even architect it so it feels pythonic versus like uh yeah it has like an R sort of smell to it how do we actually lean into kind of python as a language it seems yeah it seems so tricky to kind of strike that balance kind of on all those fronts.

I will say for great tables we when we started working on great tables which is a port of the library gt in R to python it wasn't available on PyPI so that kicked off right away like the need for a new name but what I appreciate about Rich is he was open to like what if we just nobody few people knew that gt stood for grammar of tables so I have to hand it to him for being willing to just retroactively like rewrite history and pretend the acronym stood for great tables. I think I think actually few developers would be willing to retroactively change and I'd actually forgotten that like in my head I'd already translated yeah like gt that stands for great tables I'd forgotten it was like grammar of tables inspired.

Learning in public and imposter syndrome

I'm curious Emily how do you choose like the things you get into because I think that like digging into column names as contracts and like addressing the dbt communities so interesting um how do you kind of choose what what you're gonna like which threads you want to pull. I think I've always like first just like really gravitated towards things that might otherwise frustrate me but just like had so much curiosity of like couldn't they be better like don't we maybe need more internal packages or like surely there must be a better way to like write the SQL code. And I think so I think kind of like really being kind of like drawn to kind of like the little just like more paper cut everyday problems um I think it's just like left me with a lot of like curiosity and energy to explore things that otherwise it just kind of like train you.

And then I think that's just coupled with like I love just like learning about kind of new new tools new algorithms you know just like kind of squirreling away information that like it doesn't feel like I probably ever need to have just like save it as a nut for winter. And um I think and I think that was so easy especially early in my career was kind of like the renaissance of like RStats Twitter where you just have like people just like you could be on the fly on the wall for just like non-stop conversations people so much smarter than you doing just fascinating things and then just like very luckily just being able to do a lot of that pattern matching of like oh I have this problem oh and then I heard about dbt or like I'd really love to never copy paste this again but I think there's a thing called RMarkdown that exists.

Um that I think is just like it's kind of like I'm curious about the things that frustrate me and then I had to have all these like tools I want to try and then sometimes I just like kind of get lucky pulling the right tool from the toolbox. Yeah it's so cool and I I do think it's so interesting where so much of this really shines in your blog where you're often like stitching together tools like multiple tools in really creative ways.

And I know I think something that that like came up when when preparing for this is like imposter syndrome because I know that like with blogs or putting when people put themselves out there like that's a big can be a really big challenge. I'm really curious like your take on like what what that looks like for you and maybe advice you'd have for people that are like blogging or putting tools out that might feel a little imposter syndrome.

Absolutely I mean like even preparing to join this podcast is like what have I done I could just be having a nice afternoon and I'm talking to like three luminaries this field. But um I mean I think in some ways I think it's something that in some ways you really have to get comfortable with because if you like if you aren't in rooms with people that you feel like you have a ton to learn from that's kind of sad you're probably also in the wrong rooms you know. I mean going back to like I think the last question of like the way I learned was just getting to like absorb all this amazing content from other people out there and like if you weren't part of the conversation you weren't going to get to kind of like learn and encounter those things.

So I think like I've always like somehow managed like get myself like in a lot of places that I really had no business being you know even switching between like the analyst engineering more mo hats and like just getting kind of comfortable jumping in and like learning as you go. And same with like obviously not PhD political scientists or economists but the more that like I could like peer over into those spaces and see the different ways they're talking about causal inference you see you know I mean like I think to a large extent I think it's very good to feel like you aren't smart because that's like tells you you're like in the right place where you're learning.

Sharing on the internet I think it's always like hard and scary I think it is so much to the great credit to the R community that like it never felt that way. I've thought about that if I graduated like if I'd like come into the field either like five years sooner or five years later both of which rudely was when the internet seems like was most more hostile place and like oh what I have felt is like confident just like putting stuff out there. But um I think Emily Robinson really maybe said it best at some point in a um conference talk of like just like right for the person that like is struggling with the things you were struggling like 6 or 12 months ago and just like try to help them know what you need to know because I think that's just like a good way to remember there are other people out there just working on the same stuff then that you do like genuinely have something something to contribute.

Just like right for the person that like is struggling with the things you were struggling like 6 or 12 months ago and just like try to help them know what you need to know.

Yeah it's such a neat um way to frame it I feel like of just like thinking back as both a way to like really emphasize like your own growth but also like it does it does feel nice to be able to just take that difference and like write it down. Yeah and and I just like I think it's the best way to like crystallize your own thoughts that I mean nothing I write will ever be like hey guys this is exactly what I did at work today because I'm pretty sure my employer would um they kind of prefer I not go like sharing IP all over the internet. But um I think kind of also that just thinking about like how would I help somebody else do this thing forces you to actually think at the right level of abstraction that I mean to just take the trivial example of column names it's not like how it actually works better for the state is that if I do this but kind of like almost like it forces you to kind of like generate like a theory behind things that is in itself a little bit more like reusable.

Advice to your past self and Claude Code

I'm curious if folks have this is maybe a tough exercise but like things they would say to themselves from 6 to 10 months ago 6 to 12 months ago like I'm almost curious to hear what people would tell themselves. I mean I think I would tell myself to like try Claude Code earlier maybe I was trying to use that like six months ago but uh I think that is something it does feel that like as you get further along in your career though like you do like the your growth definitely slows like there is that period like when you're first starting your job and there's just like this firehose of information and everything is like oh my god this is like amazing and you look back at code you wrote like six months ago and you're like this is a heap of shit like why would I ever write that.

But certainly feels like as you're kind of you do plateau as you get further on your career I do I guess that is more my worry now that like I am plateauing like when I look back at my code from a year ago I'm like oh that's pretty good like I don't know that feels kind of bad like I should be like oh that's kind of shitty Hadley you could do way better now.

But uh I think for me like you know I I started using code like coding agents basically as soon as I learned about Claude Code about I I think it was in March I think maybe it was was it released in February like it wasn't it was definitely like beginning of this year so there was a little bit of it took a little time for me to find out about it and then start using it but I I think I for a long time I I um I held off for no particular good particularly good reason on building like bespoke tools for myself like essentially identifying things that like in the in the past like five years ago you know you know those little I think there's even an xkcd comic about it it's like the grid of like is it worth building a coding a solution for a problem how much time per day will it save you and that tells you whether it's worth it.

And so now with coding agents like that whole chart needs to get completely redone because the amount of time that it takes especially if it's building something it's not very hard to build but it's something that is just for you and it makes your life just a little bit better it saves you five minutes a day 10 minutes a day maybe an hour now and then um I I you know should have been a little bit more like I I've built things in the last three months where I'm like should have done this a year six months ago or or or a year ago. And so I think that that seeing those those successes have given has given me like a sense of like you know maybe more boldness or like more of like a willingness to like dive into things or like set you know set an agent to work building building something like maybe it's only something it's going to save me 10 minutes like twice a month but whenever I have those 10 minutes that it saves me it's going to be super satisfying.

Yeah I mean I I hate to add to the Claude Code but I think I'm similar like I think it was in a podcast with James Blair where I realized I was the only one not using Claude Code which was a good realization but um now I'm using Claude Code a lot but I think that the most interesting thing and surprising thing to me has been like when I watch my friends use Claude Code I'm always I always learn new things and so I feel like both I would tell myself to use Claude Code sooner but also like watch watch people use it because I feel like actually with Claude Code it's so easy to just go on your own and do stuff but I'm endlessly surprised at how differently people approach things.

That reminds me a lot of this thing I remember like like like if you're a data scientist working with like a you know someone non-technical often the things they think are going to be really hard are often really easy and the things they think are going to be easier often really hard and we're kind of in the same state with like Claude Code and these tools we don't actually know like what is hard and easy for them and it's so easy to like get stuck in your like you have to see other people do it and you're like oh I just assumed Claude Code could never do that it'd be too hard for it and it actually turns out to be like an easy problem for it for whatever reasons.

I feel like that abstracts to so many things just like the ability to like learn from watching other people do things which I mean is such a like hallmark too of like think R and python whereas like I think that's something I've thought about a lot I mean also going back to just like SQL of like there's so much less open source SQL code on the internet because it's mostly probably like code people wrote for um business applications and like I'm always like fascinated by these very little spaces where it's like we are like less able to like learn from one another.

It is yeah it's also just hard to like share like this is one of the things I remember when like trying to teach SQL like to teach it you've got to like spin up a database like yeah it's like really how like how do you simulate that environment and that like that was definitely one of the things that always like motivated me and the design of R packages like how do you get people like to experience that kind of time from like downloading it to experiencing a win and like if you have to go away and like install postgres on your computer then that's like a multi-day journey that involves a lot of pain and suffering.

And the way I mean I feel like with R teaching you always like drop people like directly into the analysis like I remember the first time I read R for data science I was like oh this is cool because this book isn't trying to teach me what a for loop is for the amount of time it's fine I get it you're just more like hey like here's the diamonds data set let's like jump between some like wrangling and some biz you know this makes a good question.

One of the um words I thought the phrases I found super useful was this idea of tacit knowledge which I learned from Bill Bearman who taught this like fantastic amazing data science course at Stanford that I can help out with and this idea of all of these things that like you know to do but you'd never think to write down and like when you watch someone working you're like oh my god you can do that like like it's fascinating like even you know within the tidyverse team we've like worked very closely with each other for a long time like still when someone shares their screen and they do something and you're like like what you I was like what you can right click on the Positron in the dark and it lists your like recent projects like like mind blown like you'd never think to write that down because it's just something you're like oh everyone knows this.

Oh yeah and I think that almost gets us back to the imposter syndrome too of just like anything anybody can put out in the world is that like net new aha moment for someone like I've always felt like to the extent that I have a niche it's just like writing about things that were like too boring or ordinary or table stakes that no one else ever bothered to like think they'd be interesting to put pen on paper on but um it is just like really satisfying when you just like find that tiny little thing that was too unimportant to say that actually just feels so good.

Yeah I mean I think that like ties kind of perfectly back to like the column selectors which just feel like this like trivial little thing but until you know all people have thousands of columns in their databases and this is actually a really painful problem and like let's just create like one like boring little helper that just unlocks you know gets rid of all this pain.

Closing and posit::conf(2026)

Yeah I do feel like selectors too it's the thing you appreciate when you watch someone coding and typing you're like this is so fast actually like you kind of have to see people do it to really get just how powerful it is.

But I think we they're they're kicking us out you know we're getting long in the tooth here uh I Emily I love I appreciate so much you coming on I feel like it's been so neat to talk about all these things and both like how you've spanned communities and but also like I I appreciate you surfacing the origins of selectors through Stata um and this like really important advice for people with imposter syndrome and how to just get into different areas and really like embrace discomfort and like talk yeah address yourself from 6 to 12 months ago.

I'm also I had the are we allowed to talk about posit conf yeah sure sure if Emily's okay with it I heard a little bird told me you might be the keynote at the next posit conf is that yeah I am both honored and I told um Jenny when she mailed me I'm right now I'm still speechless which is probably not a good trade in keynote speaker but no I'm I'm so excited yeah we're super excited to have you well thank you I'm honored going back to imposter syndrome a little bit horrified but um honored yeah looking looking forward to the journey because I feel like I always like I learned so much in the process of like bubbling my thoughts back down down into words like I feel like the the process of writing talks too I feel like you just learned so much more about whatever it is you thought you were enough of an expert to talk about.

Yeah 100 well I feel like we're so lucky to have you at posit conf and I'm so excited for the talk and really appreciate you coming on no truly thank you all and thanks for the great show I'm looking forward to being back in back as a listener next week yeah awesome thanks.

The test set is a production of Posit PBC an open source and enterprise tooling data science software company this episode was produced in collaboration with creative studio AGI for more episodes visit the test set.co or find us on your favorite podcast platform.