The accidental analytics engineer

Transcript#

This transcript was generated automatically and may contain errors.

Hello, hello, hello, everyone, and welcome to our next Coalesce talk, the accidental analytics engineer, presented by the one and only Michael Chow . If you will join us in the Slack channel, Coalesce the Accidental, then we can keep all discussions, questions, and memes going and flowing through there. Really happy to be going into this talk, and normally when we're introing a talk, I like to ask the speaker, what kind of vibes do we want going into your talk? And Michael's answer to me today was, listen to your heart and be guided by your heart. And I think a talk about becoming an accidental analytics engineer is a great one to be guided by your heart going into. Many of us didn't know originally that we were going to end up in data or in analytics engineering, yet via some mysterious meandering path, we found our way there and here. So I want to thank you all for joining us today and to Michael for telling us about the accidental analytics engineer.

Yo, thanks for having me. I'm so excited to be at Coalesce. I feel like it's freaky to just be immersed in a culture of analytics engineering where this is very first nature to people. But I think what's really interesting about analytics engineering is as something that's emerged over the last several years, it does seem like I'd imagine a lot of people in this room sort of found themselves just sort of one day they realized they sort of fell into analytics engineering or sort of crossed into analytics engineering from, say, data engineering, data science, or software engineering. I'm really interested in this idea of how people end up picking up skills and sort of this crossover of when you pick up a new skill, what happens.

So part of my inspiration for the talk, the accidental analytics engineer, was this sort of series of books that talk about what to do when you realize you need to kind of do something else. So one of the books is called that I really like is the UX team of one. And this is a book, it sort of threads the needle between being short enough that you can read it in a frenzy overnight and having enough content that you can just fill in kind of the basic jobs of UX. Another one, the project management for the unofficial project manager. I think this is interesting. It's sort of like you joined a team or a project and you realize that in order for it to be successful, you might kind of have to step up and be the project manager. And these books follow a really interesting format. I think one of the interesting things is they often spend the first third of the book just describing the gap you might see. So like how do you even spot the kind of UX gap in a project? And then the next parts of the book usually cover how do you fill it? So what do you do once you recognize it? What's like the basic things you need to patch that gap?

So in this talk, I want to discuss a little bit about helping data scientists spot that analytics engineering gap. And then give three quick survival tips for the accidental analytics engineer. So the person who found themselves maybe even near the edge of that hole or like who tumbled into it a little bit.

Background and how I fell into the pit

Just for some background, so my name is Michael Chow. I did a PhD in cognitive psychology studying human memory. I work full time on open source data tools at RStudio , which has about 40 engineers dedicated to open source full time. The weird thing is I work on all Python tools. So I spend a lot of time sort of bouncing between the Python and the R world. As a bonus, I have two beautiful cats, Bandit and Moondog. And one of them is especially sassy. I'll let you decide which it is. But as a cognitive psychologist working on tools, I'm really interested in sort of the idea of how skills, strategy, and tools interact to solve data problems.

So I think one big question is how did I realize that analytics engineering is a big deal? So the short answer is that I didn't. The longer answer is that I sort of ramped up to really diving into this hole. So in 2018, I was working as a data scientist at an education company on sort of adaptive tests of data science skill. So these are tests where after each question, they try to estimate your skill level and then show you sort of a good question, a good next question on the test. And this work was interesting because it was maybe 30% modeling, 70% user and product analytics. But the really weird thing was that it also was about data science. So I interacted with a lot of data scientists in R and Python and got to kind of watch them work and tease out problems. And the one weird thing I spotted there was that R users were like weirdly fast at data analysis.

And so in 2020, I worked on SUBA, a port of this R library to Python, partially because I noticed this weird sentiment, like one person said, dplyr has forever ruined pandas for me. And that really caught my attention. I think pandas is a great library and does really important jobs. But I was really interested in kind of like what could this tool be doing for this person? And why do they love it so much? So I ended up working on SUBA just with the goals of kind of hitting fast interactive analysis and having a tool that could generate SQL because R users can use the same code to query a database.

So I'm feeling good. Flash forward to 2021, which is where I hit the hole. And I switched to working on warehousing and analyzing transit data in California. I was feeling good. I'd like hit analytics. I'd done some modeling. And so there, I made a natural choice. I did a lot of SQL transformations. And the tool I reached for was Airflow. And it worked pretty well. I was feeling good. It was pretty sane. But then as we brought in, we scaled up the team, I started to notice that this workflow was pretty crazy. Like bringing in an analytics engineer, first of all, dbt has a community of like 50,000 people inside 40 or 50,000. They're not maybe gunning to take over this Airflow DAG. So I realized that I had basically fallen into the pit of analytics engineering.

And this I wasn't totally unaware. So I had a good friend, Rika Jackson, who I think is in the slack, who kind of warned me about this. Like little gentle nudges. Have you tried ELT, dbt or BI tools? And so she really helped me kind of see this bigger picture. But as I looked around the hole, I noticed a lot of other people. As I went to conferences like PyCon and RStudio conference, I started asking people, like, have you heard of dbt or like ELT? And I realize it sounds absolutely crazy here because we're just immersed in it. But there like a lot of people had heard of dbt, but there was kind of a lack of familiarity. So I felt at least like among friends in this pit. But honestly, like dbt has blown my mind. And so now I can't stop thinking about why did I fall in that pit? And why do people fall in that pit in general?

Two cultures of data work

I think the answer is kind of weird. I think it goes back to a paper called statistical modeling, the two cultures. So this is a great paper that basically talks about how the field of sort of classical statistics and machine learning are two separate sort of hands of statistics and two cultures. And what I love about it is it's an attempt to sort of identify the hole, the machine learning hole that you'll miss and fall into if you kind of cling to the one classical world. And I realize it sounds crazy here. But I can't help having gone to PyCon and RStudio conf, feeling like there are kind of these two cultures out there. I think a year from now, you can like delete this talk, there will be one culture analytics engineering, you know, in the world, or maybe it's already in the world.

So I think in terms of spotting the gap, I want to dig a little bit into these two cultures. Just so we can really define this gap, and how I think data scientists kind of learn about analytics engineering by getting too close to the hole. I'm going to talk about two worldviews, the tidyverse , I'll call one the tidyverse worldview, and the other the sort of modern data stack worldview. And I think both of these have really strong virtues and values that they focus on. But they sort of hit different dimensions of data work. I'll then loop around to filling the gap and focus on three kind of areas that I think would have been really helpful for me, which is thoughts on learning dbt, like a first encounter. What is data modeling, I found that term very confusing, to be honest, but that was mostly on me. And then I think a concept that really struck me as being really important for, say, like data science people to know this idea of how people track data in and data out using dbt.

All right, so let's talk about the data science worldview, according to the tidyverse. So this world really is laid out in the book R for data science. This is by Hadley Wickham and Garrett Grohlman. And basically the tidyverse is an ecosystem of tools for doing data science in R. I think they have a really important definition of data science that kind of helps us zoom in on the kind of things they focus on. So the book mentions early on data science is an exciting discipline that allows you to turn raw data into understanding, insight and knowledge. And I think a sort of key here is this idea of taking raw data and producing understanding some kind of insight for people to latch on to.

So they have this nice workflow diagram that shows sort of the flow through a data science workflow. You start with importing the data. So that could be something on a bucket, it could be a CSV, it could be data in the warehouse that you're querying. Then you tidy it up. So you get it into a nice sort of consistent format for analysis. And then you engage this sort of arc of understanding. So you transform the data, you visualize it, and you model it. And you sort of go around the cycle until you hit a useful insight. Or, in other words, you produce say 100 bad plots until you find your one good one. Alternatively, modeling, I would say the you do. So visualizing is like 80% bar charts. Modeling, I would say you do the bar chart of modeling is like logistic regression. So you go around the loop to hit an insight. And then you work on communicating that insight out.

So this is a really powerful workflow that kind of focuses end to end. There are a lot of really cool things that surface in the tidyverse kind of around this stuff. So one of them is tidy Tuesday, where every week, a person named Tom mock uploads a data set to a GitHub repo, and tweets it out. And then a bunch of people analyze the data and share out their analyses and results. So this is a pretty cool way to see kind of this data science in action and compare your approaches with other people. The other thing is, there's a person named Dave Robinson, who took it to kind of another extreme, and live streams himself analyzing this data once a week for an hour. So you can actually go on YouTube and see what it looks like to go from like, never having heard of the data to reports and dashboards, basically. So it's a deep culture of analyzing and sharing.

The other sort of secret weapon they have is dplyr. So this is a tool to analyze data frames, or hit say SQL databases. And this is sort of the backbone of what R users use, or a lot of R users, people in the tidyverse use to analyze data. The last kind of piece for communicating out is quarto . So R users love writing their analyses in text files, where you have a sort of config up top. And then you can mix in markdown for narrative, and then code cells for producing outputs. So a note, a text notebook. And the result looks like this. But the kind of freaky thing is that quarto, things like quarto can also build websites. So you get things like the quarto website is built in quarto. And lots of people have their personal websites or blogs in quarto. So it's a very sort of end to end R tool chain.

So just to kind of circle back, tidyverse users love to turn raw data into understanding. There's this culture of sharing work. There's this tool called dplyr, that a lot of people use to analyze the data, and then quarto to communicate it out either as a report or a website or slides. So that's the tidyverse worldview, I would say raw data to insights.

On the other hand, you have the modern data stack. And I think a good, say definition, I think in the dbt docs, I think by Claire Carroll, there's a nice definition, which is analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their questions. And I think that's huge. I think if the tidyverse is about raw data to insights, this idea of cleaning data and giving it to someone is I think the crux of where these frameworks start to differ.

I think if the tidyverse is about raw data to insights, this idea of cleaning data and giving it to someone is I think the crux of where these frameworks start to differ.

But I think there's a lot of overlap. So I think if you squint your eyes, you can sort of just see the modern data stack in this workflow. In a sense, the data engineer ensures that import or ingest and extract and load happen cleanly. You have an analytics engineer focused on tidying and transforming. And then say you have analysts at the end doing a lot of visualization, modeling and communicating. And you sometimes see this tension to have or opportunity of analysts wanting to transform like that. So I think this is just to take it one step further, I think you can sort of superimpose this modern data stack diagram, the data loaders, transforming data in the warehouse, and at the end BI tools and other data consumers.

So I think the value of this framework is though, this isn't just sort of like, I don't these aren't equivalent. I think the big value of this perspective and the specialization of these jobs is that I think oftentimes, people analyzing data see this workflow, they see like one of these, right, I need to take this data, I need to produce this insight. I think analytics engineers know that this isn't a one off. But actually, there's sort of a whole bunch of this happening in an org. And if you don't have someone sort of like maintaining sanity in the tidy transform layer, things are going to spiral out of control real fast. The other thing is that there might be an army of analysts at the end, in which case it would be really bad if the tools weren't very friendly. So that was one thing that really kind of opened my eyes was seeing the sheer amount of thought that's gone into BI tools and that analytics engineers put into, like, both transforming the data and being sure it plays well with BI tools.

So just to lay out the two cultures, I think tidyverse people are very interested in pulling raw data, producing understanding, they're people who kind of have their, I would say their homes on their backs, right, they have a single tool chain, they can do it on their laptop, they can put a website up on GitHub super fast. I think modern data stack people are so good at scaling up this function, and they know how to kind of like split responsibilities. And I think the tell is like, I mean, if you're really focused on serving data to end users, right, this is going to become really important. So I think those are the two worlds.

Filling the gap: survival tips for the accidental analytics engineer

So I think that's the gap. And I think what I realized was basically problems turn into needing to serve data so fast, like if you don't have an analytics engineer in your organization, you're probably in big trouble. For filling the gap, I think that there were a few things that would made it a little bit easier to get started. And these are things that I at conferences found really helpful to sort of run past people who hadn't heard of dbt.

So I think first, we're learning dbt. To me, a big difference between these two groups is, are you a run locally person or a cloud person? So I think if you're a cloud person, you're set the dbt docs have you, you can use set up a BigQuery account and fly through the docs. I think if you're a run local person, this repo is super helpful. So shuffle shop, DuckDB. This is super helpful. Mach speed, no explanation needed. It's all about using DuckDB to run the dbt demos locally. I honestly I think for the people at PyCon and RStudio conf, this is a sort of like, this is manna from heaven for hitting the docs.

For the second part, what is data modeling? I think as a cognitive psychologist, who has done some psychological models, and person who does statistical models, I was so confused, actually, at what exactly data modeling is. But you hear it all the time. I think now that dbt has Python models, it's a little more confusing, because now you just don't know. But so I think a nice, a really nice introduction to data modeling, though, for me, was snapshots. And this is a more zoomed in I know data modeling covers a broad kind of set of things. But I think snapshots is a really nice, slightly scary introduction to data modeling that tells people they need to take it seriously. I think it also highlights the freaky amount of thought analytics engineers have put into these problems.

And so just to go over the gist of snapshots. The idea is that at day one in an app database, you might have this record. And then day two, it might get updated and the status field might change. So this was pending now it's shipped. And in analytics land, wouldn't it be so nice if we just had these in our warehouse, both these records. I think the concept is really simple. But the docs, I think break it down really nicely. And I think it's a nice introduction to this set of language on slowly changing dimensions, which I was really surprised not a lot of people sort of at other conferences have heard of. I think it's really probably like close to a lot of people's hip here. But I feel like it's a great thing to introduce to say, you're like classical data scientists or ML people.

I love snapshots, because it's like, but wait, there's more, it gets crazier. So there's this whole thing, like, have you tried to join a snapshot, though? And dbt has a great blog post on this. There's all this weird stuff, like, should we future proof the dates somehow, like, would there be value in adding a date from the year 9999? Look at this crazy join statement, you're going to have to use. And then finally, like a macro that just restore sanity. I think as a person who wasn't too familiar with analytics engineering, this is nice, because it's both you can grasp it, but it also goes freaky deep. And as a little bit, it has weird problems in it.

Tracking data in and out: freshness and exposures

The last thing that I'll hit on quickly is just some like simple concepts that I think end up being really powerful for people who are less familiar with analytics engineering to keep in mind. So the first, these relate to how data comes in and goes out basically of a DAG. So the first is freshness, just asking when did data last come in. And the second is exposures. So where's data going to go? If you're like me, I've been in a lot of orgs where people actually didn't know the answers to these questions. And at conferences where these concepts are very foreign, I don't think it's a bad thing. I think people focused on analytic work sometime are more focused on sort of like communicating out than the big time data tracking, you know, the analytics engineers do.

So this is just a show. Again, I probably don't need to sell you on it. But it's so simple to do, in some ways. And so convenient to have when you have dozens of analysts and their notebooks, and they're emailing them to people. And you want to know kind of like, what will break if I change these.

It's so simple to do, in some ways. And so convenient to have when you have dozens of analysts and their notebooks, and they're emailing them to people. And you want to know kind of like, what will break if I change these.

So just to recap, we talked a little bit about spotting this idea of two cultures of data science. And then filling in the analytics engineering gap, I think some small things that would have helped or helped me in getting started.

The last thing I want to say is, if you know someone who's freaky fast at analytics, or analytics engineering, I would love to get in touch with them. I think as a person who's really interested in R users and Python users, analyzing data, say they've never seen before, I'd love to see Tidy Tuesday style, like, how fast can analysts go? How do the different BI strategies let people know what's going on? How do they let people move quickly? So feel free to reach out on Twitter at chowthedog. I can send my calendar. Because I think there's actually a lot to surface there to really get out for what analysts do and what strategies really work for them.

So thanks for listening. And thanks to a bunch of people for contributing this to this talk. Especially the MLOps team at RStudio that I sit on. Chris Adams, who did a ton of design for it. And then Chris Cardillo, an astronomer, who provided a ton of feedback. Thanks.

The accidental analytics engineer

Transcript#

Background and how I fell into the pit

Two cultures of data work

Filling the gap: survival tips for the accidental analytics engineer

Tracking data in and out: freshness and exposures

Featured software#

rstudio