What even is dbt? An Analytics engineer explains | Laurie Merrell & Michael Chow

Transcript#

This transcript was generated automatically and may contain errors.

Welcome to the Data Science Lab, y'all. I'm Libby. I'm a Data Community Manager here at Posit. My co-host is Isabella. Isabella, you want to say hi?

Yes. Hi, everyone. Thank you so much for joining us. Happy New Year.

Yeah. Happy New Year, everybody. Usually, Dan is with us, Dan Chen. I think he's teaching right now. We'll say hi to Dan in spirit. If you are new around here, hello, welcome. The Data Science Lab is new. It's weekly. We get together. We talk about more technical stuff, and it's very unscripted and fun. It's like pair coding with your friends, right? We're all going to learn something new today. Today, we are talking about dbt, a tool that I have feared for a long time because I don't know anything about it. I'm a data scientist. I'm a stats person. I am not an engineering person or a computer science person, so it's been really intimidating. Thankfully, we have Michael Chow and Laurie Merrill here to help us learn about it and navigate.

I will have you two introduce yourselves. Michael, would you like to tell us a little bit about who you are? Sure. Hey, I'm Michael Chow. I work as a software engineer at Posit. I'm so excited to talk about dbt, I think, because prior to this, Laurie will introduce herself, but I worked helping build out a data team to work on some transit pipelines. I have to say, I did a really bad job with a lot of the orchestration things, and then Laurie came in and really, like, sponged her on to gold.

I'm so excited for Laurie to be here. I'm the barbarian side of dbt, where I feel like I have the appreciation. I have a healthy appreciation of dbt, but Laurie actually has substantive expertise, so Laurie, do you want to introduce yourself?

Yeah. Hi, everybody. I'm Laurie Merrill. I'm lead analytics engineer at Jarvis Innovations, and I have been working with dbt since about 2022, and I actually learned dbt to refactor the code base that Michael kind of handed off to me, and so I definitely think dbt is something that it's, I think it's hard to learn in the abstract. It's very much like a kind of practical tool that you, I think, sort of have to learn by doing, and I primarily work on public transit data, as Michael mentioned, so we do modern data infrastructure for public transit organizations, so that's kind of my background, yeah.

Well, it's so good to have you both here, and Laurie, you have some fans in the chat. They're like, I wasn't going to miss hearing Laurie talk about dbt. I'm here for Laurie, so we're so excited to have you. Let's go ahead and get started, but first, I would love for everybody to, like, we've done this a couple of times in other places, and it makes me smile so much. If you happen to have a rubber duck, like a debugging duck, with you on your desk, please feel free at some point to share a picture of it in the Discord or hold it up and say hi. Yeah, James has Figment with him, his debugging duck. This is Rocket. I also have Desk Goose with me here from Untitled Goose Game, who has a hair clip in his mouth.

All right, everybody. Michael, do you have any debugging ducks?

Laurie has one. I have, I mean, I have this baguette. Could it be a debugging? A debaggetting? Yes. Debaggetting, you know? We will allow it. Laurie, your duck has a little space helmet. Yeah, I believe that this is Jarvis swag from some point. I don't know, Michael, if you ever got one, but we got space ducks at some point. I think they saved the swag only for people who were DBT users, not DBT appreciators. You know, that wasn't enough to get the joke.

What is dbt and why does it exist?

All right, well, let's go ahead and get started. Michael and Laurie, can you please let us know what we are doing today, what we're learning?

Yeah, why don't, maybe I'll crack open, so Jenna shared a link to the talk, which was literally, after I joined Posit and Laurie came into Jarvis, which is where I went. So I went from Jarvis to Posit. Laurie came in, fixed all my mistakes. The talk Jenna shared was about sort of like how I realized how important DBT was. So that's a good one. I'm just going to share, maybe I could just share like one slide that I think is really important and was aimed at our users from that talk. And Laurie, I think has some stuff to add to it. And then I'll go into running like a really simple DBT project. And just looking around and just seeing, because if you've tried to run DBT, you may have seen like there's a lot of parts. And sometimes it's hard even to just fire up the basics.

Yeah, I told Michael, I just shared a beginner project that Michael is going to be using a GitHub repo in the chat. I told him I opened it and I cloned it to my machine and opened it in Positron . And then I was like, crap, I don't know where to go from here. I don't know what I'm supposed to be doing. Totally. Yeah. So this is, just to foreshadow, this is the project we'll open.

So we'll go into this Jaffa Shop, DuckDB, just a basic look at DBT and how you can see it running and what the different parts are. What I'm going to do is I'm going to run it in a code space. So I clicked code and then I clicked create a code space. So that might be good to do if you want to follow along. You could fire that code space up right now. It takes like a minute or two to start.

Yeah. So the one thing I just want to explain about why dbt really fast comes from this talk, the accidental analytics engineer. And what I'll just say is like, if you're coming from R or even doing data analysis in Python, you may think of data analysis from this perspective. So this is the data science sort of workflow that Hadley Wickham introduced in R for data science. And so you might think of data analysis like this, where you sort of start from some raw data and you tidy it up and transform it a bit, and then go into this cycle of understanding. And once you've sort of visualized and modeled the data, done this sort of exploratory data analysis and found something interesting, maybe you go on to communicate it through, say, like a Quarto document or a dashboard, you know, ship the results out. This is a very end-to-end workflow.

And I think a lot of people on smaller teams get a lot out of this workflow, sort of going from raw data to insights. What I'll say is that in a lot of the context, I think where dbt is helpful and to try to make sense of why or to use dbt, I think that big orgs tend to be more like this. So you chop it up. So you might have a data engineer for most what you think of as importing. So like extracting and loading data into, say, like a database or like a bucket. You have an analytics engineer who then is doing a lot of the tidying and transforming. They call this sometimes like data modeling. I should have put it in air quotes, but that's like structuring the data in a way that's really friendly for analysis.

And then you have the analysts here that might be in like a BI tool or in notebooks somewhere pulling that tidy and transform data and analyzing it, say, pulling it from a database. So I think enterprises and bigger teams tend to see the process like this. There's more sort of like handoff between each of these stages. And this is a figure from a dbt article about the modern data stack, which sort of shows how they view it in this way. So they have like data loaders, then they have dbt here and then analysts in BI tools. I'll also say I think they also see it like this. Like there's a lot of these analyses floating around. There's a lot of data being extracted, a lot of data being transformed and a lot of data being analyzed. So once you go to that workflow, you tend to get like a lot more need to really like engineer and test and work out processes around data transformation. Laurie, do you have anything?

Yeah. Yeah, I think the way that I think about it is one, I think it's important to have this context in mind because dbt isn't necessarily a tool that is super intuitive if you're kind of tinkering with it on your own, because it's really designed to facilitate kind of like organizational collaboration and sort of production use and orchestration of like production data pipelines. And so it's not a tool like you can learn how to use dbt on your own, but like it may not really a lot of its features are really geared towards like kind of running and being deployed in a kind of production environment and specifically collaboration on a project. And so I think that's one of the things that can be confusing for people trying to like learn dbt is that it's really designed to be used together to maintain a data warehouse in an organizational context. And it's not necessarily something that like a lot of the tools may not like a lot of its features may not make a lot of sense if you're just like working on it locally. And the other way I frame it sometimes is thinking about from an analyst perspective dbt is sort of the tool that is like preparing it's kind of moving up the floor of the analyst workflow. So instead of like me as an analyst having to go like grab you know super raw messy data from some kind of source it's like raising the floor of that data. So instead of going to that source and doing all of the kind of like really nitty-gritty cleaning like the analytics engineer and the data engineer are trying to give you as the analyst cleaner more prepared data so that you're focusing more on the like analysis and insights and less on the data cleaning and transformation.

So instead of like me as an analyst having to go like grab you know super raw messy data from some kind of source it's like raising the floor of that data. So instead of going to that source and doing all of the kind of like really nitty-gritty cleaning like the analytics engineer and the data engineer are trying to give you as the analyst cleaner more prepared data so that you're focusing more on the like analysis and insights and less on the data cleaning and transformation.

Running the demo project

So maybe with that in mind I'm gonna crack open this project so we can just see dbt in action because I think to Lori's point like it's a big machine. There's a lot of if you're just coming at it from the perspective of like doing a single data analysis it feels like a lot of boilerplate. So it can be pretty overwhelming to like run. But I honestly I think once you have someone crack it open and kind of like point out the different pieces I feel like it's way easier to just understand what you're getting into.

So hopefully we'll see I'm gonna be asking so many questions. I'm gonna have like a note stack open and please yeah I mean I'm gonna be asking questions too and Lori I feel like and this is also like everybody in discord this is the time to stop Michael and Lori and ask questions to be like what did you just click? What was that? What does that file a folder mean? Like this is the time so don't feel shy about asking. Yeah I would I'd be really curious too to hear in like discord like what people are interested in dbt like what are people curious about? Like are you curious about using dbt? Are there people in your org using it?

DBT stands for data build tool and it's all lowercase always. I don't know why but yeah. It's a framework. Yeah great question. Yeah I have some slides when I talk about this where it's like dbt is like literally what is it is I think for this we're going to be talking about dbt core which is just a python like what it literally is is a python package that can execute like sql. It basically provides a framework for like executing a bunch of sql scripts in a specific order against like a data warehouse environment and so it it's a python package that provides like a wrapper around sql code that you're running in an environment. The company dbt labs also there's like a sass product which is dbt cloud which is like a hosted version. They just last year released dbt fusion which is sort of like a new in-between like layer which you know folks can go look at the product information for that but dbt core kind of like the basic is a python package.

Yeah I think too I find one tricky thing with when people start talking about data tools is sometimes when they answer what something is they get very abstract like in in my mind like dbt is what it does and this I think joffleshop this example joffleshop.db is useful because it just shows what dbt does so you can just see like what goes in what comes out. Yeah Michael walked through like a little bit of this for me and I just like just see what it did I'm like oh now I know why dbt is.

Yeah I mean let's let's run the demo and I think that'll contextualize a lot like you'll see once it runs like kind of where python comes in which is not really at all. Yeah I was gonna say this is not I think once we get through I'm a python power user and I'm using this no no no and honestly all these questions make so much sense. Yeah they do I mostly say that because I think often people are wondering whether dbt is like a language of some kind or also if it's some kind of data storage like dbt does not store data itself it's not any kind of like database itself it's like it's a framework for storing your code basically it's like a way to organize your data transformation code.

I've been calling it a framework not a language. Yeah yeah and all right let's let's run this before we talk anymore because we're 18. Yeah yeah I think we'll just get more questions yeah I had all these questions too so so I'm glad people are asking but okay so I just went into the code space I created one this is one that's already run a little bit and installed and so it's really fast like this tutorial I think where does it have that little drop down that's like how okay so if you go down here running this project this is the fastest possible explanation they have so they're like in these bits of code you can run this project in our code space we're already we've already done these two things and I think we maybe even did the virtualenv so if I do it has done it for you uh let me see or are you saying that you did it before you spun this before you shared your screen with us uh I I think that it did it for us but why don't I just run it again this step so I'm gonna run this part okay why don't I just run this I'll do this step by step so you can see it and as a reminder for everybody he's in the terminal this is cli he's not in a python console or python rebel yeah yep okay so that ran and now we have our virtual environment and then I'm gonna install these requirements this might take a sec

okay so while that installs I'm gonna grab the next command uh it's weird because we already did that so that's not necessary but I'm just gonna do it anyways because I'm a good direction follower so sometimes there's a reason and they're just not telling you what it is yeah

all right and then this is where I think this will hopefully make sense so just to show you like right now we have um this folder which is models and notice why don't I just open it on github so you can see it there so in models notice there are these sql files um and basically we're gonna get tables for every sql file so if we look in there's a bunch of sql and there's there's a final sort of query down here so every sql file we want to basically produce a table from for this java shop data so when I run dbt uh what's the command lori dbt run build dbt build this is where my idiocy is going to show so dbt build all right let's see

okay so it ran and then the key is it has created a database called jaffa shop duck db and that has tables for each sql file basically in this project so if you go into models you'll notice there's like customers there's orders let's just look at that so what I'm going to do is I'm going to install this um duck db uh cli so I can see it a little more closely so I already ran this but um why don't I just run it again in the chat oh wait Michael did oh yeah I dropped it into discord let me drop it good job zoom too this is so we can kind of get to discord yeah this is so we can poke around a little bit notice that one problem is they say you need to run this command to do duck db so I'm going to click I'm going to paste this in um and run it I already did so all right and now when I type duck db I have this little command line where I can see the um stuff so now I'm going to do show tables

uh-oh oh I need to connect yep yeah yeah wait wait wait sorry duck db and then do I just I put the name of the database in right yeah I think that's why it's suggesting you can auto come yeah

yeah oh very cool okay all right this is why we pair all right so notice now we have a customers table we have an orders table and that matches these sql files and models um so that that shows us like dbt did something to create these tables from these sql files um you might notice we have all these other tables and um so there's a way to find these the staging tables are in here so notice there's a sql file for stage customers stage orders stage payments and then raw is a little bit tricky it corresponds to this data in this seeds folder so my whole point is like there's there's already kind of like a lot going on but I just want to point out like we have a table for everything in seeds these raw customers say maps to here we have a table for every sql file inside models slash staging so stage customers is here and we have a sql file we have a table for all the other sql files so like orders.sql and customers.sql so let's just like stop and appreciate like already so much has happened but the key is like these things in seeds and the sql files and models now each have a table so just I would just think about that like dbt build produced a table in the database for these different files some of them were csvs those are raw data some were sql with sql queries

um and yeah laura do you yeah I think this is where the some of the sort of challenge like the the challenge with like using a sandbox environment like this is that the reason that we have those csv files and the duck db file locally are so that this is like a self-contained local example like in a kind of organizational context your raw data would probably be coming from some upstream you know your apis or your buckets or whatever like your that's your raw data um so those csv files here are taking the place of that so that this is a self-contained example and then that duck db file that is local that we're looking at this cli you know in reality that might be your snowflake or your data bricks or your big query or such um so like I think this is one of the challenges is sometimes to make it work locally you kind of have to add these things that also make it more confusing um because like it's not that dbt itself you have to have duck db locally or you have to have like your raw data in csvs locally like that's part of making this a sandbox but um it I think it sometimes makes it harder to understand the concepts

because um you have to do some stuff to make it work I personally find it easier because before the duck db stuff existed you had to connect to snow you had to connect to big query like all their docs and their tutorials were like sign up for big query so you had to like connect to a warehouse so I'd almost flip it and I'd say like if you have these csvs and duck db you can just run everything locally it's true you won't get quite the same production environment because you won't be like hitting snowflake or big query um but you can kind of like open it up and tinker it at least no it definitely makes it more accessible to learn to be able to have the sandbox but I think it can be confusing about how the sandbox concepts translate to like real life examples of using dbt that's more what I mean is that like the way that you would use a seed in a kind of real dbt project is not all your raw data is csvs in the repo yeah like um and so I think it's it's more like the the conceptual mapping is like a little unclear yeah but yes yeah it's not very nice that you can have duck db and actually do this without having to figure out connecting to a cloud yeah

Questions from the audience

um yeah that's a great point pause really quickly and answer some questions in the discord that are popping up that might give us some clarity yeah um so if I can just if you don't mind just reiterating what Lori said like I feel like one really important thing she just said is that seeds like Lori pointed out like seeds is really useful here for a tutorial like a local tutorial but you're not gonna at work you're probably not gonna find a seeds folder because your data will say like already be in a warehouse or somewhere else it's pretty they're not going to check all your data in to your like dbt git repository which I think is important to flag yeah

okay Hubert had asked I assume there's some sort of co-currency or dependency capability thinking of like oracle job chains that have dependencies maybe yeah tackle that yeah so that's kind of like that's one of the kind of superpowers of dbt so Michael can you scroll up to where you did the dbt build and show like what it actually printed um so what dbt did is it's not just that it created all those tables but it created them in order based on dependencies that it is kind of um like inferring from like how the those sql files are specified so you can see it runs like each one like it has this ordered list of run this then run this then run this um and that is happening because I don't know Michael can we pull up the um yeah like maybe staging costs well no those are gonna be weird because they're gonna pull from source or I guess if they're seeds it might be okay um here maybe we can bail out if it gets yeah but um oh you're right there's even a note here yeah so the way that it's inferring like okay I need to run like raw customers before I run staging customers is because of that line on line 7 where it says select star from ref raw customers that like ref macro is saying like this staging customers model depends on that raw customers model so I need raw customers to exist before I can run staging customers and so the the ref macro here is like one of the really important parts of dbt because it allows you to like it will infer your dependency tree from the use of this macro um the other thing that's important about this is you can see it's just saying ref raw customers it's not specifying like a full table name in the in the database it's not saying from you know duck db dot schema dot table name it's just giving the name of the model like within the dbt project and so this is doing two things it's like making the dependency tree but it also means that dbt can run sort of against different environments like you're not fully specifying storage locations within a specific warehouse so this helps you have things like maybe a test warehouse environment and a production warehouse environment and your dbt project you can write the code in a way that's like environment agnostic and then dbt resolves those references for you um so the ref macro is like one of the like really really important um like features of dbt that helps with both of those things of the dependency tree and then also making it sort of environment agnostic and letting you can run the same dbt project against different warehouse instances and it will resolve those it will make raw customers into that like schema dot table name according to like the settings of the connection between the dbt project and the environment if that makes sense

and so the the ref macro here is like one of the really important parts of dbt because it allows you to like it will infer your dependency tree from the use of this macro um the other thing that's important about this is you can see it's just saying ref raw customers it's not specifying like a full table name in the in the database it's not saying from you know duck db dot schema dot table name it's just giving the name of the model like within the dbt project and so this is doing two things it's like making the dependency tree but it also means that dbt can run sort of against different environments like you're not fully specifying storage locations within a specific warehouse

adrian had said i've written some python files it's more getting an orchestrator i think that's the term to run code hopefully in the cloud and not on my laptop in a somewhat automated fashion and i'm wondering if dbt has any of that execution um so kind of um dbt is not a generic orchestrator in that sense like your airflow dagster prefect dbt is sort of like half of an orchestrator and more specifically for sequel um and so dbt can have some python um code in it but it's not dbt is not gonna like run an arbitrary python script on a schedule so normally with dbt project if you're using dbt core you would have a separate orchestrator like an airflow dagster prefect that's running that's like kicking off your dbt project like dbt core you can't like schedule it itself it it does the dependency and the ordering stuff but it doesn't have like the scheduling and it doesn't have like an execution environment on its own so you would have like an airflow that kicks off your dbt project run if that makes sense um so it has some of the features but it yeah not all

okay good to know um aaron asked if we're starting this project from scratch it would just start with the sql code pulling into tables yes that very first thing that he ran that dbt build it looked at a couple of different places in order to make tables so it looked for the models be like dot sql files but it also looked for our csv files that we had in our seed folder um and as michael and laurie said those csv files are sort of conceptually taking the place of database connections where we would call database connections we wouldn't have these like flat files locally but we're trying to use dbt locally when it's really about to be like a system that's plugged into a larger system

um and we did have a question a couple of questions from saraj and marcos that are kind of connected so i'd love to tackle those um both saraj and marcos had asked like how is this better than writing python functions or python script to do this automation what are the benefits of using dbt versus just a python script so dbt has a ton of features baked in for things that you want if what you're trying to do is specifically i on a regular cadence want to pull in data and then run a bunch of uh transformations on it using sql like that's kind of like volume size of the data is part of the difference um it's that it gives you a lot of additional features so it has things like the ref macro like i've i've looked at a project that i think where people kind of home grew a lot of the features of dbt and like manually orchestrating your sql ordering and like or your execution of like run this and run this like dbt kind of gives you that out of the box dbt also gives you um again a kind of a framework for how you connect to the different database environments like out of the box that you might not want to write yourself of like okay here's how i manage different connections dbt also or uh can we look at the yaml file michael schema.yaml yeah um so dbt also gives you out of the box these things like the a framework for testing so um you can just specify in yaml files like different tests that you want to run as you're executing your uh transformations um and so it gives you like it kind of just gives you a lot of things that you would likely want to do that you don't then don't have to like manually um implement yourself basically um and dbt is also definitely kind of i think it comes from the idea that you have a cloud warehouse where you're running this and so there is also the the question of it's like assuming that you're running in kind of like a sort of bigger data cloud warehouse instance

yeah um and then on the dbt docs home page um so the way you like the way you interact with it on the left is you can navigate it through either the project which is like the folder view and that will show things as they're organized in the dbt project like the the directory structure within your repo basically if you go to the database view it will show how things are organized within the database so it'll show by like um by like schema basically um and so those are the two ways to kind of navigate and then each model has its own page in the doc site so we can click on any of those and see um and it'll have at the top it has like kind of the summary information it has the columns um and like each of those descriptions that you see there you can specify those in the yaml file like within your dbt project you can write docs kind of inline and then they'll render here depending on your settings they can also be persisted into your warehouse um so that you have kind of the same like the same column descriptions can appear in multiple places um and that yeah that's and that's really um so that's really about the benefits of dbt this is a huge one because otherwise all of this would be either manual or through another docs framework

so just to map it back to so like this is we're looking at the customers table and we can see like yeah these are descriptions on the columns and where that comes from is so notice in the customers folder this is a bit tricky but there's a schema.yaml file i don't laurie maybe in a sec you can talk about how people like to write the dot yaml files because i know there's like some conventions around that but um here notice it's like these are our models so in dbt a model is a i'm afraid to say it's a table or the the output is a table so each sql file produces a table and this models is a list of tables and notice yeah i think the reason that the reason that it's model rather than just table is because part of i think part of what the sort of dbt philosophy is the idea that your data transformations and a lot of this metadata should live in version controlled code right and that the materialization like whether something is a table or a view is like an implementation detail and the important part is the business logic that's encapsulated in that model so it's like i mean for folks who've worked you know sometimes you have like underscore v like each each thing in the database is like oh is it a table or is it a view and that's kind of often like how things have been organized and dbt is like whether it's a table or view it's just an implementation detail related to like performance and storage and from your perspective you should care that it's the customers thing and like when i need to get customers i go there and i as a downstream consumer shouldn't care if it's a table or view maybe i need to change it later based on performance considerations that arise over time but that's why model is used as like a generic like it's it's you're building your your uh data model like conceptually more than right

and this house yeah it's a good call out for a bunch of data scientists yes yeah it's very confusing model is data model not statistical model or machine learning model we're talking about a data model

dbt docs and lineage

yeah so so um dbt docs so part of again this idea of dbt as a framework so when dbt runs there are slightly different run modes but basically dbt is compiling like you know is compiling all of the references and it's like inferring that dependency tree and it outputs um some like special json files that basically kind of have all that compiled information and then that allows you to do different things kind of like to parse your project and one of the things that dbt gives out of the box is the docs um and is it target is target that has all yeah it's it's in target and it's um for for docs it's catalog.json um oh yeah here so this is kind of like the internal yeah how the sausage gets made yeah so these are huge really gnarly json files that contain like a lot like all the metadata and like all of the resolved information that dbt does um but then dbt provides kind of like a user interface on top of that through the doc site um where you can see uh information about each of the models um and so it'll give you wait where's the graph i can't remember oh and bottom right the little yeah so it'll show you your um your dependency lineage as a and you can if you want to well i guess this one's really basic but if you click the to get here um so yeah go back to the home page michael of the doc site maybe the doc site let me restart the whole because it's that's what we should share

it's kind of like an if you know you know like these things are clickable and this actually expands but also how you gen do we want to reiterate how you generated it so on in the terminal dbt docs generate and dbt docs serve yeah so did uh dbt docs generate okay it's like oh i found a bunch of models and seeds okay and it tells me where the catalog is and then dbt doc serve this is a bit tricky in code spaces because it'll ask permission to like open um like notice it's on right it's open like a special link it's not like a link that you can share it's going to open locally on your machine if you're running it locally yeah so you can click here to if you command click it kind of knows to open it in a special way um so that's one kind of code space tip great thank you didn't mean to interrupt but that's no no i'm glad yours and then i'm sorry that that command was run in python or somewhere uh this is all this isn't the terminal oh okay it's a terminal command through the dbt command line interface like so dbt has its own command line interface cool

dbt versus python, and the analytics engineer role

yeah um and then on the dbt docs home page um so the way you like the way you interact with it on the left is you can navigate it through either the project i will say uh there are like there is this interesting space in dbt which is like anything that can generate sql can be made into a dbt adapter so there are like for example ibis is similar to db plier but for python and it can generate sql and um ibis dbt adapter where are you here like there is this funny space of like people who have integrated like cool that can generate sql into dbt um but it's this space isn't like super common so you're kind of like a bit off the like beaten path is the problem

i think to jenna's point like yeah this this quote like if you choose not to use dbt you'll probably waste time building a less fully featured buggy implementation of it yourself that's real um and the reality is just like r and python are way less integrated into dbt so if you're used to like generating sql and db plier there's just not like really common well trod paths to move that in and i think i think part of it's also that analytics engineers don't want to worry about all these like bespoke tools when a lot of them are very like sql savvy

all right cool let's hop over to question from saraj which said does it make sense to say that the data source detection is dynamic with dbt unlike snowflake big query etc where we have to manually set sources and the sources can be different formats um you uh i'm not i'm not 100 sure how to answer that you you do have to specify in dbt what your sources are um like you on two levels so one you connect dbt to the warehouse environment where it's going to execute so like it only has a dbt project can only execute against a warehouse that has been configured to connect to like there's you know there's an aspect of like credential like you have to be running it through a connection in a warehouse um and then the other thing is you do specify where your sources live um so this is where the sandbox is a little odd i don't think they actually have a sources like they don't have anything configured as sources because they're using all the seeds i guess yeah but normally the way it would work is that you would have in in a yaml file like this you would specify like say that my upstream source is um like a an external table um that like data is landing in a bucket and i've got an external table i would tell dbt about that external table and like it needs to know where to find yeah you can see external location so like dbt needs to know where your source uh lives and like they have to be named so that dbt can like resolve the references the dynamic aspect is more in the reference resolution like within the dbt project um in terms of like that the reference macro that the ref macro that we looked at um but you do have to kind of give it defined sources

yeah i will say this is a this tool dbt duck db is a fun way to kind of beef up these tutorials which is like rather than so here we generated a dot duck db database basically but what dbt duck db adapter lets you do is your database could be a file a folder of parquet files so essentially like your whole database could just be reading and writing from a folder of parquet um and that's that's what's shown here in this source is like my external location is this s3 parquet bucket um so duck db is good with that kind of stuff

uh i like that that definitely fills in that gap okay cool um there was a non-question sort of a question it was a pseudo question let me go find it yeah i know i saw jenna also mentioned running the docs dbt yes doc should we do that or dbt docs generate and then dbt docs serve and if we could talk a little bit about the importance of those and jenna if you would like to hop in feel free we can send you a little like thing that lets you unmute

okay yeah so so um dbt docs so part of again this idea of dbt as a framework so when dbt runs there are slightly different run modes but basically dbt is compiling like you know is compiling all of the references and it's like inferring that dependency tree and it outputs um some like special json files that basically kind of have all that compiled information and then that allows you to do different things kind of like to parse your project and one of the things that dbt gives out of the box is the docs

yeah i do i do think to hadley's point like what made sense to me was like i've seen a lot of people build dbt like things to jenna's point too like actually there are so many ways to make it work and if you're used to tools and r you've actually probably built really reasonable you could have built like really reasonable systems that automatically detect dependencies honestly it's not a really hard tooling problem but i think that um i think i think the problem dbt solved is it created a job like the analytics engineer role like you know what an analytics engineer should learn and it's mostly sequel uh and these and orchestration with things like dbt and the fact that this group of people have converged on like practices and language means you don't have this like tower of babel where everybody's doing their own yeah special orchestration tool with it's like yeah i think it's why we love tidyverse we're like we have a syntax and everyone's going to use that same syntax and we're all going to speak the same language and we're all going to understand each other and that means we can work together even better

but i think that um i think i think the problem dbt solved is it created a job like the analytics engineer role like you know what an analytics engineer should learn and it's mostly sequel uh and these and orchestration with things like dbt and the fact that this group of people have converged on like practices and language means you don't have this like tower of babel where everybody's doing their own yeah special orchestration tool

yeah yeah like i think dbt basically took a lot of things that again that you would want to do if what you're doing is i have i have this amount of data that needs to flow through like every day or every hour and i'm always needing to do the same steps um and i'm doing that in the context of like a data warehouse dbt just bundles together a lot of features that make your life easier if that's what you're doing um and so it's the like taking away a lot of boilerplate code it's adding data testing it's giving you this kind of docs site for free um and the so the dependency management is really important but it's like it's part of this kind of coherent whole and it's the idea that like managing my data warehouse is a it's like a collective thing i think it's also shifting to the idea that the data warehouse is like is kind of like a product and like that you need that people should be kind of curating and building it collectively i think often um at organizations that have big data warehouses sometimes there's kind of like a free-for-all and like different teams are managing different schemas and like people are kind of like using data um like whatever they can scrounge up what they can find who they know and it's trying to kind of like make a more coherent whole and kind of help you talk precisely and collaboratively about your data warehouse

um right i do think what maybe the one last one of the most interesting features of dbt which is hard to show off because you kind of need like a big environment to really appreciate this pain is like branching that like if you're developing if you have like six developers working on a warehouse and there you're working on like some set of tables and you're like okay i know i know what i'm doing starts here at these tables and i just want to like test out changing these tables down here like how do you have like a development environment for your work like how do you just test putting up a few tables in a warehouse that used to be really hard like it used to be like you either had a development or a staging warehouse and and you still might but like dbt is really good at helping you like fork out kind of like your development tables without having to rename everything or use like funny kind of like environment variable magic it can just handle like your like lori's development tables for this pr um which used to be a kind of a crazy problem uh to solve