
What even is dbt? An Analytics engineer explains | Laurie Merrell & Michael Chow | Data Science Lab
The Data Science Lab is a live weekly call. Register at pos.it/dslab! Discord invites go out each week on lives calls. We'd love to have you! The Lab is an open, messy space for learning and asking questions. Think of it like pair coding with a friend or two. Learn something new, and share what you know to help others grow. On this call, Libby Heeren is joined by Jarvis Innovations Lead Analytics Engineer Laurie Merrell and Posit Principal Software Engineer Michael Chow as they walk us through a beginner dbt project and let us ask as many questions as we like (and we do, we ask all the questions, including, WHAT EVEN IS dbt??). This is a super friendly, MESSY, collaborative, and curious peek at dbt. It's is a tool that's often mysterious to data scientists and it's a big enough framework that it can feel tough to get started with. Walking through the basics makes it way easier to get into! Hosting crew from Posit: Libby Heeren, Isabella Velasquez Laurie's LinkedIn: https://www.linkedin.com/in/laurie-merrell/ Michael's socials and urls: LinkedIn: https://www.linkedin.com/in/michael-a-chow/ Bluesky: https://bsky.app/profile/mchow.com GitHub: https://github.com/machow Resources from the hosts and chat: Michael Chow's talk about dbt at the Coalesce Conference in 2022: https://www.youtube.com/watch?v=EYdb1x1cO9U Beginner dbt project Michael is using: https://github.com/dbt-labs/jaffle_shop_duckdb Laurie's Coalesce talk with Ian and Jenna: https://www.youtube.com/watch?v=6aX7tAfMmIM Link to installation page for the DuckDB CLI: https://duckdb.org/install/?platform=macos&environment=cli "Why is dbt so important" shared by Jenna in the chat: https://highgrowthengineering.substack.com/p/why-is-dbt-so-important- dbtplyr: https://hub.getdbt.com/emilyriederer/dbtplyr/latest/ Parquet: https://parquet.apache.org/ From stored procedures to dbt: A modern migration playbook: https://www.getdbt.com/blog/stored-procedures-dbt-migration-playbook How to structure our dbt projects: https://docs.getdbt.com/best-practices/how-we-structure/1-guide-overview Jenna Jordan's blog on dbt mesh: https://jennajordan.me/blog/data-mesh-dbt â–º Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co The Lab: https://pos.it/dslab Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for learning with us! Timestamps 00:00 Introduction 01:09 Guest introductions: Michael Chow and Laurie Merrell 04:15 Overview of today's session 05:51 Setting up the GitHub Codespace 07:00 The data science workflow vs. organizational needs 10:06 Why dbt is hard to learn in the abstract 13:34 "Could we back up and explain what dbt is again?" 19:12 Running 'dbt build' 20:00 Inspecting the database with DuckDB CLI 26:21 "Does dbt have concurrency or dependency capabilities?" 27:37 Understanding the 'ref' macro 29:52 "Is dbt an orchestrator?" 31:14 "Starting a project from scratch with just SQL?" 32:04 "How is this better than writing Python scripts?" 35:46 "Is data source detection dynamic with dbt?" 38:36 Generating and serving dbt docs 46:51 "Is dbt an IDE like RStudio, but for SQL?" 52:32 Branching and development environments 53:57 "Where would you begin on a brand new project?" 56:38 "How would you validate dependencies and downstream impacts?" 57:48 Defining a view versus a table
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Welcome to the Data Science Lab, y'all. I'm Libby. I'm a Data Community Manager here at Posit. My co-host is Isabella. Isabella, you want to say hi?
Yes. Hi, everyone. Thank you so much for joining us. Happy New Year.
Yeah. Happy New Year, everybody. Usually, Dan is with us, Dan Chen. I think he's teaching right now. We'll say hi to Dan in spirit. If you are new around here, hello, welcome. The Data Science Lab is new. It's weekly. We get together. We talk about more technical stuff, and it's very unscripted and fun. It's like pair coding with your friends, right? We're all going to learn something new today. Today, we are talking about dbt, a tool that I have feared for a long time because I don't know anything about it. I'm a data scientist. I'm a stats person. I am not an engineering person or a computer science person, so it's been really intimidating. Thankfully, we have Michael Chow and Laurie Merrill here to help us learn about it and navigate.
I will have you two introduce yourselves. Michael, would you like to tell us a little bit about who you are? Sure. Hey, I'm Michael Chow. I work as a software engineer at Posit. I'm so excited to talk about dbt, I think, because prior to this, Laurie will introduce herself, but I worked helping build out a data team to work on some transit pipelines. I have to say, I did a really bad job with a lot of the orchestration things, and then Laurie came in and really, like, sponged her on to gold.
Yeah. Hi, everybody. I'm Laurie Merrill. I'm lead analytics engineer at Jarvis Innovations, and I have been working with dbt since about 2022, and I actually learned dbt to refactor the code base that Michael kind of handed off to me, and so I definitely think dbt is something that it's, I think it's hard to learn in the abstract. It's very much like a kind of practical tool that you, I think, sort of have to learn by doing, and I primarily work on public transit data, as Michael mentioned, so we do modern data infrastructure for public transit organizations, so that's kind of my background, yeah.
Well, it's so good to have you both here, and Laurie, you have some fans in the chat. They're like, I wasn't going to miss hearing Laurie talk about dbt. I'm here for Laurie, so we're so excited to have you. Let's go ahead and get started, but first, I would love for everybody to, like, we've done this a couple of times in other places, and it makes me smile so much. If you happen to have a rubber duck, like a debugging duck, with you on your desk, please feel free at some point to share a picture of it in the Discord or hold it up and say hi. Yeah, James has Figment with him, his debugging duck. This is Rocket. I also have Desk Goose with me here from Untitled Goose Game, who has a hair clip in his mouth.
All right, everybody. Michael, do you have any debugging ducks?
Laurie has one. I have, I mean, I have this baguette. Could it be a debugging? A debaggetting? Yes. Debaggetting, you know? We will allow it. Laurie, your duck has a little space helmet. Yeah, I believe that this is Jarvis swag from some point. I don't know, Michael, if you ever got one, but we got space ducks at some point. I think they saved the swag only for people who were DBT users, not DBT appreciators. You know, that wasn't enough to get the joke.
What is dbt and why does it exist?
Yeah, why don't, maybe I'll crack open, so Jenna shared a link to the talk, which was literally, after I joined Posit and Laurie came into Jarvis, which is where I went. So I went from Jarvis to Posit. Laurie came in, fixed all my mistakes. The talk Jenna shared was about sort of like how I realized how important DBT was. So that's a good one. I'm just going to share, maybe I could just share like one slide that I think is really important and was aimed at our users from that talk. And Laurie, I think has some stuff to add to it. And then I'll go into running like a really simple DBT project. And just looking around and just seeing, because if you've tried to run DBT, you may have seen like there's a lot of parts. And sometimes it's hard even to just fire up the basics.
Yeah, I told Michael, I just shared a beginner project that Michael is going to be using a GitHub repo in the chat. I told him I opened it and I cloned it to my machine and opened it in Positron. And then I was like, crap, I don't know where to go from here. I don't know what I'm supposed to be doing. Totally. Yeah. So this is, just to foreshadow, this is the project we'll open.
So we'll go into this Jaffa Shop, DuckDB, just a basic look at DBT and how you can see it running and what the different parts are. What I'm going to do is I'm going to run it in a code space. So I clicked code and then I clicked create a code space. So that might be good to do if you want to follow along. You could fire that code space up right now. It takes like a minute or two to start.
Yeah. So the one thing I just want to explain about why dbt really fast comes from this talk, the accidental analytics engineer. And what I'll just say is like, if you're coming from R or even doing data analysis in Python, you may think of data analysis from this perspective. So this is the data science sort of workflow that Hadley Wickham introduced in R for data science. And so you might think of data analysis like this, where you sort of start from some raw data and you tidy it up and transform it a bit, and then go into this cycle of understanding. And once you've sort of visualized and modeled the data, done this sort of exploratory data analysis and found something interesting, maybe you go on to communicate it through, say, like a Quarto document or a dashboard, you know, ship the results out. This is a very end-to-end workflow.
And I think a lot of people on smaller teams get a lot out of this workflow, sort of going from raw data to insights. What I'll say is that in a lot of the context, I think where dbt is helpful and to try to make sense of why or to use dbt, I think that big orgs tend to be more like this. So you chop it up. So you might have a data engineer for most what you think of as importing. So like extracting and loading data into, say, like a database or like a bucket. You have an analytics engineer who then is doing a lot of the tidying and transforming. They call this sometimes like data modeling. I should have put it in air quotes, but that's like structuring the data in a way that's really friendly for analysis.
And then you have the analysts here that might be in like a BI tool or in notebooks somewhere pulling that tidy and transform data and analyzing it, say, pulling it from a database. So I think enterprises and bigger teams tend to see the process like this. There's more sort of like handoff between each of these stages. And this is a figure from a dbt article about the modern data stack, which sort of shows how they view it in this way. So they have like data loaders, then they have dbt here and then analysts in BI tools. I'll also say I think they also see it like this. Like there's a lot of these analyses floating around. There's a lot of data being extracted, a lot of data being transformed and a lot of data being analyzed. So once you go to that workflow, you tend to get like a lot more need to really like engineer and test and work out processes around data transformation. Laurie, do you have anything?
Yeah. Yeah, I think the way that I think about it is one, I think it's important to have this context in mind because dbt isn't necessarily a tool that is super intuitive if you're kind of tinkering with it on your own, because it's really designed to facilitate kind of like organizational collaboration and sort of production use and orchestration of like production data pipelines. And so it's not a tool like you can learn how to use dbt on your own, but like it may not really a lot of its features are really geared towards like kind of running and being deployed in a kind of production environment and specifically collaboration on a project. And so I think that's one of the things that can be confusing for people trying to like learn dbt is that it's really designed to be used together to maintain a data warehouse in an organizational context. And it's not necessarily something that like a lot of the tools may not like a lot of its features may not make a lot of sense if you're just like working on it locally. And the other way I frame it sometimes is thinking about from an analyst perspective dbt is sort of the tool that is like preparing it's kind of moving up the floor of the analyst workflow. So instead of like me as an analyst having to go like grab you know super raw messy data from some kind of source it's like raising the floor of that data. So instead of going to that source and doing all of the kind of like really nitty-gritty cleaning like the analytics engineer and the data engineer are trying to give you as the analyst cleaner more prepared data so that you're focusing more on the like analysis and insights and less on the data cleaning and transformation.
So instead of like me as an analyst having to go like grab you know super raw messy data from some kind of source it's like raising the floor of that data. So instead of going to that source and doing all of the kind of like really nitty-gritty cleaning like the analytics engineer and the data engineer are trying to give you as the analyst cleaner more prepared data so that you're focusing more on the like analysis and insights and less on the data cleaning and transformation.
Running the demo project
So maybe with that in mind I'm gonna crack open this project so we can just see dbt in action because I think to Lori's point like it's a big machine. There's a lot of if you're just coming at it from the perspective of like doing a single data analysis it feels like a lot of boilerplate. So it can be pretty overwhelming to like run. But I honestly I think once you have someone crack it open and kind of like point out the different pieces I feel like it's way easier to just understand what you're getting into.
So hopefully we'll see I'm gonna be asking so many questions. I'm gonna have like a note stack open and please yeah I mean I'm gonna be asking questions too and Lori I feel like and this is also like everybody in discord this is the time to stop Michael and Lori and ask questions to be like what did you just click? What was that? What does that file a folder mean? Like this is the time so don't feel shy about asking. Yeah I would I'd be really curious too to hear in like discord like what people are interested in dbt like what are people curious about? Like are you curious about using dbt? Are there people in your org using it?
DBT stands for data build tool and it's all lowercase always. I don't know why but yeah. It's a framework. Yeah great question. Yeah I have some slides when I talk about this where it's like dbt is like literally what is it is I think for this we're going to be talking about dbt core which is just a python like what it literally is is a python package that can execute like sql. It basically provides a framework for like executing a bunch of sql scripts in a specific order against like a data warehouse environment and so it it's a python package that provides like a wrapper around sql code that you're running in an environment. The company dbt labs also there's like a sass product which is dbt cloud which is like a hosted version. They just last year released dbt fusion which is sort of like a new in-between like layer which you know folks can go look at the product information for that but dbt core kind of like the basic is a python package.
Yeah I think too I find one tricky thing with when people start talking about data tools is sometimes when they answer what something is they get very abstract like in in my mind like dbt is what it does and this I think joffleshop this example joffleshop.db is useful because it just shows what dbt does so you can just see like what goes in what comes out. Yeah Michael walked through like a little bit of this for me and I just like just see what it did I'm like oh now I know why dbt is.
Yeah I mean let's let's run the demo and I think that'll contextualize a lot like you'll see once it runs like kind of where python comes in which is not really at all. Yeah I was gonna say this is not I think once we get through I'm a python power user and I'm using this no no no and honestly all these questions make so much sense. Yeah they do I mostly say that because I think often people are wondering whether dbt is like a language of some kind or also if it's some kind of data storage like dbt does not store data itself it's not any kind of like database itself it's like it's a framework for storing your code basically it's like a way to organize your data transformation code.
I've been calling it a framework not a language. Yeah yeah and all right let's let's run this before we talk anymore because we're 18. Yeah yeah I think we'll just get more questions yeah I had all these questions too so so I'm glad people are asking but okay so I just went into the code space I created one this is one that's already run a little bit and installed and so it's really fast like this tutorial I think where does it have that little drop down that's like how okay so if you go down here running this project this is the fastest possible explanation they have so they're like in these bits of code you can run this project in our code space we're already we've already done these two things and I think we maybe even did the virtualenv so if I do it has done it for you uh let me see or are you saying that you did it before you spun this before you shared your screen with us uh I I think that it did it for us but why don't I just run it again this step so I'm gonna run this part okay why don't I just run this I'll do this step by step so you can see it and as a reminder for everybody he's in the terminal this is cli he's not in a python console or python rebel yeah yep okay so that ran and now we have our virtual environment and then I'm gonna install these requirements this might take a sec
all right and then this is where I think this will hopefully make sense so just to show you like right now we have um this folder which is models and notice why don't I just open it on github so you can see it there so in models notice there are these sql files um and basically we're gonna get tables for every sql file so if we look in there's a bunch of sql and there's there's a final sort of query down here so every sql file we want to basically produce a table from for this java shop data so when I run dbt uh what's the command lori dbt run build dbt build this is where my idiocy is going to show so dbt build all right let's see
okay so it ran and then the key is it has created a database called jaffa shop duck db and that has tables for each sql file basically in this project so if you go into models you'll notice there's like customers there's orders let's just look at that so what I'm going to do is I'm going to install this um duck db uh cli so I can see it a little more closely so I already ran this but um why don't I just run it again in the chat oh wait Michael did oh yeah I dropped it into discord let me drop it good job zoom too this is so we can kind of get to discord yeah this is so we can poke around a little bit notice that one problem is they say you need to run this command to do duck db so I'm going to click I'm going to paste this in um and run it I already did so all right and now when I type duck db I have this little command line where I can see the um stuff so now I'm going to do show tables
yeah oh very cool okay all right this is why we pair all right so notice now we have a customers table we have an orders table and that matches these sql files and models um so that that shows us like dbt did something to create these tables from these sql files um you might notice we have all these other tables and um so there's a way to find these the staging tables are in here so notice there's a sql file for stage customers stage orders stage payments and then raw is a little bit tricky it corresponds to this data in this seeds folder so my whole point is like there's there's already kind of like a lot going on but I just want to point out like we have a table for everything in seeds these raw customers say maps to here we have a table for every sql file inside models slash staging so stage customers is here and we have a sql file we have a table for all the other sql files so like orders.sql and customers.sql so let's just like stop and appreciate like already so much has happened but the key is like these things in seeds and the sql files and models now each have a table so just I would just think about that like dbt build produced a table in the database for these different files some of them were csvs those are raw data some were sql with sql queries
um and yeah laura do you yeah I think this is where the some of the sort of challenge like the the challenge with like using a sandbox environment like this is that the reason that we have those csv files and the duck db file locally are so that this is like a self-contained local example like in a kind of organizational context your raw data would probably be coming from some upstream you know your apis or your buckets or whatever like your that's your raw data um so those csv files here are taking the place of that so that this is a self-contained example and then that duck db file that is local that we're looking at this cli you know in reality that might be your snowflake or your data bricks or your big query or such um so like I think this is one of the challenges is sometimes to make it work locally you kind of have to add these things that also make it more confusing um because like it's not that dbt itself you have to have duck db locally or you have to have like your raw data in csvs locally like that's part of making this a sandbox but um it I think it sometimes makes it harder to understand the concepts
because um you have to do some stuff to make it work I personally find it easier because before the duck db stuff existed you had to connect to snow you had to connect to big query like all their docs and their tutorials were like sign up for big query so you had to like connect to a warehouse so I'd almost flip it and I'd say like if you have these csvs and duck db you can just run everything locally it's true you won't get quite the same production environment because you won't be like hitting snowflake or big query um but you can kind of like open it up and tinker it at least no it definitely makes it more accessible to learn to be able to have the sandbox but I think it can be confusing about how the sandbox concepts translate to like real life examples of using dbt that's more what I mean is that like the way that you would use a seed in a kind of real dbt project is not all your raw data is csvs in the repo yeah like um and so I think it's it's more like the the conceptual mapping is like a little unclear yeah but yes yeah it's not very nice that you can have duck db and actually do this without having to figure out connecting to a cloud yeah
Questions from the audience
um yeah that's a great point pause really quickly and answer some questions in the discord that are popping up that might give us some clarity yeah um so if I can just if you don't mind just reiterating what Lori said like I feel like one really important thing she just said is that seeds like Lori pointed out like seeds is really useful here for a tutorial like a local tutorial but you're not gonna at work you're probably not gonna find a seeds folder because your data will say like already be in a warehouse or somewhere else it's pretty they're not going to check all your data in to your like dbt git repository which I think is important to flag yeah
okay Hubert had asked I assume there's some sort of co-currency or dependency capability thinking of like oracle job chains that have dependencies maybe yeah tackle that yeah so that's kind of like that's one of the kind of superpowers of dbt so Michael can you scroll up to where you did the dbt build and show like what it actually printed um so what dbt did is it's not just that it created all those tables but it created them in order based on dependencies that it is kind of um like inferring from like how the those sql files are specified so you can see it runs like each one like it has this ordered list of run this then run this then run this um and that is happening because I don't know Michael can we pull up the um yeah like maybe staging costs well no those are gonna be weird because they're gonna pull from source or I guess if they're seeds it might be okay um here maybe we can bail out if it gets yeah but um oh you're right there's even a note here yeah so the way that it's inferring like okay I need to run like raw customers before I run staging customers is because of that line on line 7 where it says select star from ref raw customers that like ref macro is saying like this staging customers model depends on that raw customers model so I need raw customers to exist before I can run staging customers and so the the ref macro here is like one of the really important parts of dbt because it allows you to like it will infer your dependency tree from the use of this macro um the other thing that's important about this is you can see it's just saying ref raw customers it's not specifying like a full table name in the in the database it's not saying from you know duck db dot schema dot table name it's just giving the name of the model like within the dbt project and so this is doing two things it's like making the dependency tree but it also means that dbt can run sort of against different environments like you're not fully specifying storage locations within a specific warehouse so this helps you have things like maybe a test warehouse environment and a production warehouse environment and your dbt project you can write the code in a way that's like environment agnostic and then dbt resolves those references for you um so the ref macro is like one of the like really really important um like features of dbt that helps with both of those things of the dependency tree and then also making it sort of environment agnostic and letting you can run the same dbt project against different warehouse instances and it will resolve those it will make raw customers into that like schema dot table name according to like the settings of the connection between the dbt project and the environment if that makes sense
and so the the ref macro here is like one of the really important parts of dbt because it allows you to like it will infer your dependency tree from the use of this macro um the other thing that's important about this is you can see it's just saying ref raw customers it's not specifying like a full table name in the in the database it's not saying from you know duck db dot schema dot table name it's just giving the name of the model like within the dbt project and so this is doing two things it's like making the dependency tree but it also means that dbt can run sort of against different environments like you're not fully specifying storage locations within a specific warehouse
adrian had said i've written some python files it's more getting an orchestrator i think that's the term to run code hopefully in the cloud and not on my laptop in a somewhat automated fashion and i'm wondering if dbt has any of that execution um so kind of um dbt is not a generic orchestrator in that sense like your airflow dagster prefect dbt is sort of like half of an orchestrator and more specifically for sequel um and so dbt can have some python um code in it but it's not dbt is not gonna like run an arbitrary python script on a schedule so normally with dbt project if you're using dbt core you would have a separate orchestrator like an airflow dagster prefect that's running that's like kicking off your dbt project like dbt core you can't like schedule it itself it it does the dependency and the ordering stuff but it doesn't have like the scheduling and it doesn't have like an execution environment on its own so you would have like an airflow that kicks off your dbt project run if that makes sense um so it has some of the features but it yeah not all
okay good to know um aaron asked if we're starting this project from scratch it would just start with the sql code pulling into tables yes that very first thing that he ran that dbt build it looked at a couple of different places in order to make tables so it looked for the models be like dot sql files but it also looked for our csv files that we had in our seed folder um and as michael and laurie said those csv files are sort of conceptually taking the place of database connections where we would call database connections we wouldn't have these like flat files locally but we're trying to use dbt locally when it's really about to be like a system that's plugged into a larger system
um and we did have a question a couple of questions from saraj and marcos that are kind of connected so i'd love to tackle those um both saraj and marcos had asked like how is this better than writing python functions or python script to do this automation what are the benefits of using dbt versus just a python script so dbt has a ton of features baked in for things that you want if what you're trying to do is specifically i on a regular cadence want to pull in data and then run a bunch of uh transformations on it using sql like that's kind of like volume size of the data is part of the difference um it's that it gives you a lot of additional features so it has things like the ref macro like i've i've looked at a project that i think where people kind of home grew a lot of the features of dbt and like manually orchestrating your sql ordering and like or your execution of like run this and run this like dbt kind of gives you that out of the box dbt also gives you um again a kind of a framework for how you connect to the different database environments like out of the box that you might not want to write yourself of like okay here's how i manage different connections dbt also or uh can we look at the yaml file michael schema.yaml yeah um so dbt also gives you out of the box these things like the a framework for testing so um you can just specify in yaml files like different tests that you want to run as you're executing your uh transformations um and so it gives you like it kind of just gives you a lot of things that you would likely want to do that you don't then don't have to like manually um implement yourself basically um and dbt is also definitely kind of i think it comes from the idea that you have a cloud warehouse where you're running this and so there is also the the question of it's like assuming that you're running in kind of like a sort of bigger data cloud warehouse instance
yeah um and then on the dbt docs home page um so the way you like the way you interact with it on the left is you can navigate it through either the project which is like the folder view and that will show things as they're organized in the dbt project like the the directory structure within your repo basically if you go to the database view it will show how things are organized within the database so it'll show by like um by like schema basically um and so those are the two ways to kind of navigate and then each model has its own page in the doc site so we can click on any of those and see um and it'll have at the top it has like kind of the summary information it has the columns um and like each of those descriptions that you see there you can specify those in the yaml file like within your dbt project you can write docs kind of inline and then they'll render here depending on your settings they can also be persisted into your warehouse um so that you have kind of the same like the same column descriptions can appear in multiple places um and that yeah that's and that's really um so that's really about the benefits of dbt this is a huge one because otherwise all of this would be either manual or through another docs framework
so just to map it back to so like this is we're looking at the customers table and we can see like yeah these are descriptions on the columns and where that comes from is so notice in the customers folder this is a bit tricky but there's a schema.yaml file i don't laurie maybe in a sec you can talk about how people like to write the dot yaml files because i know there's like some conventions around that but um here notice it's like these are our models so in dbt a model is a i'm afraid to say it's a table or the the output is a table so each sql file produces a table and this models is a list of tables and notice yeah i think the reason that the reason that it's model rather than just table is because part of i think part of what the sort of dbt philosophy is the idea that your data transformations and a lot of this metadata should live in version controlled code right and that the materialization like whether something is a table or a view is like an implementation detail and the important part is the business logic that's encapsulated in that model so it's like i mean for folks who've worked you know sometimes you have like underscore v like each each thing in the database is like oh is it a table or is it a view and that's kind of often like how things have been organized and dbt is like whether it's a table or view it's just an implementation detail related to like performance and storage and from your perspective you should care that it's the customers thing and like when i need to get customers i go there and i as a downstream consumer shouldn't care if it's a table or view maybe i need to change it later based on performance considerations that arise over time but that's why model is used as like a generic like it's it's you're building your your uh data model like conceptually more than right
dbt docs and lineage
yeah so so um dbt docs so part of again this idea of dbt as a framework so when dbt runs there are slightly different run modes but basically dbt is compiling like you know is compiling all of the references and it's like inferring that dependency tree and it outputs um some like special json files that basically kind of have all that compiled information and then that allows you to do different things kind of like to parse your project and one of the things that dbt gives out of the box is the docs um and is it target is target that has all yeah it's it's in target and it's um for for docs it's catalog.json um oh yeah here so this is kind of like the internal yeah how the sausage gets made yeah so these are huge really gnarly json files that contain like a lot like all the metadata and like all of the resolved information that dbt does um but then dbt provides kind of like a user interface on top of that through the doc site um where you can see uh information about each of the models um and so it'll give you wait where's the graph i can't remember oh and bottom right the little yeah so it'll show you your um your dependency lineage as a and you can if you want to well i guess this one's really basic but if you click the to get here um so yeah go back to the home page michael of the doc site maybe the doc site let me restart the whole because it's that's what we should share
it's kind of like an if you know you know like these things are clickable and this actually expands but also how you gen do we want to reiterate how you generated it so on in the terminal dbt docs generate and dbt docs serve yeah so did uh dbt docs generate okay it's like oh i found a bunch of models and seeds okay and it tells me where the catalog is and then dbt doc serve this is a bit tricky in code spaces because it'll ask permission to like open um like notice it's on right it's open like a special link it's not like a link that you can share it's going to open locally on your machine if you're running it locally yeah so you can click here to if you command click it kind of knows to open it in a special way um so that's one kind of code space tip great thank you didn't mean to interrupt but that's no no i'm glad yours and then i'm sorry that that command was run in python or somewhere uh this is all this isn't the terminal oh okay it's a terminal command through the dbt command line interface like so dbt has its own command line interface cool
yeah um and then on the dbt docs home page um so the way you like the way you interact with it on the left is you can navigate it through either the project which is like the folder view and that will show things as they're organized in the dbt project like the the directory structure within your repo basically if you go to the database view it will show how things are organized within the database so it'll show by like um by like schema basically um and so those are the two ways to kind of navigate and then each model has its own page in the doc site so we can click on any of those and see um and it'll have at the top it has like kind of the summary information it has the columns um and like each of those descriptions that you see there you can specify those in the yaml file like within your dbt project you can write docs kind of inline and then they'll render here depending on your settings they can also be persisted into your warehouse um so that you have kind of the same like the same column descriptions can appear in multiple places um and that yeah that's and that's really um so that's really about the benefits of dbt this is a huge one because otherwise all of this would be either manual or through another docs framework maybe you're you're plugging in because there are docs frameworks but like so this is really cool
dbt versus python, and the analytics engineer role
yeah um and then on the dbt docs home page um so the way you like the way you interact with it on the left is you can navigate it through either the project i will say uh there are like there is this interesting space in dbt which is like anything that can generate sql can be made into a dbt adapter so there are like for example ibis is similar to db plier but for python and it can generate sql and um ibis dbt adapter where are you here like there is this funny space of like people who have integrated like cool that can generate sql into dbt um but it's this space isn't like super common so you're kind of like a bit off the like beaten path is the problem
i think to jenna's point like yeah this this quote like if you choose not to use dbt you'll probably waste time building a less fully featured buggy implementation of it yourself that's real um and the reality is just like r and python are way less integrated into dbt so if you're used to like generating sql and db plier there's just not like really common well trod paths to move that in and i think i think part of it's also that analytics engineers don't want to worry about all these like bespoke tools when a lot of them are very like sql savvy
all right cool let's hop over to question from saraj which said does it make sense to say that the data source detection is dynamic with dbt unlike snowflake big query etc where we have to manually set sources and the sources can be different formats um you uh i'm not i'm not 100 sure how to answer that you you do have to specify in dbt what your sources are um like you on two levels so one you connect dbt to the warehouse environment where it's going to execute so like it only has a dbt project can only execute against a warehouse that has been configured to connect to like there's you know there's an aspect of like credential like you have to be running it through a connection in a warehouse um and then the other thing is you do specify where your sources live um so this is where the sandbox is a little odd i don't think they actually have a sources like they don't have anything configured as sources because they're using all the seeds i guess yeah but normally the way it would work is that you would have in in a yaml file like this you would specify like say that my upstream source is um like a an external table um that like data is landing in a bucket and i've got an external table i would tell dbt about that external table and like it needs to know where to find yeah you can see external location so like dbt needs to know where your source uh lives and like they have to be named so that dbt can like resolve the references the dynamic aspect is more in the reference resolution like within the dbt project um in terms of like that the reference macro that the ref macro that we looked at um but you do have to kind of give it defined sources
yeah i will say this is a this tool dbt duck db is a fun way to kind of beef up these tutorials which is like rather than so here we generated a dot duck db database basically but what dbt duck db adapter lets you do is your database could be a file a folder of parquet files so essentially like your whole database could just be reading and writing from a folder of parquet um and that's that's what's shown here in this source is like my external location is this s3 parquet bucket um so duck db is good with that kind of stuff
uh i like that that definitely fills in that gap okay cool um there was a non-question sort of a question it was a pseudo question let me go find it yeah i know i saw jenna also mentioned running the docs dbt yes doc should we do that or dbt docs generate and then dbt docs serve and if we could talk a little bit about the importance of those and jenna if you would like to hop in feel free we can send you a little like thing that lets you unmute
okay yeah so so um dbt docs so part of again this idea of dbt as a framework so when dbt runs there are slightly different run modes but basically dbt is compiling like you know is compiling all of the references and it's like inferring that dependency tree and it outputs um some like special json files that basically kind of have all that compiled information and then that allows you to do different things kind of like to parse your project and one of the things that dbt gives out of the box is the docs
yeah i do i do think to hadley's point like what made sense to me was like i've seen a lot of people build dbt like things to jenna's point too like actually there are so many ways to make it work and if you're used to tools and r you've actually probably built really reasonable you could have built like really reasonable systems that automatically detect dependencies honestly it's not a really hard tooling problem but i think that um i think i think the problem dbt solved is it created a job like the analytics engineer role like you know what an analytics engineer should learn and it's mostly sequel uh and these and orchestration with things like dbt and the fact that this group of people have converged on like practices and language means you don't have this like tower of babel where everybody's doing their own yeah special orchestration tool with it's like yeah i think it's why we love tidyverse we're like we have a syntax and everyone's going to use that same syntax and we're all going to speak the same language and we're all going to understand each other and that means we can work together even better
but i think that um i think i think the problem dbt solved is it created a job like the analytics engineer role like you know what an analytics engineer should learn and it's mostly sequel uh and these and orchestration with things like dbt and the fact that this group of people have converged on like practices and language means you don't have this like tower of babel where everybody's doing their own yeah special orchestration tool
yeah yeah like i think dbt basically took a lot of things that again that you would want to do if what you're doing is i have i have this amount of data that needs to flow through like every day or every hour and i'm always needing to do the same steps um and i'm doing that in the context of like a data warehouse dbt just bundles together a lot of features that make your life easier if that's what you're doing um and so it's the like taking away a lot of boilerplate code it's adding data testing it's giving you this kind of docs site for free um and the so the dependency management is really important but it's like it's part of this kind of coherent whole and it's the idea that like managing my data warehouse is a it's like a collective thing i think it's also shifting to the idea that the data warehouse is like is kind of like a product and like that you need that people should be kind of curating and building it collectively i think often um at organizations that have big data warehouses sometimes there's kind of like a free-for-all and like different teams are managing different schemas and like people are kind of like using data um like whatever they can scrounge up what they can find who they know and it's trying to kind of like make a more coherent whole and kind of help you talk precisely and collaboratively about your data warehouse
um right i do think what maybe the one last one of the most interesting features of dbt which is hard to show off because you kind of need like a big environment to really appreciate this pain is like branching that like if you're developing if you have like six developers working on a warehouse and there you're working on like some set of tables and you're like okay i know i know what i'm doing starts here at these tables and i just want to like test out changing these tables down here like how do you have like a development environment for your work like how do you just test putting up a few tables in a warehouse that used to be really hard like it used to be like you either had a development or a staging warehouse and and you still might but like dbt is really good at helping you like fork out kind of like your development tables without having to rename everything or use like funny kind of like environment variable magic it can just handle like your like lori's development tables for this pr um which used to be a kind of a crazy problem uh to solve
Starting a new project from scratch
um different yeah yeah so dbt does give you like a like i can't remember what the command is but there's like a dbt project init kind of command like a starter template project not like this is a sandbox project they separately have like a kind of start here so i would clone that i can't remember if it's dbt init or if you clone that um starter project and then um and then i would configure you you'll have to configure your connection to whatever warehouse you're using um so if you have like if this is a real like you're actually getting started on a a thing like connect to your snowflake or your databricks or whatever um or if you're doing it locally connect to your duck db configure that in um dbt there's profiles.yaml and um that is where you configure that connection so connect to your warehouse and then um i would make my first source so i would in a yaml file specify like okay what's my incoming data your external whatever your external data is that you're starting with configure that as a source um and then i could write my first model a staging model on top of that source that maybe casts data types renames column like minimal um and dbt has uh really the other thing dbt also has this very large community they have a lot of uh documentation blog posts best practices materials out there um and so they have kind of a dbt how we structure our projects guide um and and which talks through like the components of the dbt project and so i'd kind of walk through the steps basically uh following that
okay great i love that question by the way because um this project is like all fully fleshed out and all local and it's a fantastic question because like where do you start when it's completely empty okay so we have two minutes left i want to do a quick shout out hadley said so dbt gives sql the tools that most programming languages have like testing and documentation yes also mvp of the chat goes to jenna jordan thank you so much for hopping in answering questions providing resources it's been amazing um and it also puts all of these steps into version control you can have them in github that's super important that's another thing that like your stuff might be all over the place if you don't do that uh and then darren said when i've thought about dbt the issues it solves is in part due to the dependency management for traditional sql focused data teams when you make changes to sql scripts it's hard to know if these changes can have downstream impacts with dbt those dependencies can be tested against is that somewhat correct that is a hundred percent correct yes that's the part of the value of exposing that lineage and and having tests like in line so if you downstream you need something to be unique you can specify that and if you make a change upstream you can run you can list out what are the downstream things that are going to be affected by this so yes perfect
all right well thank you michael thank you laurie yeah thanks for having us yeah it was a lot dbt is a lot to wrap your head around hadley said at the chat i feel like i know less about dbt than when i came in here right now the healing can begin you know yeah it's probably because there were some preconceived notions or confusions or or like you haven't gotten to apply this yourself or like hadley said the problems that dbt solves are far removed from what a data analyst and a data scientist does and needs to do and that's totally relatable for me also the last thing i remembered i wanted to define a view versus a table because they are different things so 20 second answer a table means that you have like persisted the data again in storage in the warehouse whereas a view is just like a sql script that's saved and it will run like it's not repersisting the data basically it's not yeah that's the shortest that i can do it is a view of the table
okay everybody thank you so much for joining us stick around on the discord this was really really fun continue asking questions we can hop in and continue to answer them and i hope that we see you on thursday for the data science hangout also if you have not taken the survey the data science hangout survey you do not have to be a data science hangout attendee although i would love you to be you can fill it out it's open till sunday and we will come back on the data science lab and say data science hangout to share the results with you so please please please do that have a wonderful week see you on thursday or next tuesday where we are going to be joined by andrew heiss if you know him you love him and all of his content um so join us then bye everybody see ya see ya bye y'all

