From Data Confusion to Data Intelligence - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

I hear we both lead data science teams. I specialize in getting data going at early stage startups. What do you do? Basically, the same thing, but more in a larger organization. I develop data science teams, but taking it to the next level, trying to get them to be able to sustain their teams and be able to do their data and analytic architecture well.

You know, that sounds great. Actually, getting data started from scratch, there are so many things that are frustrating. Often we kind of just run into a lot of barriers. It must be nice to be somewhere where everything is established and running smoothly. You know, it would be nice. We have our challenges, too. Large organizations tend to be very slow, especially government organizations, and we tend to be more reactive than proactive. So we're leading to being able to change that to be more proactive in our organizations.

You know, maybe I can share with you some of the things I've learned about getting data science off the ground, and then I would love to hear what you've learned about doing data science at scale. That'd be great. Thanks.

The gap between hope and understanding

So, I've thought a lot about what's kind of the root problem of some of these things that we face getting data science started or, you know, even the later stages, and what I've concluded is this. It's really that hope for data science is so high in terms of what we can deliver, but the understanding of how to set teams up for success is so low. And this can mean that your organization kind of accidentally makes it impossible for you to do the job they were so excited to bring you in to do in the first place.

It's really that hope for data science is so high in terms of what we can deliver, but the understanding of how to set teams up for success is so low.

So if you run into these problems, who can you turn to to help you kind of solve these organizational issues that you're facing? And I think the answer is it's you. It's you, the data scientist. It's us, the data scientists. Maybe this shouldn't be our job, but in my experience, the best chance of success is if we take the lead in helping our organizations learn how to make us successful.

So if we're going to do this, I think it's helpful to understand why we find ourselves in this situation where there's this big gap between the hope for what we can deliver and the understanding of how to help us do that. So first of all, the hope part. If you put yourself in the mind of a CEO or another executive who decided to bring a data science team into an organization, these are the kinds of messages they're getting. Data is the world's most valuable resource, right? So if you are someone who's in charge of the long-term strategic direction of an organization, you want to make sure you're not missing these big opportunities to be doing things with data.

And then I think on the flip side of that is fear. Most CEOs think their organizations have to evolve or die, whether they're small and early stage or big and established. The world is changing quickly. They need to stay on the cutting edge and be doing things that are transformative. And part of that is being smart about how they use data. So I think it's these big, high-level ideas that get people excited about what we can deliver.

We have to advocate for data literacy and data governance. We have to get people to understand. Everybody in the organization should be somewhat data literate so they can help us help them.

So, how do we create a sustainable pipeline? How do we get to where we get our old data sources, it goes through some data sources into our data lake, or our data mesh, or our data water fountain, or whatever we want to call it. It's all there and able to be shared and visualized.

So, before I get there, I'm going to give you another quick story. I talked earlier about how I got the data and how it took me so long to get that data. Well, I didn't tell you the next part of this pipeline. Once I got that query built, a little button up there that says download CSV file. Okay. I download the CSV file. Maybe one file, maybe 10, 20, 30 files I have to download. Then I put that on my laptop, and on my laptop, I open up RStudio , VS Code, Tableau, whatever we're using to ingest the data, start working, and start doing our visualization, our analysis, and do that. And I create these great applications. We create these great dashboards.

Then it comes time to present it. So, here I go. Take my laptop, go to one meeting, plug it in, show if this is what we're doing. Like that guy, go to the next meeting, show it again, go to the next meeting, show it again. Oh, you want to change? Hold on. There you go. That was our pipeline. That's what we had to do. You know, no, no, we can do better than this. We have to do better than this.

So, how do we do this? Well, we do have to think about ingestion. And that ingestion requires us to be able to turn our data, as we're taking it out of our raw data set, of my personnel data warehouse. I've got to use my scripts, use my containers, use whatever I need to do, however my pipeline's set up, to take that from a raw data set, and then turn it into a trusted data set. So, it comes into my data lake. That's my trusted data set for my data scientists.

Then we start curating it. Let's say it's something on demographics of my employees. So, I've created this demographic data set. Then I start curating it. I break out the columns I need to break out, get it all ready so that it's ready for analysis. And that data set gets updated on a regular basis. And that data set is what my data scientists use for any and all demographic analysis. We've created data dictionaries, data glossaries, so everybody knows how the calculation goes and how to utilize it. We've created wikis to share that information. So, all of that's there so that everybody's using the same data set.

Because in the previous way, with that downloaded from the web, I may pull some data, somebody else will pull some data, somebody else pulls some data, we probably don't have the same results because we're all pulling it on different days, or maybe we had different fields. This way we're saying, this is the data set you're using for demographics. And that goes into our, as a curated data set. And then that curated data set can then be used to share it across the organization for analysis visualizations internally and externally across the organization. That all fits into our pipeline. So, you got raw, trusted, curated, shared.

But you have to have an understanding of your storage. How are you going to store it? What are you going to do with it? What type of systems are you going to use for storage? And then from that, you have the integration piece. How do you access that storage? APIs, direct access, ODBC , what are you going to use to access that type of information? Then how are you going to analyze it, and then eventually present it?

So, the way I usually talk to my organizations or my managers, you want me to present you some information? You want me to visualize this for you? I have to be able to analyze it. In order for me to analyze it, I have to be able to integrate with that. I have to be able to connect to it. I have to be able to access it with the right permissions. In order for me to be able to do that, I have to know where it's stored and how it's stored. I've got probably 30, 40, 50 different databases out there I have to keep an eye on and track on and see where we're getting the various data from.

So, there's a lot of things you have to do. But that's why when they say, can't you just do it like ChatGPT or do it? This is what it takes for me to do that. In order for me to do that, you've got to give me the funding, you've got to give me the authority, you've got to give me the access to do that. Once they see that, they start to say, great, you can do that, let's go forward. And they're starting to get this information quicker and faster.

And we can do it on the fly, then they're starting to see the value of what we're really trying to do. And they're starting to say, we want more of that. And once they get hooked on that, they really want to see that information because I keep trying to tell them, and probably one I didn't emphasize enough, and you do need an advocate. I'm constantly going to meetings and talking about this. Somebody has to be there to keep telling them why they need this because if you don't, they'll forget. Because management changes, new people come in, you've got to go back and tell the story all over again. You've got to be able to do that and be able to consistently do that and share those stories with you.

So, once you have all of that, you may end up something like this. Now, the one on the left, the little diagram there, it's just something I created that I go, and this is actually animated and integrated and I can dynamic links and things that I can show to my managers to show all the different pieces that we have. But my analytic architecture first starts with the people. Understand what type of users you have, both the end users, your data scientists, your business intelligence analysts, all of the different people that are going to be using the data. Then you work through, how do you create that data pipeline? What are you going to use for that?

And this is another thing I just thought of, APIs. Talking to my IT guy who's creating all these APIs for me, finally figured out we have a difference in what we need APIs for. Again, same definition, different use. They're using APIs to pull data and put it on a website, on a web browser. They need one, maybe two data sets or two fields or observations coming through there. I need APIs that's going to pull massive amount of data over to my storage. Again, doing the same thing but just costs us different resources. And I have slowed and broken a lot of servers over my years doing these types of things. So I have to really talk to my IT about how do we set these APIs up and what we're trying to do.