Resources

From Data Confusion to Data Intelligence - posit::conf(2023)

Presented by Elaine McVey and David Meza Data science teams operate in a unique environment, much different than the IT or software development life cycle. Hope from executives for the impact of data science is extremely high! Understanding of how to make data science efforts successful is very low! This creates an interesting set of organizational challenges for data and analytics teams. These are particularly clear when data science is being introduced at new companies, but plays out at organizations of all sizes. So, how do we navigate this dynamic? We will share some strategies for success. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: From Data Confusion to Data Intelligence. Session Code: KEY-1060

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I hear we both lead data science teams. I specialize in getting data going at early stage startups. What do you do? Basically, the same thing, but more in a larger organization. I develop data science teams, but taking it to the next level, trying to get them to be able to sustain their teams and be able to do their data and analytic architecture well.

You know, that sounds great. Actually, getting data started from scratch, there are so many things that are frustrating. Often we kind of just run into a lot of barriers. It must be nice to be somewhere where everything is established and running smoothly. You know, it would be nice. We have our challenges, too. Large organizations tend to be very slow, especially government organizations, and we tend to be more reactive than proactive. So we're leading to being able to change that to be more proactive in our organizations.

You know, maybe I can share with you some of the things I've learned about getting data science off the ground, and then I would love to hear what you've learned about doing data science at scale. That'd be great. Thanks.

The gap between hope and understanding

So, I've thought a lot about what's kind of the root problem of some of these things that we face getting data science started or, you know, even the later stages, and what I've concluded is this. It's really that hope for data science is so high in terms of what we can deliver, but the understanding of how to set teams up for success is so low. And this can mean that your organization kind of accidentally makes it impossible for you to do the job they were so excited to bring you in to do in the first place.

It's really that hope for data science is so high in terms of what we can deliver, but the understanding of how to set teams up for success is so low.

So if you run into these problems, who can you turn to to help you kind of solve these organizational issues that you're facing? And I think the answer is it's you. It's you, the data scientist. It's us, the data scientists. Maybe this shouldn't be our job, but in my experience, the best chance of success is if we take the lead in helping our organizations learn how to make us successful.

So if we're going to do this, I think it's helpful to understand why we find ourselves in this situation where there's this big gap between the hope for what we can deliver and the understanding of how to help us do that. So first of all, the hope part. If you put yourself in the mind of a CEO or another executive who decided to bring a data science team into an organization, these are the kinds of messages they're getting. Data is the world's most valuable resource, right? So if you are someone who's in charge of the long-term strategic direction of an organization, you want to make sure you're not missing these big opportunities to be doing things with data.

And then I think on the flip side of that is fear. Most CEOs think their organizations have to evolve or die, whether they're small and early stage or big and established. The world is changing quickly. They need to stay on the cutting edge and be doing things that are transformative. And part of that is being smart about how they use data. So I think it's these big, high-level ideas that get people excited about what we can deliver.

A story about getting started

Years ago now, I was hired into a transit software company under these circumstances. The CEO was the one who decided that they needed to hire a data scientist. And he told me, I know we have a lot of data, and I know there's more valuable things we could be doing with it. And I'm not sure exactly what that is, but we need you to help figure it out. So I went in knowing there was this uncertainty where I would have to help find the value in the data and help the company realize it. But I also knew I had the backing of the CEO, and so this seemed like a pretty good situation.

So I arrived and started talking to the lead engineer about getting access to the data. And he said, oh, no, I can't give you access to the data. We've built our data pipelines with these particular needs for our production system, and it's not built for analytic queries. It's not safe for me to let you get the data. So immediately, this is a problem, right? And so I had to keep pestering him. I can't do anything if you can't get me the data. So eventually, he agreed, okay, we can try running a query so you can get some data.

So I wrote the query, I checked with him, he approved it, and I sent it off. And I waited and waited, and it seemed like it was taking a really long time. So eventually, I wandered down the hall to ask him if he knew what was going on. And he was not available to talk to me, because he was in the middle of an emergency situation. All the buses in our product on the maps had stopped moving. And why had that happened? Of course, it was because of my one query. So that was the end for a long time of me getting access to that data.

Three things organizations don't understand

And this kind of, like, very basic problem is not unusual. I talk to other data scientists and data leaders. These are the kinds of things we face a lot. So there's three buckets of things, I think, that we need that organizations don't really understand coming in.

One is analysis-ready data. And this can go wrong in several ways. One is you can't get access to the data, or it's not in a form that's safe for analytics. Another might be that the data you need for the project you're doing doesn't exist. Not really even being collected. Or maybe it is being collected, and you can get access to it, but the data management practices are so nonexistent that it's messy to the point where it's almost unusable.

Another thing we need is cross-team support. So in the situation I was in at the transit company, the engineering team could have solved my problem. They had the ability to get the data into an analytics data warehouse, right? But they weren't expecting to need to do this work. No one realized the data scientists would need any support. And they had a list of high-priority items that they were already struggling to be able to address with the capacity they had. So they weren't going to prioritize this.

And this can also happen in the other direction. Sometimes people are thrilled to see you arrive as the data person, because they think, my team has data problems. We spend way too much time reporting. Here's someone who can solve this for us. And then you can get a lot of incoming requests, which can be helpful in understanding what your organization needs. But if you can't control that, you end up spending all your time reacting to these requests, and never get to those potentially really high-impact projects, which are what people really wanted data scientists for in the first place.

And the third thing is space for uncertainty. So unlike almost everything else that organizations do, data science projects depend on the contents of the data in a way that makes it hard to know whether they're worth doing until we've kind of already done them. And so this can make it really hard to fit into the normal way that your organization plans work.

Guerrilla data science tactics

So if you find yourself dealing with these kinds of problems, what do you do? And I think in my experience, there's a set of tactics that work here that I think of as guerrilla data science tactics. So this means maybe not doing things ideally the way that we should be doing them, but doing what we have to do to make it work in the situation we're in. So being scrappy and clever and working around the obstacles in our way.

But guerrilla, of course, comes from guerrilla warfare, and warfare is not the right mindset for us to be in, right? So as frustrated as we might get with some of our colleagues, they aren't trying to make things difficult for us. We're all wanting to make it work, we just haven't figured out how to do that yet. So instead, maybe we think about this as guerrilla data science, powerful but friendly.

So what does this actually mean in practice? There's a series of steps that I have found success with in getting data science off the ground. The first step is scan for opportunity. So this means being heads up in your organization and looking for the overlap of what's valuable to the organization and what you're able to do with data.

And once you've found that opportunity, the next step is show, don't tell. So it can be tempting to explain to people what you're thinking and what you're planning to do and what you need from them and how this is going to play out to get buy-in, but in my experience, at least early on, this is usually a waste of time. What you need to do is build the thing, at least the skinny version of the thing, so that they can see the vision you have in a more concrete way and have something to react to.

Next, take the data and run. So this doesn't mean stealing the data or hacking into the data, but it means that if you can't get access in the way that you ultimately will need to, you have to do whatever it takes to get an initial data set. Maybe this means befriending an engineer who does have access and asking them to just help you get a one-time data set that you can get started with.

And then once you've used that data set to build something that delivers on that opportunity, nail the landing. And what this means is communicate. So all of these things you've done up to this point don't matter if no one knows what you did or doesn't understand why it's valuable. And I think it's almost impossible to overinvest in communication, communicating things different ways until it connects with people and they really get it, and also communicating broadly. So not just your primary stakeholder or your team, but everyone around the organization. Make friends and talk to them all the time about what you're doing.

And then once you've done this, it's time to up the ante. So once you've done an initial project that shows value and people know about it and understand it, you want to use that organizational goodwill to do something that's maybe higher risk but higher impact the next time. And so you go through this cycle again.

So when I found myself in this position of not being able to access the data that I was there to find value in, I was at the scan for opportunity stage. And the opportunity that presented itself, we had just raised money, we had a growing employee base, people needed to understand what was going on with the company. And so we worked on business metrics. Not what I had come to do, but it was the opportunity we had at the time. And so we started building a FLEX dashboard and deployed it and were sharing it with people so they could see what we were doing.

The data here did not come from that production system, but we were able to find scrappy ways to pull information in from around the company to feed this dashboard. And then we communicated it to everyone. We showed it in company meetings, we talked to teams about how they could use it. And that got us some goodwill to do the next step.

So now we're back to scan for opportunity. And this time, the company was getting into a new product in the area of microtransit, which is like an Uber pool for transit agencies and universities. And this runs on an algorithm. So from our data science viewpoint, we thought, well, of course we understand how the algorithm works, but that's not the same as understanding how it will play out in the real world. And we thought we could figure that out, figure out where it works well and where it doesn't in which situations by using simulation.

And so now we had this goodwill. So I thought, this time we can do it right. I went to the product manager and talked about our idea and why we thought it would be valuable and what we needed from the engineering teams to do this. And she said, oh, that's interesting. We're not going to prioritize this. No one's asking me for it. We have a long list of high-priority items, and I already can't do them all with the team that I have. So we're just not going to be able to help you with this.

So back to show, don't tell. How could we build a version of this that we could show people on our own? So in this case, because we were using simulation, we were creating our own data. So that's problem solved. But we still needed access to the algorithm. So we went to engineering with this one-time ask. Just help us get access so that we can run this algorithm offline. It won't interfere with your production system, and we'll leave you alone. And so we got that.

So then we built our Shiny app to create inputs. We scheduled all our simulations. We created reports for different scenarios. So then we had something really cool to show people. So we went back to our team and the product manager and showed them this. And within our department, people thought, you know, this is interesting, but it's kind of like a toy. It's not really, you know, part of our software. It's just kind of an interesting side project. But we kept talking about it. And one day, I showed it to the chief revenue officer. And he immediately saw the value in this because he's thinking about the customer, right? The customer is uncomfortable with this algorithm. They need to be able to see how this is going to work for them before they're willing to put this out in the world. And we very quickly became part of our marketing and sales process, and also part of how we communicated to our investors why this was important and what it actually was. So now we had gotten to that point of this high impact. This was the kind of thing that the CEO had wanted to bring the data science team in for to begin with.

Reflections on the process

So a few reflections about what's important in this process. One is to maximize speed and autonomy. So even though we're in this privileged position where people have great faith that we can do amazing things, even though they don't quite know what they are, we still have to start delivering eventually, right? So we have a limited amount of time. And in my experience, the best way to maximize speed is to maximize autonomy of the team.

And this doesn't mean working in a silo or not collaborating with other teams. But what it means to me is that that iterative data science development process, you need to be able to do that without running into places where you need help from another team, because that's what's going to slow you down.

The other thing is to build foundations wherever possible along the way. So even though we're doing these scrappy things in our guerrilla data science tactics, everywhere we have a chance, we want to start building things that we can build again on later. So building data pipelines that we can schedule, building R packages that we can reuse, all of those things.

And these are the things where my teams have found Posit Connect to be really helpful. So instead of every time we need to deploy a dashboard or we need to deploy an app or we need to run a cron job, having to go ask engineering for help with getting that set up, it was just a one-time ask, help us set up Posit Connect. And then we're off and running on our own. And we can do all these things without constantly needing support.

So if you've gone through this process and you've gotten to that first really high-impact result, pat yourself on the back. You can declare victory. And now you've completed step one, because this is just the beginning to create momentum. But going forward, you can't do things this way and have a sustainable and scalable data science practice. Particularly the take the data and run step needs to evolve. And David, I would love to hear what you've learned about doing data science at scale and how you've done some of these things.

Scaling data science in a large organization

Happy to share. So, Elaine, when I was hearing you talk, it really brought back a lot of memories. I almost felt like I was in an episode of This Is My Life, because I went through so many of these things similar to what you just did. And I just couldn't understand why it was so hard to get the data. I mean, it took me two years, two years to actually have the head of IT, of a human capital of IT, tell her staff, if David and his team need the data, they get the data. There was a lot of cajoling, a lot of begging, a lot of pulling hair out.

But I think what really came down to was showing her that we could get answers in minutes instead of days or weeks at times. Let me give you a small example of our data access at that time when I first joined this organization. We had this personnel data warehouse where roughly 18,000 data employees were there in that information, all the data information about them. Now, I have to go to this web interface of this very well-known back-end database that we have to access. And there are some queries on there that I can pull that maybe have 10, 15, 20 different attributes of an employee. It takes me maybe a minute to two minutes to have that thing load once I access that query.

Then I have to add other attributes to that employee, depending on what I'm doing. That's another 10 to 30 seconds for each attribute. And if I got the wrong one because I don't have a data dictionary, I've got to pull that back. That's another 10 to 30 seconds. Add another one, another 10 to 30 seconds. Imagine doing that for 40, 50 different attributes. And I could only pull down 750,000 cells at one time. Not rows or observations, cells. And if I needed to do historical data 10 years back across all my centers, it could be days before I get all the data aligned. And please don't make a mistake because then I got to do it all over again.

So it was just amazing. Why can't I get to all this data? There had to be a better way. And there is, I think, I hope. So it takes a lot of information. It takes somebody that's going to take what you started and take it to the next level by advocating everything you've done and being able to talk to leaders and tell them this is why we need to do it.

When we got to the end of our journey, which is very similar to yours, I thought I had built a mansion. I had all of these different applications out there that we could showcase things. But then really, did I really? What I had was a facade with a bunch of little bitty trailers in the back. And they were nice trailers. Don't get me wrong. They were good. They worked. They worked well. They worked well together. But the individual little things all the time. There was no interface or any way for me to really share this across the enterprise. There wasn't any way for me to really get it out there and showcase it to the people. So we had to do something different.

Knowledge architecture and the data foundation

So what's next? Well, I started thinking about this, and it really all starts with the data, as Elaine had mentioned. The data is the key on everything we do. It is the oil, the next oil. It's our most vital commodity that we have. So you have to think about how you develop your data foundation, what you have to do to get that data from having to go grab it from somebody and put it into a network to be able to get a pipeline that automatically gets it to you. But to do that, you have to set expectations with your leaders.

Because they'll come to you, you know, 10, 15 years ago it was, can you just do it like Google? Now it's, can you just make it like ChatGPT? It takes a while to do that. I can't do that immediately because I don't, one, I don't have access to the data. And two, I don't want to see a whole lot of hallucinations, especially when we're trying to get back to the moon and onto Mars. We want to make sure we have this right.

So I needed to create an understanding, level set the perceptions of everybody in the organization. But another key that I found was we need to differentiate between IT and data science. IT is the pipeline to get us the data. The data science, that's where we do the magic, as they so-called tell me, to be able to change things over and give them some kind of answers. So how do you do that?

Well, several years ago I developed this, probably a little over a decade ago, I developed this framework that I call knowledge architecture. And let me give you a quick little story as to how this all derived and came up so you can understand what I'm trying to get to here. Again, over a decade ago I was leading a small team in a knowledge management group down at Johnson Space Center. And I was meeting with my taxonomist and my web developer. And we were talking over things and the taxonomist went to look at the web developer and she says, I need access to the metadata. And he, the web developer, adamantly said no. They went back and forth on this for 10, 15 minutes.

I finally said, timeout guys, let's just go to respective corners, let's talk about this later. I went to go meet with my taxonomist separately, come to find out this conversation had been going on for a couple of months now and with no resolution. I said, first off, tell me, what's the definition of metadata? In her mind, the data on the data. Okay. Give me some examples. Author, abstract, date created, title, all information about the document that she was trying to categorize and classify in her system.

Went to the web developer, had the same conversation. He gave me basically the same definition. All right. Give me some examples. Field type, length, size, anything you want to talk about a database, what's that information? Again, level setting the organization to have them communicate. Same definition, different examples. Once I got them talking on the same way, they were able to get that information to them. So, it was really more about understanding each other's world to be able to do that.

So, that's when I started thinking, we need a liaison, we need a group. And that's where knowledge architecture came into play. It's somebody that can talk all three languages to get this information across. So, what is knowledge architecture? This is my mind. Knowledge architecture is a combination of knowledge management, informatics, and data science.

Knowledge management is a strategy for how we identify, store, analyze, and visualize our information, our data. Informatics, again, is that pipeline. They're the ones that get that data, get our commodity, our oil from one place through the processing. In order for us to get it out there, we can utilize it. And then data science is the algorithms and methodologies we use to turn that data into some kind of actionable knowledge. And by doing that and combining all of that, we start to develop our data foundation.

I've set some expectations, I'm communicating across, I've created that understanding, and I'm differentiating what's between IT and data. That gets me started to develop this framework that we can start creating those data analytical pipelines that allow us to more efficiently use our data in our organization.

Building a sustainable data pipeline

So, I started this again. It all starts with the data. Think back if you've seen Raiders of the Lost Ark in the last scene where they're taking that ark and putting it into a warehouse and it's going back and you don't see it again for forever. That's kind of how we've done our data. NASA's been around for 60 plus years. We've stored a lot of data and it's just kind of sitting there somewhere. And it first started in filing cabinets, then it went to file servers, then to SharePoint, databases, all in the same structure. We've always collected the data just to store it.

We need to do better. I think you talked about it as analysis ready data. We need data. We need to collect data with analysis in mind. We need curated AIML ready data sets. And that will support us. No more of this takes me 80% of my time to clean the data in order for me to do my analysis. I want to get to the data and start working and doing the analysis.

So, some of the things we need to do when we look at that is we think about our orchestration and ingestion. And this comes by working with IT and differentiating that. So, it will take somebody as to lead this effort. Somebody like a chief analytics officer, somebody that's going to work with the senior leaders in order to show this is the difference of why we need to do this. Too many times we put analytics into IT. Yes, data science is code, but we're not IT developers. We do our own thing. We have ways of doing things that we need to differentiate between the IT and data science life cycle. So, we need to make sure we get that done.

So, how we're going to ingest from our data sources into various different things. So, we want to take my personnel data warehouse. What I'm kind of doing right now is taking my data from my personnel data warehouse based on the types of analysis that we need to do in creating these AIML curated data sets. So, we're taking it through those systems. We're transforming it. But we're getting to the point where it's ready for us to be able to use it.

But you also have to make sure that somebody's talking to IT security. Again, if you're in a large organization like NASA, we've been around many years. We've been around before there was a network. I went through the Cisco versus Novell phase of who's going to end up being the networking organizations. But each center did their own way. We have these firewalls that are all over the place. So, if you don't talk to IT security early, you're going to be stopped in your tracks very quickly. So, you got to make sure you get them involved. They can be a headache sometimes. But it's well worth it to get in there and talk to them to make sure that they understand why you need access to data and how to get to that data.

But here's the key in my mind. You need to work with the data owners and the data stewards to get them to understand their role in the organization. Too many times, the data owner, and these are the people that run the business side. The person that actually owns the data for that business process. Or the person that's a data steward that understands the processes for how that data goes through. They need to understand their responsibility. Too many times, they've given up those responsibilities to IT. No, IT doesn't own the data. They get it to you. The data owners own it. So, you got to make sure they understand what they need to do. Get them involved. Show them the value, as Elaine said, of if we use your data correctly, this is what we can do for you guys.

And then that leads us, again, it's, we as data scientists may not, it may not necessarily be our role, but we have to take it on because nobody else is doing it right now. We have to advocate for data literacy and data governance. We have to get people to understand. Everybody in the organization should be somewhat data literate so they can help us help them. And by data literacy, the ability to read, work with, analyze, or communicate with data. Whatever their role is, whether that's a project manager, administrator, or anything, they all need to know how to work with their data, and that helps us help them. So, we've got to advocate for these types of things in an organization to really raise the level of what we're trying to do and make it sustainable.

We have to advocate for data literacy and data governance. We have to get people to understand. Everybody in the organization should be somewhat data literate so they can help us help them.

So, how do we create a sustainable pipeline? How do we get to where we get our old data sources, it goes through some data sources into our data lake, or our data mesh, or our data water fountain, or whatever we want to call it. It's all there and able to be shared and visualized.

So, before I get there, I'm going to give you another quick story. I talked earlier about how I got the data and how it took me so long to get that data. Well, I didn't tell you the next part of this pipeline. Once I got that query built, a little button up there that says download CSV file. Okay. I download the CSV file. Maybe one file, maybe 10, 20, 30 files I have to download. Then I put that on my laptop, and on my laptop, I open up RStudio, VS Code, Tableau, whatever we're using to ingest the data, start working, and start doing our visualization, our analysis, and do that. And I create these great applications. We create these great dashboards.

Then it comes time to present it. So, here I go. Take my laptop, go to one meeting, plug it in, show if this is what we're doing. Like that guy, go to the next meeting, show it again, go to the next meeting, show it again. Oh, you want to change? Hold on. There you go. That was our pipeline. That's what we had to do. You know, no, no, we can do better than this. We have to do better than this.

So, how do we do this? Well, we do have to think about ingestion. And that ingestion requires us to be able to turn our data, as we're taking it out of our raw data set, of my personnel data warehouse. I've got to use my scripts, use my containers, use whatever I need to do, however my pipeline's set up, to take that from a raw data set, and then turn it into a trusted data set. So, it comes into my data lake. That's my trusted data set for my data scientists.

Then we start curating it. Let's say it's something on demographics of my employees. So, I've created this demographic data set. Then I start curating it. I break out the columns I need to break out, get it all ready so that it's ready for analysis. And that data set gets updated on a regular basis. And that data set is what my data scientists use for any and all demographic analysis. We've created data dictionaries, data glossaries, so everybody knows how the calculation goes and how to utilize it. We've created wikis to share that information. So, all of that's there so that everybody's using the same data set.

Because in the previous way, with that downloaded from the web, I may pull some data, somebody else will pull some data, somebody else pulls some data, we probably don't have the same results because we're all pulling it on different days, or maybe we had different fields. This way we're saying, this is the data set you're using for demographics. And that goes into our, as a curated data set. And then that curated data set can then be used to share it across the organization for analysis visualizations internally and externally across the organization. That all fits into our pipeline. So, you got raw, trusted, curated, shared.

But you have to have an understanding of your storage. How are you going to store it? What are you going to do with it? What type of systems are you going to use for storage? And then from that, you have the integration piece. How do you access that storage? APIs, direct access, ODBC, what are you going to use to access that type of information? Then how are you going to analyze it, and then eventually present it?

So, the way I usually talk to my organizations or my managers, you want me to present you some information? You want me to visualize this for you? I have to be able to analyze it. In order for me to analyze it, I have to be able to integrate with that. I have to be able to connect to it. I have to be able to access it with the right permissions. In order for me to be able to do that, I have to know where it's stored and how it's stored. I've got probably 30, 40, 50 different databases out there I have to keep an eye on and track on and see where we're getting the various data from.

So, there's a lot of things you have to do. But that's why when they say, can't you just do it like ChatGPT or do it? This is what it takes for me to do that. In order for me to do that, you've got to give me the funding, you've got to give me the authority, you've got to give me the access to do that. Once they see that, they start to say, great, you can do that, let's go forward. And they're starting to get this information quicker and faster.

And we can do it on the fly, then they're starting to see the value of what we're really trying to do. And they're starting to say, we want more of that. And once they get hooked on that, they really want to see that information because I keep trying to tell them, and probably one I didn't emphasize enough, and you do need an advocate. I'm constantly going to meetings and talking about this. Somebody has to be there to keep telling them why they need this because if you don't, they'll forget. Because management changes, new people come in, you've got to go back and tell the story all over again. You've got to be able to do that and be able to consistently do that and share those stories with you.

So, once you have all of that, you may end up something like this. Now, the one on the left, the little diagram there, it's just something I created that I go, and this is actually animated and integrated and I can dynamic links and things that I can show to my managers to show all the different pieces that we have. But my analytic architecture first starts with the people. Understand what type of users you have, both the end users, your data scientists, your business intelligence analysts, all of the different people that are going to be using the data. Then you work through, how do you create that data pipeline? What are you going to use for that?

And this is another thing I just thought of, APIs. Talking to my IT guy who's creating all these APIs for me, finally figured out we have a difference in what we need APIs for. Again, same definition, different use. They're using APIs to pull data and put it on a website, on a web browser. They need one, maybe two data sets or two fields or observations coming through there. I need APIs that's going to pull massive amount of data over to my storage. Again, doing the same thing but just costs us different resources. And I have slowed and broken a lot of servers over my years doing these types of things. So I have to really talk to my IT about how do we set these APIs up and what we're trying to do.

Then you have to look at your data sources and understand what are the different types of connections. We have a lot of different databases but we don't have all of the modules that those databases may need. We don't have all of the connectivity to some of those databases. Sometimes they give us access or we can do read access. Sometimes we don't. So there's a lot of things you have to work through and try to get there. I mean, it's difficult but it's well worth somebody getting in there and trying to work with them to take this done.

Then you can start thinking about how are you going to store that into a data lake and get your curated data sets. And then the fun really starts for a data scientist. How do we start turning that into some kind of visualization application or something? It could be some kind of no-code or low-code tool like Alteryx KNIME or something like that. Or something where we're using more high-code and ML DevOps with our POSIX servers. We've got Vetiver that helps us with some of our modeling. We've got different things or MLflow or Kubeflow. There's a lot of different things but that gives us opportunity to do so many things. And my team is both Python and R and we can do a lot of that within those ecosystems right now.

Then you've got your business intelligence. How are you going to present it out there? Tableau, Power BI. Again, you can do that in Shiny. Or then building some kind of analytical application. But the one key is always, always, always have a group doing some kind of research on the next technologies that are coming out there. Because if you're not keeping up to date or at least understanding what's going on, you're going to be two years behind by the time you decide it's time for me to move. So you need to make sure you research as well as you can to make sure that thing, you've got some kind of momentum going so when the managers ask you, your executive leaders ask you what's the next thing you need to do, you're ready to go because you've done your research.

So, that should hopefully take you from data confusion where you get your Lego spread all over the place to data intelligence where you use your algorithms, your methodologies to create some kind of actionable knowledge. Something that's useful for your organization through that process. So it's kind of a little visualization there that some managers really like to see it, they get it. They like those pictures. You know, for some reason. And this helps them understand why it's important for us to get from one level to the other. So I think Elaine has said it earlier, hopefully we've stuck the landing.

Q&A

And that's my story. That was so helpful to hear. Thank you. You know, I'm still stuck on that part at the beginning about two years to get access to the data. But that's a really good reminder for me that patience and persistence can pay off in the end. That is true. And it really, for me, it was really validation that knowing that others have suffered along with me or are suffering along with me with the journeys they're taking, but realize that it is worth it in the end to continue on the perseverance that you're doing.

Thank you very much, everybody.

I'm Rachel Dempsey. I lead our customer marketing here at Posit and host our weekly data science hangout where you may have met Elaine or David there. We'd like to open up this opportunity to ask some questions to Elaine and David. I know we're not gonna have time for every single great question right now. So David and Elaine are also gonna join me in the lounge right next door, right after this too.

But Elaine, you mentioned it's almost impossible to over invest in communication. And I was wondering what has helped you most in getting people to grasp what you're saying with data?

Yeah, I think we struggle with this sometimes as data scientists because spending a lot of time figuring this out doesn't feel like the work we're supposed to be doing. But what I found really helpful is trying to get inside the head of the audience, whoever that is. So if you're presenting to the CEO, that's one perspective. If you're presenting to IT, that's a completely different perspective.

But I think usually when we start thinking about how to present work we've done ourselves, we're really in the process, like we tend to present it the way that we did it, right? And so I often try and get myself to think, to go through three rounds of like, okay, this is what makes sense to me. But now let me think again about my audience and what they wanna get from this. And what is the one thing that I wanna make sure they take away? And then I change the way I present it. And then I do that again, because it's really hard to get out of that kind of data science mindset into the, if I'm the CEO, I have to make a decision about this next week. I ask myself, if I were her, what would I do after seeing this? And am I getting that point across?

Absolutely, thank you. David, I loved when you were kinda walking across the stage saying you're bringing your laptop from one meeting to the next across NASA. And I was wondering, how are you doing this today?

Yeah, I mean, we've come a long way since that timeframe. So we've been able to implement a lot of different capabilities. COVID kinda helped a little bit, unfortunately. I hate to say that it helped, but it really puts some emphasis on the need for being able to share data more easily across the organization. So there was some funding that came up. We've had things for business intelligence. NASA actually had six Tableau servers across the agency at that time, so we consolidated them down to one and be able to share some business intelligence. But then we also have our Posit server that allows us now to use our Connect server to do some of the other things, whether it's a Shiny application, or created books, or wikis, or presentations, all of those things that go up to make it easier. So now we have one location where we created these portals, we call them, where they can go to that one landing page and connect to all these various different types of capabilities.

Thank you. I see a lot of the questions starting to come in now, and one was, how can one-person teams advocate for IT and data science differentiation?

Yeah, I think when you're a one-person team, sometimes you have to do some of everything. And so I think it's about figuring out where to meet in the middle and helping people understand, like David was talking about, speaking the same language. So teams I've been in, especially when we're getting started, we sometimes use tools kind of in that gap. We've used DBT as a way that we can think about structuring the data, and it's a little more friendly for people more on the engineering side to understand how that's coming in, and then we build out from there. So I think you kind of have to do whatever it takes when you're just one person, but start to show people this is what this might look like, and these are the different steps in the process, even if you're doing more than one of them to begin with.

Did you have anything you wanted to add there? No, no, these are great answers. I mean, 20 years ago, it was probably me in a bottle of whiskey with the IT guy. You know, he'd say, let's talk.

Another question, and I'll ask this to you first, David, is what if you have colleagues that are trying to make it difficult to get data for the sake of their job security?

That's, unfortunately, that's common. I get a lot of that. I had this one teammate that needed some data from an organization, and she requested to go talk to the individuals, and when she got to the meeting, there were 17 other folks in that meeting questioning her, why do you need access to that data? And a lot of it came down, they were afraid they were gonna lose their jobs, and you really gotta show them, you had to find a way of showing them that it's not about taking away responsibility or taking away job duties, it's making it easier for them. How do we make this easier together? How do we share, how do I include you in that process, in that pipeline?

But be honest, sometimes that job may become irrelevant, so what do I do to get you to change into a different role set, a different thing? So that's where you really need to work with managers. Okay, these things may change, how do we get this person to still have a viable, meaningful job, but it may not be what they're doing right now because our pipelines have changed to kind of supersede all of that now.

Okay, one other question was, I'll ask this to you, Elaine. What advice do you have for those of us trying to hand off processes we create to our stakeholders rather than maintaining them in perpetuity?

That's an interesting question. So I think it depends a little on the situation of what you're trying to hand off and why. And sometimes I think there can be intermediate steps where you can do the handoff in different ways and maybe work your way back. One example someone was asking about yesterday was using pins to share data. And I was explaining a way that we've done this where we, in the process of running our Markdown files and Quarto documents, we pin some things in Connect that are just CSVs that people can then take and do some of their own processes with. So even though at that point, the reports we're generating are still doing a lot of analysis, we're giving people a way into the middle of that process where they can start to participate and build on what we're doing and then work back to kind of where we started.

Okay, I think we have time for two more questions here. One of the questions was, we all love free stuff, but aside from hiring well, what have been the best investments you've made that cost serious money?

Well, the simple answer for me to start off with, we're not talking technology completely, it's the people. You've gotta have the right people and really invest heavily in your people, not only just who you're hiring, but the training they get and the ability to learn more of what they're doing. But if we're talking technology-wise, for me, I'm gonna have to say, because of the best bang for the buck that I've gotten so far, it's gonna have to be for the Posit. And I'm not saying that because I'm at a Posit conference.