Keaton Wilson - Modernizing the Data Science Toolkit of a 40-year-old Market Research Company
This presentation outlines the efforts undertaken by the Decision Sciences and Innovation (DSI; which focuses on statistical consulting and end-to-end quantitative analysis) team at KS&R to modernize their data science toolkit over the past year. The main goals were to foster collaboration, improve our legacy codebase, and deliver high-quality data products. Key topics covered include teamwide adoption of version control and GitHub, building and deploying internal R packages, Quarto-based documentation, and strategies for gaining buy-in across teams and leadership. Attendees can expect practical insights and tools for instigating change in their own organizations. Talk by Keaton Wilson
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I want to start with a bit of a story. It's a story that happened in 1989, about 1,400 miles up the coast of where we sit now in Seattle near Anchorage, Alaska.
The Exxon Valdez was carrying 53 million gallons of crude oil when it left port in the middle of the night. Many of you know where this is going. At around 11.25 p.m., the ship deviated from a predetermined shipping lane to try and skirt some icebergs that were in the water. There was only a single officer on duty for most of the night. This was in violation of company policy. But shifts were long, people were tired, and that's what it was.
Around midnight, this officer decided it was time to move back into the shipping lanes. He woke up a lookout, and the lookout said that there's a reef buoy that was on the right side of the ship instead of the left. This was bad. It meant that they went where they thought they were. An immediate course change was ordered, but four minutes later, the ship had run aground, resulting in one of the worst environmental disasters of the 20th century.
I wanted to start with this story because I think it illustrates a phrase that a lot of us encounter a lot and are familiar with, and that is the ship is too big to turn.
I wanted to start with this story because I think it illustrates a phrase that a lot of us encounter a lot and are familiar with, and that is the ship is too big to turn.
I think we hear it in the code we write, be it the languages we work in, maybe packages we're developing, frameworks we're using. Sometimes we hear it about the sector we work in, be it academia or government or industry, and often we hear it about the places we work, so the teams we work in or the companies or the organizations we're at.
And yeah, momentum is a strong force, but turning the ship in time is possible, and I want to talk through some of the lessons learned, both successes and challenges, in working through the process of trying to modernize or start to modernize our data science toolkit over the last one and a half years at KSR, which is a market research company that I work at.
Every organization is different, but hopefully pieces of this are useful in thinking about kick-starting change at wherever you may work.
When we were deciding to try and figure out which pieces of this we wanted to tackle and how we wanted to turn the ship, we sort of used these three guiding principles of clear navigation, managing expectations and sort of understanding the ship that you're in, and then also finding and thinking about supporting the right crew to make the turn.
What end-to-end data science means at KSR
I want to back up for just a second, though, and talk about what real-world impacts and end-to-end data science means at KSR. So we're a 40-year-old market research company, and I think market research can have impacts on the real world and our day-to-day lives in a variety of ways.
I chose three examples that are food-based, because I like to eat, we're in a great food city here in Seattle. The first is price, and I think a good example of this is the recent Chipotle shrinkflation gate that folks might be familiar with, where Chipotle was charging the same amount of money for a burrito, but offering less stuff in it. These pricing decisions by companies are often made as a result of market research.
The second is availability. So those of you that were around in the 90s might remember Altoid Sours or Choco Tacos, things that you can't get anymore, maybe if you go on eBay, I don't know. But the availability of products that companies sell is also determined often by market research.
And finally, the types of products available. So this is the reason you can't get poutine at a McDonald's here in the U.S., but you go up the road a bit in Canada, and you can get gravy on your fries.
So I work on the Decision Sciences and Innovation team at KSR, and we touch end-to-end data science in sort of three main ways, or three pillars. The first is survey design. So this is how we collect our data, and it involves components like web development, data wrangling pipelines, the language of the web, and also experimental design.
This is sort of our bread and butter. We develop statistical models and simulations for our client. We run market research-specific analyses, so things like conjoint and segmentation, and also more traditional data science analyses, so folks may be familiar with, like A-B testing.
We also touch the reporting side of things at KSR, mostly through data visualization, generating summary data for our reporting teams to use, and also a variety of web applications, mostly in Shiny, that allow our clients to interactively sort of work with and understand the data that we've collected in the surveys and the simulations and the models that we've built.
DSI also has a variety of internal initiatives that spans these areas, so things like automation, codebase refactoring and updates, and AI initiatives.
Why modernize the toolkit
So back to turning the ship. I came on at KSR about a year and a half ago, and I think one of the things we talked about when we were defining my job responsibilities and what the plan was for the next year was thinking about the areas that we wanted to turn the ship in, but also why we're trying to modernize the toolkit at KSR, and I think those whys fall into three main buckets.
The first is that statistical rigor doesn't always equal optimized code, right? KSR has a foundation of really building strong math behind our data analyses over the 40 years that the company's been in existence, but things have changed, right? And we can now optimize code for a lot of different things, so for speed, for understandability and modularity, and so that was one of the big sort of reasons why we want to try and upgrade things.
The second is that we offer data products that are software. A lot of that are the shiny apps that we develop, but also automated reports as well, and we can apply some basic but modern software development practices to improve the consistency and the functionality of those products.
And finally, the evolution of the field. Things have just changed, right? Things have changed a lot in 40 years, period. Things have changed a lot in data science over the last decade and the languages that we work in. There are a lot of new tools to bring to bear that can help us generate better work.
Clear navigation
So back to our guiding principles, we'll start with clear navigation. For us, finding clear navigation was made up of three parts. The first is identifying strengths and then figuring out ways to amplify them. The second is identifying weaknesses, patching the big holes and understanding where on sort of the trade-off axis projects are. And lastly, trying to understand our audience, and often for that, it was our internal team. For us, that's an organization that uses different languages and tools to do the work that we do, and a lot of embedded processes and history that come along with that.
For each of these guiding principles, I want to provide a short case study that sort of dives in a little bit on how we applied them. The first is our migration to GitHub. So this happened right when I came on board, or shortly thereafter, and we decided this was a high-priority thing to try and work on.
So we identified some strengths on our team. The team had a great history of collaboration and communication. We had good IT support. We identified a really, I think, big hole that we wanted to patch, which is this unsustainable version control that was going on that might work great for Word documents and PowerPoints, or maybe just work okay for those, but does not work for multiple Shiny apps. But we needed to think of our audience. DSI, our team, had a lot of skills, has a lot of skills, but they were mostly unfamiliar with GitHub.
So I think this migration has ongoing challenges and successes. Setup took time. We had to write documentation, we had to coordinate with IT, and we had a boatload of legacy code to migrate. But for us, the successes have outweighed the costs. So now we have more than 200 repositories, and we have really strong engagement across the team with the project management tools. GitHub has also allowed us to take some of the burden off IT and make it easier for them to deploy client-facing tools for us.
And we're not finished, right? There's still a lot of work to do, ongoing training on sort of more intermediate and advanced use of GitHub on our team, and integration across more teams at KSR.
Managing expectations
On to managing expectations. So for us, this is composed of four pieces. Thinking about leadership's expectations versus your expectations, I think often those two things are different, and finding alignment is really important, or was really important for us. The second is thinking about on-the-job learning. It's something that takes time and often needs to be balanced with the business needs of the organization.
There's also no sea change frequently, right? Change may be slower than you or your colleagues or leadership would like. And finally, developing a strong business case is something that's been really successful for us. It's having multiple people in the company on multiple levels, being able to understand the benefits of the change that you're trying to initiate.
So this is an example of something I was really excited to try and change when I first came on board, and that is an entrenched data pipeline at KSR. So often, we'll take that survey data from a platform we use, it gets manually migrated to in-house storage. We have some folks that do some data wrangling in SPSS. We have some folks that do some more data wrangling in Excel, and eventually we come up with summarized data that gets used in reporting.
I was really excited to change this initially, and move to our, you know, an open-source language with benefits that I think most folks in this room know and understand, and also leverage some of our API use to pull the data live from our survey platform. To me, there were really clear benefits to this modification, but there were also risks and complications.
So this pipeline had a bunch of existing processes and automation attached to it, and they were really mature. They're processes that have been going on at the company for a really long time. And there's a lot of business reliance on those processes. And it's really important that they operate quickly and efficiently and correctly. So I think that the business case for this was solid, but maybe a little misguided. And I think that once we had some conversations internally, and applied some of those components of managing expectations, it became clear that this wasn't the right time to try and change this. And it's something that, you know, I think is on the docket for us for the next couple of years. So maybe the ship is a little slower to turn here.
Finding the right crew
Finally, the right crew. I think one thing I learned in the last year and a half is that networking doesn't stop when you enter an organization. I think a lot of people probably know this. It was new to me. And the allies and champions that can see the value in what you're trying to do are really golden.
Yes, learning equals growth and development. But it also introduces new skills to the mix that can help turn things on an organizational level, sometimes in unexpected ways. And I think we've also spent a lot of time thinking about our future crew. The crew is going to change. And so doing work to try and future proof our processes and documentation was really important to us.
And I just want to highlight a partnership that I think is illustrative of the right crew. When I first came on, my perception was that the circles of the Venn diagram of our team at DSI and our IT team were really separate. There wasn't a lot of overlap. Most of our communication was request-based. I need this from you, you need this from me. We spoke different languages to some extent, and there were a few shared goals.
I think in the last year and a half, we've made a lot of progress in this partnership with IT. I think that language gap has been reduced quite a bit. We've improved our processes, and things feel really collaborative now. I think part of that is finding overlapping goals, and it's been really exciting to co-develop infrastructure with the support and input of IT.
A lot of the, I think, the adoption of new tools helped spur a lot of this partnership. So when I first came on, we were just starting to use Posit Connect and Workbench, which took a lot of the burden off of our IT team. It was easier to deploy apps to our clients, and we had really stable development environments to work in. The GitHub migration was also a strong collaborative effort, and also documentation.
I think, again, thinking about that future crew is great, but it was also super helpful from IT's perspective, from a certification standpoint, which is a core component of their responsibilities. For us, we have adopted Quarto-based documentation, which has been a game-changer in the last year. There's way more transparency, it's easy to build living documents, and the version control of documentation has been fantastic.
Impact and results
So I've talked a little bit about how we did it and things we thought about, but it's also had impact. I think a big one is more transparent collaboration and project management. A lot of this is GitHub, but implementing things like code review, project management that lives alongside code, and long-term tracking of issues has been really, really helpful for us.
We've worked to implement a lot of efficiency improvements. So again, updating that code base with some more modularity, the adoption of APIs in our work, and also building internal R packages gives folks more time to be creative in the things that they're doing and the work they're building for our clients. And finally, those same new things have also opened a lot of new work streams for us, which has been really exciting.
So we've started the turn early at KSR, and it's had impact. For us, it's a moving target. We definitely have a long way to go, but we've tried to start to turn the ship with a couple of strategies that I've talked about here. Hopefully this has been helpful in sparking some ideas for implementing change in wherever you may work.
I'm happy to connect. Please reach out if you want to chat more about data science and market research, what we do at KSR, or really anything at all related. Thanks.
Q&A
So we have a couple of questions from Slido. The first one is, data science isn't the same as software engineering. Do you have any tips or techniques for doing test-driven development when working with data?
I think it's a really good point. I think part of it, and maybe this isn't totally answering the question, but I think part of it is making those relationships and collaboration, right? I think it's really hard to be good at everything. And so finding folks that are good at the software development, are willing to work with you and sort of help implement that at your organization, I think is the best strategy. I think there's also a variety of really good tools, particularly in the R ecosystem that I'm more familiar with, in doing sort of more test-driven work. So things like TestDot.
Do you have any tips for people who are migrating their teams to GitHub? Because that migration could be a pretty big lift.
I think for us, training was really huge and sort of building, making sure we were building in enough time to do that training effectively. And then I think the other piece of it is to go slow when you need to. So work slowly on things that maybe are not high stakes at first to move, so you can work out all the kinks when you get to the important stuff.
Where do you host your Quarto documents? Is it on GitHub Pages or where do you do it internally?
Yeah, we had some debate. That's a really good question. We had some debate on this and, you know, the code base for a given process document in Quarto lives on GitHub. So people can add it, use and modify it dynamically. Ultimately, the end documents that get rendered live in a team's folder on our organization. So it's somebody's job, right, to, when there are updates, update that HTML file in the team's folder. And that's just following sort of, you know, years of history about where things are expected to live in the company.
How do you best recognize when to start turning the ship?
Yeah, I think it's really hard, right? And not to lean into the metaphor too hard, but, like, they didn't know it was too late to turn until it was too late to turn, right? So I think that, like, thinking about that second part about expectations is really important and understanding your ship and sort of how agile you need to turn. How agile your ship and how able your team is to turn is part of it.
I think the other piece is a gradual turn over a longer period of time is always going to be better than a frantic turn at the end, right? So sort of thinking long term and building in updates and modernization along the way is great if you can do it.
I think the other piece is a gradual turn over a longer period of time is always going to be better than a frantic turn at the end, right? So sort of thinking long term and building in updates and modernization along the way is great if you can do it.