Hilary Parker | Cultivating creativity in data work | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

Awesome. Yeah, so my name is Hilary Parker. Thanks so much for coming. I'm super honored to be here. This has been a really fun conference with a ton of great energy, so that's been really cool. So yeah, I want to talk about using data effectively and sort of moving beyond there's sort of people talk about data science and they talk about how data science is an art. And so I want to sort of move beyond that terminology. So I'm going to talk about that today. And just for a little background about me, I am a data scientist at StitchFix. I also have a podcast, Not So Standard Deviations. And I've just kind of been involved with our community for the past decade plus.

So yeah, I want to talk about things that I've sort of learned from all of these different things and ideas I've been able to develop with my co-host Roger on the podcast. So just a little bit of context about StitchFix because I'm going to use StitchFix later as examples, some examples of my work there. So StitchFix is a personal styling service. So the idea is that you come to our website, stitchfix.com, and you sign up. So you fill out, you know, for the types of clothes you want. And the idea is that you answer a bunch of questions about your style. We call it like a style profile, both during onboarding and then at other various points in time. And then from your perspective, you sign up, you fill out this information, and then some period of time later, you get a box with five things in it. And those are five pieces of clothing that we think you're going to like. And then you kind of keep what you want, and then you send back the rest.

And so those items are handpicked by a stylist. But what's happening behind the scenes, and the reason we have a data science team of 120 data scientists, is that what's happening in the background is that we have a bunch of inventory. Using the data from your style profile, we will sort that inventory and determine like what clothes do we think you're most likely to like. And then we order those. So we do this kind of machine learning, that's all those graphs and things that are happening. We create a list of algorithmic recommendations. Those go in front of the human stylist, and then that person selects what clothes they think they should send you. So it's called like a human-in-the-loop problem, or like augmented workflow. So yeah, it's a super fun problem, and I really enjoy it.

Data science as system design

So kind of zooming back out, the point of this talk, or like what I really want to talk about, is the way that we do data science. And all of us are here at this conference because we really care about doing data science well. And I love this kind of paradigm from the R for Data Science book about what it looks like for us, especially our users, to like participate in this data analysis, data science thing. So the idea is that you have a data set, you import it, you have to tidy it, and then you have this sort of like playful box of modeling the data, transforming it, visualizing it, and then being inspired or learning new things from that process, and going through and kind of cycling through, and like learning as you go. And then eventually you sort of take the exit ramp from that, and you figure out some way of communicating what you've learned to another person.

And of course, the purpose of this graphic is that a lot of the R packages in the tidyverse are meant to address different parts of this stage, right? So you have like dplyr and tidier for sort of shaping and tidying the data. You have a bunch of things for visualizing, transforming, and even communicating with R Markdown and various other tools.

And so if you attend the, like any sort of talk about this from Hadley or others, the idea is that, oh, actually, you know, this tidy step, we don't like to talk about it, but it's actually a huge amount of the work, right? So you kind of like have this littler, you know, the fun part that we want to talk about is that kind of fun exploratory analysis and the communication afterwards, but actually tidying is this huge amount of the job that none of us like to talk about, right?

So like one thing I love about this paradigm and sort of the way that the RStudio team and Hadley directly have talked about it is that the idea that of building the framework, of building the tidyverse and this sort of opinionated API layer about what the data should look like and kind of how you should be parsing results is that it's really about creating fluency and expression for that sort of playful, cyclical, you know, modeling, visualizing, transforming the data, et cetera. And so this is a quote way back in 2013 from Hadley where he says, you know, we may find that the two bottlenecks in a data analysis are what you want to do and how to tell your computer to do that. A lot of my existing work has been more about how to make it easier to express what you want. And I sort of bolded express because it is this idea of like a fluent language.

And so when I think about the way that we talk about doing data analysis, I was at a talk by this person, Grady Booch, who's at IBM Research, and he talked a lot about how everything is a system deep down. And I was really inspired by that because I realized, you know, when we talk about doing a data analysis, this graph really represents like a mini production system, right, where you're having data come in, you're doing something to it, but then at the end you want to have a report come out. And so really every time we're doing a data analysis, we're building a little mini production system that we want to work effectively. And we want it to work, we want to be able to build it quickly because building the system isn't exactly what we want to be doing. We want to be doing that kind of creative, fun data analysis part.

And so really every time we're doing a data analysis, we're building a little mini production system that we want to work effectively.

And so I felt like I learned this especially well when I moved to a more kind of machine learning production type job because a lot of the concepts, I'd cared so much about reproducibility in this sort of data analysis production system, and I realized that all those concepts like directly translated to machine learning production systems. So I think like thinking about our work as system design is really helpful. And one of the reasons it's really helpful is because other groups have already thought a lot about system design, so there's a lot that we can learn from them.

And last year, two years ago at this conference, I gave a talk related to this about what we could learn from like specifically the DevOps world and the operations world. I have up here a book called The Field Guide to Understanding Human Error. And the whole idea there is that like when someone has an error in a data analysis or like if some sort of catastrophic error happens in a different system, especially in the operations world when they're dealing with systems like an airplane flying or some other kind of like high importance system, they have a very disciplined way about going about talking about when a failure happens in that system. And kind of the old school world was like, oh, this was due to human error, but the people who think about this all the time realize that that was sort of disincentivizing for people surfacing when a system was breaking because it seemed like people would be worried about like being fired or being judged or being bad at their job if there was a system failure, and that's like the opposite of what you want when you are trying to keep airplanes flying safely. So they developed this whole thing about blameless postmortems, and I could go on a lot about how great these are, but the idea is to kind of approach the work like if an error occurs, it's not your personal fault. It's more like the system was failing you, and you had good intentions. You wanted to run an analysis that worked, and it didn't, and so how can we change the system so that in the future it works for you?

The other thing that groups who do a lot of system design think about are sort of opinionated frameworks. So like one example would be Ruby on Rails for web frameworks where instead of just saying, hey, you can do whatever you want, and, you know, every time you're designing this system, you're starting from scratch and thinking through all the kind of micro decisions. Instead, it's like, okay, let's kind of have our opinionated way of setting up this system and then save our kind of creative energy for the things that we feel like are really worth it, and so last year or two years ago and kind of last year, I gave a lot of thought to this, and I have a paper about it, have talks. I would, you know, encourage you to go watch those if you want, but the idea here is that, you know, I think data science should absolutely start to adopt this more actively, and then I think the reason that we all are so passionate at this conference is that I actually think that RStudio and specifically the Tidyverse already are doing this where it's like, okay, we kind of have this opinion about how you should structure your data, but once you sort of do the hard work of getting things there, then everything else is easy, and you can kind of reserve your mental energy for the fun part of data science.

Thinking holistically about data collection

So kind of going back, you know, like I said, this is like the system I originally was thinking about, and then it's like, okay, actually, like, tidying is this huge part of the system when you zoom out more, but what I want to talk about now is that if you zoom out even more from that, like, sure, we're focused on that last mile, but actually, like, the big part of doing data science is actually getting data in the door, right, and you hear about this more in, like, the statistics world where it's, like, study design and, you know, collaborating with that, but in the data science world, you really don't hear about this much.

Like, people kind of talk about, like, oh, the data exists in this database, and what do you do with it then, but what I want to talk about now is, like, I think that some of the biggest wins I've seen in the data science world are when a data scientist actually thinks about this entire process, and kind of first, you know, getting QC with the graphics, and, like, well, first of all, this is, again, kind of designing a system, and then one thing that's nice about helping, you know, kind of immediate win here is that if you're more involved with the data collection process, maybe you can get the data in the door in a way that it doesn't need to be tidied so much, right, so if you help, like, develop the schemas and everything that, you know, is going into the data in the first place, like, maybe you won't have to, you know, spend 80% of your time doing this boring task of cleaning it up.

Style Shuffle: a new data source at Stitch Fix

But, you know, that's kind of, like, one practical benefit, but I want to talk now about some examples of Stitch Fix, where I've seen, like, thinking holistically about this broader system rather than that last mile, the ways that that has really transformed our business, so kind of going back to Stitch Fix, I was saying, you know, you create this style profile, you get five items, and then you keep what you want, send back the rest, but one of the key problems, or one of the key challenges we are facing is that five clothing items, you know, for a person in some set amount of time, like, really isn't that much, so, you know, you get five things, and when you check out, like, when you decide what to buy or not, we'll ask you some questions about did it fit right, did you like the style, et cetera, but, like, that's not that much data per person, and then that's further exacerbated by the fact that when, like, in fashion in general, like, we're in the era of fast fashion, where an individual clothing item won't be on the racks, like, that long, I mean, you know, somewhere like, like, some of the really fast fashion, it's, like, a matter of weeks, and then, you know, for us, it's a little longer than that, but still, like, it's not, you don't have one shirt usually lasting for years and years, so on top of the fact that we're not getting much data from people, like, the clothes are turning over fast enough that, like, we were not able to accumulate enough data per person in order per person and per clothing item to do some of the more traditional matrix factorization and other machine learning approaches.

And so this was, like, something that was on apparently my colleague, the person who developed this, Chris Moody, was talking about the fact that, like, basically every time someone new started on the Stitch Fix team, they would try to write some sort of matrix factorization algorithm to solve this problem and would always fail, and it was sort of this rite of passage for, like, joining the team, it's not exactly an effective use of people's time, but, you know, everyone wanted to fix it, and so at some point, the idea got tossed around of, like, well, what if we could get more of these style ratings per clothing item?

Obviously getting a style rating from something in person is the most effective, like, you can really see all the details, see how it fits on you, but, you know, the visual representation of an item is a lot of the style signal, so there's no reason why we couldn't get ratings on that, and so, like I said, my colleague, Chris, really picked up this problem, and essentially people were, like, well, what if we created something like Tinder for clothes where you could just see a clothing item and say do you like it or not and kind of swipe right versus left?

So my colleague spent a year doing exactly that, he built an application first via Facebook and now it's on the formal app called Style Shuffle where you essentially see a clothing item and can say whether or not you like it, and you can go through very quickly and rate a ton of stuff, and this was, I mean, to say it was transformative is an understatement, I mean, all of a sudden we could do all of the matrix factorization work that we've been wanting to do, and the amount we've learned from this brand new data source has been monumental.

This was, I mean, to say it was transformative is an understatement, I mean, all of a sudden we could do all of the matrix factorization work that we've been wanting to do, and the amount we've learned from this brand new data source has been monumental.

One of the things that we've learned is kind of like how are clothes separating, right? What sorts of styles are very polarizing from each other? And so one example, this is in a blog post which I've linked that you can't quite see there, is that kind of what you would call boho items versus preppy items are very polarized away from each other, so those are like very two different kind of vibes that you go for with your clothing.

And so again, like this is just one example of the things we've learned and we've been able to incorporate into our entire recommender system, but the other thing that was really great about this project was that it also enabled like all sorts of new lines of investigation, not just for individual items, but like other people who were wanting to zoom out and think about that entire data system were able to start to pursue their own kind of research topics.

And so I was one of those people where, you know, I saw that people were liking these different themes of clothes and kind of clusters, and so I wanted to know like, okay, like individual items are clustering this way, but what about kind of the what's the interaction effect between different items? Like when you see an entire look that is boho, what's your reaction to that versus the individual pieces?

And so this was super fun because right when I started thinking about this, the ROpenSci group, and I think I'm pronouncing it right when I say you're in, from a postdoc with that group, created this wrapper, this R wrapper for image magic, which is a way of kind of pasting images together. And so I was able to actually, using R and the kind of embeddings we got from these machine learning models, I was able to actually identify a bunch of like boho items and various different kind of aesthetics. And so this is me, like I was able to use R to like print out essentially like baseball cards of different items. And then me and my colleague Natalia like went through and actually created outfits from these like empirically, like these items that we empirically saw were highly polarizing in these different kind of axes of the matrix factorization.

And so I was able, again, using R to create these images of outfits that I could feed right back into that style shuffle and get reactions to them. So A, this is cool, like data scientists creating outfits that hundreds of thousands of people are rating, and B, that this is all using R, so like R is impacting what customers see in Stitch Fix.

And then again, it's just like being able to creatively think about the entire system and what clients want enabled a bunch of new data science things that was much more than that last mile.

Design thinking for data science

So kind of in the last five minutes, like when I talk about this sort of designing the system, it might seem intimidating to say, hey, some of the biggest wins are just like completely creating a new app that has a completely new data source. And like once you do that, data science is great. But designing things like this is not just this kind of mythical, mysterious thing. It's actually a discipline. It's something that people can learn. And there's a lot of writing on it.

So this book is from Nigel Cross, which is, he's one of the premier researchers on design thinking in general. And I've been super inspired by reading his stuff and thinking about this in general. We've talked about it a lot on the podcast. But just some like kind of basics about design thinking. It's a very solution focused discipline. You're constructively thinking. So compared to science where you're kind of diagnosing and teasing apart problems and understanding the root of the problem, design thinking is much more about iterating toward a solution quickly. You're kind of using both your left and right brain, so your analytical mind and your more like artistic mind.

And there's some really interesting stuff around. It is its own way of knowing. It's not, it's a form of kind of rhetoric with sketching. And so it's not just like a thing. It's actually a complete, like the, Nigel Cross talks about being this like third way of knowing. There's sort of art, science, and design. And I saw this quote from him early on, and I just thought it was so perfect for data science work, where it's like, you know, design ability is one of these three fundamental dimensions of human intelligence, design, science, and art form an and, not an or relationship to create this incredible human cognitive ability.

So kind of like, I feel like my headline for this section is just, you know, it's not the art of data science. It's not just the art of data science or the science of data science, but there's also this huge aspect, and I think in some ways the most important aspect is design thinking for data science.

Cultivating design thinking and empathy

So you know, in terms of like what to do now, this, again, this isn't just like something you're born with. It's not magical. There are ways to hone it and cultivate it within yourself. The biggest one, like if I were to give my kind of instructions for what to do now, there's a bunch of really accessible reading on this. So in the podcast, we went through Nigel Cross, one of his design thinking books, and this was like a photo from one of the podcast listeners, which I like love so much. There's also, I participated in a design sprint at work, and that was really effective. So that's something I would recommend as well. There's also this really great book called Designing Your Life, which is essentially applying design thinking to like designing your life, for lack of a better term. So kind of thinking about how you're spending your time, and are you like happy with, you know, the way that you're, like the choices you've made in your life. So in terms of being introduced to the concept, it's like a very personally applicable book, so I totally recommend it.

I think the other one that's huge is like practicing design thinking for data science problems. So in general, even within the design field, they say the only way to get good at this is to do it a lot. So we actually on the podcast talked a little bit about this, where most data science problems are like, you know, here's the data set, do something with it. But instead, I would encourage you to think like, here's an interesting problem I want to solve, and how would I even go about like creating data set for this and solving it? We went through one with like commute times, so understanding the variance of your commute times, and it's like, okay, how would you even measure that reliably in a way that's like low touch? So Roger actually, we did this on the podcast, and Roger wrote like an epic blog post about it, where like kind of just describing the entire process of coming up with it. So I would totally encourage you to read that or try to think of things in your own life, and again, it might be kind of contrived, but the point is to practice.

And then finally, and I wish I had a lot more time to talk about this part, but a huge part of data science, a huge part of design thinking is having empathy for the user. So being able to understand, would people want to play this kind of style shuffle game? And just even like for data science, understanding what does this person need, like what sort of analysis, visualizations does this person need in order to be convinced of what I think is the right conclusion from this data?

And so, you know, if I wanted you to walk away with one thing from this, it's that empathy is not some fixed character trait. It's actually something that you can cultivate, like it's a muscle you can strengthen. JD talked about it, but meditation is a huge way to do this, and that's been like personally very impactful for me for cultivating empathy. And so if you're kind of like feeling hopeless about that part, I would just encourage you to explore, like there are ways to strengthen this muscle.

If I wanted you to walk away with one thing from this, it's that empathy is not some fixed character trait. It's actually something that you can cultivate, like it's a muscle you can strengthen.

So with that, you know, thank you so much. I don't think I have time for questions, but I appreciate your attention.

Hilary Parker | Cultivating creativity in data work | RStudio (2019)

Transcript#

Data science as system design

Thinking holistically about data collection

Style Shuffle: a new data source at Stitch Fix

Design thinking for data science

Cultivating design thinking and empathy

Featured software#

rstudio