
Miriah Meyer | Effective Visualizations | RStudio (2020)
Originally posted to https://rstudio.com/resources/rstudioconf-2020/effective-visualizations/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Without further ado, here's Mariah Meyer talking about designing effective visualizations.
Thanks so much for being here. I'm super excited. This conference is very energizing and really awesome. I feel like I'm one of the cool kids. So yeah, let's go with that.
So my name is Mariah, and I am a professor at the University of Utah, which is in beautiful Salt Lake City. And at the U, I get to work with an amazing group of students in the visualization design lab, which I co-run with my colleague Alex Lex.
Now in our lab, what we focus on is doing deeply collaborative, design-oriented visualization research. And so what that means is that we spend a lot of time working very closely with domain experts in a wide variety of fields to really understand how they think about their data and to translate those needs and those mental models into effective design tools.
Now we work with data analysts from biology and neuroscience through cybersecurity and even poetry. And as researchers, we use these projects as test beds for us to experiment and to ask research questions.
And so the kinds of questions that we ask as researchers can kind of bucket into three main classes of things. So the first one are questions about, you know, how can we use these collaborative projects to come up with new and innovative visualization designs that can help us better understand the increasingly complex data that we have?
The second thing that we also ask questions about is about the visualization design process itself. So how can we as visualization designers and practitioners do what we do more effectively and more efficiently?
And then the third thing we think a lot about, and one that I'm increasingly excited about, is really moving beyond thinking of visualization as the end product and to instead think of it as a way to probe into how people think about their data and their relationship to technology in general. And for me, this is really about using visualizations as a tool in the sort of human side of data science.
So today, what I wanted to share with you were a couple of examples from projects about how we used ideas from the visualization research community in our very pragmatic design-oriented process that we also do as practitioners. And hopefully along the way, maybe give you guys a little bit of ideas how you can also look to some of the visualization research as a way to change and shift the way that you're approaching visualization design.
Designing new visualization techniques
All right, so to begin, let's talk a little bit about the idea of designing new and innovative visualization techniques. And here I want to talk to you about, I want to tell you about a project that I worked on with a colleague of mine, Bang Wong, who is in Boston. He's a really brilliant visual designer.
And Bang and I were working with a group of biologists who were studying yeast, or more specifically, they were interested in the process of metabolism in yeast. Now this group was trying to understand how different species evolved a similar process for metabolism because this actually has implications for our understanding of how diseases like cancer work.
Now the main kind of data that this group was working with was something called gene expression, which is really just a measurement of how much a gene is turned on or off in a cell. And the group was collecting gene expression for many different species of yeast and under different experimental conditions.
Now their main method for visualizing this data is a heat map, and this is in fact one from theirs. And I can see Will in the front sort of tilting his head, and I'm sorry, I will never do that again. So, yeah, so, but a heat map is the predominant way that these biologists were looking at their gene expression data.
So Bang and I began to engage these scientists in conversations around what worked well for them with this visualization, but really what was really problematic. And what they said is that they wanted to be able to make very fine-scale, nuanced decisions based upon changes in gene expression, and they found that in these heat maps it was nearly impossible to see that kind of detail of the data.
So Bang and I stepped back and we said, okay, so we need to redesign a heat map, but how in the world do we start even approaching this problem of coming up with something new? And so we decided to turn to a fundamental principle in visualization design, which is that spatial encoding is the most effective encoding channel that we have. So by spatial encoding, what I mean is by positioning marks on a common scale, such as in a scatter plot, or using the length of some sort of mark like we do in a bar chart.
So okay, so this is a foundational principle. We can ask many visualization designers about this, but as a researcher, how do we actually know that this is a good principle to go by? And part of it we know through controlled studies that have been done looking at different encoding channels, the first of which was done by the statisticians Cleveland and McGill back in the 80s, when they did a controlled lab experiment where they ran participants through a series of tasks using different encodings like color, position, angle, and so on, and they found that the spatial encoding techniques significantly outperformed other types of visual encodings.
Now this study was replicated a number of years ago by Jeff Hare and Mike Bostock, only this time they did it on Mechanical Turk, so they were reaching hundreds and hundreds of people, and the results that they got were very similar to what Cleveland and McGill had done in their studies. And so it's in studies like this that we're able to say with some confidence that certain kinds of visual encoding channels are more effective for certain kinds of tasks.
And based on this, we can think about our basic encoding channels and rank order them. So for example, here what we're looking at on the left-hand side are encoding channels that we have for encoding quantitative data, and on the right-hand side are the basic encoding channels for categorical attributes. And you'll notice at the top, both of them have spatial encoding as the highest-ranked one.
So as visualization researchers, we take studies, we do studies on a variety of things like basic visual encoding channels, but we also combine it with things that we learn from design practice and years and years of anecdotal evidence to get to ideas like this.
So when Beng and I, we decided, okay, so let's take this principle and let's think about, well, what does that mean in terms of the data we have? And we decided, okay, well, the data we have is actually time series, so let's look at line charts.
So we went off in an illustrator, we mocked up every single variation that we could possibly think of for how to show a line chart, and we printed it out on a big poster, and what we noticed was the one over here on the right, these little filled frame charts, really made the shape of those line charts pop out. And so we incorporated that idea into a new technique that we called a curve map that we implemented as part of a bigger multi-view system that we deployed to our collaborators.
Now using this new curve map technique, our collaborators said that they were able to much more quickly see nuanced things that they knew about the data, but they also saw things that they had never known were there, which led to follow-up experiments and the discovery of some new scientific concepts that they weren't aware of before.
So in this way, I want to stress that this idea is that we can take ideas about perceptual principles and other things and use those as a way to brainstorm and sort of a springboard for thinking about new ways to encode our own data.
So in this way, I want to stress that this idea is that we can take ideas about perceptual principles and other things and use those as a way to brainstorm and sort of a springboard for thinking about new ways to encode our own data.
The visualization design process
Okay, the second point, the visualization design process. So here is really about saying, okay, well, we have to create visualizations, but there's a whole lot of stuff we have to do before that, which is even figuring out what is it that I want to visualize in my data. And this is one of the things that I personally have spent a lot of time on and care about a lot.
So I'm going to start with a little flavor of what it's like to be a visualization researcher like myself. So we have two people, a viz person and a biologist. We might start by saying, oh, what is it you want to visualize? And the biologist might say something like, well, from patterns of conservation, we want to visualize the mechanisms that influence gene regulation. And to me, it sounds like this.
So it turns out the ginger effect isn't just for debugging. It also occurs all the time in visualization design. And as a side note, beware of Jenny Bryan scooping your punch lines.
But I am just going to plow ahead anyway. Thanks, Jenny.
So but it's true. That same experience about only really like seeing a few things at a time is absolutely the case when we jump into these new projects. What I'm really interested is in how we get from a very semantically rich like task like we have here to the knowledge that we can then design a visualization like this to support.
So fundamental to this process is identifying something I like to call proxies, which are the partial and imperfect data representation of that semantically rich thing that the analyst actually cares about. The high level goals of a data analyst is actually rarely captured directly in the data. And if it was, you probably wouldn't need visualization anyway.
So instead, what we have to spend a lot of time doing is thinking about what are the things in my data that I can infer something from that will then actually let me answer the question I have. And so let me give you an example of what I mean by this.
Let's say that I wanted to identify good film directors because I don't know, Sundance is happening right now in Salt Lake and I'm a journalist and I need to identify some good film directors. Okay. So I go off and I scrape some data from IMDb about movies. So now I have this data set that includes lots of information about movies such as how much money they've made, their ratings and so on.
But now going back to my question, my question is really about film directors, but I have information about movies. So what do I do? And this is where I like to think about this notion of how do I select a good proxy in my data for my question.
And so my colleague Danielle Fisher and I came up with this very simple mechanism for doing this, which is we break a task down into three things. Into the action, which is the thing you want to do. Into the object, which is the items you want to take that action on. And then finally, the measure, which is the value that you care about for those objects.
And so if we do this, let's look at our tasks. So the action is identifying. What do I want to identify? Film directors. Now this is where I know that I have a data set about movies, not film directors, but I could say if I can learn something about movies, I can infer something about directors. So I'm going to choose movies as my proxy for director. And what do I want to know? I want to know if they're good.
And now, you know, I don't have an attribute labeled good in my spreadsheet. So this is where I think all the time in data science, we are constantly making decisions that are inherently subjective and really specific for the one, for the, whatever the view is on the problem you're tackling. So in this case, good could be movies that make a lot of money. It could be movies that reach a broad demographic. But for now, I'm going to keep it simple and just say it's good equals movies with high IMDB ratings.
So by identifying my proxies, I can now translate my task into something, identify movies with high IMDB ratings, which is actually something I can design a visualization or other sort of analysis to tackle.
But it turns out that actually being able to do this translation and being able to find these proxies is a really challenging process and takes a lot of time. And in our group, what we do is we spend a lot of time really immersing ourselves with the people that we're working with in order to better understand the kinds of challenges they face and the things that they want to do.
And some colleagues at the University of Calgary just this past year formalized this notion of immersion by calling it design through immersion and talking about some of the sorts of benefits that we get when we do transdisciplinary kind of research and bring sort of big tent style approach to how we problem solve and in this case do data analysis.
Visualizations as probes
And then the last thing I want to talk about here is the notion of visualizations as probes. I'm going to do this by talking about another project that I worked on that was headed up by a former student of mine, Nina McCurdy, where we were working with a group of public health experts at USAID who were studying Zika, the spread of Zika in Latin America and all the effects, notably microcephaly in babies.
And so during this project, Nina spent six months at the USAID working closely with the health experts there. And it had all the makings of a straightforward visualization project. The collaborators had data. They had tried to visualize their data and failed and so they were super excited to work with us. And so Nina left and we dug in.
Now, over the course of the study, Nina did a lot of rapid prototyping and ultimately designed a tool that used best practices for how we can incorporate all sorts of tabular data with geospatial information. She evaluated this tool with various stakeholders throughout that time and the response was overwhelmingly positive, that this was a great tool for the data that they had.
And yet, when we tried to engage our immediate collaborators in incorporating the tool as part of their workflow, we noticed a lot of hesitation and we were really concerned about this. And as we probed deeper into this hesitation, we came to understand that even though the tool was a good representation of their data, the data was not a good representation of what they knew to be true about the spread of Zika on the ground.
So this first came up in a discussion of a choropleth that looked like this one showing Zika cases at a national level. So as you'll see here, Brazil is in dark red, indicating a relatively high percentage of cases, whereas Colombia is in a lighter orange, indicating a relatively lower percentage. Now, one of our collaborators, when she saw this, she noted, well, Brazil reports all cases, whereas Colombia only reports cases after a thorough investigation.
The implication of this comment was that the different ways in which these countries were reporting Zika was leading to an inaccurate and possibly misleading picture of what was actually happening. So we pivoted to focus on this problem and found that the data was littered with discrepancies like this.
And then in the case of Zika data, this came from the way that the data was collected and processed and reported. And it was this distributed heterogeneous data generation pipeline that occurred differently in every country that led to all these discrepancies.
And now, you know, all was not lost. So even though these problems with the data were not included in the data set, the experts working with it had deep and intimate knowledge about them. And these discrepancies, it turns out, were shaped by the individual country's political, cultural, economic, geographic, and demographic context.
And so we'd get discrepancies like the union in Region X goes on strike often and doesn't report Zika data. Or country Y recently overhauled its surveillance system, leading to a sudden increase in detected cases.
So in thinking about this, we formalized this notion into what we called implicit error, which is measurement error that is inherent to a given data set, assumed to be present and prevalent, but not explicitly defined or accounted for. And so we were able to characterize aspects of implicit error, as well as then design an annotation mechanism to help experts externalize.
Nina implemented this annotation mechanism back into our tool, which we deployed back out so that these health experts could start annotating their data in order to share information with their colleagues, as well as to start to build up a database of contextual information around that initial data set.
So although this work was grounded in Zika health data, we suspect that this type of implicit error is prevalent in many, many other domains.
Okay, so this was an example where we were able to use visualization, not as necessarily the end result, but as a conversation starter about challenges that the group was having with the data, and then ultimately use visualization as a sort of mechanism to help analysts externalize things that they knew that wasn't captured in that data set.
Okay, so this was an example where we were able to use visualization, not as necessarily the end result, but as a conversation starter about challenges that the group was having with the data, and then ultimately use visualization as a sort of mechanism to help analysts externalize things that they knew that wasn't captured in that data set.
So these are three things that are the flavor of things that as visualization design researchers we think a lot about, and I think has a lot of impact about how we do visualization in practice.
Recommended reading
And I just wanted to leave with a quick recommended reading list. The first one maybe some of you know is Data Visualization by Noah Alinsky. It's a great book that I think incorporates a lot of the foundational principles that we in the research community have for designing visualizations.
The next one, Visualization Analysis and Design, is the grad textbook by Tamara Munzner. If you want to geek out about how we talk about visualizations, as well as more complex visualization types, this is a great book. The next one, Making Data Visual, is a personal little plug, but this is a book that my colleague Danielle Fisher and I recently published that really looks at the process of how we figure out what we're designing for.
And then the last one is the sort of sleeper recommendation, and this is a hand-me-down recommendation from Martin Wattenberg, which is a book that will absolutely change the way that you think about visualizations, and actually nicely complements a lot of what Will was talking about earlier today. And with that, I just want to thank you all for your attention.
Q&A
Thanks so much, Mariah. Some questions. Well, since you just mentioned him, was Martin Wattenberg right when he said pie charts were underrated? I absolutely agree. I totally agree that they're underrated, but this is a deep philosophical debate within the Viz community for sure.
Can you describe the many multiples chart you used to replace the gene expression heat map? What is shown on each axis? Did the final chart have more annotation? Yeah, so what we were showing is actually the rows were different species, the columns were different genes, and then we were looking at they had taken experimental measurements at different points during the life cycle of species, so that's what the time curves were. And really, the curve map gives you a table layout, so you can define attributes along the rows and the columns to help facet your data, and inside we were showing time curves.
How do you bridge the knowledge gap to someone who might not understand or have ever seen a visualization like you've shown? Oh my gosh. I don't know how to answer that quickly. That's a great question. You know, what we actually do, and I think this is a really important point actually, is when we are designing a very new and different type of visualization, we don't just show up one day and say, voila, look, you're going to love it. But instead, it's a process of actually building trust with our colleagues so that they trust us with their hard-earned data. We often, the first thing we implement is similar to what they already have and slowly change things over time. We're actually designing with them, not for them. And so that collaborative process, I think that really helps people embrace new and different ways of looking at data.
How do you determine the effectiveness of various visualization methods? Any qualitative metrics? Oh, yeah. Great question. Some of the basic encoding channels I talked about are things that I think are pretty straightforward to be able to test quantitatively in a controlled experiment. It's largely based on our perceptual system. The kinds of visualizations that we create in my research group, though, are complex and part of the broader workflow, and we cannot test quantitatively. And so we use a lot of qualitative methods, such as case studies and other sorts of things, in order to understand the efficacy of the kinds of tools that we create. So, yeah, definitely we rely heavily on qualitative methods.
What is the funding model for an academic vis lab? Do you have project-specific grants, function as a core group for other labs? Is your work acknowledged well? Gosh, that's a lot of questions. Yes, largely we apply for grants on certain projects, and then that money funds our graduate students a little bit of our time. We personally, our lab is not a service lab because we do focus on research, but we do have service groups, for example, on our campus that do that kind of work.
Oh, is it well-acknowledged? Yeah. No. Especially when creating these kinds of visualizations that we often put out and that people use as part of a larger analysis pipeline, visualizations, I think, are woefully undersighted and has often led to a crisis within our research community about, like, what's the value of what we do? Because a lot of times it's not really acknowledged. I think visualizations are often used for creativity and brainstorming to get us thinking about something else we want to do, not necessarily for the final answer. And I think that part is something people don't necessarily recognize as an important thing to cite. Great. Well, thank you so much. Thanks.
