Miriah Meyer | Effective Visualizations | RStudio (2020)

So in this way, I want to stress that this idea is that we can take ideas about perceptual principles and other things and use those as a way to brainstorm and sort of a springboard for thinking about new ways to encode our own data.

So in this way, I want to stress that this idea is that we can take ideas about perceptual principles and other things and use those as a way to brainstorm and sort of a springboard for thinking about new ways to encode our own data.

The visualization design process

Okay, the second point, the visualization design process. So here is really about saying, okay, well, we have to create visualizations, but there's a whole lot of stuff we have to do before that, which is even figuring out what is it that I want to visualize in my data. And this is one of the things that I personally have spent a lot of time on and care about a lot.

So I'm going to start with a little flavor of what it's like to be a visualization researcher like myself. So we have two people, a viz person and a biologist. We might start by saying, oh, what is it you want to visualize? And the biologist might say something like, well, from patterns of conservation, we want to visualize the mechanisms that influence gene regulation. And to me, it sounds like this.

So it turns out the ginger effect isn't just for debugging. It also occurs all the time in visualization design. And as a side note, beware of Jenny Bryan scooping your punch lines.

But I am just going to plow ahead anyway. Thanks, Jenny.

So but it's true. That same experience about only really like seeing a few things at a time is absolutely the case when we jump into these new projects. What I'm really interested is in how we get from a very semantically rich like task like we have here to the knowledge that we can then design a visualization like this to support.

So fundamental to this process is identifying something I like to call proxies, which are the partial and imperfect data representation of that semantically rich thing that the analyst actually cares about. The high level goals of a data analyst is actually rarely captured directly in the data. And if it was, you probably wouldn't need visualization anyway.

So instead, what we have to spend a lot of time doing is thinking about what are the things in my data that I can infer something from that will then actually let me answer the question I have. And so let me give you an example of what I mean by this.

Let's say that I wanted to identify good film directors because I don't know, Sundance is happening right now in Salt Lake and I'm a journalist and I need to identify some good film directors. Okay. So I go off and I scrape some data from IMDb about movies. So now I have this data set that includes lots of information about movies such as how much money they've made, their ratings and so on.

But now going back to my question, my question is really about film directors, but I have information about movies. So what do I do? And this is where I like to think about this notion of how do I select a good proxy in my data for my question.

And so my colleague Danielle Fisher and I came up with this very simple mechanism for doing this, which is we break a task down into three things. Into the action, which is the thing you want to do. Into the object, which is the items you want to take that action on. And then finally, the measure, which is the value that you care about for those objects.

And so if we do this, let's look at our tasks. So the action is identifying. What do I want to identify? Film directors. Now this is where I know that I have a data set about movies, not film directors, but I could say if I can learn something about movies, I can infer something about directors. So I'm going to choose movies as my proxy for director. And what do I want to know? I want to know if they're good.

And now, you know, I don't have an attribute labeled good in my spreadsheet. So this is where I think all the time in data science, we are constantly making decisions that are inherently subjective and really specific for the one, for the, whatever the view is on the problem you're tackling. So in this case, good could be movies that make a lot of money. It could be movies that reach a broad demographic. But for now, I'm going to keep it simple and just say it's good equals movies with high IMDB ratings.

So by identifying my proxies, I can now translate my task into something, identify movies with high IMDB ratings, which is actually something I can design a visualization or other sort of analysis to tackle.

But it turns out that actually being able to do this translation and being able to find these proxies is a really challenging process and takes a lot of time. And in our group, what we do is we spend a lot of time really immersing ourselves with the people that we're working with in order to better understand the kinds of challenges they face and the things that they want to do.

And some colleagues at the University of Calgary just this past year formalized this notion of immersion by calling it design through immersion and talking about some of the sorts of benefits that we get when we do transdisciplinary kind of research and bring sort of big tent style approach to how we problem solve and in this case do data analysis.

Visualizations as probes

And then the last thing I want to talk about here is the notion of visualizations as probes. I'm going to do this by talking about another project that I worked on that was headed up by a former student of mine, Nina McCurdy, where we were working with a group of public health experts at USAID who were studying Zika, the spread of Zika in Latin America and all the effects, notably microcephaly in babies.

And so during this project, Nina spent six months at the USAID working closely with the health experts there. And it had all the makings of a straightforward visualization project. The collaborators had data. They had tried to visualize their data and failed and so they were super excited to work with us. And so Nina left and we dug in.

Now, over the course of the study, Nina did a lot of rapid prototyping and ultimately designed a tool that used best practices for how we can incorporate all sorts of tabular data with geospatial information. She evaluated this tool with various stakeholders throughout that time and the response was overwhelmingly positive, that this was a great tool for the data that they had.

And yet, when we tried to engage our immediate collaborators in incorporating the tool as part of their workflow, we noticed a lot of hesitation and we were really concerned about this. And as we probed deeper into this hesitation, we came to understand that even though the tool was a good representation of their data, the data was not a good representation of what they knew to be true about the spread of Zika on the ground.

So this first came up in a discussion of a choropleth that looked like this one showing Zika cases at a national level. So as you'll see here, Brazil is in dark red, indicating a relatively high percentage of cases, whereas Colombia is in a lighter orange, indicating a relatively lower percentage. Now, one of our collaborators, when she saw this, she noted, well, Brazil reports all cases, whereas Colombia only reports cases after a thorough investigation.

The implication of this comment was that the different ways in which these countries were reporting Zika was leading to an inaccurate and possibly misleading picture of what was actually happening. So we pivoted to focus on this problem and found that the data was littered with discrepancies like this.

And then in the case of Zika data, this came from the way that the data was collected and processed and reported. And it was this distributed heterogeneous data generation pipeline that occurred differently in every country that led to all these discrepancies.

And now, you know, all was not lost. So even though these problems with the data were not included in the data set, the experts working with it had deep and intimate knowledge about them. And these discrepancies, it turns out, were shaped by the individual country's political, cultural, economic, geographic, and demographic context.

And so we'd get discrepancies like the union in Region X goes on strike often and doesn't report Zika data. Or country Y recently overhauled its surveillance system, leading to a sudden increase in detected cases.

So in thinking about this, we formalized this notion into what we called implicit error, which is measurement error that is inherent to a given data set, assumed to be present and prevalent, but not explicitly defined or accounted for. And so we were able to characterize aspects of implicit error, as well as then design an annotation mechanism to help experts externalize.

Nina implemented this annotation mechanism back into our tool, which we deployed back out so that these health experts could start annotating their data in order to share information with their colleagues, as well as to start to build up a database of contextual information around that initial data set.

So although this work was grounded in Zika health data, we suspect that this type of implicit error is prevalent in many, many other domains.

Okay, so this was an example where we were able to use visualization, not as necessarily the end result, but as a conversation starter about challenges that the group was having with the data, and then ultimately use visualization as a sort of mechanism to help analysts externalize things that they knew that wasn't captured in that data set.

Okay, so this was an example where we were able to use visualization, not as necessarily the end result, but as a conversation starter about challenges that the group was having with the data, and then ultimately use visualization as a sort of mechanism to help analysts externalize things that they knew that wasn't captured in that data set.

So these are three things that are the flavor of things that as visualization design researchers we think a lot about, and I think has a lot of impact about how we do visualization in practice.