Resources

Trustworthy Data Visualization (Kieran Healy, Duke University) | posit::conf(2025)

Trustworthy Data Visualization Speaker(s): Kieran Healy Abstract: Visualizations are the most widespread, the most immediately accessible, and in some ways the most authoritative-looking way to present results to audiences, including when the audience is just yourself. In this talk I ask: what makes a visualization trustworthy? How do tools like R and ggplot to help us achieve that goal? And what are their limits, especially now that it's easier than it has ever been to do untrustworthy things with data? Link: Kieran Healy's website - https://kieranhealy.org/ posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon, everybody. Thank you for coming. Thank you for staying right till the end. I'm sure everyone is quite tired. The good news is this talk is mostly pictures.

Yesterday Hadley showed you all ggbot. Well, this is OG ggbot.

I am a sociologist interested in data visualization. Unlike most of you, I work with quantitative data quite often and I think visualization is a good way to help me investigate it and understand it. It's also a great way for me to take something that I've found and quickly show it to somebody else. Maybe I want to explain it to other people or just as often it's a way for me to ask a question to say I don't understand what's happening here or what the hell is this.

A bit more generally, I'm also interested in how data and the tools that we have for collecting and analyzing data have transformed the world around us. In that regard, I think of data visualization as a kind of synecdoche for data science as a whole or for the use of data in society. It's part of a bigger thing that can stand for the whole thing. For me, thinking about data visualization naturally leads to these bigger questions. As researchers and data analysts, how should we think about our tools and techniques? What's the relationship between our data and the world that it's about and how does that get encapsulated by the graphs and visualizations that we draw?

All of these things, I think, have been transformed, indeed revolutionized over the past few years as the scale and scope of what we can do with data has just gotten more and more powerful. Now, data comes in many forms but speaking very generally, a common pattern is that we start from a stream of data that we get from somewhere. Maybe it's the result of our own really painstaking collection efforts or maybe more often these days it's regularly emitted by some device or some piece of infrastructure that's already in place. That might be a network of satellites or it might be a server in a giant air-conditioned shed somewhere or the phone in your pocket. And of course, these things aren't that far apart anymore.

And what we do is we take that stream of data and we construct some sort of score or sequence of scores with it. And then again, sometimes that thing is something that people habitually call a score, like a credit score. Other times it's a quantity or a coefficient that's the result of some sophisticated model, more like an estimate or a prediction of some result. Or maybe it's just a count or an average or a price, a score in the sense of a number that characterizes something. And once we have that sequence or a collection of scores, we want to use it to tell a story. And that's often the point where visualization comes in. See, look, here's what the data look like, here's what they mean, here's what we should do next.

Technical analysis and the chart men

One early stream of data that people really wanted to tell a story about was this one, the Dow Jones Industrial Average or averages. In the early days there was one for industry and one for the railroads and you can just see kind of at the bottom there one for utilities too, which actually still exists. The Dow Jones was a steady stream of data about the stock market as a whole and the desire to transmit that data far and wide in a more or less continuous stream was one of the driving forces behind the expansion of the telegraph, the development of modern communications technology generally. Deep in the bones of your terminal, of your PowerShell, of your Positron or RStudio or VS Code session, you will find the marks of the teletype and ticker tape machines that were built to transmit price data.

There was a reason that people really wanted that stream of data. They wanted to make money and in some cases they wanted to make money by using graphs to see into the future. Charles Dow, the founder of both the Wall Street Journal and Dow Jones, was retrospectively credited with the Dow Theory, which later became technical analysis. And technical analysis started from price charts, which of course had to be drawn by hand, and combined it with a theory of what investors in the market were thinking and then he used that to predict prices. And in practice, the main technique was telling stories with graphs.

The people who did this were the chart men, and they were all men. The classic source is a book by Robert Edwards and John McGee called Technical Analysis of Stock Trends. It came out in 1938 and it is still in print. It's now in its 11th edition, I think. The idea was that if you looked at trends in stock prices, either for the market as a whole or for individual stocks, you could spot recurring patterns and patterns that told you what might happen next. Not infallibly, of course, but enough to give you an edge, to let you make an informed decision to buy or sell.

So you start with a high-low closed chart like this one, which is from Edwards and McGee, and then you start looking for the telltale patterns. What patterns, exactly? Well, it turns out there are a lot. Some of them seem more or less unobjectionable. A number will go up, a number will go down, a number will go sideways. I think this style of drawing is very nice, by the way. They're not like sparklines, right, that people put in summary tables even today.

But the chart men also thought that you should take care to look at fluctuations within the trend. They were very keen to argue that downward trends, true downward trends, had to be measured by the downward trends of the highs, whereas true upward trends had to be measured by the upward tendencies of the lows, for example. Now anyone can see a basic upward or downward trend, but at this point the theory starts to get a little richer. One idea was that after a period of steady growth, you would often see investors pull back from a stock as they lost confidence in it, and hence the pullback effect. So the trick would be to sell at the top of that third peak there.

They had a whole family of these patterns, each with a name, a distinctive visual signature, and a story associated with what to do when you spotted it. Their favorite was one that was called head and shoulders, which was this triple peak thing where the third rally of a stock failed and it ended up falling below its original price. And so what you really wanted to be was where the smart money was, which was selling to buyers at the top of the head there. As I say, the underlying theory here was a kind of psychology of market segments, which is not an unreasonable idea. John Maynard Keynes made a lot of money in the stock market in his day and once remarked that the business of the investor was to predict what average opinion thought average opinion would be.

John Maynard Keynes made a lot of money in the stock market in his day and once remarked that the business of the investor was to predict what average opinion thought average opinion would be.

By the way, I've often wondered if the anti-dandruff shampoo was named after this phenomenon, and I honestly think it was. Procter & Gamble introduced head and shoulders in 1961 and it was marketed as a medicinal cream to well-dressed men embarrassed about getting dandruff on the collars of their suits. Men who, as you can see here, looked at stock prices in the paper.

Anyway, the chart men didn't think these patterns were iron rules or laws of nature. They regularly cautioned that you can't just look at the chart and read the future, obviously, because if you could then everybody would be doing that. But they were still committed to the theory, to the story, and this led them understandably to develop what a philosopher of science might call some auxiliary hypotheses to save the main idea. For instance, maybe some data doesn't quite fit the head and shoulders pattern. That's probably because sometimes head and shoulders patterns are complex, like maybe you have two heads, or two shoulders, three or four shoulders. You already have two shoulders. Maybe sometimes there's a failure pattern either at the bottom or at the top, and of course when I say a failure pattern, what I mean is that the world has failed us in some way, not the theory, because our theory knows about failure patterns. We have a name for them and everything.

So the technical analysts, the chart men, poured over charts like this and they made their recommendations. They told their stories with their data. What killed this theory in the end was the rise of the efficient markets hypothesis in its various forms, and that's the idea that in something like the stock market, asset prices reflect all available information, or at least all publicly available information about an asset. So in the short run, prices follow a random walk. That means you can't consistently beat the market, at least not like this. The technical analysis people were just reading patterns into noise. Sure, sometimes it worked, but in the long run, you wouldn't outperform an index fund, except by luck.

I wanted to write a little tiny bit of code to generate a random walk of stock prices, and I was kind of hoping it had produced one of the standard patterns, and I put in my seed for today's date, and sure enough, it did on the first try.

What makes a visualization trustworthy?

The thing about the chart men is that there's nothing untrustworthy about their data. The stream of information they used came straight from Dow Jones, and as for the score, well, the thing about technical analysis is that it wasn't that technical. Really the only score or technique they used was putting the numbers into graphs, which was a painstaking business. You had to do it by hand, and there's nothing wrong with the graphs either, considered as graphs. Some of them are really good. What's wrong, of course, is the story, which is nonsense.

Now, we don't like to think of ourselves as the sort of people who'd be taken in by something like looking for head and shoulders patterns in price charts, and certainly we don't want to think of ourselves as the sort of people who are inflicting some version of this sort of thing on our audiences. For one thing, you might say, look, I'm not trying to predict the future. I just want to describe something that happened. I just want to explore my data. Or you might say, hey, I actually have a real theory, a proper theory that can explain the thing I'm showing you. I have real evidence.

But this tendency really is pervasive. Here's an example from a few years ago. Vox published a story discussing this startling rise in the availability of fats, especially vegetable fats, in people's diets. It was pretty widely circulated. That word availability should alert you already to that there might be something going on, because the availability of something in your diet isn't the same as you actually consuming it. But beyond that, the fact that something like 20 pounds of this increase happened in about a year is odd. You can probably guess how this ends, but the thing is people really, and I mean really, wanted to tell a story about this trend.

Maybe something about long-term shifts in the American diet, the insidious rise of veganism, stress eating caused by the Y2K problem, the emergence of avocado toast as a central feature in millennials' lives. That was actually one of the theories. In fact, what happened was a change in reporting requirements. A good rule of thumb is that if you see a sudden shift or discontinuity in a time series like this, your first question should be, did something about the data generating process change? You should sort of tap the instruments. Or better, try finding someone who knows where the data came from. As Tom Smith of the National Opinion Research Center likes to say, if you want to measure change, you can't change the measure.

In this case, unlike with the Chartman, there really is a big signal in the data. It's just not signaling what people imagined. But in both cases, what people want to do is eyeball a graph and spin you a yarn. I don't think graphs are uniquely to blame here. People like to tell stories, and they'll tell them with whatever is at hand. But I think it's fair to say that people giving talks like the one I'm giving right now can be tempted to make quite strong claims about visualization's unique capacity to provide insights into the structure of your data. Maybe it's just the idea that visual storytelling or a strong data narrative makes good graphics uniquely compelling in a way that conveys your insights to others. And implicitly or explicitly, there's also this idea that the best data visualizations are authoritative, that they can decisively settle a question or definitively make a point.

We kind of want these things to go together, right? We want to make something that fuses an insightful, compelling, and authoritative use of data into a single, immediately accessible image. Ideally something like this, for instance, the one John Snow made with the Broad Street water pump, pinpointing the existence of a cholera outbreak before going and removing the handle from the pump. The little squares are measures of deaths in the buildings on the streets.

There really is something to the idea that a good graph can do a fantastic job here. A compelling and authoritative insight is an epiphany, a moment of truth. And a great graph is a kind of portable epiphany. The history of data visualization, or maybe I should say the history of data visualization talks, gives many examples. These often come with little stories of their own, little tiny myths, like how a particular image kick-started the field of epidemiology, or explained why the Challenger space shuttle exploded, or summed up the reality of climate change for people. But I think we need to be careful.

Don't get me wrong, some of these graphs are really good, elegantly assembled, cleanly executed, really effective. I wish I had drawn them. But the thing is, we know this can't be the whole story, as it were, because even if we just consider graphs alone, we know it's not the whole story, because the other thing data visualization talks are full of is examples of really bad graphs. Like this one, for instance. If you had to bet your life, what would you say those values are?

To be fair, the 3D column chart is not the default way that Excel draws bar charts. You have to make a slight effort to define the 3D option. But believe me when I say, people are willing to put the work in. We love these terrible examples, and there are so many of them. Every time you think people have finally stopped using axonometric or oblique projections to show three or four numbers, a new one comes along. But examples like this do tend to undermine this noble story of visualization as a distinctively powerful source of insightful, compelling, and authoritative narratives. They make it clear that there's a lot more going on when we put together a visualization, a lot more ways for things to go sideways, or if not sideways, maybe sort of backwards and up about 120 degrees.

And it's not just a matter of badly chosen graph types either. We have undeniable evidence that this whole business of just looking at anything is totally broken. And I don't just mean your choice of graph. I mean it's a mess like all the way down.

Now, I acknowledge that navigating a slowly cooling universe filled with blobs of condensed matter in a hellscape of electromagnetic radiation while not being eaten is a really hard problem to solve, especially on the fly. But still, whoever designed the human visual system has a lot to answer for. Visual perception is weird. The human eye is not a light meter, for example. It runs on relative contrast, not absolute brightness. Those two squares are the same shade of gray. People don't believe you, but I'm just covering it up. And methods for visualization have to deal with these things, sometimes riding around them, sometimes leveraging them.

But as I say, we know all this. And that's why I'm not dwelling on a parade of horrible graph crimes. We have decades of work, from William Cleveland in the 1980s, on down to the present, investigating what does and doesn't work when it comes to visually representing data. Working out how good or bad people are at reconstructing relationships in the data based on graphs that we show them. Knowing how evolved features, capacities, and limits in the human visual system fare when put to work at looking at data, a task that they did not evolve to conduct. We really know a lot, and not just the relatively simple stuff either. But super tricky things, like color perception.

And not only that, we all benefit from the many years of hard work getting to put this into practice. The work to develop a flexible and powerful conceptual language for describing graphs, like Leland Wilkinson's grammar of graphics. The work done by people in this room to develop tools to implement those ideas in software, like ggplot. The work to integrate it with other freely available tools, developed in the same spirit of openness and cooperation, in a giant collaborative enterprise that lets me sit in my kitchen and pull down, process, and visualize, say, 40 years of global sea surface temperature data, where each day has about 4 million observations. As if that was a reasonable thing to expect to do in your kitchen. It's not even a very nice kitchen. It's not even remodeled.

So this is where we are now. Version 4 of ggplot has just come out, and congratulations to everybody involved in getting it out the door. As June pointed out in the announcement, ggplot is old enough to vote now. Unlike most 18-year-olds, it is mature, well-organized, and stable. And so are other visualization packages and frameworks and toolkits in the R community and beyond this, in Python and the JavaScript people, and many more besides. We really are in excellent shape. Our software is reliable in the sense that it does what we expect it to do. It maps quantities and features to points, lines, shapes, and colors in a perceptually accurate way.

The way we work is also increasingly reproducible. The tools that we have for scripting, testing, and rebuilding our work are better than ever. The people who make Quarto have been doing fantastic work to make it easier than ever to produce and publish reports, articles, and books, while still keeping things based on code that we can rewrite and rerun.

And there's that last part. We want our visualizations to be trustworthy in some general sense. Are they? Can they be? And what exactly is the role of software when it comes to making our work more trustworthy? After all, when someone tells you a story, when someone shows you a graph and gives you an interpretation of it, you want to know whether you can believe them, whether you can trust it.

Here's why I think this is worth emphasizing. I think that whatever it is that makes a data visualization trustworthy, it isn't really something you can include in the graph. If you look at stock examples of bad graphs, especially older ones, it's kind of comforting because most of them look really janky. And the same goes for contemporary examples of bad marketing materials or from media outlets trying to spin some ridiculous visual summary of polling data. They don't look trustworthy. It's possible to make janky graphs with ggplot, but you kind of have to work at it. By default, ggplot makes a lot of good choices for you, and this gives it what you might call a kind of aura of plausibility. Whatever data you're showing and whatever story you're telling benefits from a kind of rhetoric of credibility that's built into the software, more or less out of the box.

Trust, reliability, and commitments

So what does that mean to trust a visualization? As you might expect, there are large social scientific and philosophical literatures about trustworthiness. And like in many social scientific literatures, if you read them, you find that empirical studies confirm that trust has many dimensions. It's also really quite hard to measure, and it's very complicated. And like in many philosophical literatures, if you read them, it turns out that trust has many senses, and it's really quite hard to give a unified account of, and it's very complicated.

Ever since the work of Annette Byer, the philosophical literature has made this distinction between trust and what gets called mere reliability. Now, in ordinary speech, trust can mean reliability too, but the idea is that our everyday way of talking is fairly loose, and it runs together something that's worth making a distinction about. And the argument is roughly that there are a lot of things in people that we rely on without actually trusting them. In particular, if they failed, we wouldn't say that we expected an apology from them. We wouldn't feel betrayed. We wouldn't feel that we'd been lied to.

So for example, I might rely on you having a USB cable I can borrow in an emergency if I want to give a presentation, because I know you're the kind of person, I know you're out there, who has a bag full of cables from all kinds of things, and you've always had one before when I've asked to borrow one. But if I needed one and you happened not to have your bag with you one day, I couldn't reasonably say that you had betrayed my trust. Or I might rely on my alarm clock to wake me up, but if it failed one morning, I'd be making a kind of mistake if I thought that it owed me an apology.

Another philosopher, Catherine Hawley, argues that trusting someone to do something means, first of all, believing that they have a commitment to doing a thing, and then relying on them to meet that commitment. A trustworthy person keeps their commitments. And when you mistrust someone, it's because you think they have a commitment to something, but you doubt that they can be relied on to meet it. What's interesting about Hawley's view, amongst other things, is the negative implications of that. Because being trustworthy, for her, is mostly about making sure you don't end up getting stuck with commitments that you can't, won't, or don't fulfill. Being trustworthy doesn't mean taking on as many commitments as you possibly can. Being trustworthy doesn't mean that you have an obligation to be responsive to every single demand made of you. Being trustworthy doesn't mean that you're answerable to literally anyone.

Now, in a talk like this, I can't give these ideas the room they deserve, but they're useful when we think about what we're looking at when we're looking at data and we're showing visualizations. Because even in quite technical discussions, I find that when we're talking about how we should proceed or what should we do, again, even in quite technical discussions, the topic moves very quickly to issues of social relations, of people's roles, and the expectations associated with these roles, what we can reasonably expect. Not so much the technical issues, but sort of the context in which they're embedded and whether we can, you know, what should we expect of others and what are the limits on those expectations?

And this is a general feature of any kind of organized knowledge production. Data visualization is a good case because a graph is such a deceptively simple and portable object. When you make a graph and tell a story with it, all kinds of norms and expectations get compacted down into this single thing, into an image meant to authoritatively convey some structured bit of information to somebody. And when you tell a story with your data, you're implicitly warranting it. You're making some commitments. You're not offering an unconditional warranty. You're not beholden to everyone for all time. But questions of trust do arise.

So when you look at this modern classic of data visualization, for example, what I made, you might ask yourself, can I trust this? Does it look like the person who made this knows what they're doing? And if you spent any time looking at graphs, then you know the answer is absolutely not, which is too bad. I mean, look at these beautiful drop shadows, that finely textured wood grain, the tasteful use of papyrus, the king of off-brand herbal tea fonts. What could be more warmly authentic?

No. As a seasoned consumer of visualizations, you know that you're better off relying on something more like this, something with a higher data-to-ink ratio, as Edward Tufty would say, although you might also think we could do a little better, maybe something like this. This is a Cleveland dot plot. Here, the x-axis doesn't go to zero, just covers the range of the data. That's ggplot's default behavior, and it's the default behavior in most modern visualization toolkits.

By the way, here's some advice. Any graph with a categorical axis and a continuous axis is really a kind of table in disguise. So put your categories on the y-axis, that is, in the rows, and the continuous value on the x-axis. No more faffing about trying to rotate your text or your head. For many audiences, a graph like this is perfectly fine, but it's easy to think of cases or audiences where the reaction might be quite different.

Here's a graph comparing all-cause mortality in 2020 to the years 2015 to 2019 for the United States. If I share this on social media, I guarantee you that someone will yell at me that I'm being misleading because the y-axis doesn't go to zero. They will demand that it should be changed, or they will personally insult me. I guess I should say, and they will personally insult me. I can guarantee this because it's happened several times. The people who complain insist that it should look like this instead.

This is worse. It's not wrong, but it's worse. There's no reason, there's no real need for the y-axis to go to zero. For one thing, the relevant baseline comparison is already in the graph. It's the gray lines that show what's happening for the five years prior to COVID. And second, there are over 340 million people living in the United States, and nobody reasonably thinks that there are weeks when nobody dies.

The thing is, though, you can't do much in the graph itself to make this point to someone who is looking at it. Maybe they know and accept the point, maybe they don't. If you have the time, I suppose you could explain why any kind of iron rule that says the y-axis should go to zero would be unreasonable. Now we can look at this. For instance, here's a graph of a long time series of mean daily global sea surface temperatures with the baseline set to zero degrees Kelvin. This was calculated from real data, by the way. Over 17 million spatially weighted observations.

But you can't explain this personally to everyone who sees your COVID graph, and it probably wouldn't help to put a note at the bottom either. I do have some sympathy for the people who get upset about baselines, because for one thing, at least they're reading the access labels, and maybe people have, you know, it happens less often than you think, and maybe people have tried to put one over on them in the past. But my point is just that there's only so much you can do with any particular image to establish your trustworthiness. The design of the graph can't by itself do it for you.

The real problem, as always, is people. But they're also the solution. It's like Homer and beer. The pictures we make are about something. They're made for someone to look at. So we'd like to know, how much does the person know about the data that we're drawing? How much do they know about the way that I'm drawing it? Sometimes the only person looking at what you make is you, but you should still ask that question, because people have a very strong tendency to go straight to the most complicated thing that they kind of know how to do, whether they're drawing a graph or fitting a statistical model.

But in any case, someone is looking at it. And so there's a relational aspect to any informative visualization. In terms of the commitments involved, it's a two-sided relation. People may make some demands of you, but if we're making a thing, we can also have some expectations. At the level of the graph itself, there's a set of conventions of representation that your audience needs to understand in order to comprehend the graph at all. And as I've said, a lot of the basics of this get handled to a very high standard by default in good software, but the people looking at it still have to understand it, and that can be more challenging than you imagine. So it depends on what you can expect from your audience.

And beyond that, there's whatever meaning the information you're providing carries for the person looking at it. Those expectations that they have, the setting in which they encounter it, the associations they have with it, usually you have very little control over any of that.

The social life of images

These days, in fact, one of the decisive features of the social life of images is the way that their context can just collapse completely very fast. So last year, I wasn't able to see the total eclipse, and the day after it happened, I went to NOAA, the National Oceanic and Atmospheric Administration, and downloaded a sequence of images from GOES-East. GOES-East is the geostationary satellite that sits over the Earth and is responsible for where US weather maps come from. A new GOES-East image gets posted every 10 minutes, and so you can stitch them together into a little animation. So there's the eclipse from space. It's pretty nice. I posted this on my website, and the next day it ended up on the front page of Hacker News. If you don't know what that is, well, I envy you in the purity of your innocence.

So I went and looked at the comments, which is a rookie mistake, and this was one that jumped out. Now, on the one hand, leading with, is that animation fake, is a little obnoxious. On the other hand, it's kind of a reasonable question to ask of a context-free image. I mean, this person is right. You can see the clouds on the dark side of the Earth in that image if you rerun it, and you can't really see the clouds at night from space like this. So what's happening? Well, the camera on GOES-East isn't a Kodak Instamatic from 1968 or something. It's an array of image sensors that look at different parts of the spectrum, and some of those are in the visible part of the light range, but other sensors look at cloud coverage via the infrared, and so you do some image processing and you get a composite image where you can see both the light and the dark parts at once.

What commitments have I made to internet randos on Hacker News? Well, to be honest, none. I made the image for my own benefit. I put it on my website and left it at that, but then the next day when I woke up to my horror and found that it had gotten tens of thousands of views, I did add a note explaining the image a little, providing links to the GOES satellite and its image feed. So I suppose I did feel that I owed the audience in general, if not a particular querulous commentator, a bit. But there are limits, sharp limits, I think, to how much more I owe them.

What's at stake here, I think, in general, is what you might call a social production of trust and credibility alongside the organizational and institutional production of data. Data visualizations get made and shared. The data comes from somewhere. The image is made by someone using some pipeline of software tools. What we want to know is whether that process is reliable and whether the people behind it are trustworthy. And even though you can partially, I contend, institutionalize and formalize many aspects of that process, at its core you can't fully automate it.

Now don't get me wrong, software can help a lot, especially with reliability. I said earlier that reliability and reproducibility were related. Resources like CRAN, tools like PAC and ORANS and TARGETS can do a tremendous amount to impose order on the unruly business of analysis and reporting and visualization pipelines. But even though these tools give us a lot, their creators don't present them as some sort of panacea. They can't, and they're not meant to, eliminate the need or automate the process of discovery. They can't, and they're not meant to, eliminate the need for people who know what they're doing. For one thing, you need some expertise to use these things properly. And that's why I think of these issues less in terms of transparency as a sort of abstract standard, more in terms of webs of relations and ongoing commitments between people who know what they're doing. Transparency is no good if you don't know what you're looking at.

A second problem is that once you start working backwards along the causal chain, the requirements for full reproducibility and full automation never end. As Carl Sagan remarked, to make an apple pie from scratch, first you must create the universe.

In practice, even when you have a fantastic, fully versioned, cryptographically signed, transparently organized build process that can bootstrap and validate an analysis all the way from the bare metal to figure one, those tools still ground themselves out socially in groups of real people who know what they're doing and who aren't cheating.

Three hard questions

Although, recently, we've been facing three hard questions, questions that I'll say right away I do not have easy answers to. The first is, what happens when people have a thought that goes, hey, what if I just made up my data? What if I just pretended to do the thing I said I did?

Our chart men might have been fooling their audiences with their stories of pullbacks and head and shoulders and resistance patterns and support levels, but at least they were fooling themselves as well. There wasn't anything fraudulent about their data either, or about what they did with it. They weren't frauds, they were just wrong. That's not the world we live in now. There are very strong temptations, or, as we now politely say, incentives, to do the wrong thing. Some of those incentives are just money, some of them involve the undeniable attraction of being up on stages like this. It's very nice.

But not only that, thanks to the very same tools we use to analyze and validate real data, it's now easier than ever to generate streams and scores that are plausible looking, but fraudulent. As I say, I do not have an easy answer to this, except to say that it's a real problem. Science runs on trust, and it is not set up to detect fraud. If you had to check fully the authenticity of literally every piece of data that came across your desk, every single piece of research, not just whether it was done correctly or in line with your standards, but whether the numbers were even real, then everything we do would grind to a halt immediately. Most people meet most of their commitments most of the time. They're professionals, they believe in what they're doing, and they do the right thing. But the flip side to having a system that necessarily runs on trust is that some people take advantage of that trust and exploit it. When you do find people who are cheating, you have to punish them.

The second question goes something like, hey, what if I could just replace all those messy and difficult people with a robot? Maybe a robot running on hardware at least as powerful as my home visualization workstation here. Specifically, what if I could get a large language model to read just a tremendous quantity of stuff which I got from somewhere, and then I could ask it questions and have it fluently tell me the answer? Or at least have it emit output that the biggest matrix you've ever seen in your life says is what an answer probably looks like. Hey, I mean, if not answer, why answer-shaped?

In the language that I've been using here, LLMs, I'd say, face challenges both to reliability and trustworthiness. On the reliability side, when you want to take advantage of the tremendous leverage they undeniably offer for certain tasks, you have to proceed with a fair amount of caution just because of how they work. Everything becomes a measurement problem, which we're pretty used to in the world of data analysis. The ways these things can fail, though, can be quite unexpected and alien, and we heard about some of those challenges yesterday and how to address them maybe.

Meanwhile, on the trustworthiness side, at least in their current consumer-facing form, you have other problems. You may recall an incident from a few months ago when a vibe-coding entrepreneur had a bad experience with an AI he was using. At one point, he was so annoyed to discover that it was faking unit tests that he made it write him an apology letter. And then a bit later, when it deleted his production database, the AI explained how it had violated everyone's trust and was like really totally super sorry. It sounded very sincere, too.

Now, I think that asking an AI to write you an apology letter is a little bit like me finding my alarm clock $20 when it doesn't wake me up. But you can see how people end up here. LLMs talk to you as if they were capable of making and abiding by commitments in a way that a person is. They very strongly encourage the intentional stance that people are naturally disposed to have, that disposition to interact with anything that looks like it might be conscious as if it really was. What LLMs often find it very hard to do is to say, no, that's not something I can reliably do, or I don't know the answer, or actually I should just leave this blank. That is to refuse to take on commitments that they can't meet in a way that a trustworthy person would refuse to take on commitments that they can't meet. Instead, if a task is nominally within its scope, an LLM will often keep insisting that it is things figured out this time.

Now, you might reply, eh, a skill issue. Engineer your prompt so that it doesn't do that. Fair enough, and I'm not saying it can't happen. But then you are in a strange sort of world where you're trying to evoke quasi-trustworthy behavior by writing the robot a begging letter and then sort of rolling 46 against your charisma while still being personally responsible for everything that it does.

Right now, it does seem like this tendency is built in to the current generation of LLMs. A recent paper just this month by a group of open AI researchers is consistent with a lot of prior research on this topic. The authors note that LLMs have a really hard time saying, I don't know, and they discuss how this has its origins in pre-training and why it persists in the face of attempts to eliminate it in post-training. One of several forces pushing things in that direction is the way that benchmarking consistently rewards providing an answer over saying, I don't know. Of course, this might improve or be solved in the future, and there are ways it can be more or less mitigated. But right now, it's kind of built in to AIs that they tend to overcommit in this way, and that's what makes it hard to trust them. In a way, it's like the HAL we've ended up with is untrustworthy like the old HAL, but for totally different reasons, like for the opposite reason.

In a way, it's like the HAL we've ended up with is untrustworthy like the old HAL, but for totally different reasons, like for the opposite reason.

Half the struggle is inductively figuring out what commitments the robot can reliably keep.

The third question is about where data comes from. If we take the perspective of, let's say, the last two centuries, there's been an absolutely mind-boggling expansion in the capacity of organizations to collect, maintain, and analyze increasingly large quantities of data. That began with the expansion of states as compulsory organizations looking to collect information about their own populations and territories. It was enhanced by other formal organizations, the modern corporation doing similar things on a smaller scale, and it was supercharged by a series of revolutions in information technology that we're still living through. If we take the perspective of, let's say, the last two fiscal quarters, then a lot of things people have taken for granted about the state's continued capacity for collection and analysis of data is very much in question. That word mere in the phrase mere reliability, the one that the philosophers use all the time, suddenly appears a little complacent as questions of institutional reliability and trustworthiness are on everyone's mind.

All I want to emphasize here is the tremendous degree of reliance that we have on a data collection infrastructure that consists mostly of public institutions or the product of work done funded by public institutions, by people trained at public institutions, and on tools that are produced and contributed to voluntarily or engineered and maintained in the public domain.

Sea surface temperatures and the bucket correction

One last example of the difficulties involved with getting a stream of data, accurately producing a score, and telling a visual story in a trustworthy way. NOAA is one of several international organizations that has collected, managed, and distributed climate data over the years. Thanks to those efforts, we have a number of carefully maintained time series of sea surface temperature measurements, some going back well into the 19th century. Some years ago, there was quite a bit of debate in the scientific literature about how this particular time series didn't quite line up with other data that we had from other sources of measurement. In particular, the sea surface data seemed to indicate a kind of pause or plateau in warming in the post-war period, beginning around 1940, when other indicators did not.

Some very careful work by various scientists, not all of them working in the same group or anything, but a lot of independent work, established what was going on. It's one of the basic but repeatedly forgotten lessons of the sociology of science that measurements come from somewhere and are eventually turned into data. The sociologist Bruno Latour liked to think that there was a moment when they passed from being a thing in the world into being a record.

How do you measure the ocean surface temperature in the 1800s? If you're part of the Royal Navy and oceans are battlefields, you do that with a wooden bucket thrown over the edge of your ship. Eventually though, sailing ships get replaced by steam and then diesel engines and the method of getting the temperature changes. On newer ships, you measure the seawater that's pumped into the engine room to cool the engines, before you use it to cool the engines, and the water that way tends to be warmer than water hoisted up in a wooden bucket, because a wooden bucket is not a good insulator. Then later, someone digitizes a lot of U.S. Navy logbooks and adds them to the time series. That makes for more warm biased measurements, especially for years when there are suddenly a lot more American Navy ships floating around in the North Atlantic, like starting around 1940, for some reason. And so you get a spike in apparent ocean temperatures, which make subsequent years look like they plateau for a while, and thus the so-called bucket correction is born. Our score needs to be adjusted.

Later, temperature sensing gets done in other ways that are more comprehensive, more standardized, more automated. The end result, or one end result, is me taking advantage of all this careful work, instead of remodeling my kitchen, to draw a picture on my laptop.

Actually, more than a picture, we can make an animation with this. This is an animated recreation of a graph you might have seen circulating in a few places. I'll set it running. We're just tracing changes in daily global averages, year after year. The graph employs the standard virtues of ggplot, a layered structure where we try to highlight the elements we're interested in, where we repeat design elements in a way that lets the viewer follow the structure of the data and the graph more easily. Conceptually, it's just the same as the COVID figures I showed earlier. It's just animated, thanks to Thomas Lynn Peterson's gganimate package.

The thing is, the trustworthiness of this visualization has very little to do with the graph directly. The trustworthiness comes from a web of actors that includes government-sponsored data collection, private sector organizations, universities, scientific institutes, individual technicians, researchers, scientists, engineers, staffers, midshipmen, all making and meeting their commitments to one another. If you think all of that is fake, or a put-up job, there isn't anything I can draw, there's no mark I can put on a page or a screen that can fix that.

If you think all of that is fake, or a put-up job, there isn't anything I can draw, there's no mark I can put on a page or a screen that can fix that.

Closing thoughts

Data visualization is a terrific tool to learn about streams of data and to make sense of the scores that you calculate from it. It's one of many tools that we have in data science. It has some special properties because of its ability to condense those streams and scores into images that you can tell a story with. But that's all the more reason for us to be wary of finding ourselves using it to spin yarns. There are a lot of ways for a story to go wrong. Our tools won't and can't save us by themselves. The important thing is not to lose sight of the collective, cooperative character of the whole enterprise, the part that Vicky Boykus calls normcore tech. That's the thing made out of all those people who just know a bunch of really specific stuff, much of it particular or local knowledge about how something works, or a specific data set, or what something is made of, or why you just broke that by mistake, and how to fix it.

They know a lot about this specific data set, or that particular method, or this weird bill system. They're there in your division of labor, just knowing things. Many of you probably are that person for some other people, and we all have someone else that we rely on to be that person for us.

Data is always a bit of a mess. Our methods could always be improved. Our software could always be better. Sorry, Adly. New technologies can be extremely weird. This collection of organized systems and loose networks, of written procedures and local knowledge, formal organizations and social ties that sustains what we do, it isn't perfect, because of course it can't be. It's made of people. But it's where trustworthy data visualization, where trust in everyone's work is produced and sustained. It exists not as some frozen structure, or an abstract set of standards, but as a living mesh of ongoing exchanges and commitments made and met. The challenge, I think, is keeping that trust alive in the world where we all now live and work. Thank you very much for your time.

Q&A

Thanks, Kieran. So we have time for a few questions. Remember, if you want to ask any questions of Kieran, you can find the Slider link in the app. I thought I would start with a personal question. I noticed you shared a number of ggplots, and none of them used the default theme. Why do you hate me?

When someone sets such a high standard, you don't want to confront it head-on. You've got to kind of ride around it.

Okay, real question. Data collection and presentation has been highly politicized and that will likely continue in the near future. How can we as data consumers survive in this environment?

You have to know where your data comes from, and who's making the visualizations, and why they're making them. I mean, I think the whole thrust of the talk is that both as data consumers and as data producers, the graph can't sustain the weight of managing the politics of truth in a polarized society. It just can't, and so we need to think about sort of how everything else is organized, too. The layers underneath the visualization is like the tip of the sphere. It's the thing that is most easy to circulate for a context to collapse around and for, you know, to become the focus of conflict, but really what we want to look at is the stuff underneath that generates the trust or lack thereof.

How can visualization help when segments of the population have become actively hostile to anything fact-based?

It can help in the same way that all the other tools that we have can help. Original data collection, careful validation, tables, models, but all of those things depend on an ongoing network of commitments that people are making to one another, and the result is that the formal products, all of those formal outputs, are not sufficient by themselves. There's no magical force that they have. It doesn't matter whether you use the default theme in ggplot or some super fancy one. That's not going to be enough to change somebody's mind. It's an important part. Now, it's not true. At the margin, you can change people's minds with data, but I think that in order to get to that point, you still have to be in front of people who are at least in principle receptive to the idea that their mind might be changed or that they're open to the idea of new evidence. If that is, by hypothesis and the question, fully closed off, then almost by definition there's nothing you can do, but I refuse to believe that we're quite that badly off just yet.

With AI now allowing us to create images on their own, what does that mean? Do you think that makes reproducibility and open source even more important?

It makes it more important, and it makes it sort of weird and difficult, because it's clear that what's happening, I mean, a company like Posit, and you're representative of everybody is facing this challenge, but you think of organizations like this as really having built their reputation over the last decade on reproducibility and code. That's why you're supposed to switch to RStudio and move away from Excel and things like that. It's because you can write code that