
Uncharted Nuuk: Data Exploration in Search of the Unexpected (Emil Malta, Statistics Greenland)
Uncharted Nuuk: Data Exploration in Search of the Unexpected Speaker(s): Emil Malta Abstract: Good graphs are more than simple decorations. They are effective tools for challenging your intuition, and can uncover engaging narratives along the way. Using examples from official statistics, this talk shows how tools like API packages can query data from official registers, streamlining ways to import and tidy data, so your time is better spent on story telling with data visualisation. Highlighting Nuuk, a symbol of Greenland's rapid change, I'll demonstrate how to use these tools to turn complex registers into clear, impactful stories. If you work with official statistics, as an educator, journalist or in government, this talk offers practical insights for streamlining workflows and crafting engaging visualizations. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
So hello everybody, my name is Emil Malta. I work at Statistics Greenland as a senior statistics consultant. So I'm actually going to be talking a little bit about Nuuk and Greenland today, though only a little bit. The main topic of my talk is about data exploration and the title is Data Exploration in Search of the Unexpected. So mainly what I'm going to tell you about today is some strategies that I have for making effective graphs.
So there's like 20 things that I want to talk to you about today, but I only have 20 minutes to talk. So if you come out of this room having learned at least at least these three things, I think we're all good. The first thing that I'm going to be talking about is about the demographics of Nuuk and Greenland, just to give you a bit of context about what kind of place it is. I realize that there are not that many people from Greenland at this conference, so I think I'd like to give you a brief introduction about the place at least.
But the main thing that I want you to take away from here is some strategies for finding the right graph and the right story if you have a data set that you're working with. And spoiler alert, there's only going to be one important graph in this talk today. That's at least what I'm trying to make. I'm going to show you a bunch of graphs and I'm going to tell you why they don't work, I think, but I am going to end on one final graph that I think is pretty good. So if your mind is completely shut off and only see the last one, I think it's fine, and then you can see that and hopefully be very impressed by it.
About Nuuk and Statistics Greenland
The first topic that I want to talk was about Nuuk. Nuuk is the capital city of Greenland. I know maybe some people from Denmark have a loose connection to what kind of place it is. It's a very small place, even though it's the capital of a whole country. When I moved there back in the day, it was around 18,000 people. And these days it's closer to maybe 20,000 people, which is kind of small for something that calls the capital of a country. But I grew up in a small town of 1,000 people in a place that was even more isolated, so me moving to Nuuk was actually a very big experience for me. It was like moving into downtown Manhattan, is what I usually call it, even though it's like these 20,000 people.
And 20,000 people might not sound like much, but there is actually some very good data that is found in a place like Nuuk. Because of the history of the place, we pick it back on the Danish Civil Registration System whenever we do any public data collection on these people. So even though there's not that many, we know a lot about them.
And that's going to be maybe counter to what a lot of people in this room think maybe when they think of Greenland and data visualization. Because when I go on Reddit and search for data in Greenland, I usually go in with these high hopes that there's going to be an amazing data visualization that tells an engaging story about some of the awesome data that is present there. But usually what I'm met with are these world maps where all the other countries have these colorful, very nice data visualization stories. And Greenland is always no data, which upsets me, to be honest with you. Because I know that there's a lot of cool stories that you can tell also with the data that's there.
And I skipped over it at the last slide, but my job is actually at Statistics Greenland. It's a national statistics office where we do a lot of different statistics that says something about the population and the society that is there. So we do population, some economy, some education statistics, crime and health, the whole thing actually. So it's all available. It's all right there.
Accessing the data
And you could also look it up if you wanted to. You could do it in a couple of different ways. This is kind of a busy slide, but one way you could collect some data from Statistics Greenland is through our website. We use the same API that is used in a lot of the other Nordic countries. So it's the same one they use in Sweden and Iceland, Finland, and a few other countries I know.
But because I am around the office, there's also an R package that you can call where you can fetch data from the place. So I'm not going to go into too much code in this talk, but if you're interested, there's the line number three at the code part right there. That's an API wrapper that I wrote so I could fetch data from the API. And then I can do some tidyverse magic on it to give me a tidy data set. And that's usually how I work also when I make a data visualization.
Because I use tidy tools. So the important thing about that data set that I'm showing you right there, because I'm actually going to use it for the rest of the talk, is that it's a population census table on the city of Nuuk that I showed you before. These roughly 18 to 20,000 people every year. Every row in this data set I've made sure is one person counted every year at the 1st of January. And we know some things about these people in this data set that I'm working with. We know age, what part of town they live in, where they're born. That's actually very important. What gender we have them registered as. And the time of, like, the census, like, when that we collect this data.
The value of surprise in data visualization
So that's a table, right? And you could show this to someone and hopefully give them some kind of meaning. But the way that my mind works is that you need a graph to actually understand some sort of a table. Some sort of, like, information about the table. I need to see the lines and the points and the bars before I can say that I've truly understood something of the data. And then, again, I'm going to show you a couple of different graphs that I make when I'm trying to dig into the data set and see what kind of story that it has.
And the simplest one that I could start with was this line graph that you can see here. This is a line graph on population growth. And on the x-axis, you show time. And on the y-axis, you have amounts, right? So it's been the basic story that you can get from a picture like this is that there's population growth in Nuuk, which may be a surprise to some of you. But if you live in town, that's not a huge surprise. Like, everybody knows that. We see many new buildings being made. We see a lot of new faces around town. So just the fact that there is population growth in Greenland, that comes as no surprise to anybody who knows a little bit about Greenland, at least, and a little bit about Nuuk. So there's no surprises here.
So what I'm trying to ask, then, is that if there are no surprises, is a graph like this even worth showing? And I would say that it's not, right? I would say that it's not, because you need a surprise to have some sort of effective value from a data visualization, or a graph, sorry, before it can be effective.
So this is a very cliched thing to do, to show a slide with a quote on it. I realize that. And what makes it even worse is that it's this quote, because it's a very overused quote when you're learning about data visualization. But let me read it to you. It's, the greatest value of a picture is when it forces us to notice what we never expected to see.
The greatest value of a picture is when it forces us to notice what we never expected to see.
Now, this is a very old idea. This is something that was said by John Tukey, I think, like 50-ish years ago. And there's a, you can interpret the quote in different ways, but the way that I interpret it is that the real value of data visualization is the element of surprise. When you set out to do a graph, challenging your assumptions is the whole point, I would say. When you start a data visualization, when you start a data analysis project, you're probably going to have some ideas about what you're going to find, what your intuition is going to tell you. There's probably all sorts of graphs that you already have in your mind before you see them on the screen. Challenging those assumptions is the whole point. You want to prove yourself wrong when you're doing analysis. Because graphs that prove your point have little to no value in my mind. They have some value, they can be decorative, they can grab people's attention, they're nice to look at. But if you want insight from a graph, you need it to prove you're wrong. And that's something that I learned very late in my own data literacy journey.
Another thing is that visualization is also a crucial part of analysis. It's often overlooked. I mean, it looks nice also. But sometimes when you do a data analysis project, maybe your instinct is to go straight for the models and go straight for the hypothesis testing, going straight by the graphs. And I think that you're missing something valuable when you do something like that. Because the graphs are there to tell you which story is relevant, and it's there to tell you where the surprises can be.
Digging into the Nuuk population data
So I'm going to show you a couple of different graphs right here, because this is me digging into what kind of surprises are there in this data set. So this is, again, the population census in Nuuk. But I've gone and see if there's a story maybe in the population growth of the different districts of town. So every color in this place is, every color in this area graph is one part of town. So the dark blue one is like downtown Nuuk, and the green one is another part. And this sliver, this light green sliver that appears around 2010, that's a new suburb that was built around town. For people like me, people like in their mid-30s having children to move into. And that was, that's kind of a surprise right here, actually, because there is a bit of a story here that I would probably show to people.
The main takeaway of a graph like this is that you can see that there is population growth, but it's not like homogenous throughout town. But population growth in the new suburb actually happens at the expense of some growth in other parts of town, which, you know, there's a story there, and there's a little surprise for me when I started looking at it. So I would probably show this to somebody.
And, but looking even more for some surprises, I could draw something like this, this is a bit more complicated graph where, again, you have time on the x-axis, and then you have mean age on the y-axis. And then you can try and draw the GM to see if there's any kind of development in the mean ages across town. And there is maybe a little bit of a story here. But the thing is, because I have like already a few different lines, a few different colors, a few different lines, a few different colors, I need to explain to people that there is a, like a summary statistic on one of the axes. That's, I need to explain too much for it to actually be truly effective. And the real thing is also that what this graph actually shows is a bunch of data quality issues, which are going to distract from some of the stories that I want to tell. And also the light green, like little line right there, that just shows that people in the suburbs are younger. So there's not that many surprises that I can find going this way. So I'm just going to not show this to people. I'm going to show this to you because I want you to understand that this is not a good graph.
But the thing that I can do afterwards is trying to do an even simpler graph. So this is another graph where I have time on the x-axis and just amount on the y-axis again. And I do that because I want to compare the population growth of men and women in town. There are people, so the red line right here, that's the women in town. And the blue line in this part is the men in town. Because I would expect them to be around 50-50, right? Especially in a place that's so small, like 18,000 people, I would expect them to be roughly the same amount of each. But I'm being proven wrong on that assumption right here. And that's actually very important. If you notice, I think you immediately notice that there are about 1,000 people more, right? So there is something going on right here, right? So my spidey sense is tingling here because I know that there's a story that I can dig into.
Visualizing distributions
So my next part is actually trying to draw the visualization of the distribution of the two. Visualizing distributions is how I understand the world around me. And the most detailed distribution graph that I can draw is a distribution of the ages of the men and women in town. So this is maybe like a histogram.
So this is a histogram that I could show. And histograms are cool. They're good. You learn about them when you are like me. You learn about them in college. And you would think they're intuitive because you're so used to them. But what I actually find is that people don't find them that intuitive, really. Because I could show a graph like this. And because I know ggplot2 really well, I know that a histogram that is split up by the fill of it by two categorical girls is going to show the two of them like on top of each other like that. So I've had people come and ask me, are there really twice as many men in town? And that's not the point that I was making. But I can understand why people think that. Because they look at one point where the peak of the women, like the peak of the red histogram there, is like only halfway of the men.
So having learned that when I showed it to other people, I tried to draw a graph that looked like that instead. Where instead of being on top of each other, I show them in front of each other like this. So they're easier to compare. And there's a story emerging when you're comparing the distributions like this. At the 40s and 50s, there are going to be more men than there are women in town. So that tells me a little something. There's like the surplus of men that I see in Nuuk are being explained by this age group that is present right there. So that's something interesting.
But again, there's something about showing anything other than time on the x-axis where I lose a couple people. So when I try and compare distributions across groups, I go for some other graphs than histograms like this. And there's a couple of different options that you have to visualize distributions between groups. One way you could do it is like having your two groups, in this case, men and women on the x-axis, and then draw some kind of geome that tells you something about the distribution. And again, if you're like me and having gone to college, you've probably seen a bunch of box plots. That's a very common way of visualizing distributions in statistical text, especially academic text. But my audience is usually more general, and they don't really have time to learn what a median or what an IQR is. They don't really have time to be told what the whiskers represent and stuff like that. So I usually try and avoid box plots, especially when I'm going for a more general audience, which is the exciting audience, I feel.
So if you've taken an introductory course on data visualization and in ggplot2, you've heard about these things called violins, and they're like box plots, but you capture a bit more about the richness of the distributions. And there's also a bit more intuition that people can draw on. If you compare the two violins between the men and women in this graph, you can sort of see that the men have broader shoulders. I don't mean that as any kind of pun, but you can see that there is something about the shape that gives some kind of intuitive sense, right?
So this is a bit closer to what I want, but really what I've lost here is a sense of the scale between the two, I feel. We already know that there are more men than there are women in town. We did that for some other graphs, but I would like to show it in one exactly. So what I try and do instead is go for a point diagram like this, where every point is a row in the data set, and that's something magical about points. But I like the violins a bit better because there was some intuition that I lose if I don't go for that. So instead, what I do is actually do a um, like a compromise between the point diagram and the violin, where all the points are scattered within the boundaries of a violin plot. And I think those plots are pretty cool. I would definitely show this to some people, especially because it looks nice and it tells a story for me, but I know that there's another surprise right here, and that's when I draw a graph that looks like that.
So this is like pretty much the same plot that I have, but I faceted on two different groups, and that tells the entire story for me at this point. So if you look at the men that are born outside of Greenland, like the men in the 40s and 50s, there's so many more of them, and that is actually what explains the difference in the men and women in town there.
So this is like pretty much the same plot that I have, but I faceted on two different groups, and that tells the entire story for me at this point. So if you look at the men that are born outside of Greenland, like the men in the 40s and 50s, there's so many more of them, and that is actually what explains the difference in the men and women in town there.
So yeah, I'm gonna catch a few slides, just 10 seconds. I heard you learn a little bit about Greenland and about Nuuk, and what my headspace is like during analysis, and I hope you learned a little bit about the value of surprise. Yes, that's what I want to say. All right, thank you so much, Emil.
Q&A
Maybe we'll throw in just the one Slido question. And one of them is, where did the cute illustrations in the slide deck come from? I come from chatGBT. I think I've seen a bunch of different presentations that have this exact style, but I think I've done something pretty cool with it, because I tried to tell a story with the visuals before I went for, what do you call it, I didn't just ask to do the illustrations for me, I just sort of tell the chat what I want, what is the situation. I have a story in my mind, there's like this little hiking guy going through town, and I see him in different situations where I tell what situation I want. So yeah, it's a whole art getting it to draw what I want. All right, thank you very much, Emil. Let's thank Emil again.
