Ahmadou Dicko | Humanitarian Data Science with R

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, I'm Ahmadou Dicko and I want to talk to you today about Humanitarian Data Science with R, particularly how to build reproducible humanitarian reports. I found this topic really relevant today because of the rise in the number of crises all around the world due to disasters, conflict, epidemics and so on.

And you have places in crisis for decades. What you can see in this picture is a camp of displaced people in the Central African Republic. Central African Republic is a good example. In this country of 4.7 million people, one out of four people are currently displaced due to a crisis that lasted more than 10 years. Half of the population are currently food insecure. It's a lot and there's a lot of people in need.

And as you can expect, a lot of organizations are on the ground trying to help. But at the same time, in Central African Republic, there's also a lot of incidents and attacks targeting humanitarians. Just last year, more than 300 incidents were recorded, affecting mostly humanitarians. And it's a lot. I think it's really one of the hardest places to work for a humanitarian worker. So people out there are really doing a great job and it's not really easy.

But how do I know all of this? How do I know all these numbers, all these figures? I can say it's because I care and I lived, for example, nearly 11 years in the center of Africa, not in this country, but really close. That's part of the reason. I can also say that I'm working for a humanitarian organization. It's also part of why I know this. But the real reason why I know all of this with all this precision is really because I'm reading reports.

And these humanitarian reports are more and more analytical and data-driven and can be seen really as a data product. So this humanitarian data product, I can share an example. For example, you can have this humanitarian situation document by OCHA, which is a humanitarian organization in Central African Republic. They have a really great team doing an amazing job there. And you can see, for example, in this document that 2.6 million people are currently in need. And out of this 2.6, 1.8 million are targeted by organizations.

The targeting really can go deep and you can even see within the country how many people are assisted and where you have more people assisted than in other places with the maps that they are sharing. And this is really key information that are useful for people on the ground and also people elsewhere like me that want to understand what's going on in this country. And as you can see, this product is really well crafted and well designed. And you can see a lot of data in there.

They will not see any difference because for them it won't look really different. And that's really the key here because we don't want this document to be different because they are already looking really good but internally we can make this change.

They will not see any difference because for them it won't look really different. And that's really the key here because we don't want this document to be different because they are already looking really good but internally we can make this change.

A worked example

Let's see how, for example, we take an example, how we can achieve something like that. We take a really hypothetical example. Let's say we want a one-pager on displacement, conflict-induced displacement, the number of conflicts per admin, admin one, and a situation of the funding situation in the country. So that's usually three types of information we need to do that, three or four. Let's see how we can achieve this. First, we need some data.

So for all this example, we will take data from a platform named HDX. HDX is one of the most used platforms by humanitarians to access data because all the data they are sharing also about the humanitarians. So you have really a lot of information. So Umu will go there, our analyst, and she will use the HDX package, which is really tidyverse friendly. And in three lines of code, she can search some data set, find the data set, and read it directly. So in four lines of code, you have access to the table that you want to use.

And now that we have the table, it's just easier to plug it to ggplot. So in less than 10 lines of code, you have this conflict-induced displacement time series. And we can see, for example, in 2013, it was really worse here in the country in terms of stock displacement due to conflict. Now we can use the flex-table package to create a table because Umu is not also just doing chart but also table for the report. And we use the same website to get information about requirements and funding. And this data set, for example, is updated on a daily basis. So the information is really up-to-date.

And you have really access to key column. You can use the dplyr package to select the column you need and filter because we want just the past four years. And use the flex-table. And then in a few lines of code, again, we have our table. So that's really a change in the way because just in a few lines of code, you have mostly the product you need, right?

And we will do the same process for the map. Our GIS officer, Ali, will go to SGX and pull the data set he needs, read the data set, and put it into ggplot. And here we can see, for example, that it is not that difficult to access a shapefile of the admin one level. Now what can be more tricky is for him to access the conflict data and put it into the map. By default, on SGX, the conflict data is not spatial, right? But you have longitude and latitude columns.

So what he will do, access the data from SGX, filter using dplyr the year that he wants, and count the number of incidents by coordinates and just turn the data into a spatial data. So in a few lines of code, he is doing something that used to take really a lot of time and maybe need to involve a lot of tools. Maybe one part will be done on Excel, another part on GS, and so on. So that's really, really important that just a few lines using our Magritte pipe, we can do just a streamlined type of analysis.

And then once we have the conflict, and we have our background map, we can just combine them using the SF package. Again, it's not that difficult if you know how to use SF package. And with ggplot2, in two lines of code, two, three lines, we have our map. And this map already show a lot. You see where you have more conflict and where you have less conflict. And that's really what we want at the beginning.

Now that we have all this, let's combine our document and see how it looks in one page. Well, it looks okay. And I think we can already share it, so I'm going to get the information from this. But we said something really key and important, right? That the designer, Javier, he's the one working on a template. And what we want is our product to look the same and to follow the chart policy or color or everything. I mean, the organization requires, because sometimes they say, okay, you have to use this color because that's our brand, and that's fine.

So using the template that Javier shared with us, it looks totally different, and I think it looks really much better. The chart are really better, the color and everything. And this one will be probably the one that other partners will see. And they will not notice any big change compared to what they used to do using other tools.

The case for reproducibility in humanitarian work

And that part, I think I want to emphasize that is really key. So here we can see the situation, and it's just something really not a realistic product, to be honest. It's just a quick example to show you how it works. And in real life, we have a lot of organizations out there now using similar workflows and more and more investing to do that. So in real life, reports are really more complex. It can be sometimes even like 50 pages or even more. It can be just dashboard. And in case I didn't talk too much about dashboard, because I think it involves also, we can also shift from Power BI Tableau to Shiny . There's really a lot of things we can do.

I was just focusing on reports because I think there is a lot of potential. And as you can see, yeah, the flexibility of R Markdown, the fact that you can build the template you want, allow your team to be more effective. And I will stop there. But just before, I just want to add that noticing just that humanitarian organizations are really increasing using R, and that's really a good thing. And a good thing because R by default will push you probably to do more in terms of reproducibility and transparency. And I think for humanitarian organizations, that's a really key aspect.

If behind each number, I can tell you exactly what I did, it's really important for, because it brings trust between partners, but also it reduces the number of errors we can do. Because you have to keep in mind that it's a life-saving information, and one error can cost a lot. So pushing and using reproducibility and transparency is just more than just, oh, I want to use R to do my stuff. It's also pushing to be more accurate, do less error, and also apply changes to my reports and my own pipeline more quickly.

Because you have to keep in mind that it's a life-saving information, and one error can cost a lot. So pushing and using reproducibility and transparency is just more than just, oh, I want to use R to do my stuff.

So that's really something I think we should aim, even if it's using another tool. But I think R provides these type of things more easily. There's a lot of documentation already out there, so it's just easier to invest in R.

Talking about investment, that's the key now, because we need to maybe work with teams that are already familiar with some tools and teach them how to use R, if they are willing to do that and want to do that, of course. And I think many organizations are using R today and are pushing for that. But we definitely need investment infrastructure, like RStudio Connect, like maybe a Shiny server if they are using Shiny, but also trainings. They have to be trained in the proper way to use all these tools.

And I think for us, by us, I mean a humanitarian person, I think I'm lucky because I'm part of this humanitarian user group that are virtual and online, called the Humanitarian R User Group. We are living in Skype for the moment, and it's cool because you see other people sharing what they are doing using R, and we are asking questions. And it's really a friendly environment, and you can see how really R is more and more used by many, many people, and that's really, really, really key.

So I will stop there, and I want to thank you again for listening to me, and I hope that if there is any data scientist there that wants really to help and work on the humanitarian sector, there is a place for you. So please, if you see an opportunity, apply, because I think that you will be really lucky to have people with your skill set, and you can really do a lot of difference with your skills. Thank you.