Resources

Ahmadou Dicko | Humanitarian Data Science with R | RStudio

Humanitarian actors are increasingly using data to drive their decisions. Since the Haiti 2010 earthquake, the volume of data collected and used by humanitarians has been growing exponentially and organizations are now relying on data specialists to turn all this data into life-saving data products. These data products are created by teams using proprietary point and click software. The process from the raw data to the final data product involves a lot of clicking, copying and pasting and is usually not reproducible. Another approach to humanitarian data science is possible using R. In this talk, I will show how to seamlessly develop reproducible, reusable humanitarian data products using the tidyverse, rmarkdown and some domain-focused R packages. About Ahmadou: Ahmadou Dicko is a statistics and data analysis officer at the United Nations High Commissioner for Refugees (UNHCR) where he uses statistics and data science to help safeguard the rights and well-being of refugees in West and Central Africa. He has an extensive experience in the use of statistics and data science in development and humanitarian projects. Ahmadou was the lead of the OCHA Center for Humanitarian Data team for West and Central Africa and has worked with several humanitarian and development organizations such as IFRC, FAO, IAEA, OCHA. Ahmadou is a RStudio trainer (https://education.rstudio.com/trainers/) and he is passionate about the R community. He is currently co-organizing the Dakar R User Group (https://www.meetup.com/DakaR-R-User-Group/) and co-leading the AfricaR initiative (https://africa-r.org/)

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi everyone, I'm Ahmadou Dicko and I want to talk to you today about Humanitarian Data Science with R, particularly how to build reproducible humanitarian reports. I found this topic really relevant today because of the rise in the number of crises all around the world due to disasters, conflict, epidemics and so on.

And you have places in crisis for decades. What you can see in this picture is a camp of displaced people in the Central African Republic. Central African Republic is a good example. In this country of 4.7 million people, one out of four people are currently displaced due to a crisis that lasted more than 10 years. Half of the population are currently food insecure. It's a lot and there's a lot of people in need.

And as you can expect, a lot of organizations are on the ground trying to help. But at the same time, in Central African Republic, there's also a lot of incidents and attacks targeting humanitarians. Just last year, more than 300 incidents were recorded, affecting mostly humanitarians. And it's a lot. I think it's really one of the hardest places to work for a humanitarian worker. So people out there are really doing a great job and it's not really easy.

But how do I know all of this? How do I know all these numbers, all these figures? I can say it's because I care and I lived, for example, nearly 11 years in the center of Africa, not in this country, but really close. That's part of the reason. I can also say that I'm working for a humanitarian organization. It's also part of why I know this. But the real reason why I know all of this with all this precision is really because I'm reading reports.

And these humanitarian reports are more and more analytical and data-driven and can be seen really as a data product. So this humanitarian data product, I can share an example. For example, you can have this humanitarian situation document by OCHA, which is a humanitarian organization in Central African Republic. They have a really great team doing an amazing job there. And you can see, for example, in this document that 2.6 million people are currently in need. And out of this 2.6, 1.8 million are targeted by organizations.

The targeting really can go deep and you can even see within the country how many people are assisted and where you have more people assisted than in other places with the maps that they are sharing. And this is really key information that are useful for people on the ground and also people elsewhere like me that want to understand what's going on in this country. And as you can see, this product is really well crafted and well designed. And you can see a lot of data in there.

The humanitarian data team

So behind each data product, you have a data team. And humanitarian data teams are as diverse as it can be. But we have some typical role in the team. We can start with the analyst and let's call her Umu. Our analyst, she's the one analyzing the data, creating tables and making visual. She's also really in charge of anything related to the data, particularly tabular data. And she's using Excel and Power BI to do that.

And her colleague is a GIS officer and his name is Ali. And all special type of analysis, making maps and everything is usually his duty. And he's using QGIS and RGIS to do that. Now that you have really the core of the information we need, they cannot just be shared as it is. We need a designer, a graphical designer to really put all these documents together into a single document, but also to look at the design of this table, of this chart and maps and create also the type of infographics. So we have a graphic designer named Javier. And Javier is really good using Adobe Illustrator and Adobe Design.

The typical workflow and its challenges

So they have really these tools that they master, they use very well. And it allows them to create document that's the one you just see. But let's go maybe further and see how it works. So in a typical setting, Umo will get the raw data. She will crunch the data using Excel, Power BI and create some tables and some charts. At the same time, our GIS officer Ali will also do the same, will take special data and create some maps.

Now that they have maps, tables and charts, this information will be shared with the designer. And he will be the one really combining all this information into a document. And they can go back and forth until they're satisfied about the final product. And they can be shared now internally with other units in the organization or externally with other partners.

Now what could go wrong with this setup? Many things, but one thing that comes to my mind is with timeliness. So we want this document to be ready as soon as possible, because I think it's really life saving information and as soon as it can be validated and used, the better, right? But then if you share this product, let's say, okay, we have the draft we share with external partners, NGOs, government or the UN organization, and they all have the input. The government can say, oh, the data is outdated and we have new version, you have to change the data that you use to do that.

And maybe the NGOs or the UN organization can say, the methodology you use is good, but maybe I think this one is the one validated now and we have to change the approach. So all this change is more than just removing or fixing a typo on the document, you have to go back to the raw data and changing the whole document, right? But then doing this, you have to keep track of what you did first. And I think it's really key when you have to apply some change. And when you are using point and click tools, it can be hard, not impossible, but really hard to track past changes applied to the data. And when you have a lot of really, because it's a process and you can go back and forth like that for many rounds, actually, and then it's easier to have some mistakes and not spot them.

A better approach with R

So what we can ask now and is really the question is, is it possible to do better? And the answer is yes. And there are many approach maybe to just reduce the time we spend building this document and the time we spent really fixing this correction when we have a revision document. And if you look at the tools they are using, our analysis, mostly Excel and the GIS, and the designer is mostly around the Adobe suite, that's fine. But what we want to achieve is something that allow more transparency. And one thing that going to mind is reproducibility.

And one tool that allow us to have more reproducibility is R. R is a good choice, because you can because the ecosystem is really rich. We have a lot of packages to do analysis, maps, create beautiful tables and charts. And with R Markdown, we have one tool to create really any type of document and our designer with his skills using, for example, CSS can create a really nice looking template, right? So we are pushing for a shift and let's see how the workflow will change.

Because now what we'll do is really to have a more R Markdown based workflow to have all our reports. Our reports used to be based on point and click tools, that's fine. Now it will be really based on this R packages and our inputs will not be the input that we used to send to the designer, will not be the table, the charts and the map, but the document will be a living document inside an R Markdown document.

So the tidyverse will be used to crunch the data, to create some summary tables and output can be, for example, something from the GT flex table package or ggplot2 for any type of graphs. At the same time, any special data or type of information will be analyzed using the SF package or the raster package and can be visualized using ggplot2 too. But now all this analysis will live into one single R Markdown document, for example, in this case.

And it brings a lot of transparency because we can all see what is in the document and it helps us also keep track of it because we can also use other nice tools like version control that I don't want to mention too much. But there's a lot of goodies that come really with this approach. And then we have our final document, when we are happy about our R Markdown document that Javier provided with a nice template that he built. And then the final output for external users will be totally the same because with a template you can have a nice looking document.

They will not see any difference because for them it won't look really different. And that's really the key here because we don't want this document to be different because they are already looking really good but internally we can make this change.

They will not see any difference because for them it won't look really different. And that's really the key here because we don't want this document to be different because they are already looking really good but internally we can make this change.

A worked example

Let's see how, for example, we take an example, how we can achieve something like that. We take a really hypothetical example. Let's say we want a one-pager on displacement, conflict-induced displacement, the number of conflicts per admin, admin one, and a situation of the funding situation in the country. So that's usually three types of information we need to do that, three or four. Let's see how we can achieve this. First, we need some data.

So for all this example, we will take data from a platform named HDX. HDX is one of the most used platforms by humanitarians to access data because all the data they are sharing also about the humanitarians. So you have really a lot of information. So Umu will go there, our analyst, and she will use the HDX package, which is really tidyverse friendly. And in three lines of code, she can search some data set, find the data set, and read it directly. So in four lines of code, you have access to the table that you want to use.

And now that we have the table, it's just easier to plug it to ggplot. So in less than 10 lines of code, you have this conflict-induced displacement time series. And we can see, for example, in 2013, it was really worse here in the country in terms of stock displacement due to conflict. Now we can use the flex-table package to create a table because Umu is not also just doing chart but also table for the report. And we use the same website to get information about requirements and funding. And this data set, for example, is updated on a daily basis. So the information is really up-to-date.

And you have really access to key column. You can use the dplyr package to select the column you need and filter because we want just the past four years. And use the flex-table. And then in a few lines of code, again, we have our table. So that's really a change in the way because just in a few lines of code, you have mostly the product you need, right?

And we will do the same process for the map. Our GIS officer, Ali, will go to SGX and pull the data set he needs, read the data set, and put it into ggplot. And here we can see, for example, that it is not that difficult to access a shapefile of the admin one level. Now what can be more tricky is for him to access the conflict data and put it into the map. By default, on SGX, the conflict data is not spatial, right? But you have longitude and latitude columns.

So what he will do, access the data from SGX, filter using dplyr the year that he wants, and count the number of incidents by coordinates and just turn the data into a spatial data. So in a few lines of code, he is doing something that used to take really a lot of time and maybe need to involve a lot of tools. Maybe one part will be done on Excel, another part on GS, and so on. So that's really, really important that just a few lines using our Magritte pipe, we can do just a streamlined type of analysis.

And then once we have the conflict, and we have our background map, we can just combine them using the SF package. Again, it's not that difficult if you know how to use SF package. And with ggplot2, in two lines of code, two, three lines, we have our map. And this map already show a lot. You see where you have more conflict and where you have less conflict. And that's really what we want at the beginning.

Now that we have all this, let's combine our document and see how it looks in one page. Well, it looks okay. And I think we can already share it, so I'm going to get the information from this. But we said something really key and important, right? That the designer, Javier, he's the one working on a template. And what we want is our product to look the same and to follow the chart policy or color or everything. I mean, the organization requires, because sometimes they say, okay, you have to use this color because that's our brand, and that's fine.

So using the template that Javier shared with us, it looks totally different, and I think it looks really much better. The chart are really better, the color and everything. And this one will be probably the one that other partners will see. And they will not notice any big change compared to what they used to do using other tools.

The case for reproducibility in humanitarian work

And that part, I think I want to emphasize that is really key. So here we can see the situation, and it's just something really not a realistic product, to be honest. It's just a quick example to show you how it works. And in real life, we have a lot of organizations out there now using similar workflows and more and more investing to do that. So in real life, reports are really more complex. It can be sometimes even like 50 pages or even more. It can be just dashboard. And in case I didn't talk too much about dashboard, because I think it involves also, we can also shift from Power BI Tableau to Shiny. There's really a lot of things we can do.

I was just focusing on reports because I think there is a lot of potential. And as you can see, yeah, the flexibility of R Markdown, the fact that you can build the template you want, allow your team to be more effective. And I will stop there. But just before, I just want to add that noticing just that humanitarian organizations are really increasing using R, and that's really a good thing. And a good thing because R by default will push you probably to do more in terms of reproducibility and transparency. And I think for humanitarian organizations, that's a really key aspect.

If behind each number, I can tell you exactly what I did, it's really important for, because it brings trust between partners, but also it reduces the number of errors we can do. Because you have to keep in mind that it's a life-saving information, and one error can cost a lot. So pushing and using reproducibility and transparency is just more than just, oh, I want to use R to do my stuff. It's also pushing to be more accurate, do less error, and also apply changes to my reports and my own pipeline more quickly.

Because you have to keep in mind that it's a life-saving information, and one error can cost a lot. So pushing and using reproducibility and transparency is just more than just, oh, I want to use R to do my stuff.

So that's really something I think we should aim, even if it's using another tool. But I think R provides these type of things more easily. There's a lot of documentation already out there, so it's just easier to invest in R.

Talking about investment, that's the key now, because we need to maybe work with teams that are already familiar with some tools and teach them how to use R, if they are willing to do that and want to do that, of course. And I think many organizations are using R today and are pushing for that. But we definitely need investment infrastructure, like RStudio Connect, like maybe a Shiny server if they are using Shiny, but also trainings. They have to be trained in the proper way to use all these tools.

And I think for us, by us, I mean a humanitarian person, I think I'm lucky because I'm part of this humanitarian user group that are virtual and online, called the Humanitarian R User Group. We are living in Skype for the moment, and it's cool because you see other people sharing what they are doing using R, and we are asking questions. And it's really a friendly environment, and you can see how really R is more and more used by many, many people, and that's really, really, really key.

So I will stop there, and I want to thank you again for listening to me, and I hope that if there is any data scientist there that wants really to help and work on the humanitarian sector, there is a place for you. So please, if you see an opportunity, apply, because I think that you will be really lucky to have people with your skill set, and you can really do a lot of difference with your skills. Thank you.