Resources

Democratizing Access to Education Data - posit::conf(2023)

Presented by Erika Tyagi Learn how the Urban Institute is making high-quality data more accessible through the Education Data Portal. Every year, government agencies release large amounts of data on schools and colleges, but this information is scattered across various websites and is often difficult to use. To make these data more accessible, the Urban Institute built the Education Data Portal, a freely available one-stop shop for harmonized data and metadata for nearly all major federal education datasets. In this talk, we'll demonstrate how the portal works and share lessons we've learned about making data accessible to users with varying technical skills and preferred programming languages. The Urban Institute's Education Data Portal: https://educationdata.urban.org Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: End-to-end data science with real-world impact. Session Code: TALK-1145

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My name is Erika, I'm a Senior Data Engineer at the Urban Institute, and the Urban Institute is a non-profit research organization that provides data and evidence to help advance upward mobility and equity. Today I'll be talking about a tool that my coworkers at the Urban Institute and I built called the Education Data Portal.

Before I get started, I have a confession to make, which is that I am very much not an education data expert. In fact, I know very little about education data. Luckily, I'm not here today to talk about the ins and outs of iPads or Common Core data, and instead what I'm here to talk about is how the Education Data Portal bridges the gap between data availability and data accessibility.

Specifically, there are three questions that I want to answer today. The first is what do we mean by this gap between data availability and accessibility? The second, how does the portal bridge this gap so effectively? And lastly, why does this matter?

The gap between data availability and accessibility

So every year, a handful of government agencies do the hard and important work of collecting detailed data on K-12 schools, school districts, and in colleges in the U.S. The goal of the Education Data Portal is to put all of this data under a single roof by providing a freely available one-stop shop for all of the major national data sets released by these agencies.

In many cases, these agencies have been collecting this data for a couple of decades, and as you can imagine, there have been lots of changes to data formats, file structures, classifications over the years. So our goal is to do the hard work of harmonizing that data, that way we can do that work and offload that from researchers and data scientists and other folks trying to use that data. When I say we, I really mean our army of research assistants who do that hard work. So much kudos to them.

Ultimately, we do this to make it easier for both technical and non-technical users to look at trends over time and combine data from these different sources and just to put that data at their fingertips a little easier.

A concrete example: tuition data

So this is a little bit abstract, so to provide an example, suppose I'm interested in answering the question of how has tuition in my alma mater risen over the last couple decades in my case. So I went to a tiny little liberal arts college in Northfield, Minnesota called Carleton, and like many other delightful little liberal arts colleges, I probably have a rough sense that tuition over the last couple of decades looks something like this. However, as I mentioned at the beginning, I work for the Urban Institute, and we take data and evidence very seriously, so I'm not just going to use my priors here, and suppose I want to really get into the data and try to reproduce this plot myself.

So first, suppose I live in a world without the education data portal. I'm also going to be showing a live look at my pain levels as I go through the process of answering this question without the education data portal. The very first thing I'd have to do is just figure out which agency is collecting this data. As I mentioned at the beginning, I'm not an education data expert, so I just have to do some Googling, figure out which website to go to, get to that website.

Next, I'd have to read through the data documentation, assuming that website has data documentation for me to read, to figure out what tables am I looking for, which variables. Maybe I'll type the keyword tuition into some kind of a query, and then get a couple hundred or a couple thousand tables back to figure out just where is the data that I'm looking for on this website.

Next, I'd have to download individual data files for each year. So I think for folks who are used to working with government data, agencies typically report data as CSV files, or depending on the programming language you're using, SAS or Stata files, which can be great if you use those languages, but it also means that in my case, I'd have to download 20-some individual data files, which again, not the end of the world, a little bit annoying, but you move on.

So next, I would load those files into the programming language of my choice. This is what I like to call the choose-your-own data mishap trauma, but I think you can think of any number of things happening here. Maybe it's that in the late 90s, a variable definition changed, or maybe earlier on, there were three different files for this data, but then turns out later there were two or later one.

Maybe there is a 99 value for tuition, and while I think it would be great if Carleton tuition costs $99, that more likely means that it was a total row where missing data or some kind of encoding. So at this point, you can either cry, or you can reread the data documentation, figure out what you did wrong, figure out what you need to redo, update your code. The very fun part of setting yourself a calendar reminder to a year from now, do the same thing, go to the government website, download that data file, and last but certainly not least, hope that nothing changes.

Hopefully I belabored the point that this process is really tedious, it's really error-prone, and simply not fun.

How the Education Data Portal helps

So enter the education data portal. Suppose you're an R user, like I think lots of folks in this room might be. Great, we built an R package for you called education data, you can download it from CRAN, and with just a single call to the get education data function, you can get a plot that looks remarkably similar to what I had in my brain at the very beginning of this.

Suppose you're not an R user, instead you're a Python user, great, we also built a package for you, less great is that's not yet publicly available, but it will be soon, but again, very, very similar syntax, very, very similar, and just being a little bit, a lot bit easier to get this data.

Suppose you are not an R or Python user, instead you're a Stata user, and I recognize I'm at a conference evangelizing open source technology, and Stata is very much not an open source technology, but ultimately in the space I work in, I work with a lot of economists, I work with a lot of policy researchers, and again, you can see with just fewer lines of code than the R and Python syntax, similar plots, prelabeled, which is great, so yeah, again, just trying to make that data a little bit more accessible for them.

Suppose you, oh, gosh, suppose you are not a programmer, and instead you're someone who maybe just wants to click a couple of buttons, toggle a couple of things, specify the states or the time frame that you're interested in, and just get an Excel spreadsheet, so again, this is really looking at that non-technical user, someone who doesn't know R, doesn't know Python, doesn't make sense for them to learn those languages, they just want to click a couple of buttons and get data as an Excel spreadsheet, great, we also built a tool for them, and that's what this looks like.

So hopefully I've made clear that I personally think the Education Data Portal does a really good job of bridging this gap between data availability, so that's the agency, or that's the data that these agencies are publishing, and then data accessibility, so it's really putting that data at the fingertips of the R programmer, the Python programmer, the data explorer user.

The underlying API

So the second thing I want to talk about today is how do I think the portal bridges this gap so effectively? Specifically, I think it's because it focuses on two things. The underlying API, and then data documentation. I'll talk about these in more detail.

So I think in contrast to these agencies, as I mentioned before, who typically publish data as flat files, that's like CSV files or things like that, or in programming language specific files, the Education Data Portal really has an API at its foundation, or a language agnostic API at its foundation. So this includes a couple of things. First it includes the actual data endpoints, that's around 110, 115, endpoints that actually contain the data, so that's tuition data, that's finance data, that's the actual data that users use. It also includes about a dozen metadata endpoints, so these are API endpoints about the data, so that's information like what years are available or what variables exist for an endpoint, or what values, like what are the special values for each of the variables.

So there's a lot of reasons why focusing on the underlying API here is very helpful, but I think the main reason for us, and the main reason for us is just for a lot of these datasets, they're really big, just in terms of, I also realize that big is like very much of a spectrum, but they're really big in terms of the, they're really big for the folks who are trying to do the things that they're trying to do with this data, and I mean that both in terms of the number of rows, and then also the number of columns, so I think it's pretty, it's not uncommon for rows to, for rows of data, if you think about, I don't know, disaggregated data by like race, ethnicity, sex, institution, then maybe also like the type of, like the major or something like that, you can easily get into the tens or hundreds of millions of rows here, which again, for a lot of people, that really isn't big data, but for some people, for a lot of people, it's just more data than it's easy to work with in the conventional languages they're used to work with, or conventional frameworks, so by virtually building an API, you don't have to download that 10 million row CSV file, instead, if you're only interested in data from Carleton College, or from Minnesota, or for computer science majors, you can specify through query string parameters, exactly which rows do you need, through that filters, and you just grab those from our database, without having to download the exact files, just make it a little bit easier for users to only get the data that they need, similarly, you can specify the columns you need, so if you only need tuition, you only need the name of the institution, you don't have to get all 100, 100 plus variables that a lot of these agencies typically include in the data files that they start with.

And the last thing that I think, personally and selfishly, is the best thing about having this focus be on the underlying API, is that all of the other tools, packages, and documentation are built directly from these endpoints, so what that means, is that I don't have to spend all day, every day, maintaining Stata packages, and R packages, and Tableau dashboards, and JavaScript sites, all I have to do, is really make sure that the API is fast, it's reliable, and that's accurate, and by virtue of having this single source of truth, I just freeze up a lot of my time, to not have to deal with all these other things that are going on.

All I have to do, is really make sure that the API is fast, it's reliable, and that's accurate, and by virtue of having this single source of truth, I just freeze up a lot of my time, to not have to deal with all these other things that are going on.

Data documentation as a first-order feature

So the second thing that I think the education data portal does really well, that I think really provides a major value-added over the existing data, is that it really treats data documentation as a first-order feature, and a first-order value-add, and not as an afterthought, where I think data documentation, and documentation in general, tends to fall. So I was told that showing a slide with 1,400 variables wouldn't be very aesthetically pleasing, so instead I'm going to focus on two things that I think really just provide a lot of value, the first is that data documentation really is written for both humans and machines, and for anyone who's seen Johnny Bryan's great talk on naming things, I think a lot of the same principles about thinking about who your end user is, whether that is a human or a machine, and think about making your work accessible to them, I think that also really applies to documentation here, especially data documentation dictionaries, and the second thing is that the documentation really provides the user with details on demand, and this is borrowed from a design principle, and I'll talk about these in a little bit more detail now.

So when I say data documentation is written for both humans and machines, here's an example of a variable, degree of urbanization, I as a human think this is written for me, like I think I know what it's telling me, I have the relevant information, it's pretty user-friendly for someone looking at it. Here's that same documentation, or data documentation, but provided in a way that I think is machine-readable, so I'm making an API call to the API variables metadata endpoint that I mentioned earlier, you can get the exact same information, so you get the label, you get the data type, and then you've got standardized formats of the values written in a way that's for machines. Since there are a couple of obvious advantages of this, maybe you're someone who's trying to make hundreds of plots, and you want to automate adding labels, or maybe you're someone who wants to add captions, you're building a dashboard, I think there are many numbers of things that you can do that really get a lot of value from having this data documentation be written, again, both for humans on the left, and also in a way that can be programmatically accessed on the right.

The second thing I want to talk about on the data documentation front is that it provides the user with details on demand. So here you can see that same variable by hovering over the tooltip, a little information tooltip, you can see that before 2005, there were nine types of urban-centric locales, and after 2005, the variable switched, which, again, there are lots of folks for whom that's really important information for them, they really need to know that, which is great, but there's also lots of folks who that just isn't relevant information for them. So if this was something like a PDF, or an HTML table, if I'm someone who's just trying to quickly scan a page, figure out what table am I looking for, what variables am I looking for, having all of this information just there for you really affects the user's ability to just quickly scan. So instead of providing users with this detail on demand as opposed to just there for you, I think it really provides a huge value add for making this a first-order feature and value add.

So I think implicit within this focus on both the underlying API and the data documentation is that the Education Data Portal really is a collaboration of folks with really deep education expertise, so that's Erica, Jay, and Leo, as I mentioned on the bottom, so these are folks who have been using this data for years, in some cases decades, who really know the ins and outs of when variables change, what that means in terms of implications, and then it also means that it's a collaboration with folks like me who are on the technology side who, I couldn't tell you what the majority of acronyms mean, when I do, I usually get them wrong, but what I do know how to do is do things like make fast APIs or build dockerized applications, and more importantly, I also know how to work with the folks in the education team and vice versa, and I think by really having that mutual respect for what we both bring to the table, I really think it just makes Spur a stronger and better tool.

Why bridging this gap matters

So the last and final thing I want to talk about today is why bridging this gap matters, and ultimately, I think it matters because different people ask different questions. So now I get to do what I think is my favorite part of the job, which is talking to real users and seeing what they're doing with data from the education data portal.

So the first example I want to talk about is an organization we've worked with called Reboot Representation, and they're a nonprofit that's committed to increasing the number of black, Latina, and tribal women who work in technology fields, so they've been using the portal to answer questions like which colleges are graduating or which colleges have large numbers of women of color graduating with computer science degrees, which of these colleges are tribal colleges or HBCUs, and also how this is changing over time.

I also recently found out that a friend of mine whose mom is a middle school principal in New Jersey had been using the portal to answer questions like how do student test scores in my school compare to other schools across the district, and how do they compare to other schools across the state? Another question they found out she'd been answering with the portal lately is how has this changed over the course of the pandemic, which, again, I think that's a question a lot of teachers and a lot of administrators are thinking about, so having this data at her fingertips has been helpful for her.

The last example I want to talk about is a particularly fun many layers of education example, but a colleague of mine was recently, a couple months ago, was picking up her kid from school, and she was wearing an Urban Institute-branded fleece, and another parent came up to her and was like, oh, do you work at Urban? I actually work at PBS, and I don't know if you've heard, but Urban makes this great tool called the Education Data Portal, and we've been using it to pull geocoded lists of schools and school districts to think about how that maps to our broadcasting coverage, and I talked about ways to use data on school funding to specifically identify under-resourced schools to think about how that maps to where PBS has prioritized their broadcast coverage.

So these are just three of thousands of questions that users are answering at the portal, and this graph is probably pretty tiny, but last month you can see 4,000 unique users, where here I'm using IP addresses as a proxy for users, because we don't collect that data, but what I think is interesting, or I think lots of things are interesting about this. One, I think, is just the fact that as much as I would love to believe that everyone is using the R package and everyone's using the Data Explorer, ultimately that's just a very small number of the folks who are using the portal every month, and the vast majority are coming from folks who are either directly connecting from the API, so that's using Python or JavaScript or other applications or other programming languages that we don't have packages for, but what a lot of that also is is folks who are building their own Power BI or Tableau or whatever other kinds of dashboards or other kinds of user interfaces to build their own tools to help folks who they're one step closer to answer their own questions.

So ultimately, while I think folks in this room who are data scientists or data engineers, it comes really natural to us to think about building tools for other data scientists or other engineers, similarly as R programmers or Python programmers, it might be really natural to think about building tools for other R programmers and Python programmers, but I think that by unlocking data for more people, we can allow more questions to find evidence-based answers that ultimately drive impact in the places that folks are working. Thanks so much.

By unlocking data for more people, we can allow more questions to find evidence-based answers that ultimately drive impact in the places that folks are working.