Democratizing Access to Education Data - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

My name is Erika, I'm a Senior Data Engineer at the Urban Institute, and the Urban Institute is a non-profit research organization that provides data and evidence to help advance upward mobility and equity. Today I'll be talking about a tool that my coworkers at the Urban Institute and I built called the Education Data Portal.

Before I get started, I have a confession to make, which is that I am very much not an education data expert. In fact, I know very little about education data. Luckily, I'm not here today to talk about the ins and outs of iPads or Common Core data, and instead what I'm here to talk about is how the Education Data Portal bridges the gap between data availability and data accessibility.

Specifically, there are three questions that I want to answer today. The first is what do we mean by this gap between data availability and accessibility? The second, how does the portal bridge this gap so effectively? And lastly, why does this matter?

All I have to do, is really make sure that the API is fast, it's reliable, and that's accurate, and by virtue of having this single source of truth, I just freeze up a lot of my time, to not have to deal with all these other things that are going on.

Data documentation as a first-order feature

So the second thing that I think the education data portal does really well, that I think really provides a major value-added over the existing data, is that it really treats data documentation as a first-order feature, and a first-order value-add, and not as an afterthought, where I think data documentation, and documentation in general, tends to fall. So I was told that showing a slide with 1,400 variables wouldn't be very aesthetically pleasing, so instead I'm going to focus on two things that I think really just provide a lot of value, the first is that data documentation really is written for both humans and machines, and for anyone who's seen Johnny Bryan's great talk on naming things, I think a lot of the same principles about thinking about who your end user is, whether that is a human or a machine, and think about making your work accessible to them, I think that also really applies to documentation here, especially data documentation dictionaries, and the second thing is that the documentation really provides the user with details on demand, and this is borrowed from a design principle, and I'll talk about these in a little bit more detail now.

So when I say data documentation is written for both humans and machines, here's an example of a variable, degree of urbanization, I as a human think this is written for me, like I think I know what it's telling me, I have the relevant information, it's pretty user-friendly for someone looking at it. Here's that same documentation, or data documentation, but provided in a way that I think is machine-readable, so I'm making an API call to the API variables metadata endpoint that I mentioned earlier, you can get the exact same information, so you get the label, you get the data type, and then you've got standardized formats of the values written in a way that's for machines. Since there are a couple of obvious advantages of this, maybe you're someone who's trying to make hundreds of plots, and you want to automate adding labels, or maybe you're someone who wants to add captions, you're building a dashboard, I think there are many numbers of things that you can do that really get a lot of value from having this data documentation be written, again, both for humans on the left, and also in a way that can be programmatically accessed on the right.

The second thing I want to talk about on the data documentation front is that it provides the user with details on demand. So here you can see that same variable by hovering over the tooltip, a little information tooltip, you can see that before 2005, there were nine types of urban-centric locales, and after 2005, the variable switched, which, again, there are lots of folks for whom that's really important information for them, they really need to know that, which is great, but there's also lots of folks who that just isn't relevant information for them. So if this was something like a PDF, or an HTML table, if I'm someone who's just trying to quickly scan a page, figure out what table am I looking for, what variables am I looking for, having all of this information just there for you really affects the user's ability to just quickly scan. So instead of providing users with this detail on demand as opposed to just there for you, I think it really provides a huge value add for making this a first-order feature and value add.

So I think implicit within this focus on both the underlying API and the data documentation is that the Education Data Portal really is a collaboration of folks with really deep education expertise, so that's Erica, Jay, and Leo, as I mentioned on the bottom, so these are folks who have been using this data for years, in some cases decades, who really know the ins and outs of when variables change, what that means in terms of implications, and then it also means that it's a collaboration with folks like me who are on the technology side who, I couldn't tell you what the majority of acronyms mean, when I do, I usually get them wrong, but what I do know how to do is do things like make fast APIs or build dockerized applications, and more importantly, I also know how to work with the folks in the education team and vice versa, and I think by really having that mutual respect for what we both bring to the table, I really think it just makes Spur a stronger and better tool.

Why bridging this gap matters

So the last and final thing I want to talk about today is why bridging this gap matters, and ultimately, I think it matters because different people ask different questions. So now I get to do what I think is my favorite part of the job, which is talking to real users and seeing what they're doing with data from the education data portal.

So the first example I want to talk about is an organization we've worked with called Reboot Representation, and they're a nonprofit that's committed to increasing the number of black, Latina, and tribal women who work in technology fields, so they've been using the portal to answer questions like which colleges are graduating or which colleges have large numbers of women of color graduating with computer science degrees, which of these colleges are tribal colleges or HBCUs, and also how this is changing over time.

I also recently found out that a friend of mine whose mom is a middle school principal in New Jersey had been using the portal to answer questions like how do student test scores in my school compare to other schools across the district, and how do they compare to other schools across the state? Another question they found out she'd been answering with the portal lately is how has this changed over the course of the pandemic, which, again, I think that's a question a lot of teachers and a lot of administrators are thinking about, so having this data at her fingertips has been helpful for her.

The last example I want to talk about is a particularly fun many layers of education example, but a colleague of mine was recently, a couple months ago, was picking up her kid from school, and she was wearing an Urban Institute-branded fleece, and another parent came up to her and was like, oh, do you work at Urban? I actually work at PBS, and I don't know if you've heard, but Urban makes this great tool called the Education Data Portal, and we've been using it to pull geocoded lists of schools and school districts to think about how that maps to our broadcasting coverage, and I talked about ways to use data on school funding to specifically identify under-resourced schools to think about how that maps to where PBS has prioritized their broadcast coverage.

So these are just three of thousands of questions that users are answering at the portal, and this graph is probably pretty tiny, but last month you can see 4,000 unique users, where here I'm using IP addresses as a proxy for users, because we don't collect that data, but what I think is interesting, or I think lots of things are interesting about this. One, I think, is just the fact that as much as I would love to believe that everyone is using the R package and everyone's using the Data Explorer, ultimately that's just a very small number of the folks who are using the portal every month, and the vast majority are coming from folks who are either directly connecting from the API, so that's using Python or JavaScript or other applications or other programming languages that we don't have packages for, but what a lot of that also is is folks who are building their own Power BI or Tableau or whatever other kinds of dashboards or other kinds of user interfaces to build their own tools to help folks who they're one step closer to answer their own questions.

So ultimately, while I think folks in this room who are data scientists or data engineers, it comes really natural to us to think about building tools for other data scientists or other engineers, similarly as R programmers or Python programmers, it might be really natural to think about building tools for other R programmers and Python programmers, but I think that by unlocking data for more people, we can allow more questions to find evidence-based answers that ultimately drive impact in the places that folks are working. Thanks so much.

By unlocking data for more people, we can allow more questions to find evidence-based answers that ultimately drive impact in the places that folks are working.

Democratizing Access to Education Data - posit::conf(2023)

Transcript#

The gap between data availability and accessibility

A concrete example: tuition data

How the Education Data Portal helps

The underlying API

Data documentation as a first-order feature

Why bridging this gap matters