
Solving a Secure Geocoding Problem (That Hardly Anybody Has) - posit::conf(2023)
Presented by Tesla DuBois Due to data security concerns, the strictest health researchers won't send patient addresses to remote servers for geocoding. The only existing methods for offline geocoding are expensive, cumbersome, or require working with code - all limiting factors for many researchers. So, a couple of classmates and I made a standalone desktop application using shell, Docker, PostGIS, and Python to geocode addresses through a simple GUI without ever sending them off the local machine. Come for the technical ins and outs and stay for the anecdotes about how my R background played into the daunting, frustrating, but ultimately successful task of creating a data science tool using unfamiliar technologies. Presented at Posit Conference, between Sept 19-20 2023, Learn more at posit.co/conference. -------------------------- Talk Track: Developing your skillset; building your career. Session Code: TALK-1111
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi, everybody. I'm Tesla Dubois, and I'm really excited to be here today to tell you all about solving a secure geocoding problem that hardly anybody has. So you might think from the title of my talk that I'm someone who's overly concerned with the precise locations of things. But in reality, I place a lot more value on the journey. So today, I'm going to tell you about my journey of getting from one place in data science learning to another place that was relatively further along through the task of creating a secure, code-free geocoder.
Why build a secure, code-free geocoder
So first of all, why build a secure, code-free geocoder? Well, as my puppy is proudly telling you here, I work at Fox Chase Cancer Center, and I'm a geospatial data analyst there. So I do things like this. I characterize cancer burden by neighborhood. Then I do the same thing with cancer risk. So in this case, we're looking at smoking behavior and the density of tobacco retailers. And then I use that to inform interventions.
Basically, I'm arguing that targeted cancer prevention efforts depend on geocoded patient data. So geocoding is the process of turning a text address into a geographically referenced coordinate system like latitude and longitude. And there are many ways to do this. Most people can use one of these tools. But the thing about doing this with patient data is that you have to be really careful because an address is an individual identifier. So we can't send that over a server. At least the most conservative health researchers are not going to send that over a server for data security concerns.
There are a few options that are secure for geocoding, but they tend to be clunky, expensive, and or require coding. So this is a data science conference. Why is coding an issue here? Well, while I am the only geospatial analyst at Fox Chase Cancer Center, I'm not the only one interested in using geocoded data. Normally, the next step after geocoding is that we link the geocodes with some area-level measures of social determinants of health, which can be defined as the environments in which we live, work, and play.
So I'm going to give you a few examples of other folks at Fox Chase who are interested in using this type of data. We have clinical researchers who want to know whether socioeconomic status has any association with treatment outcomes in their patients. We have a multidisciplinary team right now who's working on increasing diversity in our clinical trials and want help with focusing their recruitment and education programs and which community partners we should engage in this effort. We have leaders of cancer prevention and control interventions like our mobile mammography unit and our patient navigation programs who want to make sure we're prioritizing and serving residents in the highest risk areas. And finally, our leaders and administrators of the Cancer Center are required to report back to the National Cancer Institute how well we're serving the residents in our designated catchment area.
So currently what happens is that these types of inquiries go to my fantastic boss, and then they're passed on to me like data requests. And on one hand, it's really awesome that I get to support this really impactful work at Fox Chase. And I also got to thinking that if I could give researchers and administrators at Fox Chase a tool to be a bit more independent in geocoding and potentially linking that patient data with common measures from area-level measures, that I would be freed up to do some of the more advanced spatial analyses that I'm interested in.
So I wanted to create a code-free, secure geocoding application for researchers and administrators working with patient data. And I actually workshopped this idea quite a bit with my fantastic boss, Dr. Shannon Lynch, and her collaborator, Dr. Kevin Henry. And actually, because of those two individuals, I'm going back to school now for my Ph.D. in Geography.
Building understanding with tools you know
So I took a class last semester on geospatial application development, and it was a project-based class. We had to be in group projects. And I presented this as an idea on the first day of class. And quickly, two of my classmates joined my group. So we sat in the back of the classroom while other groups were forming, and I talked about my ideas for how we would approach this in R. And right before we left the classroom, the professor told us that we had to use Python-based solutions.
So this is where I need to tell you that I've always been a bit of a rebel. And in fact, as this newspaper clip tells you, I was the only girl in my all-girls private Catholic high school who rode a motorcycle from a different county to school every day. And what the newspaper doesn't tell you is that on my lunch break, I used to get back on my motorcycle and ride through the convent behind the school and through the paved rose gardens and wait for the nuns to come out and yell at me.
So when my professor told me, an R lover, that I had to use Python-based solution, I immediately went home and started coding it up in R. So I will argue that this wasn't just an act of rebellion. I just as in my teenagerhood, I really needed to explore the world in a mode that felt freeing to me and assert myself as an independent being. Similarly, in this project, I needed to deeply understand conceptually what I was trying to do in a language that I understood and knew well.
Similarly, in this project, I needed to deeply understand conceptually what I was trying to do in a language that I understood and knew well.
So the first phase of this project is do you build understanding with tools that you know? This is what it looked like for me in R. I started with an address, and then I used the Tigris package to pull down all of the streets for that county within which the address resided. And this is important because this county is the only information that I sent to a remote server, and that's not individually identifiable information. From there, I used dplyr to filter down to the correct zip code, stringer to match on street name, dplyr again to get down to the line segment that included the number of the address. And then I took the centroid of that final line segment using SF. And this is a crude geocode, but I would argue at least in an urban setting, it's sufficient.
Next I can use tidy census to pull in the census tracts again and link it spatially to that point. And tidy census again to pull in area level measures of interest. And then I have this result, which is an address with the latitude and longitude and a linked area level measure of interest. Then I do everybody's favorite and stick it in a for loop. And this was my R geocoder. As I iterate over addresses, I'm getting a bigger data frame with all the linked data.
So while this worked, there were several problems with this. First, addresses had to be an exact match. So if there wasn't a W for West Main Street, there just wouldn't be a match at all. There was no name regularization built into this. Second, my R environment got way too big, way too fast. If I was including multiple counties in that initial data set, I'm pulling down those line files, which are really big for every single county, and it starts to get bogged down. Third, this is definitely not code free. It worked for me, but it wasn't something I could extend to the other researchers and administrators at Fox Chase. And fourth, it wasn't a Python-based solution. And we did for a moment consider just translating it to Python and turning it in, but the other three issues would have still remained.
The talk-try method: puzzling new tools together
So if my teens were characterized by riding a motorcycle through the convent, then my 20s were characterized by trying to shove my triathlon bike into the back of my Mini Cooper. So in my 20s, I was really into triathlons. I'd finally gotten to four-wheeled vehicle after moving out to the East Coast, and I would spend my Sundays driving my bike out to a place where I could ride for 50 miles at a time with very little car traffic. And then I'd get back to my car, and inevitably I'd be hot and tired and dehydrated, and I'd have a hard time fitting my bike back into the Mini Cooper. So I would call my husband and tell him, first of all, like, I'm done with my ride, I survived, and then talk to him as I tried shoving the bike into the Mini Cooper. And usually just the process of me talking about what I was trying would give me an idea of what to try next, and eventually I'd get the back of the Mini Cooper shut, and I'd be on my way.
So this next phase of the building a code-free geocoder was very similar to that. We had a lot of tools that we ended up using, and on my team none of us had any experience with any of these before we started. So we started out by just like going home, each picking one, trying to become an expert on it, coming back, talking to each other about what we learned, and then slowly starting to puzzle these different things together. So that's the next phase, using the talk-try method to iteratively puzzle new tools together.
We started out with this tutorial by Michelle Tobias, which I appreciated both for the technical expertise, but also for the validation that this was really hard, and confusing, and took a long time. And even with that tutorial, we spent many nights on Zoom looking at pages of failure messages, but eventually we did come up with something that worked, and this is how it worked.
So first, we have a shell script that starts a Docker container. You have to have Docker Desktop already installed on your computer, but it doesn't need to be open or running. The shell script will start the Docker container and put a Postgres SQL database on it. Then we have another shell script that will pull the relevant line files with Postgres and store them on that Postgres SQL database. And we have a file in there that updates every time you pull new county line files, so that the next time you run it, you don't pull those same county line files again, and the data persists in your Docker database. And again, this is the only information that we're sending to a remote server, is the county.
Next, we have Postgres functions to geocode against the database. So Postgres has a geocoding function already, and it normally sends the address over a server, but instead we're using our Docker database as the server. So we're directing the data to our local Docker database, and it's never going over a server. So we are coding this up with a Python function. And finally, we created an interface with QtDesigner and linked the whole thing up with Python. And this is a very basic, basic user interface, but it does have a couple options. You can name your input file, designate your output file, and then you can pick between having your output be a CSV or a shapefile. You can choose to have your points jittered, which means just randomly moved within certain bounds from the true latitude and longitude. And then you could add the social determinants of health data at the census track level or not.
Compromising: prioritizing functionality over flair
Okay, so if my teens were characterized by riding a motorcycle through the convent, and my 20s were characterized by shoving a bicycle into the back of the Mini Cooper, my 30s — how many people have experienced their 30s in here? So my 30s have been characterized by a bit of compromise. So when I was pregnant with my second daughter, my husband and I decided it was time to give up on any idea that we would ever be cool again, and we went ahead and bought a minivan. It was very sad. We had to let go of some ego, but it was the most functional choice, you know? It would work for our family. But I believe that you should really own where you are when you have to make that type of decision. So we went ahead and got the license plate, van, with five A's.
And the only time that has backfired is when my kid got a point off on a spelling test for putting too many A's in the word van in first grade. And she tells me it was a joke, but the teacher obviously didn't get it. But she only put four A's, so she was right to get a point off.
So this phase of this project also required some compromise. But in doing so, I had to prioritize functionality over flair, much like getting the minivan. But before we had to compromise, we were actually feeling quite good about where we were. So one week before the final was due, we gave a demo to our class. I'm going to walk you through that now.
So the first demo, we're exporting a CSV of true points with the census track variables attached. Now, this might be a little hard to see, but we're starting down here with a CSV that has two columns. One is just an address ID, and the other one is the addresses all in one column. And we hadn't packaged this into an executable file yet, so we're pressing run right up here from the Spyder interface, and then it pops open our GUI. We can navigate to the CSV of addresses, designate our output. We want a CSV, not jittered, with the social determinants of health data at census track level. Press OK, and it gives us this output that it's telling us that it's pulling those county line files. And then we have the file pop-up where we designated. We can open this up, and we see that we do indeed have a latitude and longitude, and importantly, that rating for how good of a match it is. So I have a PO box in here as a test. It has a really poor rating, and that's what we want to see. And then we have the area level social determinants of health measures on there.
In our second demo, we have a shapefile of jittered points and the census track variables attached. Again, we're going to go ahead and press run, put in our file, designate the output. We want a shapefile. We do want it jittered this time, and again, we're attaching the census track variables. Press run. This time, it's not pulling those county line files because we already did it two minutes ago. And remember, the data persists in our Docker container, so it doesn't have to call it again. This time, it's telling us that it has to shorten the column names when it's exporting a shapefile, which if you do geospatial stuff, you're familiar with. We can open up QGIS now, call in that shapefile, and indeed, we see that that jittered point is just slightly off from the true latitude and longitude.
All right, so we were feeling pretty good. Our professor was impressed. Our classmates were impressed. We thought we were pretty awesome, but then we still had to package it up into a downloadable executable that somebody could click on their desktop and it would pop open our GUI, and we ran into some problems. So this is our group chat the day before the final was due, and it turned out that the Python module that I was using to calculate the census track level variables was causing an issue when we were packaging it into an executable, so I had to just comment out all of that code. Then we did get it into an executable file, but it wouldn't run when you double-clicked on it from download. It would only run if you go to the terminal and tell it to run from the terminal, which is not entirely code-free.
But the final was due, and we had to turn in what we had, and so that's what we did. We turned it in, and this is where being a journey person really comes in handy, right? Because even though I didn't get all the way where I wanted to get with this geocoder, I would argue that by building conceptual understanding with tools that I knew well, using the talk-try method to iteratively experiment and puzzle things together, and knowing when it's time to compromise, and by doing so, prioritizing functionality over flair, I definitely progressed in my data science learning journey.
By building conceptual understanding with tools that I knew well, using the talk-try method to iteratively experiment and puzzle things together, and knowing when it's time to compromise, and by doing so, prioritizing functionality over flair, I definitely progressed in my data science learning journey.
And not only did I progress in that journey, even though it's not ready for prime time, I did create something that I can use at work. So I recently geocoded 3,000 patient addresses. 85% of them were geocoded with the Python-based solution, which is pretty good. Like, it's what you would expect for patient data, and then I got an extra 5% by running the ones that didn't geocode through the R geocoder.
And the thing is, the journey continues. So I can't tell you what the next phase or lesson is going to be, but I can tell you this. On days that I drive to class or to the office, I drive a hybrid. And at work, I just got a data science undergrad intern to help me continue building this out. So I hope that the phase feels really efficient. But, you know, either way, I'm just happy that the journey continues. Thank you.
