Data Science at the Census Bureau | Jessica Klein | Data Science Hangout
To join future data science hangouts, add it to your calendar here: https://pos.it/dsh - All are welcome! We'd love to see you! We were recently joined by Jessica Klein, Data Scientist at the Census Bureau, to chat about open-source adoption at the Census Bureau, the growth of R and Python communities within the agency, training opportunities and internal support systems, and working with complex census datasets. In this Hangout, Jessica talked about the agency's move towards the Cloud and its efforts to reduce reliance on proprietary software like SAS and ArcGIS. She shared insights into the internal support system, mentioning training programs, workshops, and resources for using the Census API. She also discussed the challenges and strategies involved in transitioning from SAS to R & Python, including the importance of double coding for validation with mathematical statisticians and the increasing use of version control with Git GitLab. The development of internal technical support groups for R and Python was a key element in fostering knowledge sharing and breaking down silos within the agency. Resources mentioned in the video and zoom chat: Jessica Klein's LinkedIn β https://www.linkedin.com/in/jessica-klein-8a5a35196/ tidy census package β https://walker-data.com/tidycensus/ tigris package β https://github.com/walkerke/tigris Fast Link package for probabilistic linking β https://github.com/kosukeimai/fastLink StatQuest YouTube Channel β https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw 3Blue1Brown YouTube Channel β https://www.youtube.com/c/3blue1brown {sf} package for R spatial analysis β https://r-spatial.github.io/sf/ Resources on spatial analysis in R β https://rspatial.org/ and https://rspatialdata.github.io/index.html and https://asdar-book.org/ Apache Superset (open-source visualization tool) β https://superset.apache.org/ Coding it Forward Civic Digital Fellow program β https://blog.codingitforward.com/tagged/civic-digital-fellowship Resources shared directly by Jessica in the chat: Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker: https://walker-data.com/census-r/index.html Webinar series on using tidycensus for analyzing and mapping Census data: https://ssdan.net/events/the-2025-ssdan-webinar-series-2023-acs-data-with-r-mapping-tools-and-the-2020-census/ A Guide to Working with US Census Data in R: https://rconsortium.github.io/censusguide/ Using the API requires a getting an API Key, which you can get for free here: https://api.census.gov/data/key_signup.html See here for the Census Data API User Guide: https://www.census.gov/content/dam/Census/data/developers/api-user-guide/api-guide.pdf Explore the ABCs of the Census API with Census Bureau staff: https://www.youtube.com/watch?v=JbThy7GUg3k See here for a tutorial on Census API Basics: Simple Steps to Better Data Access: https://www.youtube.com/watch?v=d-FJ2IyVfdk See here for a tutorial on Using the API to Get All Results for an ACS Table: https://www.census.gov/library/video/2020/using-api-all-results-for-acs-table.html See here for a tutorial on Using the API to Get Results for Multiple Estimates: https://www.census.gov/library/video/2020/using-api-results-for-multiple-estimates.html Explore ACS Data StoriesβStats in Action! to see how the public is using ACS data: https://www.census.gov/programs-surveys/acs/about/acs-data-stories.html Visit the Slack show-and-tell channel and show how you use Census data: https://app.slack.com/client/T6AL55003/CRY4R9R7S If you didnβt join live, one great discussion you missed from the zoom chat was about the enthusiastic reception and usefulness of the tidycensus package for accessing and working with census data in R, with many attendees expressing their appreciation and sharing resources related to it. Let us know below if youβd like to hear more about it! βΊ Subscribe to Our Channel Here: https://bit.ly/2TzgcOu Follow Us Here: Website: https://www.posit.co Hangout: https://pos.it/dsh LinkedIn: https://www.linkedin.com/company/posit-software Bluesky: https://bsky.app/profile/posit.co Thanks for hanging out with us!
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Alrighty, welcome back everybody to the Data Science Hangout. If we have not met, my name is Libby. I'm a data science community manager here at Posit and I'm also a really passionate data science educator. I have a background in business statistics and experience teaching Python and R. I am based in San Antonio, Texas where right now it's a lovely 65 degrees outside.
If you aren't familiar with Posit, Posit builds enterprise solutions and open source tools for people who do data science with R and Python. We are also the company formerly called RStudio, so if you know RStudio, you know us. Rachel, who started the Hangout many years ago, just popped up in here as well, so let me have Rachel hop in really quickly and introduce herself.
Sorry if I'm out of breath. I was running from an event in Boston. Hi everybody, so nice to see you. I'm Rachel. I lead customer marketing at Posit and I'll be hanging out here behind the scenes too.
Well, the Hangout is our open space to hear about what's going on in the world of data across a bunch of different industries. We chat about data science practices, data science leadership, we connect with other people who share our experiences, and we get together every Thursday, same time, same place, here on Zoom. So if you are watching this on YouTube later and you want to join us live in the future, I highly recommend it.
All right, at the Hangout we love hearing from you. It doesn't matter what your years of experience are, your titles, your industry, the languages that you use or do not use, and we really encourage you to connect with each other in the chat.
Introducing Jessica Klein
I am so excited to be joined by our featured leader today, Jessica Klein, a data scientist at the Census Bureau. Jessica, I would love it if you could introduce yourself, tell us a little bit about your role and what you do outside of work for fun.
Absolutely. So, hello, everyone. My name is Jessica Klein and I am, as Libby mentioned, a data scientist at the Census Bureau, but I didn't start out that way. So, I have been at the Census for, it'll be 18 years in June, and I started my career as, in 2007, so luckily the year before the job market was scarce, so it was a really, really good time to get into Census, but I came in at an entry-level position. I had a degree in sociology and a background, minor in statistics, and the Census Bureau does require 15 credits of statistics to join.
So, I came in just doing, answering phone calls from the public, and I was very fortunate that I also got to start a criminal justice master's degree at that same time. It was before virtual, so I'd spent all my day in D.C. working, go to Baltimore all night, and go to school, but it worked out really well because as soon as there was an opening in the criminal justice branch at Census, I was able to join. And that was really where my ability to dive into data, surveys, the survey responses started to occur, and that was about maybe 2010.
So, there still wasn't an official data science position at Census. That is new as of, say, the 2020-2021 is when they started hiring data scientists. So, all of us survey statisticians were doing data science in our own way, and that might involve SAS. There was a few SPSS licenses, but primarily SAS was the way we were analyzing data, diving in, creating visuals, some pretty cool deliverables, but still not the level of the things that we see today.
So, fast forward to 2017, and I switched to another criminal justice branch. So, we have two at Census, and that was where I got to really dive into the data science field. We were offering a self-led data science training program in Coursera, and that was in both R and Python. So, I started with the R program, and I have not looked back since, and it has really changed the ability to look at our data, to touch the data, to correct data, to link data across time.
A little bit of what I like to do for fun outside of work is I play the piano. I spend a lot of time outside exploring in the forest that I live next to, and I also have two kids, an elementary schooler and a teenager.
Working with census data and open source tools
So, currently, and I will talk about my experience in the Criminal Justice Branch, but you can extrapolate this experience to all different programs all across the agency. We collect data on about 300 different topics. We have surveys. We collect administrative records.
Within my Criminal Justice Branch, we collect data from facilities. So, we collect juveniles that are in residential facilities, and then particular information on inmates. So, this is going to a central reporter that's giving data on behalf of their population. So, it's not always, I would say, sometimes it needs a little bit of correction when it comes back, and R gives me that ability to highlight what's an outlier, what's missing, what needs to be flagged. Automated workflows that we can run our data through to highlight where's the problems and how can we fix them.
Another really neat thing that we have started doing in the last few years is linking our data, not only across time, but across programs. We have four separate surveys, and before, we were really looking at snapshots in time. What did the survey data look like for a year? And then we would pass it off to our sponsor for their own analysis, and they would create the publications. So now, I've been able to now link data for 20 years for one survey, 14 years for another survey, and then my survey has actually been collected at Census since 1930.
Now, because I started in this field, all of us are starting to, in the Criminal Justice branch, we're starting our own data science learning journey. We have people learning Python, we have people learning R because both of them give us the ability to work with the data on such a deeper level than we ever have been able to. So, it's like we're trying to take the best of both and apply it to our work in the most incredible way possible.
Transitioning from SAS to open source
You know, I think that Census was one of those really incredible examples of where the people doing the work and then the high up came together to see that this was going to be a valuable transition for everyone. Like I said, I started at Census long before we had any of these. I mean, we had R, and we had people using R, but it was very limited to the geographers, the people who were mapping, the people who were using our sheet files.
I came, and I, like I said, I asked for an SPSS license. They told me that it was really SAS. It was SAS or nothing. So, I took a bunch of SAS classes. I never got the hang of it, not the way that the people who are really SAS savvy were able to.
So, and that was not only my experience. It was many other people's experiences, and they, that movement of, you know, there has to be something better. There has to be a better way to work with the data brought the staff together at that analyst level. We helped each other. We didn't need our bosses to tell us to learn this. We just knew it was going to be valuable. We were working with too big of data, too disparate of data to just rely on something that we only had this tentative grasp on how to use.
So, when we started creating, after the training program, when we started creating deliverables, showing them to our managers, and that's when the buy-in really started. That's when people started seeing, you know, these are things that we can show to our sponsors. These are things that we can publish on our website to show the public that we're aligning with the way the world, I guess, the education system, really. In college, people are teaching R and Python. They're not teaching SAS as much.
So, when we started creating deliverables, showing them to our managers, and that's when the buy-in really started. That's when people started seeing, you know, these are things that we can show to our sponsors.
And then, once managers started saying, you know, not only do we like this, but we want to add to the skill set. We want more data scientists and we want more people taking data science training. That's when we started seeing that top-down. And then, we met in the middle. And now, it does seem at every level, there's a value of data science. You might not understand data science as a manager, but you understand how you can leverage it.
Retention and career growth at the Census Bureau
That's a fantastic question. And, you know, my plan wasn't at the very beginning, this wasn't my plan. I really wanted to work in a criminal justice space. When I was little, I had dreams of being in the FBI one day. But, you know, life has a different path. And Census, I believe, as time went on, as they started saying, let's try to innovate, let's not stick with these traditional tables, these traditional ways of delivering data that we were all used to that are a little bit harder, I think, for an average analyst to be able to just work with.
But now that we have the right tools, I think that those of us who are really, really, not only enjoying the data science space, but wanting to see how far we can push it, it's made it an attractive agency to stay at. The Census has truly changed their mindset and I feel like pioneered this modernization effort. And I think rather than considering a job, I almost think like, where are we going to be next year? Like, I'm excited to stay and see what does the next year look like? What does it look like when we're all in the cloud? What does it look like when I get to move to the cloud?
Hiring and technical skills
So I have only gotten to be part of hiring one time, and that was before I became an official data scientist because I had a data science skill set. They allowed me to kind of come in and help assess skill set. When something's new, you don't really know what you're looking for.
I believe strongly that the public get hubs with practice examples, data science pipelines, projects that look similar to the agency or the organization that you're applying to go a long way, more so than listing out the packages, the skills. Those are very important to put on your resume. But I remember from my experience just diving in and being able to see a project from beginning to end, help me understand what value of contribution that employee would come in and be able to apply towards our projects.
So every government hiring goes through USAJOBS, and I do believe that that might be a little scarce right now, but please don't give up. I believe that hiring will come back soon. And that's where that first step of resume. Hiring managers, I believe, are starting to bring in people who have that data science experience to kind of go through the resume.
I'll use TidyCensus, for example. That's one of our favorite packages here. So they might see the TidyCensus package, but they might not understand, well, what does that do for me as a data scientist? I know that gives you all the census data at the tip of your fingers. If you can use TidyCensus or the Census API, you probably have a very good skill set.
Surprising findings in the data
One of the most interesting things that we recently saw was how data can kind of conflict with the news. The data that we're being provided is not the same as the data that we're seeing reported in news articles, and we're able to now more or less sync up information with the information that we're seeing in the public. I find that to be really interesting sometimes, where it either matches, where we can say, oh, yeah, this is what was reported to us, this is what I've seen reflected, or, you know, this is what was reported to us, but that's not what I've seen reflected in this piece of information I've found.
And that, I feel, is probably one of the more fun pieces of data science. It's the what did you find that you either didn't expect or is missing or you know you can correct or should correct from your analysis.
Census data and public policy
You know, that is something that I can't answer. I think that I can answer it with a non-answer. So as Census, we really try to stay away from that policy. I work in a reimbursable area, so my responsibility is to collect the data and then pass it off to my sponsor, the Bureau of Justice Statistics, to create their own analysis that will then shape policy.
A good, maybe a good example where it didn't shape policy, but it was that real-life impact is during the hurricane season, we were, had some of our colleagues were working with FEMA, and they were able to use the census data to help address the community resilience estimates and which communities were going to need the most help, which communities were in danger of failing or, what's that word, not being able to rebound because they didn't have all the resources they needed. And I always like this, where census can actually, real-time help, we're so used to collecting data and then maybe it's another year or so before it actually, something gets to happen with it, but those community resilience estimates, those emergency situations is when you really get to see quick thinking or census data help people in the long run.
Geospatial tools and ArcGIS
That's a fantastic question. You know, since this was an open source conversation, I had intentionally left it to open source, but I am Tableau certified and have taken a full year of ArcGIS training, so those are two tools that we use and we rely on heavily. Similar to what you just said, how do we break free from these contracts, these proprietary software, we do want to try to loosen our footprint and our reliance on these softwares.
Our geography division does use ArcGIS probably more than any other area. We also have the, what's called a community explorer, ArcGIS community explorer, it's kind of a web-enabled place where we can build dashboards, we can give little surveys, and we can give more interactive dashboards to our teams. If you are not producing deliverables, maps, dashboards that are going to the public, you are encouraged to use R or Python for your solutions. Keep them internal, don't use a license if you don't need it, but if you are publishing, we currently don't have a way to publish outside of ArcGIS and Tableau.
You ask the packages, Tigris, which is a compliment to TidyCensus. Tigris will give you all of the census shape files. I find that to be a fantastic package. That's like a nine-hour webinar that Kyle Walker does every year. I find it to be the most, one of the most valuable resources to using TidyCensus and Tigris. The content doesn't change that much, but the practice and when they add new data sets, he gives a nice overview of how to access that new data.
Data sharing, security, and confidentiality
That's a great question. So I personally have only done a few of those little linkage projects, but the big ones that we're working on, and we actually just had a presentation on this, is the fast link package and probabilistic linking. And that we've had a few demonstrations on and it always seems to be not only very popular, but in the chat, kind of like you said, when people are presenting in the chat, but I see people light up and they're like, oh yeah, if I'm doing linkage, that's what I'm working on too.
We have this linkage process that randomizes people. It groups people in like buckets, whether that's facilities or peoples, and that's how we can do this linkage across different spaces. So sometimes it's not a one for one, especially because we have to protect the data within our program. So you had mentioned sharing data outside, much harder to share data outside the agency. Even with myself as someone who works on reimbursable surveys, there's policies on when I can share the data, who I can share the data with.
So our Census network is incredibly locked down. Not as locked down as like probably DOD or some of the other agencies, but even I've heard other statistical agencies, maybe an agency like NOAA, where they don't have those same levels of restrictions. But we have a very, very strong firewall. We are not allowed to hook our laptops up off of network.
But all of that leads to the idea that if you're working within your computer, your server, so I have my own criminal justice server that our team works on that we don't have to worry as much about that. We're not sharing information outside of our network. We're not sharing it with people outside of our team. We have this policy called need to know. And every year we get trained on it, that if someone needs to know, because it's related to their work, you may work with them on it. But if not, it is illegal for you to disclose that.
So that does mean that there are things with R and Python open source that we can't do, because we don't want to risk sharing, inadvertently sharing information. We can't use AI with this information. We can't use packages that are off of CRAN, or PyPy, I believe is the equivalent. So that's some of our security procedures to make sure that this information doesn't get in the wrong hands.
Survey weights and quality control
In the most incredible coincidence or fate, as fate would have it, this actually, that was my last R project that I finished yesterday, is I had to create sample weights for the sampling file for our upcoming collection. So, and it was a challenge. We had to deal with some amount of double coding the way it used to be in SAS. So that's something that I work on. I try to convert some of our SAS programs to R.
We don't do it by region, we do it by strata. So that is actually region for one of them, but that could be the size, that could be if there are a secure facility versus a non-secure facility, size, region, and then maybe a few other details. Are they male only, female only? So that was the challenge, right? Like making sure that the survey weights fit the expected sample versus the population. And it all matched yesterday and it was amazing.
But I don't know if I would have been as successful if I did not have the SAS output to compare to. It was really not me being able to validate it because I'm not a math stat. I am familiar with statistics, but I'm a social data scientist is how I look at myself. But I can use open source really well. I can understand statistics. So I understood that my output made sense, but if I wanted my sponsor to buy into it, it was still going to have to match the output from years prior.
Professional development and upskilling
I love training. So training is one of the things that I find not only to be the most important as someone who's trying to maintain skills, learn new skills, but as myself, leader of our user group at our agency that I try to promote upskilling opportunities. And not only the ability to go back to skills that we all maybe know, or we don't use as often, so we forget, but to be able to adopt new skills too.
And I do think that that continuing education, the continuing of showing that you are still learning, you're taking training, you are furthering your professional development, does go a long way. That being said, I've learned that in the government level, I learned this through my own experience actually when I was hiring, that some of that professional development doesn't count towards an official education requirement.
So I do, like I said, I think that training is the most valuable thing that someone can do for themselves, whether that's we use LinkedIn Learning, Udemy, Percipio, Pluralsight, Coursera, all sorts of great options, Datacamp, Posit Academy, I believe you guys have, all sorts of great options. But if there are certificates, like I know I'm hoping to be able to take the master's certificate in data science at our local college. I think those things go a long way towards hiring.
Survey statisticians and data science
I do believe that every survey statistician is essentially the beginning of a data scientist. You just might not have the right tools, or you might not have gotten access to the right applications to be able to take it to that next level of true data science that you can show to your managers, but we're all working with complex data. I was trying to do the best I could with data science in Excel, so even... And I do have someone on my team, actually, who probably uses Excel more like magic sometimes.
I do believe, like I said, the survey statisticians at that fundamental level, you're working with the survey data, you have the ability to take that next step into the data science role. Where we're limited, I would say in the government, is that you have to go from 15 credits of math to 30 credits of math, and not everyone has that, because you probably wouldn't have gotten that unless you had a specific math major.
I do believe that every survey statistician is essentially the beginning of a data scientist. You just might not have the right tools, or you might not have gotten access to the right applications to be able to take it to that next level of true data science that you can show to your managers.
Version control and breaking down silos
Version control, GitHub. So we are huge on GitHub, GitLab. We're actively making strides to get the whole agency enabled with GitHub. Right now, it depends on what area you're in. If you have, again, GitLab for some areas, GitHub for some others. I use GitLab, but I can't share things with people who don't have it, so we're trying to get to that point that we can not only share within our agency, but my sponsor, I had to email her all the code files.
Last year, I converted the sampling process from SAS to R, and, oh, so there were two processes that were just a little bit more complex than I was used to at that point. I was able to borrow the code for one of our larger surveys. I was already out in the field, had already been vetted and approved by many people, and it was the same sampling process. So I used the code that they had put on their GitLab, and then their sponsor knew, okay, this is not something that Jessica just made up. This is something that's already been put into practice, already shows that it works well, and we're trying to do more of that. We're trying to share rather than recreate.
We're all doing, I would say, a similar piece of different, a similar process for different surveys. So if we can get away from creating silos, get away from that redundancy, and all align and modernize together with similar code, I think it's just going to make for a much better way to not only get these things out to the public, but make sure that for years to come, we have consistency.
Technical support groups and community building
That is a great question, and that's really where our tech... We don't have user groups, we have technical support groups. So we have an R technical support group, and I've been leading that since 2021, and then we have a Python technical support group. And we have membership from all over the agency. It's open to anyone who's interested, and that's staff, that's contractors, that's people in our regional offices.
An example was not last week, but the week before we had a presentation on tidy data. Fundamental, but so important for so many people. And a lot of people came and they said, you know, we really need to think about getting our data into a tidy format before we move into the cloud. One of the biggest challenges of the census API is that you, not tidy census, but just the census API, is that the data's in different formats. You have to do a lot of wrangling to get it into a similar format so you can combine it.
Next Thursday, we're doing our R-SNAC session, where we're going to do just the basics of reading in, writing out data, and creating synthetic data. And how do the different file formats take up, like, what are you losing with different file formats and what are you gaining in space? So just those little things as we all align. And if we can help people think of better ways to do it, better ways to store their data, process data, analyze their data, then again, I think that those silos will slowly start to break down and we'll see that we've made big strides together.
Career advice
You know, I, because I did not start as a data scientist, I didn't get what I think would be the best advice for now. And again, I think that's those practical projects, the projects that you can put together, the domain, the ability to get that domain knowledge, I think is so much more important sometimes than the data science skills themselves. I've met a lot of people with some really, really great skills and that can innovate and modernize, but if they don't understand what they're trying to accomplish, it's very difficult, lots of roadblocks you can only get so far.
So that just demonstrating a subject matter expertise in the field that you're trying to get in and showing those projects and being able to talk to a recruiter, an interviewer about the data or the projects or the data science that you're most interested in and how it applies to the space that you're applying to, I think really does go a long way. I have never not talked to someone who has a love of data science and not been like, you know, I really would love to add that person to my team.
So that just demonstrating a subject matter expertise in the field that you're trying to get in and showing those projects and being able to talk to a recruiter, an interviewer about the data or the projects or the data science that you're most interested in and how it applies to the space that you're applying to, I think really does go a long way.
Working with large datasets and census resources
My data is not big, but I feel this question all the time because I have to help other people. People are wondering, how do I work my large data set? So we have a space called the IRE. It's our Integrated Research Environment. And that's where a really, really, really large data is housed. And that's where we have a few people that have tips and tricks on parallel processing and the data table package and different things that will speed up your analysis.
But that being said, that is probably one of our biggest challenges. And I believe that our move to the cloud will kind of speed up what we can do with the large data. If you are interested in working in the government, this coding it forward, Civic Digital Fellow, I cannot recommend it enough. Some of the most impressive people I've worked with have come to us from that program and stick with us.
I am putting this link in the chat, and that's our guide to working with census data. It is confusing. And you know what, as, it's almost embarrassing to say, but even though I worked at census for so long, it wasn't until 2021 when I attended Kyle Walker's webinar, and that's why I recommend it to everyone. That's what helped me understand census data. He has a tidy census book that's free, or analyzing census data with tidy census book that's online and free, and just practicing it.
And then if you master tidy census in Tigris for visualization, I would suggest moving to the census API package and trying, not necessarily trying to get different data, but seeing how you have to query and call the different data sets using the census API. And that kind of gives us like the start of the understanding on how to access the census data through, again, the census API, it's all the data is a little bit in a different format. So you have to really start to understand nuances, reading documentation, but once you get the general hang of it, it does seem to fall into place.
The linear algebra was probably one of the best that I had. Yeah, because just I feel like it helps you even just visualize what you're trying to accomplish. But that's honestly one of the places that I frequently am trying to upskill and focus my training is math for data science. So on LinkedIn Learning and Udemy, I found a few courses that are like statistics specifically for data science, math for data science, and that seems to kind of help subset.
I want to remind everybody to join us next week. Jeroen Janssens is going to be sharing his wonderful experiences with us. He's a senior developer relations engineer here at Posit. He's also the author of Python Polars, the Definitive Guide, and Data Science at the Command Line. Fantastic person and wonderful stories to share.