Data Science at the Census Bureau | Jessica Klein | Data Science Hangout

Transcript#

This transcript was generated automatically and may contain errors.

Alrighty, welcome back everybody to the Data Science Hangout. If we have not met, my name is Libby. I'm a data science community manager here at Posit and I'm also a really passionate data science educator. I have a background in business statistics and experience teaching Python and R. I am based in San Antonio, Texas where right now it's a lovely 65 degrees outside.

If you aren't familiar with Posit, Posit builds enterprise solutions and open source tools for people who do data science with R and Python. We are also the company formerly called RStudio , so if you know RStudio, you know us. Rachel, who started the Hangout many years ago, just popped up in here as well, so let me have Rachel hop in really quickly and introduce herself.

Sorry if I'm out of breath. I was running from an event in Boston. Hi everybody, so nice to see you. I'm Rachel. I lead customer marketing at Posit and I'll be hanging out here behind the scenes too.

Well, the Hangout is our open space to hear about what's going on in the world of data across a bunch of different industries. We chat about data science practices, data science leadership, we connect with other people who share our experiences, and we get together every Thursday, same time, same place, here on Zoom. So if you are watching this on YouTube later and you want to join us live in the future, I highly recommend it.

All right, at the Hangout we love hearing from you. It doesn't matter what your years of experience are, your titles, your industry, the languages that you use or do not use, and we really encourage you to connect with each other in the chat.

So, when we started creating deliverables, showing them to our managers, and that's when the buy-in really started. That's when people started seeing, you know, these are things that we can show to our sponsors.

And then, once managers started saying, you know, not only do we like this, but we want to add to the skill set. We want more data scientists and we want more people taking data science training. That's when we started seeing that top-down. And then, we met in the middle. And now, it does seem at every level, there's a value of data science. You might not understand data science as a manager, but you understand how you can leverage it.

Retention and career growth at the Census Bureau

That's a fantastic question. And, you know, my plan wasn't at the very beginning, this wasn't my plan. I really wanted to work in a criminal justice space. When I was little, I had dreams of being in the FBI one day. But, you know, life has a different path. And Census, I believe, as time went on, as they started saying, let's try to innovate, let's not stick with these traditional tables, these traditional ways of delivering data that we were all used to that are a little bit harder, I think, for an average analyst to be able to just work with.

But now that we have the right tools, I think that those of us who are really, really, not only enjoying the data science space, but wanting to see how far we can push it, it's made it an attractive agency to stay at. The Census has truly changed their mindset and I feel like pioneered this modernization effort. And I think rather than considering a job, I almost think like, where are we going to be next year? Like, I'm excited to stay and see what does the next year look like? What does it look like when we're all in the cloud? What does it look like when I get to move to the cloud?

Hiring and technical skills

So I have only gotten to be part of hiring one time, and that was before I became an official data scientist because I had a data science skill set. They allowed me to kind of come in and help assess skill set. When something's new, you don't really know what you're looking for.

I believe strongly that the public get hubs with practice examples, data science pipelines, projects that look similar to the agency or the organization that you're applying to go a long way, more so than listing out the packages, the skills. Those are very important to put on your resume. But I remember from my experience just diving in and being able to see a project from beginning to end, help me understand what value of contribution that employee would come in and be able to apply towards our projects.

So every government hiring goes through USAJOBS, and I do believe that that might be a little scarce right now, but please don't give up. I believe that hiring will come back soon. And that's where that first step of resume. Hiring managers, I believe, are starting to bring in people who have that data science experience to kind of go through the resume.

I'll use TidyCensus, for example. That's one of our favorite packages here. So they might see the TidyCensus package, but they might not understand, well, what does that do for me as a data scientist? I know that gives you all the census data at the tip of your fingers. If you can use TidyCensus or the Census API, you probably have a very good skill set.

Surprising findings in the data

One of the most interesting things that we recently saw was how data can kind of conflict with the news. The data that we're being provided is not the same as the data that we're seeing reported in news articles, and we're able to now more or less sync up information with the information that we're seeing in the public. I find that to be really interesting sometimes, where it either matches, where we can say, oh, yeah, this is what was reported to us, this is what I've seen reflected, or, you know, this is what was reported to us, but that's not what I've seen reflected in this piece of information I've found.

And that, I feel, is probably one of the more fun pieces of data science. It's the what did you find that you either didn't expect or is missing or you know you can correct or should correct from your analysis.

Census data and public policy

You know, that is something that I can't answer. I think that I can answer it with a non-answer. So as Census, we really try to stay away from that policy. I work in a reimbursable area, so my responsibility is to collect the data and then pass it off to my sponsor, the Bureau of Justice Statistics, to create their own analysis that will then shape policy.

A good, maybe a good example where it didn't shape policy, but it was that real-life impact is during the hurricane season, we were, had some of our colleagues were working with FEMA, and they were able to use the census data to help address the community resilience estimates and which communities were going to need the most help, which communities were in danger of failing or, what's that word, not being able to rebound because they didn't have all the resources they needed. And I always like this, where census can actually, real-time help, we're so used to collecting data and then maybe it's another year or so before it actually, something gets to happen with it, but those community resilience estimates, those emergency situations is when you really get to see quick thinking or census data help people in the long run.

Geospatial tools and ArcGIS

That's a fantastic question. You know, since this was an open source conversation, I had intentionally left it to open source, but I am Tableau certified and have taken a full year of ArcGIS training, so those are two tools that we use and we rely on heavily. Similar to what you just said, how do we break free from these contracts, these proprietary software, we do want to try to loosen our footprint and our reliance on these softwares.

Our geography division does use ArcGIS probably more than any other area. We also have the, what's called a community explorer, ArcGIS community explorer, it's kind of a web-enabled place where we can build dashboards, we can give little surveys, and we can give more interactive dashboards to our teams. If you are not producing deliverables, maps, dashboards that are going to the public, you are encouraged to use R or Python for your solutions. Keep them internal, don't use a license if you don't need it, but if you are publishing, we currently don't have a way to publish outside of ArcGIS and Tableau.

You ask the packages, Tigris, which is a compliment to TidyCensus. Tigris will give you all of the census shape files. I find that to be a fantastic package. That's like a nine-hour webinar that Kyle Walker does every year. I find it to be the most, one of the most valuable resources to using TidyCensus and Tigris. The content doesn't change that much, but the practice and when they add new data sets, he gives a nice overview of how to access that new data.

That's a great question. So I personally have only done a few of those little linkage projects, but the big ones that we're working on, and we actually just had a presentation on this, is the fast link package and probabilistic linking. And that we've had a few demonstrations on and it always seems to be not only very popular, but in the chat, kind of like you said, when people are presenting in the chat, but I see people light up and they're like, oh yeah, if I'm doing linkage, that's what I'm working on too.

We have this linkage process that randomizes people. It groups people in like buckets, whether that's facilities or peoples, and that's how we can do this linkage across different spaces. So sometimes it's not a one for one, especially because we have to protect the data within our program. So you had mentioned sharing data outside, much harder to share data outside the agency. Even with myself as someone who works on reimbursable surveys, there's policies on when I can share the data, who I can share the data with.

So our Census network is incredibly locked down. Not as locked down as like probably DOD or some of the other agencies, but even I've heard other statistical agencies, maybe an agency like NOAA, where they don't have those same levels of restrictions. But we have a very, very strong firewall. We are not allowed to hook our laptops up off of network.

But all of that leads to the idea that if you're working within your computer, your server, so I have my own criminal justice server that our team works on that we don't have to worry as much about that. We're not sharing information outside of our network. We're not sharing it with people outside of our team. We have this policy called need to know. And every year we get trained on it, that if someone needs to know, because it's related to their work, you may work with them on it. But if not, it is illegal for you to disclose that.

So that does mean that there are things with R and Python open source that we can't do, because we don't want to risk sharing, inadvertently sharing information. We can't use AI with this information. We can't use packages that are off of CRAN, or PyPy, I believe is the equivalent. So that's some of our security procedures to make sure that this information doesn't get in the wrong hands.

We take that the public trust and the trust that you all, well then it's public trust, and the trust that we're not going to disclose anything. That is probably the number one oath that we take as census public servants.

Survey weights and quality control

In the most incredible coincidence or fate, as fate would have it, this actually, that was my last R project that I finished yesterday, is I had to create sample weights for the sampling file for our upcoming collection. So, and it was a challenge. We had to deal with some amount of double coding the way it used to be in SAS. So that's something that I work on. I try to convert some of our SAS programs to R.

We don't do it by region, we do it by strata. So that is actually region for one of them, but that could be the size, that could be if there are a secure facility versus a non-secure facility, size, region, and then maybe a few other details. Are they male only, female only? So that was the challenge, right? Like making sure that the survey weights fit the expected sample versus the population. And it all matched yesterday and it was amazing.

But I don't know if I would have been as successful if I did not have the SAS output to compare to. It was really not me being able to validate it because I'm not a math stat. I am familiar with statistics, but I'm a social data scientist is how I look at myself. But I can use open source really well. I can understand statistics. So I understood that my output made sense, but if I wanted my sponsor to buy into it, it was still going to have to match the output from years prior.

Professional development and upskilling

I love training. So training is one of the things that I find not only to be the most important as someone who's trying to maintain skills, learn new skills, but as myself, leader of our user group at our agency that I try to promote upskilling opportunities. And not only the ability to go back to skills that we all maybe know, or we don't use as often, so we forget, but to be able to adopt new skills too.

And I do think that that continuing education, the continuing of showing that you are still learning, you're taking training, you are furthering your professional development, does go a long way. That being said, I've learned that in the government level, I learned this through my own experience actually when I was hiring, that some of that professional development doesn't count towards an official education requirement.

So I do, like I said, I think that training is the most valuable thing that someone can do for themselves, whether that's we use LinkedIn Learning, Udemy, Percipio, Pluralsight, Coursera, all sorts of great options, Datacamp, Posit Academy, I believe you guys have, all sorts of great options. But if there are certificates, like I know I'm hoping to be able to take the master's certificate in data science at our local college. I think those things go a long way towards hiring.

Survey statisticians and data science

I do believe that every survey statistician is essentially the beginning of a data scientist. You just might not have the right tools, or you might not have gotten access to the right applications to be able to take it to that next level of true data science that you can show to your managers, but we're all working with complex data. I was trying to do the best I could with data science in Excel, so even... And I do have someone on my team, actually, who probably uses Excel more like magic sometimes.

I do believe, like I said, the survey statisticians at that fundamental level, you're working with the survey data, you have the ability to take that next step into the data science role. Where we're limited, I would say in the government, is that you have to go from 15 credits of math to 30 credits of math, and not everyone has that, because you probably wouldn't have gotten that unless you had a specific math major.

I do believe that every survey statistician is essentially the beginning of a data scientist. You just might not have the right tools, or you might not have gotten access to the right applications to be able to take it to that next level of true data science that you can show to your managers.

Version control and breaking down silos

Version control, GitHub. So we are huge on GitHub, GitLab. We're actively making strides to get the whole agency enabled with GitHub. Right now, it depends on what area you're in. If you have, again, GitLab for some areas, GitHub for some others. I use GitLab, but I can't share things with people who don't have it, so we're trying to get to that point that we can not only share within our agency, but my sponsor, I had to email her all the code files.

Last year, I converted the sampling process from SAS to R, and, oh, so there were two processes that were just a little bit more complex than I was used to at that point. I was able to borrow the code for one of our larger surveys. I was already out in the field, had already been vetted and approved by many people, and it was the same sampling process. So I used the code that they had put on their GitLab, and then their sponsor knew, okay, this is not something that Jessica just made up. This is something that's already been put into practice, already shows that it works well, and we're trying to do more of that. We're trying to share rather than recreate.

We're all doing, I would say, a similar piece of different, a similar process for different surveys. So if we can get away from creating silos, get away from that redundancy, and all align and modernize together with similar code, I think it's just going to make for a much better way to not only get these things out to the public, but make sure that for years to come, we have consistency.

Technical support groups and community building

That is a great question, and that's really where our tech... We don't have user groups, we have technical support groups. So we have an R technical support group, and I've been leading that since 2021, and then we have a Python technical support group. And we have membership from all over the agency. It's open to anyone who's interested, and that's staff, that's contractors, that's people in our regional offices.

An example was not last week, but the week before we had a presentation on tidy data. Fundamental, but so important for so many people. And a lot of people came and they said, you know, we really need to think about getting our data into a tidy format before we move into the cloud. One of the biggest challenges of the census API is that you, not tidy census, but just the census API, is that the data's in different formats. You have to do a lot of wrangling to get it into a similar format so you can combine it.

Next Thursday, we're doing our R-SNAC session, where we're going to do just the basics of reading in, writing out data, and creating synthetic data. And how do the different file formats take up, like, what are you losing with different file formats and what are you gaining in space? So just those little things as we all align. And if we can help people think of better ways to do it, better ways to store their data, process data, analyze their data, then again, I think that those silos will slowly start to break down and we'll see that we've made big strides together.

Career advice

You know, I, because I did not start as a data scientist, I didn't get what I think would be the best advice for now. And again, I think that's those practical projects, the projects that you can put together, the domain, the ability to get that domain knowledge, I think is so much more important sometimes than the data science skills themselves. I've met a lot of people with some really, really great skills and that can innovate and modernize, but if they don't understand what they're trying to accomplish, it's very difficult, lots of roadblocks you can only get so far.

So that just demonstrating a subject matter expertise in the field that you're trying to get in and showing those projects and being able to talk to a recruiter, an interviewer about the data or the projects or the data science that you're most interested in and how it applies to the space that you're applying to, I think really does go a long way. I have never not talked to someone who has a love of data science and not been like, you know, I really would love to add that person to my team.

So that just demonstrating a subject matter expertise in the field that you're trying to get in and showing those projects and being able to talk to a recruiter, an interviewer about the data or the projects or the data science that you're most interested in and how it applies to the space that you're applying to, I think really does go a long way.