
Jeff Leek | Data science education as a public health intervention in E. Baltimore | RStudio (2020)
Originally posted at https://rstudio.com/resources/rstudioconf-2020/data-science-education-as-an-economic-and-public-health-intervention-in-east-baltimore/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm super excited to be here. This is my very first RStudio conf. Thank you so much for the invite. I'm going to talk about a project we've been doing in Baltimore to use data science to create new economic opportunities.
We rely heavily on tools developed by RStudio, so I'm really grateful for all the help they've helped us do over the last couple of years. I talk super fast, so you're going to see a lot of slides. If you want to follow along or look at them later, you can find them at this website.
So, first, I wanted to thank a lot of people who took a gamble on this project, and in particular, I really wanted to thank four people, Shannon Ellis, Abuzar Ahadavan, Jamie McGovern, and Ashley Johnson, who I had this completely harebrained idea to do this project, and they all gambled on it. So, thank you to all of them.
So, I'm going to turn it over to the other people who gambled on me with the project, and I really appreciate their time, and then you'll hear about some of the other people that have gambled on us, and hopefully it's paid off for them. So, they're the real MVPs of the talk that I am giving to here. I'm just the mouthpiece.
So, this whole project has felt a lot like this. This is a project, I'll tell you, that has a lot of moving parts, and it's felt like, you know, throughout the project, we're a train going down a track, and we're rapidly trying to get debris off the track so that we don't go off. And all the people that I just showed you a minute ago are people that helped us solve problems.
Income inequality as a public health problem
So, as you probably heard or know, or you've read about in the news recently, income inequality is a big problem in the world. And I work in a school of public health, and it turns out income inequality is also a public health problem. If you look, this is actual real data. I know the line is so straight, it looks like I made it up. This is real data.
It is the relationship between how much money you make and what your life expectancy is based on census records for the entire United States. If you make, at the 80th percentile, on average, you live between five and seven years longer than people who make at the 20th percentile. This is a correlation, of course, not causation, but this is an incredibly tight correlation, and it also relates to every other public health consequence that you might imagine.
So, at the same time that economic inequality could cause some concern for us about public health differences and disparities, economic mobility is also at a record low. So, this is a map produced by the Harvard group, Raj Chetty's group, looking at economic mobility in the United States.
And this is the neighborhood around where I work at Johns Hopkins. Johns Hopkins is a massive, multibillion-dollar institution with a lot of very well-paid doctors, but the median income for a person that grows up in the neighborhood around our hospital is $18,000 for a family at the age of 34. So, that, and not only that, but the chance of going from that income to going up to a higher level of income is very small.
This isn't just a problem in cities. I grew up in Pocatello, Idaho, which is a random little town in Idaho, and there's a similar kind of economic mobility challenge, even in rural parts of the United States.
It turns out that the best way to solve the economic mobility problem, the best intervention anybody's ever invented is education, but it turns out that you need access to education in order to be able to take advantage of this intervention. Technical education obviously breaks this trend and gives us an opportunity to inexpensive education and move people up the economic mobility ladder.
This whole project is predicated on the hypothesis that talent is evenly distributed, but opportunity isn't.
Teaching data science online
So, we've been teaching data science online for a little while now. We've taught more than 5 million people on the Coursera platform in just straight data science using R. We've also taught 6 million plus across a variety of different data science modalities, and so that's something we're really excited about and we're really proud of.
But one thing that we wanted to know is, could people use these programs to improve their economic prospects? So, we did a study, and we looked at what people's income was before they took our program and after, and the dots, which you probably can't see too well, but are up in this top left-hand corner, are people who had below-poverty-level income but ended up at above the 80th percentile of income after taking a $1,000 online program that they took in their spare time.
That was a very, very small subset of our Coursera students, but it's a pretty inspiring subset of those Coursera students. So, it turns out, though, that almost everybody that takes our Coursera classes are already educated. You know, the most common user of one of our Coursera classes is an upper-middle-class white male Silicon Valley engineer, so there's probably a few around here, which is great, and we love educating those folks, but it's not reaching the broad audience.
So, even if you build it, they won't necessarily come to the massive online open courses. So, we started thinking about, from first principles, how do we take somebody who's maybe never heard of data science and turn them into a functioning data scientist who can get a job and improve their economic prospects?
Building the Cloud-Based Data Science program
So, these are the challenges that we came up with. Shannon and Abuzar and Sean, who's sitting here in the front, got back in my office a few years ago. And so, we thought an expensive computer is a challenge. It's been said that a data scientist is a statistician working on a MacBook Pro. That'll run you about $3,000 for a MacBook Pro, but if your income last year as a family was $5,000, you're not buying a MacBook Pro.
So, we wanted to build a program that could be run on any computer that has an internet connection and a web browser. So, this is where we relied heavily. We were the initial beta testers. Tarif and Robby helped us out a lot, got us on RStudio Cloud, helped us scale this whole program. That was the only way we would be able to do that.
So, we've been doing data science on a Chromebook for a while. You can do it with Google Slides, Google Sheets, but the real hero of this story in terms of technology is RStudio Cloud. So, we ran our whole new MOOC sequence. We built an entirely new MOOC sequence entirely based on RStudio Cloud so that anybody could take it. You could go to the public library and take our data science series on the web browser.
Then you'd have to know about data science. I've talked to a lot of the people in the community where we have run this program. They've never heard of that. They don't know it's a thing. They don't know you could do something with it. They don't know you could improve your life with it.
So, we had to tell people about that, and again, this is the same problem we had before. Most people who are taking these online courses are already educated, so they hear about data science. This is not exclusive to us. If you've heard of a data science boot camp or Lambda School or any of these other things, if you look at the data on who's being educated among already educated folks.
So, we went out and partnered with a nonprofit in East Baltimore just off the block where we work, Historic East Baltimore Community Action Coalition, and so they're amazing, and they have a GED training program, and so they also have access to a catchment area of a lot of really talented youth who maybe had not heard about data science. So, we worked with them to identify some people and to tell them about data science.
Then we needed an appropriate program. So, the cool thing about the tools that have been built by R is that with a relatively little amount of overhead, you can go from nothing to magic, and so we can compress training programs that make people very capable down into very short periods of time. With this much code, you can build a funny little website. With this much code, you can build an interactive dashboard on the NHANES data. With this much code, you can access a gigantic database and process thousands and thousands and you can scrape data or connect to APIs, all of the things that the packages have been. We've been early adopters of all of them.
So, we built a program called Cloud-Based Data Science. Anybody can take it. It's online, it's free. Go to clouddatascience.org. We have tons of people taking it all over the world, but we also partnered with this nonprofit to run the program locally in Baltimore.
So, we had our content development starting in February 2018. We had our first meeting with Yode, the Hebcax training program. They said, hey, this is great. We should start tomorrow. We had started content development on February 2018, so we said, maybe next month? And they said, okay, May 21st it is.
So, remember, I had to build a whole training program and we had a month to do it. Fortunately, we had this amazing team of people who developed content for us in a distributed way, and Shannon Ellis and Abuzar Hadavan deserve the lion's share of the credit in terms of content development, but what was amazing is we actually had technologies developed by Sean Cross and by John Muschelli that allowed us to develop and deploy this content very quickly.
So, then we've solved a couple of the problems, but there are still some problems out there. We still got to get them a computer, access to instruction, income security.
Sean built some really awesome technologies that allowed us to automatically synthesize the voices around our videos and automatically generate all of our videos and all of our MOOCs from our Markdown documents and from Google Slides. So we can recompile and re-release new courses very, very rapidly to be able to deploy this program. Similarly, all of our tutorials are built in an amazing program called Swirl, also developed by Sean and Nick Carcetti, that allows us to write sort of Markdown-like, YAML-like tutorials that people can then interactively take in R.
So, we wanted to go after this new cohort which comes from a different sort of level of education and from a different background, so we built a training program that has in-person office hours three hours a week, you get paid to complete the courses, you get a free Chromebook, and then we help you with job assistance. I get up very, very early in the morning because I have young kids and my post-doc goes to sleep very, very late because he doesn't have young kids, and so you get basically 24-hour Slack support when you're taking our courses.
And then we decided to run a pilot, so we ran it with — I was going to run it with 20 people, and Joe Chang came to visit us in Baltimore and he said that is a terrible idea, try two first. So we did, we tried it with two people. The learning started May 21st. How long should the program take? We don't know, we just made this up last week. So, we said three months, which is a dramatically short timeline to learn a lot of data science, especially for people who had never heard of data science and were already looking at us a little skeptically to start.
Surprisingly, even with these challenges of us basically being behind in content development, both of the candidates finished the program in four months. The new cohorts have finished in the three months we originally projected, because now content is done instead of the first cohort where they weren't. So we've had multiple cohorts go through this and we've had an 80% success rate. This is an incredible success rate, given that the people that we're working with had never heard about data science beforehand and it's really a testament to the incredibly talented people that we're working with who just didn't have an opportunity to do this, and now they pick it up way faster. They're teaching me stuff now.
This is an incredible success rate, given that the people that we're working with had never heard about data science beforehand and it's really a testament to the incredibly talented people that we're working with who just didn't have an opportunity to do this, and now they pick it up way faster.
Employment challenges and solutions
The next challenge that we had is we wanted to use this as an economic intervention, but if you've ever seen a job ad for a data scientist, these are your actual roles and responsibilities, but these are the job requirements. And so, that's a bit of a problem when we're training people to do really powerful, useful stuff, but the job requirements look like that.
It's true for this, as I'm picking on someone in particular in the Baltimore area, but this is, you know, you need a bachelor's degree and five years of experience, and what they really needed was somebody to look into their SQL databases and figure out what was going on. So, it turns out there's a problem with us directly employing the data scientists we've trained if they don't have a bachelor's degree.
One of the people that's completed our program has had a bachelor's degree. He's actually here today somewhere. There he is. Anthony. You should go talk to him later. And Antonio, who's also here, completed our program, is getting his bachelor's degree, and he's going to be a doctor someday as fast as he possibly can.
But the Fair Labor and Standards Act doesn't, you know, have these carve-outs for who can be a salaried employee. I learned a lot about HR law through this project, and it turns out data scientists doesn't meet the software engineering requirement for the Fair Labor and Standards Act to be an exempt employee. I've talked to a lot of lawyers about this. So, this is a problem because now you have to hire the folks that we've trained as hourly employees.
So I was this close. I've trained incredibly talented people who are super smart, can do incredible things, but they don't have a job that they can get because the labor laws require them to be an hourly job, but they're being listed as salary jobs, and so I'm in a bit of a pickle. I work at Hopkins, and I'm a faculty member, and when you're a faculty member at a place like Hopkins, you can pitch a fit and try to get them to do things for you. And so I did that, and I have over the last year and a half pitched enough fits that they've created a new job classification just last week so that people that complete our training program can be employed at Johns Hopkins as data scientists, so very excited about that.
But in the meantime, we didn't have a place for people to be employed even though they're super talented. Well, it turns out a lot of people were emailing me to do data science consulting work because I'm on Coursera all the time, and I used to just hit delete on all those emails, sorry, but then I had an amazing outreach from some folks at — this guy's from Wyden Kennedy, and he said, hey, right at the time when I was launching this, he said, hey, do you want to do some stuff together?
A couple of the people from WK might be here who we work with a lot. They're amazing. And they, we, so we've started to work with companies like that and to basically do data science as a service for them. And so what we found in doing that is with a lot of companies, WK is one of the best case scenarios. In a lot of the cases we work with companies, the main problem is the data isn't organized. As everyone knows, 80% of the problem is data cleaning.
So we've kind of created a product which is tidy data as a service. We call it Streamline, but the idea is basically we go in and build a custom set of pipelines that pull everybody's data in, put it in a database every night, and so they have the clean data that they need to use. And so we've been deploying this, the team — you should meet Antonio and Anthony back, they're super talented. Hire them away from us, please, and then I'll train more.
And so we founded this company and we decided we were going to put it right in the neighborhood where this is happening. So our company is headquartered like right up the block from Hopkins. We're hiring data scientists from this training program, and we think of that as their first data science opportunity, but they're so talented that's definitely not going to be their last data science opportunity.
So this is the project we set out to try to accomplish. We've tackled each of this bit by bit. It's still a little bit held together with sticks and mud and glue and the hard work of Anthony and Antonio and their friends, but we're taking a big swing at how do we tackle getting RStudio and R and these sort of talents, just like Carl mentioned, how do we use this as an engine to create economic opportunities for people who wouldn't otherwise have them in this area and tackle these big public health problems head on.
So if you want to read more about this, again, JTLeague.com slash talks, they wrote a nice little article about us. And then if you want to get involved, we would love for you to reach out to us about being a mentor, hiring a graduate of our program, becoming a customer of our company, or donating to help us train more people in data science. And thank you very much for your attention.
Q&A
Okay, so having set a new RStudio conf record for syllables per second. Yes! First question that comes up, where can people get training in all the stuff you had to learn that wasn't data science? You mentioned HR. I imagine there's a whole bunch of other things you had to learn. To run this project? To do things like this.
That's a good question. I'm not sure I'd recommend that someone go and try to learn all those things. What it is, is, you know, it's this, the reason why we call the company Problem Forward is because we think about problem forward, not solution backward. We're trying to figure out what's the problem and then solve the barriers instead of ending up with a solution and reverse engineering it. And what I found in not only this project, but every project is, if you pick a project, an end point or a project you really care about, a problem you really care about, you just have to learn a whole bunch of stuff that is unrelated. You wish it would just be R scripting, but it never is.
Okay. Next question. How has all of this changed what you teach and how you teach at the university? Yeah, so the online courses were actually, I've been teaching this advanced data science course at Johns Hopkins for years using R and RStudio and stuff like that before Coursera. Then we did Coursera and now we've been, Coursera and online programs like LeanPub and Coursera have been kind of our like, it's like our education labs. We like run it there first, try it out, see how it works, learn all the things that go wrong when a thousand people try to learn it that way and then kind of build it back into our local in-person program. It's kind of a cool, yeah, way to do that.
Okay, and one last question and this one's from me. If you could send email back in time to the day you first thought of doing this, what would you tell yourself? So many things, but I think the first thing that I would say is be super grateful to all of the people that took the gamble with you on this. Like I said, a lot of people kind of gambled with their careers to help me do this and I deeply appreciate that.
And then the second thing I think I would have said, and this is a true story, is JJ talked about creating an economic engine that allows him to do other things first and I didn't generate the economic engine first and that has been challenging. I wish I had done economic engine first, social mission second, but, you know, it's still working out. I'm not sure that's a hard and fast thing, but it's a nice idea. So, thanks.
