Resources

Carl Howe & Greg Wilson | Data Science Education in 2022 | RStudio (2020)

More people are learning data science every day, and there are more ways for them to learn than ever before. To understand where we are and where we might be going, this talk looks at what data science education could look like two years from now: far enough away that we can dream, but close enough that we can only dream a little. We explore the balance between automated and collaborative learning, different ways to deliver different kinds of lessons to different kinds of people, and ways in which our tools and practices could improve

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, now it's good afternoon. So if you were coming here expecting to hear Greg Wilson, who is a well-known educator, talk about the future of data science education in 2022, you're going to be disappointed. Because he wrote the abstract for this talk, and then he talked to me and said, you're going to have to give it.

So my name is Carl Howe, I'm the Director of Education for RStudio, and I'm very honored to be here to talk to so many wonderful people. If you want to know a little bit more about me, I actually have a very old and decrepit website at carlhowe.com, but it does have my full resume.

One of the things you should know is this is my third career. I started my life as a software engineer, I have a degree in computer science. Then I became an analyst. I don't know how you feel about analysts, I was an industry analyst, and one of my jobs was to predict the future.

And I love this quote that came from a Danish parliamentarian, which is that it's difficult to make predictions, especially about the future.

But I actually thrived in that business for about 20 years before I became the Director of Education here. I actually did also travel around the world teaching big data before I came here. But the thing that drew me here was RStudio's mission. And you heard it from JJ this morning, but I think it's worth repeating. We want to equip everyone, regardless of means, to participate in a global economy that rewards data literacy.

And one of the ways we're doing that is we have become a public benefit corporation. We think this is incredibly important. We want these tools to be around for hundreds of years. And we're willing to play the long game, to work hard to do this.

RStudio's education mission

Now when I joined, I decided we needed to have a mission in our education group as well. And by the way, when I say our education group, there's a variety of people, Greg is on the education group, Alison Hill, Garrett Grohleman, who's not able to be here because he's on family leave. But we also include people like Hadley Wickham, Jenny Bryan, all the folks in the Tidyverse team. So it's a great group.

But we established a mission, which is we want to train the next million R users. Actually we did a hashtag and a talk last year about this. And you can actually use that hashtag to tweet about it.

So let me talk a little bit about some of the things we've done to get there. Greg has actually headed up a group that is also our certified RStudio instructor team. We have nearly 100 certified instructors already.

But here at RStudioConf, I think this has been incredibly exciting for me, that we have held 19 workshops here, led by a teaching staff of over 100, and teaching 1,300 students on Monday and Tuesday this week. I believe, and I'm happy to take contrary views, that this is the largest R education event in the world.

And in addition to this, we also have other contributions. Many of you have seen our free online books, our packages for education, open source materials on GitHub, free academic licenses for RStudio Pro products. If you're an academic using our products for teaching, you get free licenses. And our annual survey of how people actually learn R.

So we're doing a lot. But I'm sad to say, it's not nearly enough. We need to do more. And the reason is that when we survey our users, we discover something important, which is that most R users are self-taught. 48%, almost half, are self-taught. Which means no matter how wonderful an instructor I am, I'm not reaching most of the world.

Three challenges for data science education

And one of the challenges I see is that data science faces serious challenges in the next two years. You know, you would think, two years from now, it's within our vision. It's right on the horizon. But I think there are some big changes coming. And in the immortal words of Yoda, if you're not afraid, you will be. So I'm going to go through three challenges to guide you through this.

The first is the data explosion. Frankly, we're drowning in data nowadays. I'm going to use just a really simple metric from IBM that said 90% of the data in the world was created in the last two years. 90%.

So if you think of data science as looking for a needle in a haystack, we have a problem. Because if we're looking for a needle in a haystack today, we're going to be looking at 10 haystacks by 2022. And by the way, by the time we get to 2024, we'll have 100 haystacks.

Now I know most of you sort of say, oh, but computers get faster. It's no big deal. Moore's law isn't going to bail us out. This is Moore's law. Factor of two increase in computing power every 18 months. Actually originally a marketing campaign done by Intel, but it's worked out pretty well. This is the data explosion. If we look out five years, Moore's law gives us a factor of about four and a half in terms of computing power. The data explosion, we're going to have 46 times as much data in five years. So we're drowning in data, and we're not really prepared to process it.

Secondly, I want to talk about something that I really don't have to explain to you, but it's such a great topic, and it's something we really have to pay attention to. It's irreproducibility results. Sorry, I can't even say the word. Unreproducible results. We no longer trust science.

This is a real issue, and it's not because science is inherently untrustworthy. It's that we're kind of screwing up how we do it. There's a great article in Physics Today. By the way, I'm borrowing from Gerrit Grohlmann's slides here, because they're just so compelling. He heard about this study that was being done about supercooled water published in Physics Today. They published the paper. They were basically just cooling down water, it had to be very clean water, and passed the point of freezing down to very, very low temperatures. Two labs ran the same experiment, and they got different results.

After seven years of suing companies like Nature, I think it was published ... No, it's in Physics Today, of course. After suing the publisher, they finally got access to the other team's data. What they discovered was there was a bug in the code that caused their studies to diverge.

You can blame this as just one isolated institute, but there's a famous Amgen study that was done in 2012 that discovered that when they looked at landmark results, the foundations of their very science, they could reproduce only six out of 53. Six out of 53.

Let's not just blame the farm industry. This is everywhere. If you look in the field of psychology, for example, they're only able to reproduce 36 out of 100 important psychology studies. If we look at economics, it's the same story, 29 out of 59. Look at Nature and Science, the two premier publications in the scientific world, 13 of 21.

By the way, if you want the references, they're right here. I did some very serious research on this. I worked very hard on it. I did 21 coin tosses and ended up with 12 heads. If you look at the ratios there, economics is doing less well than the coin tosses and Nature and Science are only just barely doing better.

Needless to say, this has caused a few headlines. It ranges from the New York Times to the Atlantic and to the Wall Street Journal.

I just want to try to put this in perspective. In the United States, so not the world, in the United States, the cost to the biomedical industry in the United States for irreproducible research is estimated to be about $28 billion. Just to put $28 billion in perspective, that's enough money to buy a latte for everyone on earth from Starbucks. Or if you like, it might be a little bit more fun if we just simply say, I can buy everyone in this room with this money your own island in the Bahamas while supplies last.

This is a serious issue. That's just in the United States and that's just in the biomedical industry. It's trillions of dollars worldwide. So this is a real thing. Unreproducible results is something we're going to have to address in data science education.

Unreproducible results is something we're going to have to address in data science education.

All right, the third one is I think, I'm not sure how many people are going to see this one coming. It's fake data. Data has just become another way to lie. If you look at the headlines, you see this every day.

We had the Brexit issue where they were promised there was going to be 350 million pounds for the NHS. That's not happening. Hundreds of climate skeptics have convinced an international campaign to get rid of the net zero targets and there are people donating millions to anti-vaxxers.

Now I blame this really on media explosion. We get overwhelmed with media everywhere we go. It didn't used to be the case that you would get the latest headlines on your person within seconds of them being published. So the world has sped up, we're overwhelmed with media. So what do we do?

When people are overwhelmed with media, they turn to their trusted sources. The most trusted people in the United States or in the world even, they are people like Facebook, Google, and Twitter. They're relying on their friends for their facts.

The Royal Society of Open Science has actually published a paper on this called Fake Science and the Knowledge Crisis, Ignorance Can Be Fatal. And the real danger was actually illustrated very well by the television series Chernobyl. The real danger is if we hear enough lies, we no longer recognize the truth when we see it.

The real danger is if we hear enough lies, we no longer recognize the truth when we see it.

Now I'm going to illustrate this with a true life example. And some of you probably know this, it's a very famous study. It's called the Rogoff-Reinhart model. Anybody heard of this before?

This was published in 2010, I believe, and the conclusion is very simple. Two Harvard professors published a report that said when a country owes more than 90% of its GDP, it slides into recession. And they had done studies and models and they were very proud of this. And this was the foundation for austerity in Europe for the next 10 years.

Well, it turns out that some folks at University of Amherst said, well, this is a great study. It's the sort of thing I can give to a graduate student to reproduce. So a fellow by the name of Thomas Herndon took this data and discovered a small problem. He actually talked to the Harvard professors. They gave him their very detailed Excel spreadsheet. And when he looked at it, he saw this.

Over on the right hand of that table, there's a little selection there. If you notice, it doesn't cover all the countries in the list. They missed five countries at the bottom. And so they've just been dragging this column across the spreadsheet and they missed these five countries. If you include the five countries, their conclusion is completely different. There is no implied descent into recession.

Rogoff and Reinhart's results weren't reproducible, yet they led to this. They led to the cutoff of a lot of medical care, provisions for the poor, because the claim was that Greece couldn't afford it, and Britain for that matter.

What we need to do

All right, so let me finish up here. We do see some hope. And I just want to say, one of the things about analysts is they're just somebody who reports the obvious, preferably before someone else does. So I'm just going to finish up with, here's what we need to be doing. We need to focus on teaching data science, not coding science.

So over on the left hand side, let's talk about the changes we need. And this one's going to be real obvious. We're going to embrace statistics. Data science embraces statistics on real data. Why? Because when the data gets really big, you can no longer process all of it, which means you're going to have to do sampling. That requires statistics. Why real data? Because real world stuff is just messy. If you don't know how to deal with it, you're not going to be able to do data science well.

There's a wonderful book coming out from Janine Harris at Washington University in St. Louis called Statistics with R. I've only gotten a preview of it. But this is a wonderful narration, a wonderful story of dealing with real world data sets using R.

My second point is we're going to have to use public processes. And here, good data science relies on R Markdown computational documents and open processes. All of you are probably pretty familiar with R Markdown. We love it. But I don't think people have really understood the huge impact of being able to take our process for doing science, documenting it, and creating papers that we don't read. They're papers that we run.

But we're going to need a few other things. We're going to need curated data so that we can trust the data we work with. We're going to need computational documents. Those are pretty easy. But we also need open source publishing of sources to allow others to run our work.

I'm just going to wind up here with the fourth point, which is fake data. How do we combat fake data? Well, we need authoritative storytellers. We have to democratize data science. And here I'm going to just do a shout out to a few organizations and people.

I'm going to start with R Ladies. Probably the most amazing organization for democratizing data science in the world. They got together and they said, let's take the data science to the experts. The women who are working in these industries. And by the way, it's not just women. It's everyone. But they found a way to really engage a new audience.

But our democratized experts must be viral storytellers. And here I'm going to ask you a question. Who would you trust more to diagnose your illness? A data scientist who learned some medicine or a doctor who learned some data science?

Anybody going for number one? Okay. Well, here I want to give a shout out to Stephen Kaduke. He's an MD and PhD at the Children's Hospital of Philadelphia. And he has created a program for doctors to teach doctors. They don't want to hear from nurses or other folks. They want to learn from doctors. And he's promoting R to do that.

Another person I'm going to give a shout out to is Desiree De Leon. And you're going to hear from her later. She's a fabulous person. And I'm not going to say anything about that because she's talking.

And the final person I want to talk to is a farmer. A farmer who is sitting in my R Markdown and Interactive Dashboards course on Monday and Tuesday. He's from Brazil. He came here to learn how to create interactive dashboards so he could help other Brazilian farmers do farming better.

So those are our challenges. We need to focus on teaching data science and not coding science. And it's really too soon to tell whether this is going to be successful. But with your help, each of you can help us overcome the new challenges in data science education and help everyone, regardless of means, participate in a global economy that rewards data literacy. Thank you so much.

Q&A

Okay, Carl, we've got time for a couple of questions. First one is, how do you see traditional schools and traditional curriculum fitting into all of this?

Yeah. Well, there's a lot of ways we could work that. But one thing I see is that data science is going to go more broadly. The tier one colleges don't need our help. Community colleges, heck yeah. We need a lot more democratization to the community colleges is one thing. I'm working with a bunch of groups, including the Concord Consortium, on pushing down statistics and specifically learning statistics with R into high schools. Big problem with high schools, by the way, is their curriculum is already full. So a lot of this is extracurricular. But a lot of new curricula are coming out of these organizations that address high schools.

And the other one is, you gave a good shout out to our ladies. Are there any other organizations that you think people ought to be putting their attention to that may not be as famous yet?

Our OpenSci, I think, is a great one. Any others? Actually, you're probably better at that than I am. I haven't met everyone yet. I'm working on it, though. It's a big planet.

There's an organization called StatPrep, statprep.org.

Thank you very much. Another round, please, for Carl, and we'll get our next speaker up.