Resources

Carl Howe | The next million R users | RStudio (2019)

Many students believe that R is obscure, complex, and difficult to write. However, data from a new large-scale survey of R users conducted by RStudio shows that new R users are taking dramatically different learning paths from those who learned R as recently as 2 years ago, and these new learning paths are changing its perception. In this talk, we'll present this new survey data, describe how new tools and techniques for teaching R can satisfy the demands of today's R learners, and outline a vision for adding millions of new R users to our community. VIEW MATERIALS https://github.com/rstudio/learning-r-survey/blob/master/slides/Next-Million-R-Users.pdf

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

My journey to this conference started one year and one day ago. As with many of you, it started with a tweet from Hadley Wickham. How much more could you want than this? RStudio is hiring a director of education. Please apply if you love stats and education.

At this time, I was teaching big data for a consulting company that had been acquired by a very large technology company. I was putting on about 100,000 air miles per year. At that level, you get to know the TSA folks on a first-name basis. Hey Tom, how's the wife and kids? And it seemed like staying around home might be a little bit more fun.

So I interviewed with, basically, my heroes. Garrett Grollemund, Menechet and Gaya Rundle, Hadley Wickham. Finally, it comes to the point where I interview with Tarif, who's the president of RStudio. And I'm a little nervous. And I get a little bit more nervous when he tells me, you know, I'm not sure you're who we're looking for. I don't think you're going to work out because we're thinking a little bit bigger than you are. We're thinking that we really want to equip everyone, regardless of means, to participate in a global economy that rewards data literacy.

So I had to go home and think about that. And we had coffee scheduled for the next day. And I came in and I met with him again, and I gave him my best shot. And as a result, I am now the director of education at RStudio. And the shot I gave him was this. This is what's called a BHAG, a Big Hairy Audacious Goal. Comes from an old book called Built to Last. And my BHAG was really simple. RStudio education's mission is going to be to train the next million R users.

Now this sounds sort of preposterous when you first think about it. But frankly, it's smaller than the population of San Jose. Not so bad. And it is a little bit larger than the number of people who own Klingon dictionaries. By the way, there are 300,000 people that own Klingon dictionaries. This particular one says, I must attend RStudio Conf.

The learning R survey

So it's easy to make a Big Hairy Audacious Goal. But how do you execute on it, right? So I'm a member of AA. No, I'm not an alcoholic. I'm actually an analyst's anonymous. I was an industry analyst for 15 years. And when you ask an analyst, tell me how you're going to do something, they start asking questions back. And my questions were, well, who learns R? And why do they learn R? And how do they learn R? And what keeps them from learning R?

Well, again, analysts are trained to do this. I said, well, the clear way we answer these questions is we do a survey. And I brought this up to Hadley as a proposal for doing this talk. And Hadley comes back and says, gee, using data to make decisions. I think I've heard of that somewhere before.

So on December 6th, I put out this tweet of my own, which is, do you wonder how and why people learn R? And if you've learned or are planning to learn R, fill out our five-minute survey. Now, thanks to Laura Acion, who's in Argentina, we also got a Spanish language translation of the entire survey. This went up a few days later. And as a result, we actually gathered data.

So today, I'm here to announce RStudio's Learning R survey is now published. It was fielded between December 6th and December 31st. The respondents were solicited from community.rstudio.com, Twitter followers of RStudio employees and colleagues, and reddit.com data science. What I think is really terrific is we received 3,300 responses. I would have been happy with about 1,000, but 3,300, about 10% are Spanish speaking. And all of this data is now at this repo. Every single bit of data we collected and all the plots I'm about to show you are available for you to peruse and the code necessary to analyze it. It's about 2,000 lines of our markdown.

Now here's where we have to do the warning messages. First warning, significant sampling bias may be present. These results are not necessarily representative of the general R population. This is, in fact, 3,300 of RStudio's closest friends. The other thing is there's a lot of open text in this survey. And that means if you're in K through 12 and you want to teach your kids data science using this survey, you should know that this is restricted data. K through 17 requires accompanying parent or guardian.

Survey findings: who learns R

Now we're going to blast through the data. There's no way I can do in 20 minutes any real honor to this data. But let me just give you the highlights. Who learns R? Well, here's a choropleth of the respondents to our survey. We got responses from 110 different countries. These are the top 10. Again, thanks to Laura, we have Argentina as number three. She tweeted through R ladies there.

Most respondents consider themselves to be intermediate users. They're not experts. They're not beginners. And 3% say none. And those are really important people because those are the people we're trying to reach. Males made up about 75% of respondents. I had thought we were doing better on diversity than we are. This is an area where we have some work to do. The mean age was 35, although there's clearly a long tail. You can see where I am. I'm kind of over towards the right-hand side there. And respondents started learning R at a mean age of 30.

So basically most people in our survey had an average level of experience with R of about five years, although clearly there's big tails on both ends. Our community mostly has advanced degrees. Master's is number one, doctorate number two, bachelor number three. So these are pretty educated people. And almost half of them work in research or education. So a very educated group, but also working in the research or education community. Technology is number three, and health and medicine is number four. So that's who answered our survey.

Survey findings: why and how they learn R

Let's go on to why do they learn R. Well, okay. This is where we start getting into data that's going to make you go, duh. Most of them learn it to do statistical analysis. But what's interesting is about one in six did this because they were personally interested. They're interested in learning stats. So that's kind of cool. And about 13% said they think it'll open new career opportunities. So that's really neat.

So then we asked them, okay, what tools do you use with R? And here we get some, or what applications do you use, I'm sorry. And here, statistical analysis, number one, again, duh. Visualization number two. And in fact, there's some detail here that's interesting. In the English language version, visualization is number one. Spanish language statistics is number one. So a little bit of difference between the two populations.

So how do they learn R? This is education, so we want to know how they're learning. And respondents mostly learn R on their own. To me, this is obvious, but at the same time, I think one of the biggest takeaways from this survey. People are learning R on their own. It's a problem, actually, as we'll see. I had thought online courses would be higher, but it's not. People are using books and online materials, Stack Overflow. You can actually learn R from Stack Overflow. We don't recommend it.

If you heard Fiene's talk this morning, exploration is not the most efficient way to get there, but it's certainly one of the ways you can learn. If we think about some of our own packages, like the Tidyverse, has that had any effect? Only 45% learned using the Tidyverse, but today, do you use the Tidyverse when you use R today? 67%, two-thirds of people are using the Tidyverse. So I think Tidyverse has really moved the needle.

Now one question I'd like to ask, this is a very specific question. It's used for computing something called the net promoter score. People are really excited about R. They love R. As a matter of fact, when we ask this question, how likely are you to recommend R to a colleague, friend, or a family member, we get a net promoter score of 71%. And that is actually excellent. If it were four points higher, it would be world class. That type of loyalty cannot be bought. It's a fabulous number. It's in the level of brands like Apple and Google and things of that sort.

People are really excited about R. They love R. As a matter of fact, when we ask this question, how likely are you to recommend R to a colleague, friend, or a family member, we get a net promoter score of 71%. And that is actually excellent. If it were four points higher, it would be world class. That type of loyalty cannot be bought.

Survey findings: what keeps people from learning R

The tools that people use, the IDE gets love from our friends, no surprise there. But look, Microsoft Excel, still showing up there. We have to reach Excel users. All right, the last bit of data, what keeps them from learning R? So we'll start with the R users themselves. What did they find most difficult? And number one is error messages. I don't think that's any surprise to any of us. But let's look at the non-R users. So it's a pretty small sample here. So beware, these numbers are a little squishy. But nonetheless, language syntax is number one. Getting started is number two.

So quick recap, the majority of R users are pretty sophisticated and educated researchers. They learn R by themselves and not necessarily through formal courses. And one of their biggest challenges for learners is getting started. OK, so what? What can we do?

What RStudio education is doing

So here's where I want to just give you a few activities we're doing within RStudio and within our education department. The first thing we can do is hire more teachers that can teach R. I have the privilege of working with what I believe are some of the best teachers in the world for teaching this. You've met most of them at this conference. Some of them have spoken either yesterday or today. But I wanted to alert people to the fact that we've hired a couple more. Greg Wilson, I'll talk about him in a moment. Allison Hill is actually here today handling this session. And if you were in the advanced R markdown session, you would have seen her teach for the entire first day of that. Fabulous people.

But we're announcing a program today called RStudio Certified Teachers. These are new programs beginning this month. It's a two-level program with both instructors and coaches. So the idea is we can certify instructors, but we can also certify other people to certify instructors. This is headed up by Greg Wilson, who is the cofounder of Software Carpentry. Also a former professor at the University of Toronto. This certifies teaching skills and knowledge of RStudio tools. We think this is going to be great for driving a whole new cadre of teachers. So that's step one, more teachers. But remember, most people are self-taught. So we need more tools as well.

So here's one that most people should know, but they probably don't. That we provide free academic licensing for all of our RStudio professional software. It's free. All you have to do is give us a course syllabus and any certified academic institution. I don't care whether you're K through 12, community college, university, doesn't matter. You get our tools for free. If you're doing research rather than teaching, you get a 50% discount. We want you to use our professional tools, and all the details are on our website under the pricing page. So that's tool number one.

Tool number two is data science in a box. This is an initiative that Minay Chetinkaya-Rundle started based upon her experience at Duke. It's an entire introductory statistics course with all the slides, all the exercises, all the exams, all the answers. Well, don't show your students the answers, but nonetheless, it has everything you need to really teach a course. This is open source. Go to datasciencebox.org, and you can get all the materials and start teaching with them. So that's number two.

Number three is online books. We don't often think about this. Many of the people at RStudio are authors, and they've negotiated with their publishers so they can actually publish their books online for free. So R for Data Science, if you didn't know this, this is available online for free. You can go in and read the latest version. Advanced R is also online. Blog Down, Creating Websites with R Markdown, Hands-on Programming with R. I've thrown in another one, Geocomputation in R, simply because it's a book I've been learning from lately. And what I love about it is it targets an audience that is not a data scientist. Somebody wants to create maps, do geographical information systems. So I love this book because it's an entirely different area of investigation.

The next tool I'm going to talk about is one I'm not going to talk about. You're going to hear from RStudio Cloud from Mel Gregory, but it's probably one of the most important introductions we've done. This is going to allow you to do anything that we do on the professional tools in the cloud. That's Mel. She's coming up at 2.09 today. Right here in this room, don't go anywhere. But as part of that, we do have primers. These are tutorials for people who want to learn on their own. You can create an account at RStudio Cloud and just go in and learn R, and it's free. So this is another way that we're allowing people to self-teach to really learn how to use R on their own.

The power of community

But there's one more thing, and I just want to leave you with this one. One more piece of data that I think is really key. We've heard it a little bit in the prior talks here, but we have data to support this. So if we ask people what they love most about using R, and we actually asked it in kind of that way, what did we discover? Community is number one. The community is really what makes this work.

Now here's the sad part, also in the data. 15% of our users have no one else in their work group that knows R. They're completely on their own. If we add in one person, it's zero or one, it's one in four, 25%, have no one to work with when they're doing R. And as all of you know, it's one of the exciting things about using R. It's being able to share your results, publish them.

So I challenge each of you to go off and teach R to a colleague. This is why I felt pretty good about my BHAG. I'm pretty sure we can teach a million R users. If every one of you goes off and grabs a colleague and says, let me tell you how you can do that thing you're doing in Excel and do it repeatably and reliably. I say those words specifically.

Remember back to Tareef's talk yesterday. Seems like forever ago, doesn't it? But Tareef's talk when we opened the conference. The printing press was a new way of communicating with the rest of the world. It was a way to share knowledge. These tools are, the tools we sell, are things that allow us to create our own knowledge. We start from data. We create open processes that can be verified by other people. And then we create provably, if not correct, at least close to correct results. We are the people that are creating knowledge. And that's why we believe that by training the next million R users, we can equip everyone regardless of means to participate in a global economy that rewards data literacy. And each one of you is going to be a part of that. Thank you very much.

And that's why we believe that by training the next million R users, we can equip everyone regardless of means to participate in a global economy that rewards data literacy.

Q&A

Okay, we have time for a few questions. No questions. Oh, I see a few. There's one in the back over there. There's lots of data to play with. I encourage anybody who's interested in the data to go do that.

And my main question is, are you considering, do you consider public health as research regarding the tools that RStudio provide? And my second question is, when you're reaching out for teaching, are you looking for the specific area like public health? We're a little bit stuck with SAS because it was kind of funded by the federal government and people don't have anything else to use but what is available to them.

So, I'm not sure I heard the first part of the question very well. Could you say it again? And that's the most important question is, you said RStudio is providing the tool, half price I think, for research and free for education. Where public health fall, is it research or whatever you call it, to benefit from those discount?

I think maybe if I could rephrase the question, are you trying to figure out if public health as an area of study is considered research or teaching and if you get the 50% off versus the full free? Is that the question? Yeah, exactly. That's the question. The easy decision is, do you have a syllabus of teaching for students? If you have a syllabus, then you're a teacher. If you don't have a syllabus, if you're doing research, then you're a researcher. It's really that simple.

So, depending on what country you're working in, if you're sort of in the, what do they call them, the emerging countries, there's a significant discount for that too. I would encourage you to reach out if that's what your use case is. What if you're in the what? Washington State. Washington State is probably not an emerging country.