Tom Mock | A Gentle Introduction to Tidy Statistics in R | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

Thank you all for joining me today. I'll be going through a live demo as well as some examples and kind of background about why R is very exciting, how you can get started and kind of an immediate plan and then a long-term plan about how to be successful using R for statistical programming.

So first off, a little bit about me. Who am I? I originally got started with statistics and using statistical analysis as a master's student. I was doing exercise physiology, looking at how exercise is beneficial for brain health. Now, when I first got started using statistics, I was using mainly point-and-click interfaces, things like SPSS, Excel, Origin. So a couple different things, my statistics and SPSS, my data cleaning and organization in Excel, and then finally my plotting in Origin. So it often meant that I had to kind of change things around or flip them when I was switching between programs.

Eventually, I ended up in a PhD program where I was analyzing mouse behavioral data with a neurobiology program. And this is where I really got my first taste of R. We had an intro to statistics course in my second year, and we used R for some of the basics in that course. And once I got started with R, I was like, oh, this is really exciting. I liked the statistical programming aspect where I was doing coding and doing everything in R rather than having to switch between a lot of different programs and got interested in data science through that.

While I had that course, I did find that I needed to learn a lot more beyond that was just, you know, given to me in this one month or, sorry, this two-month-long course. So I actually got involved with the R for DS or R for data science online learning community, which is still around today. And basically, it's a Slack group where people are able to go in, share questions, interact with code, or just, you know, share ideas. And out of this was born the Tidy Tuesday project, which is a weekly data project. And I help host this where every week I upload a data set. Community members, you know, attack it with R and the tidyverse , create plots or statistical analyses, and then share the examples on Twitter.

And it's been a really good project for kind of continuous learning in terms of every week there's a new data set. So you can, you know, try things out in a different way than you're used to, rather than doing the exact same thing over and over, as well as it's a distributed community. So even if you don't have local people who are using R around you, you can learn online with an online community. And lastly, I joined RStudio about a year ago. I'm on the customer success team, as Rob mentioned. And what I do is help our professional customers better use our professional products as well as open source products to use R in their enterprise settings.

Who this talk is for

What I really want you to get out of this is that, you know, my journey might be similar to yours in terms of a lot of people get started with R, didn't start as computer scientists or didn't start as programming in other languages. This was their first experience with using statistics and programming together. And that's kind of who I'm anticipating will be a lot of the crowd today. Maybe we'll have some experienced R users and they can also get something from this. But I primarily aim this at people who are just getting started with R.

So maybe you've heard of R or you've heard of RStudio, you've heard of the tidyverse. But when you first started, you know, trying to get started with R, you faced some hurdles. And so you wanted to, you know, find some more examples about how to get started and be more powerful. And this really resonated with me. And the kind of the basis for what this presentation started as was a blog post I wrote a year ago, based off this tweet by Jesse, who helped found our R4DS online learning community. And basically, she proposed that, you know, if you'd started R and didn't end up continuing with it, what were your stumbling blocks?

And some of the common things that resonated with me was things like this, you know, I didn't have examples of code or people doing the basic things I need to do, you know, basic multivariate stats. And I wasn't experienced enough. For David here, you know, interested in R, but for a lot of people, it was too high levels, maybe they weren't ready for full blown enterprise level data science, they were just trying to get started with, you know, basic statistical programming and using the tidyverse to create some plots or do some statistical analyses. And lastly, my contribution was I know it can be frustrating, you know, you're just like, hey, I just want to run my ANOVA in R, you know, getting the data in cleaning it up, getting everything set up can often be a hurdle for a lot of people.

Why R?

I'm borrowing this graph curve here from JD Long, who is a data scientist out in the world, great guy, if you have a chance, you should look at some of his presentations that he gave at RStudioConf 2018 or 2019. And basically, the premise of this graph is that there's this learning curve. And as with anything, when you initially, you know, get started with learning something new, you're here at the what they call the suck threshold in terms of things are hard, and you're not very powerful. With a bad learning curve, you'll actually see that you stay here in this kind of not being able to do much powerful things for a long time. So the goal for today is to get us on this good learning curve where you're able to get into this point of I'm kicking ass, I'm doing powerful things, I'm creating something useful in R, and I feel powerful.

So first of all, why R? I think the biggest thing for me is you're able to connect with an amazing community. Often when you're doing stats or programming, it might be all by yourself in terms of you could be siloed in an academic setting, or maybe other people in your workplace are using just SQL or Excel or some other language. So you're able to connect with this larger community, ask them questions and kind of gain access to their expertise, whether it's through things like LinkedIn, or Twitter, or other different resources like Stack Overflow, or RStudio Community.

Additionally, programming itself is basically a superpower that everyone has access to. Programming can kind of level up your skill set over something like a graphic user interface and allow you to do things more efficiently. And the thing for me that I really liked was that I could do cleaning, analyzing, plotting, and finally communication around my data all in one place. So I didn't have to switch between a bunch of different software suites and pay for a bunch of different software suites. I was able to do everything in R.

Reproducibility is a big thing with statistical programming in terms of your kind of data product is based off of code. So you kind of have an example of exactly what you did rather than having to ask somebody, well, hey, what did you do to clean up this data? And they're like, oh, I don't know. It was a year ago. With, you know, code, you're able to actually reference exactly what you did and see what's going on. You can do things like automation, where if you have a repeated report, you can just rerun your code with new data, and it will update that report for you. And obviously, it's free. You know, R is an open source language. It's extremely powerful. And all the packages that add additional features to R are also free.

Additionally, programming itself is basically a superpower that everyone has access to. Programming can kind of level up your skill set over something like a graphic user interface and allow you to do things more efficiently.

We actually ran a survey right before our last, RStudio conference and asked people what they like best about R. And again, the community packages, the tidyverse, ggplot2 , these were some things that were very popular amongst users. So being able to be powerful quickly and engaging with the community is something that really resonates across groups.

Lastly, kind of hitting again on why you should be excited about the community is, as Kim says here, you know, because it's fun and you can learn so much without having to seek it out. So you're just able to kind of passively take information and learn a lot by what other people are doing. Ludo over here has some other examples in terms of, oh, I can actually help someone out who's six months behind me in their learning process or commiserate over something that's difficult. Thank somebody, see examples of data viz or other excellent resources.

So again, no matter how experienced you are, it's okay to make mistakes. It's okay to be learning, and we're all here to learn with you.

So with that, I'll actually end a little bit after time, but still have some time for questions. I appreciate your attendance today, and I'll take questions now. Thanks.

Q&A

Ah, so the key differences between RStudio Cloud and RStudio is RStudio Cloud is hosted, so you connect to it purely through a web browser. You don't have to install anything. And then the RStudio Desktop is actually installed on your computer, whether it's your laptop or your desktop computer. Both will look exactly the same once you're actually using them.

Okay, some people ask about the tidyverse library. It's basically packages. So packages give you those functions that allow you to, you know, do more things in R without having to write it by hand. And the tidyverse is a meta-collection of other packages that follow along in that syntax.

You can actually create websites using R Markdown. It produces HTML content, so you can look at the blogdown or the distil packages. Those actually allow you to create entire websites or blogs in R, and that's how I run my personal blog. There's some great resources at blogdown. There's actually a book about it, and distil has its own website, distil.rstudio.com.

Yep, there we go. So this actually walks through kind of in long form exactly what we covered today. I didn't want to just read this out in front of you, but it basically gives a similar presentation of what I did, although I have changed it up a little bit since then. But this could be useful if you maybe don't like videos as much and you want to read through something. You can just search for themockup.netlify.com. It's actually the first post I have there.

The tidyverse is one package, but it's a meta-collection of packages. You can actually, someone's asking about how do you know which functions are available in each package. A lot of packages will have their own website or a CRAN page. So if you think of like the tidyverse, it actually has a website where you can look at all the different things. So it's just tidyverse.org . You can look at the different packages that are here and available. You can click on the individual packages, and there's everything from cheat sheets to example usage, where it is in the lifecycle, and then references to all of the different functions that are available in it.

You can also do this from within RStudio Cloud. Yeah. So you can actually just type some code. It will kind of show you exactly what that function is. So if you did like question mark dplyr, or question mark ggplot2, that will open up in the help pane an example about that. So it will give you links to external resources, as well as some information about the package. You can also do that same syntax in terms of question mark, followed by the package name, and then tell it what are the functions you're looking at. So let's do annotate. So if I did on just one function, it will give a description, the example arguments that are available, the details, and a quick example of it.

You can label a factor as anything. So someone's asking about labeling a factor as say like less than 24, as opposed to 24 written down as text. That label could be used as a character. So you could actually just type in less than 24. So example like factor x label, you know, less than 24. Something along those lines. This wouldn't actually give you the full code, but just an example of this is now a character vector and could be labeled that way.

Any of the free R training programs, courses, MOOCs, I'm a big fan of the cloud workspace, these primers, because you're able to do them super quickly and for free in terms of going to here to the basics, click on visualization, and you're able to get started with both kind of example queries, actually running code, and it will give you outputs here. So you can, you know, actually practice inside of a workspace or just go through these primers.

Tom Mock | A Gentle Introduction to Tidy Statistics in R | RStudio (2019)

Transcript#

Who this talk is for

Why R?

What we're covering today

R vs RStudio and packages

Live demo: RStudio basics

The Alzheimer's dataset

Reading data and exploratory analysis

Data cleaning and summary statistics

Running the ANOVA

Post-hoc tests

Building the publication plot

Next steps and resources

Q&A

Featured software#

ggplot2

rstudio

tidyverse

webinars