
Tom Mock | A Gentle Introduction to Tidy Statistics in R | RStudio (2019)
R is a fantastic language for statistical programming, but making the jump from point and click interfaces to code can be intimidating for individuals new to R. In this webinar I will gently cover how to get started quickly with the basics of research statistics in R, providing an emphasis on reading data into R, exploratory data analysis with the Tidyverse, statistical testing with ANOVAs, and finally producing a publication-ready plot in ggplot2. Use the code presented instantly on RStudio Cloud! RStudio Cloud: rstudio.cloud Webinar materials: https://rstudio.com/resources/webinars/a-gentle-introduction-to-tidy-statistics-in-r/ About Thomas: Thomas is involved in the local and global data science community, serving as Outreach Coordinator for the Dallas R User Group, as a mentor for the R for Data Science Online Learning Community, as co-founder of #TidyTuesday, attending various Data Science and R-related conferences/meetups, and participated in Startup Weekend Fort Worth as a data scientist/entrepreneur
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thank you all for joining me today. I'll be going through a live demo as well as some examples and kind of background about why R is very exciting, how you can get started and kind of an immediate plan and then a long-term plan about how to be successful using R for statistical programming.
So first off, a little bit about me. Who am I? I originally got started with statistics and using statistical analysis as a master's student. I was doing exercise physiology, looking at how exercise is beneficial for brain health. Now, when I first got started using statistics, I was using mainly point-and-click interfaces, things like SPSS, Excel, Origin. So a couple different things, my statistics and SPSS, my data cleaning and organization in Excel, and then finally my plotting in Origin. So it often meant that I had to kind of change things around or flip them when I was switching between programs.
Eventually, I ended up in a PhD program where I was analyzing mouse behavioral data with a neurobiology program. And this is where I really got my first taste of R. We had an intro to statistics course in my second year, and we used R for some of the basics in that course. And once I got started with R, I was like, oh, this is really exciting. I liked the statistical programming aspect where I was doing coding and doing everything in R rather than having to switch between a lot of different programs and got interested in data science through that.
While I had that course, I did find that I needed to learn a lot more beyond that was just, you know, given to me in this one month or, sorry, this two-month-long course. So I actually got involved with the R for DS or R for data science online learning community, which is still around today. And basically, it's a Slack group where people are able to go in, share questions, interact with code, or just, you know, share ideas. And out of this was born the Tidy Tuesday project, which is a weekly data project. And I help host this where every week I upload a data set. Community members, you know, attack it with R and the tidyverse, create plots or statistical analyses, and then share the examples on Twitter.
And it's been a really good project for kind of continuous learning in terms of every week there's a new data set. So you can, you know, try things out in a different way than you're used to, rather than doing the exact same thing over and over, as well as it's a distributed community. So even if you don't have local people who are using R around you, you can learn online with an online community. And lastly, I joined RStudio about a year ago. I'm on the customer success team, as Rob mentioned. And what I do is help our professional customers better use our professional products as well as open source products to use R in their enterprise settings.
Who this talk is for
What I really want you to get out of this is that, you know, my journey might be similar to yours in terms of a lot of people get started with R, didn't start as computer scientists or didn't start as programming in other languages. This was their first experience with using statistics and programming together. And that's kind of who I'm anticipating will be a lot of the crowd today. Maybe we'll have some experienced R users and they can also get something from this. But I primarily aim this at people who are just getting started with R.
So maybe you've heard of R or you've heard of RStudio, you've heard of the tidyverse. But when you first started, you know, trying to get started with R, you faced some hurdles. And so you wanted to, you know, find some more examples about how to get started and be more powerful. And this really resonated with me. And the kind of the basis for what this presentation started as was a blog post I wrote a year ago, based off this tweet by Jesse, who helped found our R4DS online learning community. And basically, she proposed that, you know, if you'd started R and didn't end up continuing with it, what were your stumbling blocks?
And some of the common things that resonated with me was things like this, you know, I didn't have examples of code or people doing the basic things I need to do, you know, basic multivariate stats. And I wasn't experienced enough. For David here, you know, interested in R, but for a lot of people, it was too high levels, maybe they weren't ready for full blown enterprise level data science, they were just trying to get started with, you know, basic statistical programming and using the tidyverse to create some plots or do some statistical analyses. And lastly, my contribution was I know it can be frustrating, you know, you're just like, hey, I just want to run my ANOVA in R, you know, getting the data in cleaning it up, getting everything set up can often be a hurdle for a lot of people.
Why R?
I'm borrowing this graph curve here from JD Long, who is a data scientist out in the world, great guy, if you have a chance, you should look at some of his presentations that he gave at RStudioConf 2018 or 2019. And basically, the premise of this graph is that there's this learning curve. And as with anything, when you initially, you know, get started with learning something new, you're here at the what they call the suck threshold in terms of things are hard, and you're not very powerful. With a bad learning curve, you'll actually see that you stay here in this kind of not being able to do much powerful things for a long time. So the goal for today is to get us on this good learning curve where you're able to get into this point of I'm kicking ass, I'm doing powerful things, I'm creating something useful in R, and I feel powerful.
So first of all, why R? I think the biggest thing for me is you're able to connect with an amazing community. Often when you're doing stats or programming, it might be all by yourself in terms of you could be siloed in an academic setting, or maybe other people in your workplace are using just SQL or Excel or some other language. So you're able to connect with this larger community, ask them questions and kind of gain access to their expertise, whether it's through things like LinkedIn, or Twitter, or other different resources like Stack Overflow, or RStudio Community.
Additionally, programming itself is basically a superpower that everyone has access to. Programming can kind of level up your skill set over something like a graphic user interface and allow you to do things more efficiently. And the thing for me that I really liked was that I could do cleaning, analyzing, plotting, and finally communication around my data all in one place. So I didn't have to switch between a bunch of different software suites and pay for a bunch of different software suites. I was able to do everything in R.
Reproducibility is a big thing with statistical programming in terms of your kind of data product is based off of code. So you kind of have an example of exactly what you did rather than having to ask somebody, well, hey, what did you do to clean up this data? And they're like, oh, I don't know. It was a year ago. With, you know, code, you're able to actually reference exactly what you did and see what's going on. You can do things like automation, where if you have a repeated report, you can just rerun your code with new data, and it will update that report for you. And obviously, it's free. You know, R is an open source language. It's extremely powerful. And all the packages that add additional features to R are also free.
Additionally, programming itself is basically a superpower that everyone has access to. Programming can kind of level up your skill set over something like a graphic user interface and allow you to do things more efficiently.
We actually ran a survey right before our last, RStudio conference and asked people what they like best about R. And again, the community packages, the tidyverse, ggplot2, these were some things that were very popular amongst users. So being able to be powerful quickly and engaging with the community is something that really resonates across groups.
Lastly, kind of hitting again on why you should be excited about the community is, as Kim says here, you know, because it's fun and you can learn so much without having to seek it out. So you're just able to kind of passively take information and learn a lot by what other people are doing. Ludo over here has some other examples in terms of, oh, I can actually help someone out who's six months behind me in their learning process or commiserate over something that's difficult. Thank somebody, see examples of data viz or other excellent resources.
What we're covering today
This resonates pretty well with the total R4DS model that Hadley Wickham and Garrett Grolemund kind of laid out in R for Data Science, their textbook. In terms of often statistical workflows can look like this, where you bring your data into R, you tidy or clean it up somehow, and then you go through this cycle of visualizing, modeling, or statistical testing and transformation of data, eventually creating a data product that you can communicate, whether it's R Markdown or Shiny or just a simple plot. And all of this is wrapped around by programming in terms of rather than having to point click and do all these other things, you're able to write a flow and syntax in R and it will create these things for you and be able to work through the workflow here.
I really want to emphasize that the other things that we're not doing today, which is covering what statistical tests to run when, we will not be doing a deep dive into statistical programming or data science in terms of scaling out to the enterprise, and I don't expect you to 100% get it the first time. This is just the start of your journey and I'm hoping to kind of, again, get you on that good learning curve and get you started down a path where you're able to be successful quickly.
R vs RStudio and packages
As far as R and kind of getting into the meat of the presentation, I wanted to emphasize the difference between R and RStudio. So if you think of R, it could be the engine of your vehicle. So it actually will do all the computation for you, it is actually the code that you're writing and executing, whereas RStudio, the IDE or integrated development environment, is just an interface to that engine. It allows you to have a nice place to work and write code, some user enhancements that I'll walk through, but R itself is executing everything.
This basic example of the RStudio cloud interface shows you the two kind of examples. So the console would be the interface directly to R, so you can write some code here, you execute it, and this actually created a plot as its output. You can also have outputs that are more numeric or text-based, those can output in the console, but this right here itself would be the RStudio interface. I'll walk through this, we'll actually do a live coding demo, but I just wanted you to get a very brief overview of what this looks like.
The other thing I wanted to cover before we jumped into live coding was the difference between R and R packages. So you can think of R as a new phone, you know, it's powerful, it has some core features like, you know, calling, texting, email, a web browser, but you need to add additional applications to do more things, and R is exactly the same way, where you can install packages that add new features. So things like the tidyverse, whether it's dplyr or ggplot2, allow you to get new features that you can then use in R.
Live demo: RStudio basics
And with that, I'd like to swap over into the live demo. So again, let's see if I make this full screen, this will be the RStudio cloud interface. So this is actually a hosted version of RStudio, so we have an option of installing it on your desktop. You can do that after the fact, and kind of see, you know, using it locally on your own machine. But I wanted to start with RStudio cloud, because I'm actually going to share this workspace with you, and you'll be able to interact with the exact same code and data that I'm using, and you'll be able to try it out without having to install anything.
So here, again, is the RStudio interface. So if we were to work directly in the console, we could do some math. So we're, you know, typing some code, and we just do 3 plus 3, and you get your output of 6. It's often more useful, rather than having to write long functions in the console, to rather write with them inside a script, or inside an R Markdown document. So here, you're able to, you know, write out longer things. I have, you know, 50 lines of code here, and I can write some comments, and do things. I'm still able to execute the code that I have. So again, the 3 plus 3, I get my output of 6.
The other thing I wanted to cover real quickly is base R functions and objects. So you can do math, or do anything else, and assign it to an object. So in this case, we'll assign 3 plus 5 to x, and then we could actually call x, and we can see that it will then execute the math. Additionally, we can see that x has been stored as a value here in our environment. So when you're working with data sets, and I'll import a data set here in a little bit, you can actually find them here in the RStudio environment pane. So that can be useful if you're not used to, like, you know, finding where your data sets are. Always look up here in the top right for your environment.
You can also do some other things, like combine data set, or numbers. So here, you know, 1 through 5 is now stored as y, and there's some functions that are just built into R. Things like sequence, which actually will generate a sequence of numbers. So 0 to 10, and it will iterate by 2. So 0, 2, 4, 6, 8, 10. These are things that are all built into R.
Additionally, you can write your own functions, which is where a lot of power comes from. In terms of, I'm going to create a function here called add pi, and this function just basically says, take something that I input, and add 3.14 to it. So if I save this, and add pi to it, we can see that 3 plus 3.14 gets us to 6.14. Now, when we're talking about these functions, you may not use functions for quite a while, but a lot of libraries are actually built off of these functions. So they give you more interfaces where, rather than having to type this out itself, you just call add pi.
And similarly, if we use a library, which is a collection of functions, we can load it, and this now gives us a bunch of plotting functions based off of the ggplot2 library. This will take a data set, in this example, a car's data set, and allow you to kind of set up a graph. So if I actually just run the code here, it will input a graph, and you can see that I have a graph of horsepower versus miles per gallon, and then a linear model fit to it, and a title.
The real big thing that I love about R, and a lot of the tidyverse functions, is that this is very word-based, or verb-based. So I can look at this and say, okay, well, I want my x-axis to be horsepower, and it's horsepower. I want my y-axis to be miles per gallon, and you can see it's miles per gallon. And then I want to add points to that plot. So if I just call this part right here, we can see that graph of horsepower versus miles per gallon, and have the nice plots here. We can then add more features to it. So adding things like the GM smooth, or that linear model fit, and then changing the titles so that they look nicer, rather than saying HP and MPG, adding them back so they say horsepower and miles per gallon.
I'm also going to load the tidyverse library, which has a lot of other functions and other libraries within it. Mainly the ones we're looking at are dplyr and tidyr, which are used for data cleaning and transformation, and ggplot2, which I already loaded, which allows you to create plots.
The last thing I want to cover related to tidyverse is the pipe function, which you can think of as and then. So if you're reading some code, so here I want to take the code here, and I'm going to say, I'm going to take the dataframe cars dataset, and then I'm going to mutate cylinder into a factor, and then I'm going to select the columns miles per gallon, cylinder, displacement, and horsepower. So if we look at the dataframe cars dataset before we do anything else, you can see that it has a lot of columns, so miles per gallon all the way up to carb, and we can see that cylinder is labeled here as a double, which actually means that it's as a numeric value.
When we do this transformation here, we take that dataset and then change cylinder to be a factor, and then select only these four columns, you can see that it will actually change the dataset. So now cylinder is labeled as a factor, and you only have the four columns of interest.
The tidyverse is based around these functions where you have, you know, nice verbs to work with that allow you to kind of understand what you're doing, even if you're just getting started. So yes, this may seem a little intimidating right now, but you are able to work through these functions and kind of get a sense of what you're trying to accomplish. And then again, we can take that same dataset and create a plot with it. So this point, instead of doing a gm point, I use gm box plot, which allows you to create a quick box plot and then overlay another jitter or jitter dot points all over the graph here.
The Alzheimer's dataset
So the origin of this whole talk was based off a blog post I wrote back on my blog here, and what's interesting is this covers a fake dataset about Alzheimer's disease and then giving these patients a drug and seeing how it changed their cognitive function. I wrote this whole blog post in R, so you see all these graphs and text. This whole thing was written in R. So as an example of kind of the power of what you could do, I can actually just rerun this whole report and create that blog post again, and you can see in about 20 seconds or five seconds, two seconds, I've recreated that entire thing. So all the text, adding the pictures, rerunning all the code, making the plots, all that's been recreated quickly. So this is one of the powers of code in terms of you're able to reproduce what you've been working on.
As a basic example, we can think of Alzheimer's disease as a decline in cognitive function due to changes in brain. So your brain actually deteriorates as you age with this disease, and for example, we're looking at a novel drug and we're giving it a couple different doses to males and females and seeing how it changes their cognitive function. So that's kind of, the context isn't super important, it's just a data set that we'll be using, but at least kind of giving you a sense of what we're getting started with.
Reading data and exploratory analysis
I did want to let you know that as we get started with this second file, that this is actually an R Markdown document, and if we compared that to our R file here, you can see that the text, if we were to write it out, gives you this error because it's trying to evaluate this code. With R Markdown, there's an intent of saying, I'm going to write text here that is describing what I'm doing or talking through what the results are, and then each of these code chunks here will actually allow you to evaluate code as you would expect.
So for the first example, what we want to do is load our libraries. So again, these will add new functions to R. So we have the tidyverse for plotting, cleaning, and other functions like that. The broom package, which cleans up statistical outputs and make them a little bit cleaner. Nid R, which is some tables and R Markdown outputs. The read Excel package to read in Excel files, and then the here package, which allows you to do programmatic kind of aiming at data sets to read them in or file structures. So we'll load these, and they've been loaded now.
So the next step would be to read in the data. So with other kind of interfaces, you might go into file and open a data set. You are still able to do that in RStudio. So you can tell RStudio either programmatically, so write out code saying, I want to save a data set known as raw df, and I want to read it in as an Excel file. So if I do this, it will add the raw df data frame in. Or if maybe you wanted to do it not programmatically, you could also import a data set. So go to file, import data set, and then go to Excel. You can then browse and then choose a data set of interest. So here we're looking at the AD treatment. Open it, and it will give you a preview of what the data set looks like, the code to actually read it in, and then you can import it that way as well.
Because I know exactly what the data set is, I just did it programmatically. So again, I use the read XLS function, and I read in the data set as XLSX. It's stored up here, and if I want to, I can take a quick look at that using the glimpse function, which just shows you an overview. So it says how many observations there are, a number of variables, and then gives you a quick peek at what the actual variables are.
Now that we've got it loaded into R, we can actually start using it. So again, using something like ggplot2, we can say, okay, we want to use the data raw df, which is the same data set we just loaded, and we want to assign a mini mental status exam or cognitive function to be on the x-axis. I'm then going to add the geom density function, which would tell it to create a density plot, which is kind of similar to a histogram, and we can see the distribution of a cognitive function across our patient examples.
So this is kind of a quick exploratory analysis. You can see it's a very small amount of code. You're really just saying what data set, what are your axes, and then what kind of plot do you want to make, and it'll quickly make this plot for you. And this shows us that we have kind of a bimodal distribution. We have a group of patients that have a low mini mental status exam, so lower cognitive function, compared to this patient population, which has a higher cognitive function, closer to about 25.
Now that we've got kind of an exploratory graph, maybe you want to take a look at the some basic stats about it. So we can actually run a new function. So again, taking the raw df and then summarizing it. The summarize function from dplyr will actually say, okay, take this data set and then, you know, knock it down to just these two values. So I want to look at the min and the max. You see the min here is 8.4, so that's the lowest mini mental status exam value, and the highest is about 28.
But, you know, this just shows you the raw range across it. That's not all that useful. So we can actually do a second function here. We can use groupby. So rather than just doing it, you know, across the whole data set, we can actually group it by health status, where the health status actually will show, you know, Alzheimer's patients versus healthy patients. So we run this, the raw df groupby health status, and then for our summarize, we want to get the min value, the median value, and the max value, where these correspond to other code to get started with here that are part of base R. So we can see that the Alzheimer's patients generally have a lower range. So they have min of 8.4, the median is about 15, and the max about 25. Compared to healthy patients, we can see all these are above 20 for their mini mental status exam.
So that's pretty interesting, but maybe we want to see something more about the data set. So we'll look at, you know, a count, so kind of a table function of how many of each we have. So again, taking the raw df data frame and then grouping by both drug treatment and health status, where we have a couple different doses of the drug, we use a high dose, a low dose, and then a placebo, and we give this to all of our patients that either have Alzheimer's or are healthy patients. And we can see that we have a hundred of each of these groups. So each of the groups have an equal amount of, you know, subjects in them, so it's a well set up study here.
When we're thinking more about summary statistics, it might be helpful to kind of get into some additional graphs. So rather than just doing a raw distribution, we might look at it by a few different groupings. So again, something more similar to where the drug treatment versus the health status example. So in this case, we have ggplot2, and we're loading our data set of interest, raw df. We're taking on the x-axis drug treatment, so either high dose, low dose, or placebo. And then on the y-axis, or the dependent variable, we'll look at mini mental status exam, and then we'll assign color to be drug treatment. This will basically say that there will be, you know, three different colors here, so each of the drug treatments will get to be a different color rather than all the same.
We're going to build a box plot, so it shows the different examples of the min, median, and the range. And then we're going to facet this, or create small multiples, according to health status. So if we go here, we can see for Alzheimer's versus healthy patients, we can see a nice graph of the mini mental status exam, so the high dose, the low dose, and the placebo. And then here in the healthy patients, again, the distribution around that.
Something I do want to point out is that, you know, these axes aren't, you know, super pretty in terms of the drug treatment has, you know, a symbol in it. Mini mental status exam is all lowercase, and we probably want to flip this so placebo, low dose, and high dose are arranged differently. But we do have a sense of there appears to be a difference in that the healthy patients, regardless of treatment, appear to be having higher MMSE scores compared to the Alzheimer's patients where the placebo has a low MMSE, and the low dose and the high dose increase it slightly with that treatment there.
Data cleaning and summary statistics
Again, just taking a quick look at the raw data frame, we can see we have age, sex, health status, drug treatment, and then the MMSE as our dependent variable. It's important to note that these variables here are actually labeled as just text or character, and we actually want to save those as a factor, which assigns them, you know, a tier or kind of a categorical label as opposed to just raw text. These are numeric or double, so we can leave them as is, except for sex is coded as 0 or 1, where 0 is male and 1 is female. So we do want to do some data cleaning here to get everything off the ground correctly.
So we'll actually create a new data frame or a new object here, so some data frame. We're going to take the original raw data frame and then mutate it where we're changing sex to be a factor, and the levels are going to be 0 and 1, and then we're going to assign labels to that of male and female. So basically, it will go through, R will read and say, okay, there's a 0 here, this will now be male, there's a 1 here, so this will be assigned female, and it will change that and then assign it as a factor. We're also going to change drug commitment into a factor, so we're going to mutate drug treatment to be a factor, again, it's drug treatment, where the levels are going to be placebo, low dose, and then high dose, so we're actually putting in the correct order we want to see. Lastly, we're also going to mutate health status to be a factor around health status, where the levels are going to be healthy, and then Alzheimer's.
So if we run this code, we can see that the same kind of glimpse here that we look at the data, so the variables all look the same, but it says male here as opposed to 0, and here's that 1, and so there should be a female, and we see a female, and these are also labeled as a factor, so health status and drug treatment and sex are all labeled as factors now. So that's good because it gets us set up for our future analyses, whether it's statistics or applying.
So again, we're going to create a summary table here. So again, this is all the data sets, so when we're doing our summary data frame, we want to take a look at, you know, means or medians or, you know, standard error, and then the amount of samples, so that's what we're doing here, is taking the original data frame and reassigning it as the same name. We're going to group this, so basically take all of these, group it by sex, health status, and drug treatment, and then summarize around our functions of interest. So we'll get the mean, we'll get the standard error as calculated by standard deviation by the square root of n, and then show the number of samples by using the n function, which is for number. Lastly, we'll also tell it to ungroup so that it takes away that grouping variable after it's already calculated.
So if we run this code, then we can look at SUMDF again. We can see that rather than having 600 observations, we've now gone to 12 observations, so basically the different groups, so male healthy placebo, male healthy low dose, female healthy placebo, female healthy low dose, all these have the mean, the standard error, and the number of samples calculated out. So this is helpful in terms of, we've got our nice summary statistics saved here, and we're ready to work with those, whether it was plotting or just saving those as a table itself.
If we did want to create a summary plot of that, we can again take that data frame, so the SUMDF rather than the raw DF that we've been working with previously, and again assign an x and a y-axis. So here we're taking the x-axis, drug treatment, the y-axis as many mental status exam mean. We're going to assign a group, which basically groups those numbers together, drug treatment, and then a color of drug treatment, so we can see that we get three colors as opposed to one. We're going to increase the size of the GM point, so we're going to add points to our plot and make them a little bit bigger. And lastly, we're actually going to, rather than make one plot where they're all kind of overlaid here, we're going to facet it or create small multiples, so facet grid, and we're going to make it where it's sex by health status. So if we rerun that code, we can see that now it's separated out into four plots rather than one where male and female are here.
Running the ANOVA
We're finally getting to the ANOVA in terms of walking through that kind of slowly and getting set up with the actual statistical analyses now. So for this example, we're going to create a new data frame called stats data frame, since we already saved this as a summary level. So actually create a new data set. We'll start again with rawdf. We'll do our mutate again, so we'll change drug treatment to be a factor, sex to be a factor, and health status to be a factor so that it's ready to go for the ANOVA.
The actual code for running ANOVA, at least a type one sum of squares, is pretty simple in terms of the basic setup is use the AOV or ANOVA function. You assign the dependent variable first, use a tilde, and then the independent variable, and you tell it what is the data set of interest. So here we're using a fake data set, datadf. If you want to do main effects or interactions, there is a slight variation there. So for just main effects, you will add your dependent variable, your independent variable one, so here's sex, drug treatment, health status are all added individually as main effects, or you can do main effects and interactions by using a star instead of a plus sign.
There's a couple different ways of doing ANOVAs. I'm just showing one example here, and there's a whole lot of other different statistical tests that you can do in R. So again, this is just a basic example, and there's some textbooks I'll be sharing about great examples of going through choosing a different statistical test or approaching it from a statistical validity standpoint, just trying to get us started with the basic setup here.
I did want to point out that you can use the plus or the star, but we don't want to use commas here. This would actually throw an error, and that's okay. It doesn't break anything. It just says, you know, it's not able to do this because it's trying to assign this in an improper way. So always make sure and use a plus to add only main effects or the star for the main effects and interactions. So we'll run it as the main effects and interactions, and now we can say that it saved our Alzheimer's disease ANOVA as a list object here.
So if we actually call summary on that object, ADAOB, you kind of get a nice table here of what you would expect to see, you know, your degrees of freedom, sum of squares, mean square, your F value, and then your P value, and it also tells you the significance levels, whether it's 0.1 or all the way down to zero. So we can see that there are significant main effects as well as an interaction of drug treatment by health status, so we'll follow up with those, but this kind of output here is not always the most useful in terms of working with it in other ways.
So we can actually tidy this up with the broom package, and if we were to look at tidyADANOVA, we can see that it's in a different format now. It's a little bit cleaner, so rather than having this kind of ASIC output, we've got it in a nice data frame where you're able to see all the different things kind of across a little bit cleaner, and again, you could save this to Excel or as a CSV if you wanted to export it, or you can just kind of keep it here.
Post-hoc tests
Because we did see main effects interactions, we should follow up with post-hoc tests, so we'll take a look at a couple different post-hoc test examples. The first one would just be a pairwise t-test with a Bonferroni correction, and the basic setup here, if we take a look, is that we're going to save it as an AD pairwise, so we know which one this was. The pairwise t-test function gets us started, and we're going to assign a STATSDF with MMSE as our dataset of interest, where MMSE is the dependent variable.
And then we've got a little bit weirder syntax here that's hard to read, but I'll walk through with it. So again, we're using the STATSDF data frame, and we're using the sex column as the factor. We're doing an interaction shown here with the colon, and then using the next factor, which is drug treatment, another colon for the interaction, and then the STATSDF and the health status column, and lastly telling it to do a p-value adjustment for multiple comparisons of Bonferroni. So we run that, and we can again see that it's saved as an object.
Again, I like to clean this up a little bit, so I'll run tidy on it. So we use the broom package to tidy it, and I'm actually going to then mutate the p-value to round it, so it's less expansive. Rather than sig figs, we kind of get this nice just five significant units. So we can see the comparison here, where we're saying comparison one, male placebo Alzheimer's compared to male placebo healthy is significantly different, as opposed to the one that we're using, the low dose in the healthy male versus the placebo in the healthy male, not statistically significantly different. That's just a quick example of the Bonferroni.
If maybe you like the Tukey test better, you can also do that. So here we're using the Tukey HSD function. We're using the AD ANOVA data set we've already created. So that was our actual ANOVA output from above, and this syntax is a little bit cleaner, where we say run a Tukey ANOVA where the interaction of sex, drug treatment, and health status is of interest. We're then going to tidy it to clean it up like this, take just the first six values with head, and create a nice table. So if I do this, we can again see that we have our term, the comparison here, so female placebo healthy versus male placebo healthy. It's got your confidence intervals, low and high, and then the adjusted p-value here for the Tukey test. These are going to be pretty similar to the Bonferroni because the groups are very, very different, but you'll obviously need to choose whichever statistical test makes sense for your analysis.
Building the publication plot
Lastly, in the next couple of minutes, I'm going to go through a quick guide to the publication graph. So I'm actually going to create a data set, and I actually can create this with the triple function, where I can assign a column and then all the different values in it. So if I run this function here, you can see that column A gets A, B, and C, A, B, and C, and column B gets 1, 2, 3, 1, 2, 3 in it. We assign the column names by adding a tilde before it, and then each of the columns below it get added correctly.
Rather than just adding kind of random data, I'm going to assign these as the proper groups. So I actually looked back through the pairwise comparisons above and found the groups that were significantly different. So the low-dose Alzheimer's male and the high-dose Alzheimer's male versus the placebo, and then the low-dose Alzheimer's female and high-dose Alzheimer's female versus placebo was significantly different. So we're creating a new data set based around that, and then reassigning our factor levels there.
We're finally getting to that publication plot, and it looks a lot wordier or longer than our plots from above. That's because we're doing some more customization. So if we look at the actual plot here, and I'll pull it up in the window so you can see it nice and big. You can see a few different things. So we're doing our basic setup where we have our data as the sum data frame. We're assigning x as the drug treatment and y as the mini mental status exam mean. We're going to fill it or assign a color according to drug treatment and group it by drug treatment, and then we're going to add error bars. So this white, gray, black is according to the fill, and then we've assigned the error bars which are really small here according to group.
I built out that data set above here because I'm actually assigning small asterisks to the graph, and that can be found with the GM text I've assigned here. So this is data according to the SIGFIG data frame that we set up and adding a label of an asterisk. I also changed a couple of the theme items. So changing it to be a white background versus a gray, changing the access text size. All these different things are kind of at your discretion. So maybe you're like, oh, I want to add these different changes or make it a little bit prettier. This would be an example.
The basic idea is that we create based off of the data set. We assign the column, or sorry, the x and y-axes. What type of graph do we want to make? An error bar, a bar graph, and then we want to make small multiples or multiple graphs according to sex and health status where we see sex and health status again. Lastly, we'll add our labels. So rather than kind of blah or small descriptors, we have nice big drug treatment and cognitive function and then a figure caption here. So figure one, the effect of novel drug treatment with our groups and the total N as well as a significant indicator.
Lastly, we can take that graph that we saved as G1, give it a nice title. So AD publication graph, assign a height, a width, units for that, and then a DPI. The DPI indicates the dots per inch, which basically says high quality as that increases. If you don't do that, it might be a little bit fuzzy. As you can see, this is a little bit fuzzy versus we look at the publication graph here, very nice and sharp, and it's actually very large and you can zoom in and it looks good.
Next steps and resources
So for the next steps and kind of the follow-up items that might be helpful for you, I will be sharing the slides in the RStudio Cloud instance. So you'll have the availability of all this code and be able to play around with this data set in RStudio Cloud without having to install anything. You just do it through your web browser. So you'll be able to kind of walk through exactly what I did and play around with it.
Additionally, working inside RStudio Cloud, there's actually training data sets and examples there. So if you go to RStudio Cloud primers, it'll actually show you the basics of how to get started with R, how to work with data, do some visualization, tidying, iteration, which is more of high-level functional programming, and then writing functions themselves, which I covered very quickly. The good thing here is that these are interactive, so you're able to write code and get feedback immediately. They're ready to go without having to install anything and you can't break anything. You don't have to worry about installing something on a Windows versus a Mac versus a Linux box. It's just ready to go in your web browser.
There's additionally some great books if you're trying to figure out, okay, which statistical test do I need to use? Learning Statistics with R by Danielle Navarro is a great book that covers kind of the didactic understanding of why statistics and which to use, as well as raw R code about how to do them. The R for Data Science book actually covers examples of going through R for Data Science as opposed to pure statistics, so doing more programming, plotting, exploratory data analysis. This is a great textbook by Hadley Wickham and Garrett Grolemund, and if you're more interested just purely in data visualization, Social Viz by Karen Healy is a great resource for seeing example R code to create these beautiful plots and examples like you saw today.
The last thing I wanted to go over as well was the TidyTuesday R4DS online learning community. So this is that weekly project I talked about briefly. Basically, a data set gets uploaded weekly at GitHub, so GitHub R for DS TidyTuesday. Every Monday on Twitter, we release the data set, and it looks like this. So a link to the data, an article about the data, and then some example plots. Basically, you just play around with the data, and you can share it or just observe what other people are doing on Twitter. It's a great way to learn more about using R and the tidyverse with new data sets each week.
Additionally, there's RStudio community where you can ask questions, so whether it's questions about the IDE or if you need help with actual code, you can go in here and ask questions there. The R for DS online learning community has a Slack group where you can also ask questions, and it's a little bit more private since it isn't a public-facing group. Anyone can join, but you have to log into Slack at the very least, so it sometimes feels a little bit quicker to start as you're able to do immediate interaction.
And lastly, R is pretty active on Twitter. Somebody actually collected all the tweets using the Rstats hashtag. Between 2008 and 2018, you can see there's almost 500,000 or about 430,000 tweets, and there are additional hashtags you can follow, whether it's Rstats or tidyverse, ggplot2, epidemiology, Twitter, rspatial. There's a bunch of different resources there.
ROpensci has a lot of packages that may be more specific to your type of science, so I covered one example for neuroscience. Maybe you're doing plant science or marine biology or something. There could be other packages through ROpensci that are useful, and that's just ropensci.com. And lastly, if you're trying to get involved, the RLadies global community is great, and basically their mission is to promote gender diversity in our community, so this includes under-representation of any minority gender. You're able to join remotely or at actual meetups in your city, and it's a great initiative to staying involved and contributing to this community.
And lastly, I know that this kind of was very quick, and it can seem intimidating to get started with R, and maybe you're like, oh, I'm later in my career or I'm a little bit older. Jenny Bryan, who I have a huge amount of appreciation and respect for, was talking about this with Sharla and talking about how everyone has their own career path, and it could be a slow career movement or a fast career movement. That's okay. So it kind of reinforces that you don't have to get started super quickly, and you can kind of go through it at your pace. And then lastly, obviously, Hadley Wickham is pretty well known in the R community, and he still Googles stuff and finds out through the RStats hashtag. So again, no matter how experienced you are, it's okay to make mistakes. It's okay to be learning, and we're all here to learn with you.
So again, no matter how experienced you are, it's okay to make mistakes. It's okay to be learning, and we're all here to learn with you.
So with that, I'll actually end a little bit after time, but still have some time for questions. I appreciate your attendance today, and I'll take questions now. Thanks.
Q&A
Ah, so the key differences between RStudio Cloud and RStudio is RStudio Cloud is hosted, so you connect to it purely through a web browser. You don't have to install anything. And then the RStudio Desktop is actually installed on your computer, whether it's your laptop or your desktop computer. Both will look exactly the same once you're actually using them.
Okay, some people ask about the tidyverse library. It's basically packages. So packages give you those functions that allow you to, you know, do more things in R without having to write it by hand. And the tidyverse is a meta-collection of other packages that follow along in that syntax.
You can actually create websites using R Markdown. It produces HTML content, so you can look at the blogdown or the distil packages. Those actually allow you to create entire websites or blogs in R, and that's how I run my personal blog. There's some great resources at blogdown. There's actually a book about it, and distil has its own website, distil.rstudio.com.
Yep, there we go. So this actually walks through kind of in long form exactly what we covered today. I didn't want to just read this out in front of you, but it basically gives a similar presentation of what I did, although I have changed it up a little bit since then. But this could be useful if you maybe don't like videos as much and you want to read through something. You can just search for themockup.netlify.com. It's actually the first post I have there.
The tidyverse is one package, but it's a meta-collection of packages. You can actually, someone's asking about how do you know which functions are available in each package. A lot of packages will have their own website or a CRAN page. So if you think of like the tidyverse, it actually has a website where you can look at all the different things. So it's just tidyverse.org. You can look at the different packages that are here and available. You can click on the individual packages, and there's everything from cheat sheets to example usage, where it is in the lifecycle, and then references to all of the different functions that are available in it.
You can also do this from within RStudio Cloud. Yeah. So you can actually just type some code. It will kind of show you exactly what that function is. So if you did like question mark dplyr, or question mark ggplot2, that will open up in the help pane an example about that. So it will give you links to external resources, as well as some information about the package. You can also do that same syntax in terms of question mark, followed by the package name, and then tell it what are the functions you're looking at. So let's do annotate. So if I did on just one function, it will give a description, the example arguments that are available, the details, and a quick example of it.
You can label a factor as anything. So someone's asking about labeling a factor as say like less than 24, as opposed to 24 written down as text. That label could be used as a character. So you could actually just type in less than 24. So example like factor x label, you know, less than 24. Something along those lines. This wouldn't actually give you the full code, but just an example of this is now a character vector and could be labeled that way.
Any of the free R training programs, courses, MOOCs, I'm a big fan of the cloud workspace, these primers, because you're able to do them super quickly and for free in terms of going to here to the basics, click on visualization, and you're able to get started with both kind of example queries, actually running code, and it will give you outputs here. So you can, you know, actually practice inside of a workspace or just go through these primers.
