Kelly Nicole Bodwin | Intro stats with R: Easing the transition to software for beginners

Transcript#

This transcript was generated automatically and may contain errors.

Wow, applause at the beginning. Thanks, Hunter. So that person who just wooed, I just want to shout out, that's Hunter Glanz. He's also at Cal Poly. We're both faculty, and he helped me with this project, is involved in the, excuse me, the coding and everything. We decided not to switch off talking in a short session, but if you have questions, etc., you can ask him as well. So right there, making noises. Okay, so in this talk, I kind of just want to tell you about what I've done with R in the really, really intro level classes.

So the way I'm going to structure this, basically there's a lot of window dressing. I want to give you some background on what courses we have at Cal Poly, the advantages and challenges of using R in some of these classes, which I'm sure many of you are familiar with, and then like what my lab assignments look like, what are the students being asked to do, and then the reason for the hamburger, apart from it being mid-afternoon and lunch has worn off, is that the meat of this talk is these pre-lab exercises that Hunter and I have built in Shiny that have kind of helped us introduce R to beginners. And then I don't know, I've got some student outcome information. I've surveyed the students, so hopefully you think lettuce is also important to hamburgers.

Courses at Cal Poly

So at Cal Poly, the courses using R, we do have a dedicated R class, so that one you would hope involves R in some way. But we also have a lot of upper level classes, and those will involve some coding in order to get the concepts across, et cetera. Most of the professors seem to favor R, although it's up to our decision there. But then we also have these service level classes, what we call the introductory classes for non-majors. And these are the classes where there's no stat or CS background. There's no prerequisites of any kind. And so everyone agrees that these students should see software in some way.

But typically people are using applets. There's some debate over whether we should use R or use something, an applet or Jump or Minitab, some of these easier kind of point and click approaches. That's debatable. There's no opposition to using R as kind of the tool. But what is really kind of unpopular is that students in these classes should code, because these are not stat or CS majors. These are, for my class, they're bio majors, viticulture, food science, animal science, things like that.

Advantages of coding in intro stats

So this is the class I want to talk about. It's a service level class that I teach. And I do think there's a lot of advantages to introducing coding in this class. Pedagogically, I think that thinking through things with code forces them to not just memorize steps. What are the steps of a hypothesis test or a t-test or same thing, linear regression, et cetera. They have to kind of plan ahead, plan their output and not memorize formulas.

Conceptually, I think we live in an age where data is all digital. So thinking of data as a digital object and how do people interact with it, it's a good way to kind of interface with how this is being done in the real world and to think of data as information and not a list of numbers that's handed to you on an exam.

And then practically speaking, of course, not all of them are going to be data scientists or creators, but they do need to understand how it's done in the real world. I don't know why we are still using these T tables here and normal Z tables. Nobody uses that in the real world. We're all just going to use P-norm, Q-norm or our language of choice if this were not an R conference. It makes no sense to be using things in class that are never, ever, ever going to be encountered in the real world.

I don't know why we are still using these T tables here and normal Z tables. Nobody uses that in the real world. It makes no sense to be using things in class that are never, ever, ever going to be encountered in the real world.

if we're offering to teach statistics and we don't teach them to think about it as a digital process, we are not teaching them statistics.

I'm sure many will disagree, but I would love to argue with you. So that's the idea. So that's where this came from. So Hunter and I, we went upstairs at this conference and we sat down amongst all the Japanese business people having lunch and we coded for a couple hours and then spent the whole summer continuing to develop these things.

So the idea is the principles behind these tools, we figured they should kind of have immediate results, right? What's nice about applets is the students go very quickly to seeing output that is exciting and interpretable. So we wanted to maintain that to kind of immediately see the process. But then we wanted to require them to be generating the queries, to not be selecting from a drop down menu or clicking a checkbox or so forth as a real coder would. And we also wanted to make really clear what's kind of built into R versus what they're choosing because this is not an R specific, I'm not trying to teach them R. So I want them to be able to interact then with another language and understand we have to think about what variables we're using, even if the function to make a box plot or whatever has a different name.

And then I really wanted to link the concepts and questions from lecture, the big ideas, to what was happening with the code.

So I want to do a demo here. And I've got a backup plan but I think these are working fine on shinyapps .io. So I'll have these links at the end. I've got three of these hosted on shinyapps.io if you want to demo them. For the class we hosted them separately because shinyapps would get pummeled and I would not be able to handle it on my free account. But these ones are here if you guys want to try them.

So the idea here is this was their first lab exercise. It's built on learn R. So someone asked a question earlier. A really nice package for this kind of thing. And the key here is that what we're going to do is we have the Titanic data set for all of these exercises. And they have to choose the variables. So let's say we want to look at gender and then we want to see who survived. And so what this does and kind of the key point is it fills in that code line there. So that line of code, they're seeing the blue is the thing that is, you know, a name that was chosen, but I chose it. They didn't just input it just now. And the red is the stuff that they just selected. And the rest of it is just the syntax of this particular coding language.

And so then we see, okay, they used facet grid. And that's you can kind of get an idea of what's going on, but not so great. So we can go down to stack bar charts, things like that. Make different choices. They can see the output immediately and decide which of those plots answers their question better. Which one is going to show you that men tended to die and women tended to survive on the Titanic.

And so additionally, we have this idea of tying it to the concepts in class. So here's the exercise for t-tests where we can say, all right, I'd like to make claims about the age on Titanic. The age of the passengers. And under the null, let's say 30. And let's say we're interested in testing whether the average age was less than 30 for some reason. And so we can see this in words that they've been seeing in class. The hypothesis is that the true mean is equal to 30 versus less than 30. We can see this in code, again, with the blue and red. And then we can see the output that they have to interpret.

And LearnR is nice. It's got a good interface for including some questions. We can say, oh, look, what sample mean did I observe? Well, that was 29.547. There we go. And it will tell them right away they got it right. So they're able to kind of check and make sure that they've gotten the output that was expected.

And then the last one I want to show you, just to tie all these principles together, is the normal random variables in general. So we could say something like, let's say we have a normal random variable with a mean of 67 and a standard deviation of 3. And I'm interested in the area below 60. So again, I'm tying this to the way they've seen it in class, which is sort of this notation of normally distributed and what's the probability. And then we can see the code, the result. And then I made these little illustrations just so they can see what's going on, what the picture looks like there. So all these things that they see typically in lecture all in one place and kind of being dynamically generated thanks to the magic of Shiny and the coding that Hunter and I have done, I suppose.

Student outcomes

So I started these out on my summer class with only 18 students and used them with 108 students this fall. The first thing is students definitely recognize that R is applicable. This question is, as the in-class lab activity helped me understand how statistical analyses are conducted in the real world, they overwhelmingly agree or at least feel neutrally about that. No one is arguing with that.

They recognize the value to themselves. This one says the in-class activities helped me understand the concepts from the week's material. And you can see they kind of over time started to see more how this was helping them. I don't know if that was a progression as they got more comfortable or if maybe that third lab, which had to do with the t-tests, was particularly helpful one way or the other. I like that that was positive and also that it improved over these three labs that I surveyed heavily on.

I really like this one. I feel proud of the work I produced on the assignment, the assignment being the actual data analysis, not these exercises. And they feel very proud of it. Almost none of them feel unproud of it. And most of them were kind of excited. And it was fun during these exercises to watch them hit knit on their file and go, oh, my gosh, it worked. This is cool, you know, and explore this data. That was very rewarding for me anyway.

I surveyed them pretty heavily on the skills they gained. This one, they were asked to agree or disagree or whatever with I feel confident in my ability to read and interpret a bar plot. And that one, it looks like the lab helped. So I surveyed them before anything, after the exercise, and then after the assignment itself, the full analysis. And there seems to be some improvement happening.

I was hoping to find some improvement over concepts, and I didn't actually really find that. Mostly anything that was about identifying aspects of data or describing aspects of data. They did not self-report improving, although anecdotally, I do actually feel that they did. And then small improvements for anything to do with creating or analyzing plots and a little bit to do with calculating things.

So some feedback. I'm hanging myself out to dry here reading one of my least favorite sentences that I've ever received. But in summer one of my students described these as a, let's see, complete unorganized catastrophic mess. I was having a lot of trouble figuring out how to host them. Shiny apps couldn't handle all my students, and I had to relearn Unix to host them elsewhere. So there was some technical difficulties. I tried it on that class because there was only 18 of them so I could help them out. But for the 108 in this past fall, it went very smoothly. And here's three of the comments. People are saying that they thought these labs were helpful. There were some slightly negative comments, but nothing like this one. They were more I don't understand why we did it instead of I hated it. So that was encouraging.

And lastly, I didn't survey heavily my previous classes. I didn't know I was going to be doing kind of experimenting like this. But in winter 2018, I also had three sections about the same time, same number of students. Homework was pretty much identical. I didn't tweak it too much between winter and fall. And the lab material was the same, the same data sets and questions, but it was split for the fall into this pre-lab and lab format.

And so I asked them two questions. Halfway through the quarter, I asked them how much the homework assignments helped and how much the lab assignments helped. And what we found was not a huge change in the homework from winter to fall, which you would expect. It's the same homework. But the lab assignments, if you see here, I don't know why I keep forgetting this exists. There's lab assignments here. In the winter, like three-fourths of my class hated them or didn't think they were useful. And in the fall, most of the class did think they were useful or at least felt neutral. So I think that is maybe attributable to these exercises.

Take-home messages

So just some take-home messages, what worked well. I like that the students could analyze data from scratch themselves. Anecdotally, I really think the students ended up better off in the fall class. I like seeing them get excited about results.

What didn't work? Well, first of all, these exercises, we banged them out. We are not developers. We're trying to be. We're getting there. But we're not. I'm going to hang myself out to dry again. This is kind of what it looks like when I have to input to output both the strings of code and also evaluate that code. In particular, I'm just going to really hang myself out to dry. A lot of people I know will get mad at me for that. EvalParse. I know it's a crime. But I don't know what else to do. And I wanted to do something. And something is better than nothing. So there we are. Sorry, Dave Robinson.

And so apart from that, and this is kind of why I'm sharing it with you. One reason is in the hopes that there's people out there who might want to help me make this more of a usable, shareable, manageable tool.

I think this idea of thinking through code, students tended to type their variable names for the new data into the exercise, which, of course, wouldn't run. But then they would copy-paste, which isn't quite what I was hoping, but maybe it's okay. And motivating the students, they really were disturbed by the fact that they were spending class time on something that wasn't on the exam.

So my advice for you, the growing pains are worth it. There is a headache, but it's worth it. I think if you have an idea for a teaching tool, build it. It's not perfect. Mine is ugly, but it's cool and it helped. And that these kids can handle R. They really can. They grew up with computers. They can do more than they think they can. They can think about data. They can do it. And so can the people teaching them. So I hope that more classes will really engage with coding instead of just applets in the future. That's my goal.

All right. So the GitHub repository there has all the material for these exercises if you want to use or, better yet, edit them. Those three demos, the links are here. And then contact info for me or Hunter. And that's all I got.

Q&A

Hi. This is a great talk. Thank you very much. So a lot of times when I hear these discussions about sort of computing and statistics pedagogy, they're like kind of along two veins. One is I think kind of what you're doing here, which is using R and Shiny and these kinds of tools to improve the teaching of classical topics. And the other one is to kind of replace like the sort of teaching of classical topics with more computationally based stuff like permutation testing, bootstraps, stuff like that, right up front instead of normal derivations of t-tests, things like that. I was wondering if that sort of like second component of it is something like you and your colleagues have like discussed as like you're incorporating more of this stuff into these earlier classes.

Yeah, totally. So I think the question is about like using computing just to reinforce things like t-tests versus to introduce things like permutation tests that get the spirit of hypothesis testing without the math maybe necessary. Yeah, that's a big discussion. I think Beth Chance teaches a class like fully on simulation permutation tests and none of the other stuff. For me, I kind of believe in teaching the classical concepts because they are going to see that. Even if they never do stat, they're going to read papers in their fields and it's probably going to have a T-score. They should know what that is. And so mostly I've been using it to reinforce those concepts, but also I do feel really strongly about don't do something in a class that is never ever going to happen in a real world. So as maybe those things go away, then it would make more sense to emphasize the other stuff.

Kelly Nicole Bodwin | Intro stats with R: Easing the transition to software for beginners | RStudio

Transcript#

Courses at Cal Poly

Advantages of coding in intro stats

Challenges with new students

Lab assignments

Pre-lab Shiny exercises

Student outcomes

Take-home messages

Q&A

Featured software#

rstudio

Shiny