Resources

Amelia McNamara | Implications of R Syntax in Intro Stats | Posit (2022)

This talk reports on a head-to-head comparison of the formula and tidyverse syntaxes in a full semester introductory statistics course, providing data to help guide other instructors in their pedagogical decision-making. The formula version of the class used the mosaic package for summary statistics, ggformula for graphics, and base functions such as t.test for inference. The tidyverse section used base functions inside summarize() calls for summary statistics, ggplot2 for graphics, and functions such as infer::t_test for inference. Analysis of materials allows us to determine the number of functions students were exposed to in each section, which functions they actually used, and how much time they spent on their assignments in each class. Session: Lightning Talks

Oct 24, 2022
4 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I'm Amelia McNamara. I teach at the University of St. Thomas in Minnesota, and I tweet at AmeliaMN. I just tweeted a link to these slides in case you want to follow along.

So I'm really interested in R syntax and also teaching statistics. If you are someone who teaches statistics, you know that we are always trying to have students working with multivariate data. So in this example, looking at the relationship between flipper length and bill length of penguins between different species to see if that relationship is the same.

And I chose this sort of contrived example to show you why educators believe that we should not be teaching base R syntax. And that is because it can get really complicated, really verbose. It's easy to make mistakes. And if you look at those graphics, they don't even have the same consistent axes. So most educators agree we should either use the formula mosaic package or the tidyverse syntax. And both of those are going to create beautiful ggplot2 graphics with those consistent axes. But we have lots of opinions about which of these we should use and not a lot of data.

we have lots of opinions about which of these we should use and not a lot of data.

The head-to-head comparison

So I did a semester-long head-to-head comparison of the formula and tidyverse syntaxes. Why did I do this to myself? I basically doubled my work, right? I had to create the materials in both syntaxes. I tried to make them as consistent as possible in R markdown documents that had the same text. The only thing that was different was the R syntax. And the reason for this is because I wanted to get us some data and I also really believe that constraints breed creativity. So if you put those constraints on yourself, it's going to make you more creative.

Results

So here are my results. Some things are really easy regardless of syntax. R markdown or Quarto works so well. Inference works well in intro stats, whether you're using the mosaic package for inference or the infer package in the tidyverse.

But different things can be hard. So in the formula syntax, dealing with and explaining how we deal with missing data, that's pretty challenging. There's options, there's na.rm equals true, there's use equal complete ofs. That was tough. In the tidyverse, dealing with and explaining the relationship between two categorical variables was a challenge. So we're used to seeing two categorical variables explained in a two-way table. That's not a tidy data structure. And it's harder for me and for students to conceptualize of the relationship when it's in that tidy format. And that can have implications when you do things like chi-square tests and proportion tests.

How many functions do students really need?

The other thing that I was really interested in is how much R code you really needed in an intro stat course. And it turns out not that much. So in the formula section, I exposed students to 37 functions. In the tidyverse section, they saw 50 functions. Those are a little bit different in terms of numbers. But both of them over the course of 15 weeks, totally reasonable. A big criticism of the tidyverse in teaching is, oh, there's so many tidyverse functions. How would students ever learn them all? Well, they're not going to learn them. You're just going to show them the 50 that they need.

And if you look at the top five in each of these sections, I think you'll kind of see some of the differences here as well. So in the tidyverse section, I used the summarize function a lot. Every time we did summary statistics. I used the ggplot function a lot. Every time I built a graphic. Versus in the formula section, in the top five, I have the gfhistogram. And I have the mean function. So a specific summary statistic. And you'd see some of those things kind of coming up in the tidyverse section if I went further down the list.

The other thing I found is there was a difference in the amount of time students were spending computing on RStudio Cloud. Students in the tidyverse section, it looks like they spent more time computing. Why is this? I have no idea. I probably should have collected some data, done some interviews, asked them, is it because you're on there? You're like, oh, it's so fun. I want to keep looking. I want to play around. Or is it because it's really tough and you're banging your head against the wall and you're frustrated? I don't know. And that's a great avenue for future investigations.

Tools and resources

If you want to try something like this yourself, if you want to know how many R functions do I show my students, use the R function get parse data. And if you want to know how much time each individual student is spending on RStudio Cloud, you can dig into your browser's developer tools to kind of sneak that data out. If you want to know more about how I did this and what I found out, I've got a preprint on the archive. And if you want my materials, my R Markdown documents, you want to see how I got that data for all my functions and how I analyzed the data for my paper, if you want to read the paper, all of that is available. I'm happy to take questions on Discord. And thanks so much.