
Exploratory Data Analysis with R in Positron
Learn exploratory data analysis (EDA) in R with this tutorial by Mine Çetinkaya-Rundel. Using Positron, Mine guides you through a real-world project, 'exploring deadlines,' to analyze the impact of homework deadlines on student performance and stress levels. Discover how to effectively clean, filter, and visualize data using ggplot2 for insightful comparisons. This tutorial emphasizes best practices for data organization and clear data presentation while highlighting Positron's features that streamline your data analysis workflow. Perfect for anyone looking to master data visualization in R and enhance their data science skills in this new IDE. 0:00 Introduction 0:25 Opening a new Positron project 1:48 Loading and exploring data 3:44 Creating a new R file 4:05 Running exploratory data analysis 16:37 Formatting code with Air 19:22 Copying a plot Positron documentation: https://positron.posit.co/ Download Positron: https://positron.posit.co/download.html Read the blog post: https://posit.co/blog/eda-in-positron Air documentation: https://posit-dev.github.io/air/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone. Welcome to this quick tutorial on doing exploratory data analysis with R in Positron, the next-generation data science IDE built by Posit. In this video I'll walk you through setting up a project, exploring data using the handy Data Explorer, writing an R script for loading data from a CSV file, then wrangling it and visualizing it, all seamlessly within Positron.
Setting up the project
So here I am in a fresh Positron window and the first thing that I am going to do is to create a new folder for my analysis. I want to keep everything associated with this analysis in a new folder, including my data, my scripts and so forth. So let's go ahead and click on new folder and since I'm going to be working with R, I'm going to select our project. Let's click on next and I am going to name this exploring deadlines. It'll be clear in a second why we're naming this. I could choose a different location to save this but I'm just going to leave it in my user directory and let's click on next.
And here you can see that I can choose different versions of R, any other version that I might have installed on my system. I'm going to leave this at the release version of R and say create. And Positron asks me whether I would like to open this in new Positron session or the current window since there's nothing else going on in my current window. Let's go ahead and select current window here.
And here we are. I have a Positron folder. I can see the name of that here in my folder selector and also if I go to my primary sidebar I can see I have no files in here right now. With these actions what Positron did for me is, if I scroll over to my finder, created a new folder in my user directory called exploring deadlines.
The dataset
Let's go ahead now and figure out what data set we are going to use. We're going to use a data set from this paper is on the impact of homework deadline times on college student performance and stress and it discusses an experiment in a business statistics course where different sections are randomly assigned a 4 p.m. or an 11 59 p.m. homework deadline and then asked about their stress level associated with deadlines. The researchers also compared learning outcomes of the two groups as well. And when they published the paper they published their data too.
So let's click on the supplemental material allows me to download the data set. Then I can go ahead and locate this data set in my finder. Let's go to my downloads and I will go ahead and copy this CSV file and go back over to Positron. To keep things organized I am going to create a new folder. I will call it data and then I can simply paste my data set into in the Positron window. It says are you sure you want to paste? Yes, let's go ahead and say paste.
Exploring data with the Data Explorer
And I can see that it places the CSV file in my data folder and pops open the Data Explorer for me. On the left is my summary panel which gives me a high-level overview of all the variables in my data set and if I click on them I can get some numerical summaries on them as well. And on the right I can see my data grid where I can see each of my variables in a spreadsheet format and my observations in rows. I can also see the variable types in this view as well.
I can go ahead and collapse the summary panel if I want to take more real estate from my data grid and I can also open this file as a plain text file if I want to inspect it as a CSV as well. But instead of doing that we are going to go ahead and load this data into R. So I'm going to go back to the root of my exploring deadlines folder and I can do a few things. I can I want to create a new file so one way that I can do that is using the new file icon here or is to use the command palette which I can open with command shift P and then I can open a new R file. I'm going to go ahead and save this file as code.R as I'm not feeling very inspired with a different file name.
Loading and cleaning the data
Let's go ahead and load our packages. I will use the tidyverse package for this analysis and whenever I'm working with a data set with non snake case variable names I like using the janitor package as well for the clean names function. Let's load our data. I will call this homework raw to begin with and we will read these data from our data folder. I am tabbing to find the folders in my current the subfolders in my current folder and let's go ahead and load these data. Load our packages first and then our data and I can see that my data is now in this session tab under data and I can go ahead and double click on this to pop it back open in my data viewer.
Let's go ahead and collapse the summary panel and take a look at our data one more time. I can see the number of rows and columns reflected here so if I had read the paper I would be able to check to see if this these numbers actually sound right and before I get to even writing some code to explore the data I can start looking at it using my data viewer as well.
For example if I want to see the data organized based on whether the deadline was at midnight or not so 0 indicating a 4 p.m. deadline 1 indicating midnight or 11 59 deadline I can click on the three dots here and say I want to sort these data in ascending or descending order or if I am done viewing this data in this manner I can go ahead and say clear sorting. I can also sort based on multiple variables as well so let's go ahead and first sort ascending based on midnight deadline and then sort descending based on fall semester for example. I'm going to clear these one more time.
Additionally we can use the data viewer to filter the data as well so let's go ahead and add a filter for midnight deadline and I am going to say I'm looking for a midnight deadline is equal to 1 by the filter so this shows me that there are 42 observations where the midnight deadline was 1 so this must be the 42 students who were randomly assigned to the section that had a midnight deadline and let's go ahead and add another filter. Let's say I want to look at year in school add another filter and this time let's use a different condition is greater than or equal to 3 so we're looking at juniors or seniors here and once again at the bottom I can see that now we're down to 13 rows or 13 students who meet these two criteria.
Now another cool thing that I can do is now that I have these 13 students selected say I want to share these data in some other way quickly with a collaborator I could go ahead and copy their data. I'll go ahead and copy and say go to something like Excel and paste just these 13 rows so I am able to copy from my data viewer into an Excel file. Obviously this is the kind of thing I would prefer to write code for for a reproducible data analysis but it's pretty neat that Positron allows me to do these quick things without having to actually write some code to export the data as well.
Obviously this is the kind of thing I would prefer to write code for for a reproducible data analysis but it's pretty neat that Positron allows me to do these quick things without having to actually write some code to export the data as well.
Now that I'm done looking at these 13 cases I can clear the filters either one at a time by axing out of each of the conditions or I can go ahead and clear all filters at once.
Data wrangling in R
Now let's go ahead and write some code. I want to do a little bit of data prep so I will start with my homework row object and the first thing that I'm going to do is to clean the name so that each of the variables is now in snake case so I can take a look at the result in my console quickly. There are a couple of variables that I'm going to work with one of which is the midnight deadline variable and we can see that it's coded as a 0 or 1 but I'd like to actually use informative labels for its level so let's go ahead and mutate and create new variable called deadline and write an if-else condition if midnight deadline is equal to 1 we're going to call this midnight otherwise we're going to call it 4 p.m. and I'll place this variable after the ID variable since it's one that I want to be available to me every time I'm looking at the data and here we go.
The other variable that I'll work closely with is the stress variable so let's go ahead and relocate that that's the q3 deadline stress to be after my deadline variable as well and let's go ahead and save this cleaned up data as homework that I can use for the remainder of my analysis. I have the same number of rows in my raw data and my cleaned up data and this one additional column where a deadline it has the levels 4 p.m. and midnight.
Comparing deadline groups
The next thing I might want to do is to compare these two groups so compare deadline groups stress levels. I'm gonna do a numerical summary first so let's go ahead and say I want to group by the new deadline variable and do something like summarize mean stress level so I can tab to find the variables that start with a particular string so this is q3 deadline stress and if I run this code I'm getting an NA for my midnight group I bet this is because there must be an NA in my and an observation in the midnight group that didn't respond to the deadline stress question on the survey. Since this is the only data that I'm going to be working with for this quick tour I'm going to go ahead and redo my homework data frame to filter for any rows where the q3 deadline stress variable is not NA so that I can be left with it seems like there was only one observation where we had an NA and let's go ahead and do this.
My quick analysis suggests results that match up with the author's conclusions that students in the midnight group are less stressed on average that's not what these numbers say 298 versus 265 but the scale is that 1 indicates that the deadline made it much more stressful and 5 indicates that the deadline made it much better.
Visualizing with ggplot2
Now a better way of looking at this could be a visualization so let's go ahead and do that as well. I'll add plot I'm going to use ggplot2 to make this so the data frame is homework and I would like to create a ggplot2 where on the x-axis I have my deadline variable and then I want to fill by the categories of deadline stress variable and I would like to make a geom bar. ggplot2 needs this variable to be a factor so let's say as factor here in order to be able to fill by it and since we've removed one observation from the data set my two groups are not of equal size so I'm going to go ahead and say position fill in order to make this a bar plot where the segments represent the proportions of students that fall into each category each fill category.
Let's try to make this plot a little bit more easier to interpret so I'm going to add some labels on the x-axis this is about the deadline and on the y-axis we're actually looking at a proportion my favorite theme is the minimal in ggplot2 so I'm going to add that quickly as well and the next thing that I want to do is I want to add some informative labels to this legend because with funny variable name and my legend name as well as a scale where it's not clear what one means versus five means the visualization is not that meaningful to me.
How am I going to do this? First let's go ahead and change the color scheme from something from the default to a gradient color scheme so I'm going to use scale fill variadis D for this a discrete color scheme remember I said that one indicates that the deadline made it much worse and five indicates that it made it much better so I'd like to go ahead and reverse the scale as well and then reverse the color scale as well so so far I have been able to rely on my ggplot2 knowledge to make these changes however I want to label the levels I want to give them meaningful text so that anyone looking at this visualization can easily tell what each one of these levels mean and frankly I don't remember how to do that.
I can of course use the question mark command to go to my help so I can do a question mark scale fill variadis D which will pop open the help view here but let's think of another way that Positron can help us with it. Help for now and let's go to my command palette with command shift P and I will look at my help options and I can see that there's an option called show help at cursor and I'll take a note of the fact that I can also get to this with the f1 keyboard shortcut in the future so let's go ahead and select this and what this does is wherever my cursor is if I do f1 it will show me the help for the function that is where my cursor is so let's go ahead and open up the help for scale fill variadis again and take a look at some of the options that I might have so I think this dot dot dot might be helpful for me so let's scroll down and see what other arguments we can pass on to a discrete scale and some of the ones that are going to be useful for me are things like labels for example.
I'm also not a huge fan of the default color scale for variadis so I'm going to change my option to be E one that I like a little bit better so the five yellow indicates that deadline made it much better and the one dark blue indicates that deadline made it much worse in terms of stress. Let's go ahead and get those words on here I will say that I want labels and I want to create a vector and in this vector let's go ahead and copy from the data dictionary I had saved in another window here and here's my data dictionary I can use the multi cursor with option shift and do a multi selector here and let's say let's put some quotes around these say these are equal to these explanations and commas and I'll just get rid of the last comma let's see if this actually does what I wanted to do and in fact my legend is looking a lot better we're gonna go ahead and add a better title for this legend as well so this is for the fill scale and here is the survey question from the from the start from the study that's a really long text so let's go ahead and wrap this for 20 characters at a time maybe and things are starting to look a lot better now let's go ahead and add a title for this plot as well something like stress level versus deadline and things are starting to look a little squished in our plot.
You can see that I can scroll back to my plotting history to see where things started to look not so right I can also clear certain items from my plotting history if I know that I'm not going to need them by simply Xing out of them go ahead and run this code one more time and I can also if things are starting to look squished open this plot in an editor tab for example so it looks like if I was to size things correctly that overlapping text is not going to happen I can go ahead and change how this plot is shown in the plot viewer perhaps pick different sizes for it or give it a custom size myself but without even needing to worry about custom size I was able to find a size that seems to not overlap the text so things are looking a lot better with my plot.
Formatting code with air
But my code is a little bit all over the place in terms of formatting I could manually go ahead and add line breaks and tabs and whatnot but instead I can use air to format my code so let's go ahead and go to our command palette with command shift P and then say air and we want to format the workspace folder it says we need to save our code before we can do that I'm gonna say yes to that and you can see that it has made a bunch of changes to my code not to the text in my code but for the formatting of my code.
I had to do this manually here but going forward I can actually ask Positron to do this for me every time I save my code so let's go back to our command palette one more time and this time I want to set some user settings in the user settings JSON file which is how Positron saves your user settings you can see that I already have a few settings for myself here and I'm going to add a new one for R which will say that we want to use air as our default formatter and we want to format on save each time so let's go ahead and save this I don't need a restart or anything of Positron and let's go back to our code.
Adding a caption and finishing the plot
And let's finish things up by adding a caption so for this caption I am going to go back to my paper and say that I want to cite this article let's go ahead and copy this citation to the clipboard and give this a try it is there but it's all squished we're probably going to need to do this STR wrap trick one more time to wrap our text I will try something like a hundred characters for this one and that's starting to look good but I'm going to need to move it so that it is not it's aligned properly at the bottom and in order to do that I need to add some theme settings I personally never remember all of the theme settings available in Positron so I will do f1 to open up the help for a theme within theme I can go ahead and search for anything related to caption so I can do plot caption and plot caption position so the plot caption position is panel by default but I could try plot as well let's see if that helps me that does seem to help me a little bit but things are again looking quite squished so let's go back to our help and it looks like for plot caption I can give it an element text so let's try that too and let's give it a horizontal justification of zero so that is looking a lot better let's give this plot a little bit more room to breathe so that we can actually see all of its contents and click on the copy plot to clipboard button here go to my email and go ahead and plop it in there.
And that's a quick intro to Positron for exploratory data analysis with R from project setup and data exploration to writing scripts and making plots Positron is packed with thoughtful features to support your data exploration workflows.



