
What's New With Readxl | RStudio Webinar - 2017
This is a recording of an RStudio webinar. You can subscribe to receive invitations to future webinars at https://www.rstudio.com/resources/webinars/ . We try to host a couple each month with the goal of furthering the R community's understanding of R and RStudio's capabilities. We are always interested in receiving feedback, so please don't hesitate to comment or reach out with a personal message
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Thanks to all of you who've signed on for this, which is my first RStudio webinar. I'm going to talk about the package readxl, which I've recently done a maintenance release on, and I want to talk a little bit about our sticker for readxl.
RStudio and many other package-producing entities have started to create these hex stickers so that you can show your love for packages, and the one we made for readxl is supposed to capture that really special feeling you get when you find out that the data you need is actually in Excel. So hopefully if you want one of these, you can find one of us in person at some meeting and maybe we'll have them. You can put them on your laptop to show your suffering.
readxl and the Tidyverse
So readxl is a package that finds itself inside a larger group of packages, which we're calling the Tidyverse, and this I believe is still a subset of the Tidyverse, but the idea of the Tidyverse is to take a collection of packages and make sure that they operate with a common philosophy and a really similar interface so that it's easy to transfer expertise from one package to the other. It's easy to move objects from one package to another, and so readxl is the Tidyverse solution for getting data out of Excel spreadsheets and into R.
Another really important sort of documentation, I would say, for the Tidyverse is there's a recent book, although it's been online for quite a while, by Hadley Wickham and Garrett Grolamund, both of our studio, called R for Data Science, and in some sense it's a meta-manual for the Tidyverse, or you will certainly see the Tidyverse packages working in their natural environment and working in concert with each other throughout that book. So it's a really good resource, and you can buy a physical book, or if you go to this URL that I'm showing there, it has always been developed in the open, and you can always access the content online for free.
And below I have a graphic that we see in a lot of talks by Hadley and other people who develop or work in the Tidyverse that's meant to depict a typical data science and data analysis workflow. And Read Excel is one of the packages we have to deal with that very first step, which is importing data out of a spreadsheet and getting it into R in a favorable form to sort of fly through the rest of this cycle at least as easily as possible.
History of readxl
So it was first released on CRAN in April of 2015, and that release was written entirely by Hadley, and then about a year later it had a patch release, which I think was just triggered by the need to update for compatibility with the package, so that didn't have a lot of changes in it. And so then it got its first major maintenance release about a month ago, which is when I took over as the new maintainer and we moved the version up to one. I feel like it is pretty feature complete, and we squashed lots of bugs and added a few new features, which I will talk about today.
So one way to get Read Excel installed on your computer is to install the entire Tidyverse. So we now have a package called Tidyverse, which is kind of different from the concept of the Tidyverse. It's a meta package that will install the whole set of packages on your computer, or you could install Read Excel all by itself. Then either way, you must load Read Excel individually. You can actually call library Tidyverse, and it will load a subset of the Tidyverse packages that are kind of called the core. And Read Excel is not part of that core, because you wouldn't need it in almost every analysis. It's only when you have a spreadsheet. So you'll always have to load it explicitly on its own.
The landscape of Excel-reading packages
So I promised in my abstract to give a little bit of the landscape of our packages that Read Excel lives in. And the reason I wanted to do that is I figure anyone who's willing to spend an hour with us talking about Excel must have some slightly non-trivial need to read Excel files. And it's kind of interesting just to know where the packages fit in and why some packages are harder to use than others or what the different features are.
So there's a huge set of packages that read Excel files, because this has been an important area for a very long time, obviously. So Read Excel might actually be the most recent entry into this group. OpenXLSX is also fairly recent, and it uses a lot of the same stuff under the hood as Read Excel. Packages that have been around much longer are XLConnect, XLSX, and GData, RODBC.
So when you're thinking about where does Read Excel fit in and which of these packages are relevant to me, I think there's at least three things you need to consider. So one, there's two broad classes of Excel files. And unless someone's been evil, you'll be able to identify them from their file extension. So .xls are the older Excel files. In our documentation, we call them legacy files. And those are files from Excel up to about 2007. But there's plenty of systems today that are still producing .xls files, and in fact, Excel itself is still quite content to produce these files. And then the more modern file format is XLSX. And that's what you're going to get by default coming out of Excel now.
So the main question to ask is not about the files, but it's about the R package. And there's sort of two main categories of R packages. There are those that are self-contained versus the ones that have an external dependency. So neither of these file formats is something that you want to parse by hand, all by yourself from the ground up. So all of the packages that read Excel files are standing on the shoulders of others and have some sort of dependency where we're bringing in an existing library to parse the file.
And it turns out that the most comprehensive package for doing that is a Java system called the Apache POI. And so the packages that wrap Java have great functionality in the sense that you can almost say if Excel can read this file, this package can read it because of all the accumulated conventional wisdom in Apache POI. And so that's why when you're writing an Excel reading package for R, it's very tempting to use Java because you can take advantage of this pretty amazing resource. And likewise, there are some sort of Perl scripts that have been around for a long time that have a lot of smarts about reading Excel files. And so we've taken another approach in ReadExcel, which is to bake everything you need into the package.
And then the third huge component to think about is whether the package you're using is oriented towards bringing in a rectangular piece of data, so a data frame usually, or is it more focused on or capable of focusing on individual cells and formatting.
So here's where ReadExcel comes out on all of these points. It is able to read XLS and XLSX. It is self-contained. So once you get it installed, which should be relatively trivial, and definitely let us know if it is not, it will just work. You don't need to configure an external dependency in the right place for it to work. And it is explicitly focused on bringing in a rectangle of data. So it is not a package that's going to reveal formatting for you, whereas some of those others will.
Working with Excel data rectangles
I'm going to show you, this is a screenshot of one of the Excel workbooks that ships with the package, because I like to use it a lot in examples. And it's something I created recently, and I guess it's a bit morbid, actually. It's a spreadsheet. It felt, you know how we talked about a lot of, it feels like a lot of very famous beloved people died recently around Christmas? And so I actually made a spreadsheet of recent deaths. And it's also more to show how we import data with different column types and whatnot.
But the thing I want you to notice here is I tried to make it sort of a typical spreadsheet where people have written some stuff at the top and some stuff at the bottom. And so Read Excel is really focused on helping you find or identify this data rectangle and bringing just that data rectangle in. So that's indicated by this yellow box. So Read Excel is highly focused on a data rectangle as opposed to absorbing an entire spreadsheet in all of its glory.
The second point I want to make is that the IDE, the RStudio IDE, has a really nice little helper for getting Excel files loaded into R. So this is a screen capture of that. And we will show that to you live. And this is me importing that same spreadsheet. And you can see a lot of the same data there in the data preview. And below in the import options, we've exposed some of the most useful arguments for the main function of the package, readExcel. And you can see a preview of the code you would need to write to get that.
Here's a screenshot right after I've brought that data set in. And we have a data frame in R. It's a TIBL, which is a special flavor of data frame that is arguably the strongest theme that runs throughout the tidyverse. So it prints very nicely, which we're enjoying right here. So you see we've got a TIBL that we brought in from Excel. It has 10 rows. It has six variables. And the thing I really want you to focus on is notice the types of the variables. That's one of the nice features of TIBL printing that you get to see these types. So name and profession are character vectors. Age is double or numeric. Class kids is a logical vector, true or false. Date of birth is a POSIX CT date time. And the code right above there actually brought this data set in, and you can see that I didn't have to specify any of those column types. ReadExcel guessed them, and in this case, guessed them correctly.
So I would describe readExcel's column typing as pretty groovy when it works.
So I would describe readExcel's column typing as pretty groovy when it works. You can also take explicit control, which we'll show in the demo.
Iterative workflows with multiple sheets
So here I'm importing that same data set. It turns out it has two worksheets or two tabs, and they have the same structure. They all have name, age, has kids, all those same variables. And so one thing I want to point out before we're all done here is that readExcel can be used quite nicely in workflows. And so what I'm showing here is I'm reading that workbook, but I'm reading both sheets, and I'm gluing them together row-wise to get one big data frame that has the data from the first sheet and the second sheet and has a variable indicating which sheet it came from.
And this seems to me to be an under-known, underutilized feature. The mapDF function I've circled there is not from readExcel, it's from Per, another member of the Tidyverse. But so I do really want to make sure that people are aware of these iterative workflows so that you have solutions when you have multiple worksheets in a workbook or you have multiple workbooks that you need to import.
