Resources

Jenny Bryan | Object of type ‘closure’ is not subsettable | RStudio (2020)

Your first “object of type ‘closure’ is not subsettable” error message is a big milestone for an R user. Congratulations, if there was any lingering doubt, you now know that you are officially programming! Programming involves considerably more troubleshooting and debugging than many of us expected (or signed up for). The ability to solve your own problems is an incredibly powerful stealth skill that is worth cultivating with intention. This talk will help you nurture your inner problem solver, covering both general debugging methods and specific ways to implement them in the R ecosystem

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good morning, and welcome back to RStudioConf. It's really great to see you all here again, whether it's in person or on one of our livestreams. But now it's my very great honor to introduce our next keynote speaker, my colleague and friend Jenny Bryan.

Jenny's work has almost certainly touched you in some way, whether it's one of her books like Happy Git With R or What They Forgot To Teach You, or perhaps it's because you're afraid that she'll set your computer on fire because you use setwd. Or maybe you're using one of her packages like Read Excel or Google Sheets to get data out of spreadsheets, whether it's Excel or Google and into R.

But of all the packages, of all the work that Jenny has done, I think my favorite package is the reprex package. Not only because it's such a great tool to help you get help from other people, but I think it's one of the rare packages that has, like, no precedent in any other package. It's in any other programming language. It's something that's genuinely new. So without further ado, I'd like to welcome Jenny Bryan.

So this, I think, is R's most infamous error message. Objective type closure is not subsettable. It was my title as a joke for a while, and then people thought I should actually stick with it.

So I have 20 years of experience triggering this bug, which is why I can now do it in two lines of code. And this is also, I think, commonly how it actually happens, although it's usually never quite this clear.

But you create your main data object, you call it dat, then you promptly lose all memory of having done so, and you ask for the X column of DF, which you haven't made, but DF exists. It's a function that gives you the density of the F distribution. So what you've asked for makes no sense, and R tells you this in this very special way.

And my sort of fantasy message down there is maybe it would be able to somehow read my mind, which is obviously not going to happen. And so this sets the mood for the next hour, where I want to talk about general strategies for coping with confusing and frustrating situations.

So you went into data science, you were probably told that it's going to be glamour and fun like 24-7, and you make very creative concoctions that you present, and people love to consume it. But there's all this drudgery, as there is in any job, where we actually spend a much greater proportion of our time and our mental energy.

And so I've sort of made a habit of talking and teaching about those things so that you feel really cool and have fun, but you get your drudgery done as well.

So we're not using Slido for live questions, although you're welcome to ask them, because I'm going to blog about this talk later, but I am using Slido live for some polls. So if you're willing to get your laptop or your phone out, I'm curious what your current main debugging method is, and if you use multiple, as you probably do, you will have to pick a favorite.

And while I'm letting you take this poll, I'm going to say a few more words about why I think this is so important, the drudgery part. So if we don't give a name to these things and give them dignity, when you lose half your day to doing something like this, it's extremely demotivating, because you feel like you haven't actually gotten any real work done.

And the other risk, especially with debugging, is if you're only reactive and you're always dealing with today's bug, it means that you are constantly putting out fires and you don't probably have the time at that point to develop your debugging skills and be a little bit proactive about it. But you shouldn't be perpetually surprised that there's a new bug. Like really? Again? Today? This is going to happen every day, and it's actually worth giving some thought to how you want to do things.

And the other risk, especially with debugging, is if you're only reactive and you're always dealing with today's bug, it means that you are constantly putting out fires and you don't probably have the time at that point to develop your debugging skills and be a little bit proactive about it.

Overview of the talk

So here's where we're going to go. There's four sections of this talk, and I hope that there's something for everyone here, depending on, like, your R experience can be quite little or quite a lot, and that there will be something that you find interesting, useful, or at least makes you feel very validated in what you do. And this is also basically approximately the order in which I do these things. And they do all come up all the time on most puzzling situations.

Resetting your R session

So the first thing I want to talk about is the beauty of resetting things. Earlier this week we ran a ton of workshops. I helped out in one. I was not in charge because I was doing this, and this reaffirmed my commitment to how important the idea of resets is and why it really should be your first strategy.

So as soon as you get some sort of error, I don't know about you, but I immediately I just send that same command again. Because maybe it's going to work this time. And it does not ever, ever work.

But there is a small variation on this that is an extremely productive implementation, and it's the world's most famous troubleshooting advice for anything, but especially tech, because it's so hard to get this right, that you should try turning it off and turning it back on again. And why is that?

So this is a super corny phrase that we have all heard before, that if you love something, you need to set it free, and if it's really yours, it will come back. And I think this applies to unloved things as well. So if you have a strange problem, and it just actually doesn't make much sense, consider setting it free, restarting R, and see if it comes back.

So I want you to restart R often, and especially when things get weird. And you might think I'm being glib or I'm just trying to sweep some problem under the rug, but that is not actually true. One of the things that's pretty unusual about R is we install and update packages from R, and this is a little bit like working on your airplane engine while you're flying. And I think people updating and installing packages while they're doing work in R, and especially if they have multiple R sessions open, is a common reason why things get funky in a way that's quite difficult to debug and understand, and the good news is you don't have to quit, restart, and you're guaranteed to have the package version that's loaded into memory view, the one that was installed on disk. So that's an example of why this is a legitimate way to reset things.

So how do we actually do this in R? So this is the RStudio version. There is a menu entry where you can restart R. I have this keyboard shortcut emblazoned into my brain. It's something I do many times per day.

And the second thing I recommend that you consider, this is kind of a big lifestyle change, so don't do it right now, is to consider not reloading your workspace at startup and not saving your workspace when you quit R. It's a pretty radical lifestyle change.

And I get a lot of pushback on this, so I do appreciate the clapping. I'm going to remember that when I get this week's pushback. This is the way to do this if you were just starting R in a terminal, and I also just want to use this as a proxy for that figure I just showed you. So you can start R with command line flags, including no save and no restore data.

And I want to argue this is vastly superior to another thing that a lot of us do, and I have lots of this on my computer from previous years, where people use RM list equals LS. So that lists all the objects in the global workspace, and it deletes them. And so this is a really common command to see at the top of R scripts, and believe me, this was my 100% practice all the time. But the problem is it doesn't really go far enough.

So this brings me to my next poll, and I'm going to continue to believe that you're all filling out the poll, and I want you to think about these six R commands, and they all have some sort of effect. And then let's say you execute this command, RM list equals LS. Which of these effects will persist in the session after that?

So I'm going to do the big reveal. So library dplyr leaves dplyr attached. So that persists after RM list equals LS. Redefining the summary function, that's been cleared. So you have a reset summary to its normal definition. If you've changed an option, like strings as factors from true to false, that persists in the session. If you've changed the language of the session that's going to affect what error messages look like, for example, that persists. Binding 1, 2, 3, 4, 5 to the name X, that's gone. X is gone. But if you've attached an environment or a data frame to the search path, that's still there as well.

And so all of those four things that persist here aren't top of mind. You're really thinking about those objects. But they all have an effect on how your subsequent code runs. And so this makes it very easy to develop code under a set of expectations that will not hold when someone else runs that code or when you are running it in a fresh R session.

So it's for that reason that I think starting R in a way where you don't reload the workspace and you don't save it is vastly superior to this practice. Because it's really, like, if you care enough to kill your workspace, you care enough to restart R. You should go that far.

So fresh starts, clean the workspace, reset options and environment variables and clear the search path. So I want us to think of this as your R sessions are like crops. You grow them. You harvest them without any fear. Not a house plant.

And this practice of kind of having no memory in some sense of not loading a workspace and not saving it is pretty difficult to implement by itself. It really works best in synergy with some other habits. In particular, saving your source is obviously very important. So source is real and there's some other habits that are quite important.

Making a reprex

So let's talk about reprex. I'm very excited that Hadley likes it. And I'm kind of talking about the package, but I really want to talk about the reprex mindset more than anything.

So you know that if I get a mistake, the first thing I do is I submit the same command again. The next thing you might do is sort of brood, dither, and fret about what you've just seen. And a lot of people just immediately go into speculating, usually about worst case scenarios, about what could possibly be wrong. And this is just as effective as submitting the same command again, which is to say that it is not effective in the least.

And so one good way to knock yourself out of this paralysis is to work a small example. And for years, I was a statistics professor who cared about R, but I was by no means a professional R programmer. And then over various years, I started hanging out more and more with the experts, and I think I've probably crossed the line into being one of those experts now. And here's one of the things I learned. I used to think that the experts just knew everything all the time. And that's not true. They know some things for sure. But a bigger distinction is they have this habit of working an example.

So if there's a really weird situation, they test a theory or they gather some data. And this is much more approachable as a strategy than trying to solve all your problems. All you have to do is work one small example that sheds a new light on the problem, confirms so-and-so's theory or rules so-and-so's theory out.

So the term minimum reproducible example is preexisting. It's important across all programming languages, and a minimum reproducible example is much beloved in places like Stack Overflow and GitHub, and my colleague decided to coin the term reprex by mushing those two words together. And at that same time, I was creating this package mostly out of wild frustration with code conversations with my students, so I used the name.

And making a reprex is both a science and an art. So the reproducible part is the science part, and that means you've provided code that someone else could actually run. And that's what the reprex package can help with. It can only help with sort of mechanical, robotic things. But then there's this whole aspect about the art of making a reprex, and that's making it minimal. And only humans, really, can do that. And so that comes with having more and more experience, and if you can't instantly give yourself more experience, you can hang out in places where you're exposed to good reprexes all the time, and you'll start to absorb what the principles of that are.

So template is a string. It's got placeholders for an exclamation and an adjective. And then I call a function praise on it. But this is not a function in base R, so the error that we get is that it can't find the function praise. So this is a problem that you see when people post a lot of code, but they don't tell you which packages they're using, and you get to slowly sort of figure that out through 20 questions.

So here's another small variant of this. Let's run that again in your head. Imagine it in a fresh R session, not the one that we just used. So here we do remember to attach the praise package, and then we call praise on template. Template has not been defined here. It might exist on someone's computer, whoever ran that code. But in terms of by the time this code goes somewhere, again, someone won't be able to run this.

So in this super tiny example, this is what a complete reprex would look like. We declare all of our dependencies. We are attaching the praise package. We create all of our inputs, like this template object. And then we do the thing we've come to do, which is to emit some praise.

And so making this type of error and correction easier is one of the main reasons for the reprex package is to sort of help people put their code on a little spaceship and send it somewhere to be executed in isolation. And so the reproducible part is that there's no reliance on hidden state or secret things that I know that you don't know or that are true about my R session that aren't true about yours.

And another reason it's incredibly important to provide to express your problem in runnable code is because of this. So I don't know if you've ever tried to, like, help a relative with a tech problem over the phone. But there's what someone thinks they're doing. I trust, like, everyone has good intentions. And what they say they're doing. And then there's what they're actually doing. And quite often this gap between these two things, like, that's where the problem is.

And so if all you're getting is what you think and say you're doing, it's incredibly hard for someone else to maybe help you troubleshoot things. So by providing your minimum reproducible examples as runnable code, you get rid of all sorts of opinions people have about what's wrong, potential for vocabulary confusion where you say potato, I say potatoes, and it's much easier to figure out what's going on.

So to turn back to minimal. When you're trying to figure out what's going on in a confusing situation, you can think of it as the classic looking for a needle in a haystack exercise. And so common sense would tell you that if you can make that haystack smaller, it's going to be a lot easier to find your needle. And this is the basic principle behind why making a reprex minimal is so important. So your goal is to try to make the code and the data as small and simple as possible. And if you took anything else away, it wouldn't be making your point or it wouldn't show the error.

So I'm going to show a wild caught problem. And this was kindly donated by Brooke Matabonwu with her permission. And so this was a problem she shared with me privately at first. And it's okay if you can't see all of this code. It's kind of the point, the meta point of this slide is that wild caught puzzles are complicated.

And so the main thing you want to know here is that this is a little data ingest snippet. It brings in a bunch of Excel worksheets that come from a bunch of Excel workbooks, brings everything in as a list of many, many, many data frames, and then turns all the variables into character. And then out at the bottom pops this completely mysterious message error. The dot, dot, dot list does not contain three elements. So it's not at all clear where in the pipeline that's coming from or what she can actually do about it.

And the thing I want to show here, and this was a legitimate problem that needed to be solved, but this is like a wild caught example by definition probably uses private files that only you have. In this case, you know, she needs ten lines of code to do what she's doing. That's just the complexity of what she's doing. Eight functions from five different add-on packages.

So we chat a little and Brooke also goes off and sort of ruthlessly minimizes things and eventually this surfaces as an issue on dplyr. And by the time she was done, this is what the reprex looked like that enabled a coherent conversation to happen, and it was really more about things were happening in various packages and we gradually got a much better error message here. But the key features are that we have inlined data. She defines the data frame there. And she calls one function. So inline data, not private data. Two lines of code, not ten. It involves one package, not five, one function, not eight. And this gets to the heart of the matter much quickly.

So how do you actually do this? And I would say one of the easiest places to start is simplifying the data. So if the data that created your problem has 500 rows, why can't it have 499? Why not 498? And, like, keep making it smaller and you will gradually reveal to yourself which features of that data frame are important for showing the problem.

So minimal reprex has small, simple inputs. It's awesome if you can inline them. And no calls to packages or functions that aren't actually needed.

I watch a lot of the repositories for our packages on GitHub, and there's a certain type of notification or actually it's usually, like, 100 notifications I get at once when Hadley does issue triage. And so one of the things he will do is he'll post, and he always says it this way, slightly more minimal reprex. So I actually did a GitHub search for that phrase and went and looked at these issue threads. And this is a sketchy diagram because I like the way that looks, but this is actually based on data and a ggplot2 figure. And I counted the lines of code in Hadley's version versus the original version. And they're consistently a lot smaller.

And you might say, well, Jenny, that's great. Hadley has special knowledge of these packages and is maybe better at this than I am. And that is possibly true, but I think there's something else going on that makes me want to still point out how important this is.

But somehow the discipline of knowing that you're preparing something to show other people makes you, like, get your ducks in a row. And it also consistently forces you to minimize things. And a lot of people, the numbers vary. I obviously made these up. But a lot of people report that when they finally decide that they're going to post something on GitHub or Stack Overflow or RStudio's community site, 80 to 90% of the time they solve their own problem. Because it just got them working in a productive way.

But a lot of people report that when they finally decide that they're going to post something on GitHub or Stack Overflow or RStudio's community site, 80 to 90% of the time they solve their own problem.

And that won't happen every time. And so when it doesn't happen, it still means that you have this beautiful version of your pain that you can post somewhere in a way that other people are more likely to engage with it.

Proper debugging tools

We're going to move on in some sense to the heart of the talk, maybe what you thought it would be, which is about proper debugging. But I do want to say I think the two previous things are actually incredibly important. They don't feel like rocket science, but they can prevent you from getting to this point a lot. But so what if you haven't solved your own problem and no one stepped forward to solve it for you? You have no choice but to debug it yourself.

So this mood I want to set here is that this is a pretty famous far side cartoon about how we talk to our dogs, where we express all these detailed emotions or detailed instructions and we're pretty sure that all the dog hears is blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah, blah. And I think a lot of us process error messages this way.

So this is a real error message. It's going to strike fear in the hearts of those of you who can actually read it. It's the classic can't install R Java error message. And instead of reading all of this detailed information that might help you sort out exactly what went wrong, I think a lot of us just see error, no, failed, and go back into this speculate, dither, and fret cycle.

So these proper debugging tools are nerdy and they're technical and you're going to have to push through big, ugly error messages or call stacks, but you can do it. So I'm going to show you a small example in this section of a function I've written called fruit average.

So the type of input that it takes is a data frame where we have one column for each piece of fruit and one row for different fruit attributes. So we know that a blackberry has four calories, weighs nine grams, and my personal rating on yumminess is six. And so when you pass that object to fruit average and give it a pattern, it will find the matching columns and average their attributes.

So that's what it looks like when we're averaging two fruits. What if I ask for melon? Melon isn't in this data set, so I get no fruits. You could argue that found zero fruits, like, pluralization is really hard. I didn't get to spend a lot of time on this, so that's fine. It's not beautiful, but it's fine for an early edge case.

But here's a problem. So if I ask for the fruits whose name contain black, I thought maybe there would be more than one. Blackberry, blackcurrant. I get a weird message, like, found fruits, and then I get an error. The error is about row means being applied to minidat, and I'm being told that X must be an array. So I didn't call row means. I didn't make minidat, and I don't know who X is. And so this is a common sort of confusion situation where you are going to have to slowly figure out what all of those mean.

You're going to fiddle around in the bowels of fruit average to figure out what's going on, and does fruit average contain a bug, or did you somehow send unexpected data?

So when I thought about this part of the talk, which is kind of hard to deliver in this setting, because you can't all do exercises and whatnot, I decided to take a tour through three modes of true R debugging, and I'm using a death metaphor. I think it's accurate, because we are talking about fatal errors.

So we're going to go through some things from the left side to the right, and they're basically in order of probably what you should try, and also in order of how much control you have. So the least amount of control is the death certificate, where you can just learn a few basic facts. The next level up is you get to participate in an autopsy, and you are actually allowed to examine the subject. And then finally, if you haven't watched Game of Thrones, this creature is reanimating a lot of dead people to create an army, but the idea is of reanimation or resuscitation.

So this is how I map these on to some classic R debugging strategies, and we're going to go through them in this order. So traceback is your first line of defense, I guess, where you can see what all was called on the way to death. So it is very much like a death certificate, where you get some rather spare facts. If that doesn't allow you to solve your problem, you might go a little bit more interventional, and you can change the error option in R so that right before the function exits, like still on a one-way ticket out, you can do an autopsy and inspect the call stack, but you can't really change the past at this point, whereas if you use browser and related techniques, you actually interrupt things before death is inevitable and get a much better opportunity to maybe fix things, so we're going to go through these.

So if I call fruit average on our troublesome example, I get the error, you immediately would call traceback here, and what you do is you read this from the bottom up, and it shows you the sequence of calls that led to the error, so you called fruit average, apparently somewhere inside fruit average on line 5, in fact, romeans got called, and somewhere inside romeans, there was a stop. So this is called the call stack, and that's a term that applies across many, many languages in R, we summon it with the function traceback.

So when I decided to do this topic, I thought, I am finally going to get to the bottom of what all those different terms mean, and it turns out that you can take any two of those words and you can probably put them in any order, and you will find some pocket of the R community who uses that term, except callback, which is real and is totally different, but you hear people talk about the call stack, the traceback, the stack trace, and the back trace all the time, and they all mean the same thing.

An alternative view that's coming for back traces is from rlang, so rlang is accumulating more and more functionality for developers, really, for throwing class errors, but one of the things it also offers is some new takes at how to present the call stack, and that's the one part of that area of rlang that might be relevant to just about anyone, mostly this is developer-facing, but this is an alternative view of that same call stack, and it's in a different order, and there's some nice nesting, and so this is being designed with readability in mind, so if you like looking at call stacks this way, there's something you can do in your startup file.

And if you like using RStudio, it also has a really nice default method for how to handle errors, and it will by default show you the base R traceback and also offer you an easy way to get to some of the techniques that are coming next. So that's traceback.

So the last two techniques I want to talk about are the ones where you get to intervene. Either intervene post-death in the autopsy or intervene even earlier, and this is just a little video that I think totally evokes what's going on and how it feels that you get to open this hidden door and go into this whole world that has a microwave and a refrigerator, and you can certainly look at things and you might be able to do things in there. That's how it feels to be in the debugger. It's less charming, I have to say. And then you come back out, close the door, and hopefully that hidden world goes back to being hidden. But that's kind of what's going on in these next two techniques.

So if you decide that you actually need to see the state of things at the time of an error, you can set your error option to the recover function. So right as you're on your way out the door, execution will pause, and you're allowed to look at frames, and those are the environments corresponding to the different function calls. So I'm going to pick one here. I want to see what minidat is, okay?

So we're in the usual interactive R console, but you know things are special because the prompt contains the word browse, and one which tells us which frame we're in, and you can print objects here, but a lot of what people do here is they use LS to see which objects exist or in this case I'm using LS.stir to look at each object in the environment, and I am particularly interested in minidat because I know it's my nemesis and I notice that it's an integer vector, and I know from the error that it's being sent to romains, which I'm pretty sure needs a two-dimensional object. So I think many of you, your spidey sense is already telling you what the problem might be here.

If you do this recover work inside RStudio, it's great because your usual environment pane is brought to bear on this problem, and you can be looking at the execution environments of those functions in the normal beautiful environment viewer.

But let's say so now I have a theory about what's wrong. I think the fact that this somehow became a vector is my problem. But now I want to sort of test that. So the final most interventional thing you can possibly do is we're going to use browser, and so this last debugging mode is most powerful if you actually have the source of the function you're trying to work with. There are ways to get here without that, but it works best when you do because you'll have something called source references.

So that was the first look you got at the actual source of fruit average, and what I do is I insert a call to the browser function in the body of the function, and you want to insert it before the error, and if you knew where the error was, we wouldn't be having this conversation, so you often want to start high and then you can work it lower as you learn more about the problem, so I'm putting it as the first line.

And there are, as I said, other ways to get into a similar world that I simply don't have time to cover, so in the RStudio IDE, again, if you've got the source open, you can set what's called an IDE break point, and that's what that red dot means, and then you're not editing your code, so a lot of people find that a nicer workflow, and if you don't have the source to a function, maybe it's in a package owned by someone else and you haven't bothered to download the source, or it's base R, you can use debug to get a fairly similar experience, but it's a little more hampered by the fact that you don't have the actual source.

So this is a little video of me live browsing this problem. So first thing I do is I source a version of fruit average that has that browser call in it. Then I'm going to immediately call it on my usual troublesome example, and what you're going to see is I immediately get kicked into this slightly different version of the regular interactive, our console and the browse thing will be in the prompt. So I can use N now to go next line, next line, and I'm walking through that function. It might be someone else's function, line by line. And finally we get to minidat.

So I'm going to inspect minidat very exhaustively and see what it looks like. I'm going to see what its dimensions are, which are null, because it's a vector, and how many columns it has, which is also null, because it's a vector, and I'm pretty sure this is my problem, but here's the cool thing that you can't do in any other debugging mode is I can redefine minidat and I'm going to do the same subsetting, but I'm going to specify drop equals false, so that even if I get just one column, it's still going to be a two-dimensional object. I check the dimensions again. And then I can resume execution, and I'm going to get the error of the message I should see and I'm going to get the result I should see, and so this is how you sort of test a theory or pilot a solution to what you think might be going wrong.

So that's what using browser and there's different ways to get into this world looks like. So to conclude this section, every time I've talked about this before, people come up to me afterwards to talk about this, they're like it must be in the talk, it has to be in the talk. So it's very easy, and I certainly have experienced this myself, to be in the browser, you think you're really clever, I'm going to upgrade my debugging skills, and you don't know how to get back out of it. So you will never remember this, but it's capital Q, okay? If you're in RStudio, there's also a helpful button, a stop button with a square that will get you out of it.

You'll eventually learn. And then two things that are more proactive to know about is if you've used debug on a function, it means every time you execute, you're going to get kicked back into the browser. Debug on the same function is how you cancel this behavior. And some people have been burned by this so badly so many times that they have a policy of only using debug once, and what it does is it will send you into this environment browser exactly once, the first time you hit that point, and then never again. So it's sort of self-destructing. So those are all good to know about.

Building projects that are less hospitable to bugs

Our last section is more future-facing, and it's how do you create your projects in a way that they are less hospitable to bugs, and when things go wrong, because of course they're going to, you're giving yourself more information to help you solve it. So if you fixed something once or you've seen some weird edge behavior that causes all hell to break loose, do something to make sure that that stays fixed. And these tips are increasingly going to be more package development focused, although some of it is relevant to scripts.

But so based on the example we just worked, like if fruit average was in a package I maintain, once I make that fix, drop equals false, I also add this test to make sure that the behavior is what it should be when we have zero fruits, one fruit, and two fruits matching.

And so for people working with data analysis in a data script, this is another hypothetical, like let's imagine you're importing fruit data on a regular basis, and somewhere down in the pipeline there's an implicit assumption that everything's numeric, and maybe you've been burned by having things unexpectedly import as character. Once you've spent a half day debugging that, you should add an assertion in your pipeline so that the next time that happens, because you probably can't prevent it, you at least know very early and you have a really excellent error message for yourself.

So over time you're going to accumulate lots of these sorts of checks on whether you make a data pipeline or on a package, and so you want to be running them en masse and really often. So two great big collections of checks that you could be running, and this is definitely about packages now, is our command check itself will show you if your package is meeting all the various criteria enforced by CRAN, and for the vast majority of those, it's a standard you want to meet, whether you're going to put your package on CRAN or not, and something you'd want to run really often, and then you're going to have custom tests like the one I just showed you that uniquely test the functionality of your work, so our group uses a package called test that to express those and to sort of choreograph the running of them, and when you set that up, you can run just them, and then it's also wired up that those will run every time our command check done, and this means you're much more likely to run all of your tests, and the sooner you learn that you broke something, the easier it is to fix it, because usually the delta is smaller and you're looking in a smaller haystack.

If you only run our command check every ten months, you've made a very big haystack to look through, and then the next level up is to run those checks not on your machine, but on their machine, and that pretty much means that you have to be using a whole other set of practices that are beyond this talk, but that you are keeping your code under version control, pushing it to a remote version control host like a GitHub or a GitLab or a Bitbucket, and you'll use something called continuous integration, and every time you make a change, it will kick off running our command check, which includes your test, preferably over many different operating systems, so you get extra credit here if you are running it on different operating systems, and, again, this just means that you find errors sooner when they're easier to fix.

This last this particular point is extremely personal, so I have found that there are certain patterns I will use inside a function or certain data structures that later prove to be very enriched for bugs, they're like bug magnets, and when I have to go back in and tinker with that piece of code, I hate it, so for me, it's recursion and high-dimensional data arrays, like big dimensional data cubes, okay? So if you need that abstraction to solve your problem, absolutely, of course, you should leave it in, but if I'm honest with myself, sometimes I did this because I could.

And both of these things I find make perfect sense when you've been thinking about nothing but that problem for three days, and it's just so elegant, and then you go away and six months later, this is where the bug will be. It is not so elegant anymore. It takes a long time to, like, upload all of that back into your RAM. And so I find that if you have some pattern like that where you kick yourself for using it when you could have done something simpler, stop doing that.

Here's a Douglas Adam quote that I think has a lot of relevance to building data packages. The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when the thing that cannot possibly goes wrong, it's impossible to get that and repair. And so the idea I want to talk about here is that as you're building things, you will be so happy later if you've left yourself some kind of an access panel.

And in this case, you know, there's a valve here that you can go in and turn off or up or down. And I'm not going to go through all of the examples here, but I'm going through several packages near and dear to me, and I'm also showing an example from Base R where if you're trying to debug Excel import or making HTTP calls or some sort of hairy nonstandard evaluation problem, you have a way to flip a switch and suddenly be getting a lot more information. And this is great for package developers. It's very useful during development. And then also when you're trying to help a user debug something and you're having that sort of vocabulary communication problem, you can ask them, open the access panel. Run your problem and send you a lot more information that might help you get unstuck.

Writing better error messages

My last point is about writing error messages. So we've come back to where we started, which is with the world or the R's most famous error message, object of type closure is not subsettable. My theory is that the reason this sends people for such a loop is the word closure. And that a lot of people don't know what that means. So we could also in most cases use the word function there. I'm not sure if that would immediately make everyone love this error message, like I totally know what to do now, I can fix my problem. But I think it removes one communication barrier.

In the tidyverse, we've been trying to create some sort of standard for ourselves for error messages, where we return as much information as we have. Like where the error occurred, the name of the object involved, if we're pretty sure we have that right, and maybe even a hint. So this is a much more controversial version of object of closure is not subsettable, where we say I can't subset a function for you. Have you forgotten to define a variable named df? So this is the type of hint we give sometimes. It's extremely dangerous, though. Because people really trust you. And if you're wrong, it's really difficult to predict all the different ways people can get an error message. If you're wrong, you're going to send a lot of people on a wild goose chase.

So I'm not sure that this error is really amenable to this, but a much cleaner example is from dplyr. So dplyr has a function called filter, where you can ask for just certain rows in a dataset. And it's very easy to use the single equal sign when you want the logical double equal sign. And apparently enough people have fallen down this pothole that the dplyr maintainers have had mercy on all of us, and it's pretty clear what people mean to suggest maybe you need to be using the double equal sign. So I think of all the error messages that have hints, this is my favorite, and it's the one that has helped me the most.

Your troubleshooting blueprint

So that brings us to the end here. I'm going to review your troubleshooting blueprint, okay? So something weird happens. I think the very first thing you should do is turn it off and turn it on again. You'll be amazed how often your problem is gone. If it's not gone, try to make a reprex. Pretend like you're going to send it out into the world and you need to minimize it, you need to write a clean, self-contained version. Again, you will be amazed how often that process gets you unstuck and leads to a productive solution.

If that doesn't work, you will have to dig into the error. And I really think using the proper debugging tools is very useful, and I know that I put off that day way too long. So if they look kind of intimidating, I suggest that you time box it. So maybe say the next time I come up with one of these weird situations, I'm going to fiddle around with these traceback, recover, and browser for ten minutes. And if I haven't really gotten anywhere, I have permission to quit and go back to my old ways. And I think you'll find that with just a little bit of usage, you get a lot better much more quickly.

And then finally, plan for the unexpected. So you are clearly going to be debugging everything you build. I'm sorry to tell you that. And so you might as well build it in such a way that when it fails, it fails informatively. When you break things, you learn quickly. And again, just make it easier to recover from the bugs that will inevitably come up.

And so you might as well build it in such a way that when it fails, it fails informatively. When you break things, you learn quickly.

So there's the short link again if you want to see more links about the talk, get the slides. They're kind of large, so I would recommend that you look at them on speaker deck. But I really want to give big thanks to two people, virtual people. First, the Tidyverse team has listened to me practice parts of this talk for many, many months. And I have to say that we actually learned a lot about debugging from each other based on these every sort of two or three week conversations.

And part of why I want to say that is a lot of this stuff maybe feels a bit exotic, especially the debugging section. And just let it be known that people on the Tidyverse team learned things by talking to each other about this. Not everything we talked about makes it into the talk. It's more technical. But people don't talk about this enough, I've decided, and people have cool tricks to share with you.

And finally, I want to thank Christine Cooper, who created the beautiful visual dine for this talk. And without further ado, thank you very much.