Jenny Bryan | Help me help you: creating reproducible examples

Transcript#

This transcript was generated automatically and may contain errors.

Welcome everyone to today's webinar. We're going to talk about reproducible examples from a conceptual point of view and why they're surprisingly important and then also a great deal from a mechanical point of view, how to make your reproducible examples in a way that they're easy to share with other people.

Okay, so this short link, the rstud.io, reprex currently points to a copy of these slides on speaker deck. Another very relevant URL is reprex.tidyverse.org which is the website for the package I'm going to talk about. And we may change what the upper short link links to as a result of this webinar because we're pretty excited to get some of these materials captured, but I promise it will always point to something very relevant to this package that will link to absolutely everything else. And you're also going to see that short link many many times during this presentation. And as said in the intro, this video will also be posted within 48 hours and all of these things will be publicized.

I think that conversations about code are much more productive if they contain three things. Code that actually runs. Code that I do not have to run as the reader but code that I can easily run.

Self-contained code demo

So I want to be really detailed about what I mean when I say code that actually runs. So you're gonna isolate a little piece of R code and you hand it off to reprex. You've seen one demo we're about to do a whole bunch more. That code is taken and it is run in a completely new R session and that means it has to be completely self-contained. So it must include the command to load all necessary packages and it must create all necessary objects. And this can be very frustrating for people but it's extremely important. So I'm going to go do this live to show exactly what I mean.

Okay so I'm looking at an R script that contains the code you just saw on that slide. And I'm going to restart R. So let's imagine like a typical interactive R session. So I'm going to be down in the console here and I'm going to say oh I'd like to play a little bit with this praise package I've heard about. So there I go. I say library praise down on the console. Now up in my source editor I make a new object called template and it's a template string exclamation your reprex is adjective. And so if I then call the praise function from the praise package I don't expect you to know this I'm just using it as an example. It's going to create like random little sentences for us praising someone for their awesome reprex. So let's say I want to share my joy about this with people using the reprex package. I would select this little snippet of code again this is the the long way I'll show you a short way later. Copy it go down to the console type reprex and hit return. And now let's look at the preview here. It shows defining template and then my praise call fails. Error in praise could not find function praise. And that's because you don't have the library praise command here. So over in that fresh R session the praise package is not available to use.

So here's something else you might do you're like okay I'm going to add that command then I'm going to make my call to the package. Let's see if that works. Copy call reprex again. I have a new error. Error in grep whatever whatever object template not found. So this snippet is incomplete in a different way. It actually doesn't contain the code that defines the template object. So here's the full snippet. It loads the praise package. It defines the template object and it makes this function call. So I'm going to copy all of that to the clipboard. Re-execute reprex and we have made an exquisite reprex.

So that's a little belabored but when I try to answer our questions for people and I try to run their code the two most common ways that I fail are that they haven't explicitly listed all the packages they're using and I have to either sleuth it out of them or figure it out for myself and add those commands or the objects they're referring to are not available to me and so those are the two reasons why I can't run their code.

Do's and don'ts for reprex

So on the reprex website I have a list of do's and don'ts that are distilled from a lot of other really fantastic sources about creating reproducible examples which are referenced there but the three big big high points are you need to write this reproducible example using the smallest the simplest and the most built-in data set you can get away with and that is very uncomfortable for people. Include commands on a ruthlessly strict need to run basis so you really need to strip your example down and then I say pack it in pack it out and don't take liberties with other people's computers and this is referring to making sure that if you create files you remove them or if you change the working directory you reset it if you change options you reset them but basically leaving things as you found them.

But I want to talk about so let's see here's here's what that web page would look like if you want to read it more but let me just give a short example of something that a lot of people struggle with which is that they feel like they have some big hairy data object and they can only show their example using it so tricks to know so the read CSV file you probably think of as normally being a function that you use to bring data sorry the read CSV function is something you usually bring delimited data in from a file but it also has a text argument that allows you to inline really tiny our objects and then also just the data frame function itself so I'm going to reprex using a keyboard shortcut those two snippets of code and see that's a very easy way to make a very tiny data frame either in line using read CSV or sort of from first principles using data frame.

And then if you are a tidyverse adherent the table package is what takes care of the care and feeding of tables which are a flavor of data frame and the triple function is extremely useful for creating tiny little data frames because it allows you to write it in this really humane row wise way like the same way it would look in for example in a CSV file so if I reprex this little snippet you'll see very very similar output as what we just saw with the base function but it allows you to inline the creation in this case of a two row two column data frame or you can again use just the table function directly and so if you make a lot of reprex is you get you get really good at figuring out how to in line the creation of very small objects of course figuring out what's the smallest object that still shows your problem is difficult and we're going to talk about that very end.

Okay another principle is that the reprex should contain code I do not have to run because a lot of your readers have a great deal of our experience and sometimes not always but sometimes they can quickly see the point without actually running the code but that is greatly enhanced if they can see the output instead of having to run it in their head and in their imagination and try to figure out what's happening it's just much easier if you can actually see the output and so that's why I think it's important that your typical reprex contains the code and it also reveals the output being produced by that code.

So here's an example I took from the github repository where the read our package is developed because it's a perfect little example and it probably was produced with reprex you can't tell and this person is just reporting a bug but it's like a great minimal example it says you know if the header and your CSV contains quoted new lines you get kind of a weird column name and you get weird data and the fact that this person provided a small example and it completely shows the problem I imagine is what allowed the maintainer Jim Hester who's listening to this call to quickly label this as a bug and we've already got at least one other user giving it a thumbs up meaning they've experienced it as well and so if you would only have the code here I think you'd have a lot less sort of quick engagement with this issue.

And so I want to show you yeah okay so code that I can easily run is very important we're going to keep working with that issue so if that person had instead copied and pasted the output from their our console this is what we would be faced with so if I were Jim Hester and I needed to to reproduce this issue and make sure that it's still a problem I have a lot of really annoying editing to do so I have to get rid of all the prompts at the beginning of the lines I have to get rid of all this output to isolate the three lines of code that actually do anything so copy paste from the our console hits some of our checklist but it's not great because it's very hard for the next person to run this code.

Worse than copy paste is the screenshot so this of course does again hit some of our checklist it clearly shows the code and the output but again if somebody else wanted to check this and reproduce it they actually have to retype everything which frankly is never going to happen and so this is what I want to see in a reprex because it can be copy pasted and run so I'm going to prove that to you right now so if I go to this issue on github and I copy I could copy all of this or I could as long as I get all the commands I'm okay so I'm going to put that on my clipboard I'm going to go back to our maybe to make this really explicit I'll show you what I copied right that's what I did so I can copy this again and call reprex and I get exactly what this person was reporting on github so I've been able to reproduce it very quickly from a copy paste.

But as you saw reprex is like are you sure you want to do this because I'm I've got this output here and so if you if you really want to get really clean code from a reprex that someone else has made you capture it and use one of the undo functions and the reprex package I could use a reprex clean and I'll show you that right now so here's what I copied from github so I could copy that and call reprex clean and now if I paste you'll see all the output has been eliminated and so I think that's a slightly obscure thing you might want to do but there are the full set of backwards functions in reprex so it helps you take code that people have copied from the console or that they have already made a reprex from.

Shock and awe: advanced features

Okay so we've gotten to essentially the the meat of the webinar now so if you were really interested in basic usage you've seen it now and now I'm going to go into the shock and awe section where I run through a lot of more interesting features of the reprex package that I still think are pretty cool so the slides show you what we're about to do live I'm going back to RStudio and I'm in a script called shock and awe I'm gonna restart R just for good measure.

So the first thing I want to show you is how frictionless reprex can make it to talk to people about figures so I'm gonna load the gapminder data and ggplot2 and I'm gonna make a plot with ggplot2 so you see it down here in my plots pane so let's say that there's something about this I don't like or that I want to discuss with a colleague I can use reprex for this so as usual I can select the snippet copy to my clipboard and run reprex you're gonna see all the same stuff so we've got an nicely rendered reprex that includes the figure but watch this I'm gonna go to a github repo that I created just play around with I'm gonna create a new issue I have a question about this plot and I'm gonna paste let's look at what we've got we have the usual sort of nicely formatted markdown and look at this so when when reprex rendered this code it made your figure and pushed it up to imager and dropped this link into your markdown so if I submit this issue people see my code and they see the actual figure that you just made so this is an example of one of the cool things you can do that that removes a tremendous amount of friction if you're trying to have a quick conversation with somebody about code that produces figures.

Okay so we're going to go back to this shock and awe script and we're going to execute reprex many times showing some of the options and different arguments you have so so far I've only shown you reprexing when the source code is on the clipboard but there are a lot of other ways to provide the input so you can provide it directly in the reprex call as an expression so here you see that the assignment of X and Y gets done and we compute the correlation between them there's also an input argument that I'm actually not going to demonstrate where you can provide the source as a file or as a character vector.

reprex by default goes and does its work in the session temp directory that's all part of it sandboxing all of your work but if your reprex does for example file input and output it could be much easier to force reprex to work in your current working directory so out file equals na is shorthand for that so if I try to if I ask are to write the first six letters of the alphabet to a file with out file equals na all of a sudden these four files that reprex needs to create are being left behind in my working directory instead of in a temp directory and it's the R script that reprex makes it's the HTML that it uses for the preview and the markdown that it puts on the clipboard for you so all those usual files are left behind in a much more accessible place but you'll notice it has a god-awful file name because we just created it out of thin air so if you want to work somewhere specifically and have nice file names you could also provide the base for that in out file and now you see that it leaves the same four files behind but they have a much better file name.

Okay so so far I've been producing reprex output that's optimized for github so it's producing what's called github flavored markdown but stack overflow is another common target and that produces slightly different looking markdown stack overflow doesn't use fenced code blocks it use and indented code blocks and let me show you what this would look like I'm going to pretend like I'm going to answer my own question on stack overflow but I won't actually submit this but now if you paste that into stack overflow it also has a preview feature and it will be formatted correctly for stack overflow.

You can also make reprex produce it creates an R script which seems sort of weird but an R script that includes the output as comments and that is very handy for pasting into an email or into slack so I'm going to show you the slack version of that so this is me talking to myself on slack I'm going to create a code snippet paste it maybe I always have it set to R create a snippet and that would create a little R file in slack properly syntax highlighted and again people could copy paste it into R and run it or sometimes you just want to inline it you don't get the syntax highlighting but that also looks quite nice.

The final venue I'll talk about is RTF for rich text format and this is a very experimental venue it only works probably on the Mac at this point because I actually have to call an external utility to do this but I'd like to show you this and in fact it's how I made the slides for this talk so we run a little bit of code but now I can go over to keynote or PowerPoint or something and I could paste that in and I'm getting rendered R code that is properly syntax highlighted and that is in fact how all the snippets in my webinar were produced.

You can suppress the inclusion of that little ad at the bottom or you can include it you can ask for your reprex to include session info and for the github venue it can be placed in this cute little collapsible thing so this is a great thing to include if you think for example that the bug you're reporting could possibly be related to the version of software on your computer and I love that it gets folded here so then sometimes people include this when they don't need to and it's kind of overwhelming so the fact that we can put it in this folding tag is really nice.

reprex can also use the styler package to restyle your code so here's a really I would say poorly formatted piece of code so by default reprex trusts that you know what you're doing and that you you like your formatting but if you don't trust yourself you can explicitly ask for reprex to restyle your code and give it a much more conventional layout you can be silly and change your comment string and make it some sort of emoticon if you want.

reprex is part of the tidyverse right and so the tidyverse meta package can be quite chatty at startup and tell you all the packages that you've just attached and if there are conflicts between them so we actually have a special argument where you can control whether you want that or not and usually you don't so we default to silencing it.

And then the last thing I'll show you is reprex can actually capture input that in an interactive chef session shows up in your console but it's actually being sent to standard output or standard error so I'm going to install a package from GitHub that requires compilation this takes a moment so I'm going to chat over it but what you're going to see when this reprex actually renders is that we have captured everything that would normally show up in the R console when doing this so the stuff that is sort of coming from R as well as the things that are being sent standard output and standard input I was hoping that would be enough that there we go okay and so here's the output of installing the bench package from GitHub which does require compilation and so this is the part that's coming through sort of normal our channels and this is capturing what's being sent to standard output and standard input.

So that wasn't very quick live demo of some of the more I don't know if they're really advanced but features you don't need in every reprex but that you might need before long a lot of the things that I showed you toggling on and off you can actually set up your own personal defaults for these things by again putting some code in your dot R profile which we've already talked about so this is just an example of someone who hates the ad they always want to include session info they always want to restyle their code they have a whimsical sense of what the comment string should be for output and they they always want to see the tidyverse startup message so these are not my defaults but it's an example of what you can do.

And the last thing I'll show you mechanically is most of the time I do not do what I've shown you which is copying code to the keyboard and then going to the console and typing reprex there are two RStudio add-ins that really accelerate your reprex life and one of them is called render reprex which launches a GUI I'll show you that in a second or reprex selection literally reprex is the code that you have selected and it's absolutely conceived for use with a keyboard shortcut and RStudio lets you modify your keyboard shortcuts and so I have bound that add-in to shift command R this is how I usually use reprex and for example Hadley also uses it a lot he has bound it to something else and let me go show you the add-in.

So again I could select the snippet of code that made that figure and launch the add-in and so this allows you to specify a lot of the things that you can specify in the call by clicking so I am going to take the source from the current selection let's target stack overflow and yes let's append session info click render and the usual things happen and the usual output appears down here and stack overflow doesn't have the capability to support this little folding toggle so the session info actually gets dumped in there and it's full glory so that is these are two other ways to get your input into the reprex function that actually are probably more humane than typing it all the time.

The human side of asking for help

All right the last thing I'll say we have I'll try to go quite quickly because we're already at 45 minutes is talking a little bit about the human side of making reproducible examples and now this has nothing to do with the reprex package it's just about asking questions so that they actually get answered and I like this image because it conveys somehow that you know we're talking about programming maybe we're all supposed to be acting like robots and people often seem to assume that they're talking to robots but that there's a lot of humans involved in this process.

And I want to warn you I'm getting a little bit of tough love here there's there's been a lot of although probably still not enough talk of of experts being empathetic to newcomers and question askers but since this is a talk targeted at people asking questions and preparing examples I also want to say it has to go the other direction as well so bear with me for a moment here but I need to say you know with all the love in the world sometimes people come with a question and they have like a very rigid theory about what's wrong or how they should be solving a problem but if your theory about what was going wrong was so great like you you wouldn't be here asking this question right now and this is the origin of why people really want to see code instead of having sort of a prose discussion because it's very hard sometimes to tell what people are really talking about.

The other life phenomenon I want to link this to is I don't know if you've had this experience but if you've ever tried to help for example one of your relatives sort out a computer problem over the phone it can be extremely difficult a lot of what they're saying doesn't really make sense they don't use the words you're used to to refer to things you just feel like you can't really get a grip on things and this is basically what it feels like when you're trying to answer someone's programming question just based on English prose and again like this is why people constantly push you to actually just show a small piece of code removes all sorts of ambiguity.

So let's assume that everybody that question asker and the question answer is acting in good faith and if they're not then they're irrelevant to me okay so everyone's in good faith it turns out that experts posting on public sites actually are afraid to post code that doesn't work and so another reason why these people want to see your code is you know they're not just reading it and guessing most of these people are actually running your code proving that their proposed solution works and then they post it when they know that it's safe to do so and this is a big like revelation to me I really used to think that the people I looked up to as experts just knew all this stuff by heart and they were answering all these questions just off the cuff and then it gradually dawned on me that part of why they're experts or expert behavior is that they are constantly running lots of small examples and experiments so sharing your problem in code is extremely fruitful.

Making a good reprex is a lot of work like sometimes you think I can only show my problem in my our session and I haven't restarted our for seven months and it requires the full data set from my thesis and that is in fact true it is a lot of work but you're asking other people to solve a problem and so this is part of meeting them halfway but it turns out you get a lot out of this as well so let's be very selfish if you make a good reprex out of your hairy messy problem and if you reproduce other people's problems even reproducing other people's problems is a real service and then sometimes you're gonna be able to solve them it turns out this discipline it's like playing scales or you know serving over and over again like you actually get better at programming by doing this.

And the last selfish point that I'll make is it turns out when you sit down to make a good reprex out of your problem and you keep it self-contained you strip down your giant hairy data set to the smallest data set that reproduces the problem it is amazing how often you end up answering your own question in the privacy of your own home and you didn't have to make yourself vulnerable to other people so this is a great revelation and I think the reason this works is that when you have a problem it's very easy to just keep going in circles and banging your head against the desk but there's something about preparing it for other people and the reprex package is also being a real hard-ass about making sure that your problem is self-contained it kind of knocks you out of that very unproductive place and gets you back on the path of actually working the problem so most people report this when they first start making reproducible examples is that it's kind of amazing how often this exercise means you actually answer your own question.

It turns out when you sit down to make a good reprex out of your problem and you keep it self-contained you strip down your giant hairy data set to the smallest data set that reproduces the problem it is amazing how often you end up answering your own question in the privacy of your own home and you didn't have to make yourself vulnerable to other people.