Hadley Wickham | testthat 3.0.0 | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

So today I am excited to talk about the 3rd edition of testthat . So I'm going to start by talking a little bit about what do I mean by that. You certainly want to know what a version of a package is, but what do I mean by this being the 3rd edition of testthat.

Then I'm going to talk through three big new features in this version, in this edition, see how testthat now displays differences much more clearly using the waldo package. We'll talk about a new type of testing, a new type of testing to testthat called snapshot tests. Very briefly talk about testing in parallel, multi-process testing, and then finally I'm going to give you a real quick goody bag of a bunch of little small stuff that will hopefully make your life a little bit easier when using testthat.

And certainly in my own, this is like, this is the number one reason I've been converting packages to use the third edition to test that, just because it makes things so much easier to see when something's gone wrong.

So if you want to learn more about waldo, you can go to the waldo website, which shows a bunch of kind of examples, shows some of the principles. A lot of the differences of the values are powered by this really nice package called diffopt by Brody Graslam, which uses the same algorithm as the diff utility on Linux, which just makes it really easy to kind of narrow in exactly what has changed between two vectors.

This change also makes it now possibly precise about what's the difference between expect equal, expect identical, and expect equivalent, which are always a little vague in the past. Now all of these functions are equivalent to expect identical, but with some extra arguments set. So if you use expect equal, that's equivalent to expect identical, except that it ignores small floating point differences. Or if you use expect equivalent, which is now deprecated, it's just the same as using expect equal or expect equivalent with ignore attribute equals true. That's the only thing that expect equivalent does is it just ignores attributes. Now we're going to make that precise in a function argument. And then any other arguments in expect identical or expect equal are passed on to waldo compare, which gives you the ability to kind of fine tune your comparisons as needed.

Now this is such a great idea. You might wonder, well, why can't it work for the second edition as well? Well, unfortunately, when I implemented expect equal, it turns out that I made a rather silly mistake when implementing the tolerance comparison. So depending on exactly which code path it goes down, it computes the tolerance in slightly different ways. One way it always uses the absolute tolerance, and one way it uses absolute or relative tolerance. So there's basically no way I did a few experiments. There's no way, regardless of which one I pick, it causes like hundreds of CRAN packages to break. So that's the main reason that this has to be in the third edition.

Snapshot testing

The next big feature I want to talk about is snapshot testing. Again, this comes with a vignette, so if you want to learn more about it, you can read the vignette. If the vignette doesn't make sense, please file an issue so we can make it better. But the basic idea is this is a new type of testing.

So normally in unit tests, you describe the expected output using code. And in the vast majority of cases, this is a really, really good idea because it allows you to kind of describe what you expect. And in some sense, the tests help document, because the tests are code and code is a means of communication. They kind of help describe the expected behavior of the functions as well. But sometimes describing the expected output is just like really annoying, like if it contains a bunch of special characters, like quotes or backslashes, you have to spend a bunch of time like carefully escaping them. And then when something goes wrong, it's hard to see exactly, like you've got to unescape them in your head, and it's just a pain. Or maybe it's very large, like maybe you want to check an entire HTML page or multiple paragraphs of text, as you expect. Or maybe it's not even something that you can easily describe with text. It's an image.

So the idea of a snapshot or golden test, which idea used in other programming languages and other testing packages, and test that inspiration draws primarily from Jest, which is a JavaScript testing package that Joe Chang shared his experiences with me a bunch and persuaded me that this was something really useful.

But so the snapshot test, the key idea is that instead of recording the results inline in the test itself, they're stored in a separate file. And test that provides a bunch of tools for looking after that, for managing that file. So it will create it automatically the first time you run it, and then it gives you tools to update it when you decide there really has to be a change. And if you've used verify output or expect known output, which we've never really advertised because we've never really been particularly happy with them, the snapshot tests basically supersede those functions.

So what does a snapshot test look like? So here I'm just going to give you a quick simulation in the presentation, and I'll show you what this looks like in a real package. So here I've got two files, foo.r, I've just got this very silly and simple function, which has a mistake in it, which you can see, which we'll fix shortly, and then I've got a test. So I run foo, I expect it as a character vector, and then I'm going to use this new expectation, expect snapshot output, and this doesn't have what the expected value is, because it's going to save it to test.

So the first time I run this, it's going to say, warning, create snapshot reference, and it creates a new file, and I forgot to put the correct directory in here, inside the test that directory, it's going to create a new directory called snaps, underscore snaps, and inside that, it's going to have a file called foo.md, right? So our R file is called foo.r, our test file is called test-foo.r, and then our snapshot for that, the snapshot file is going to be called foo.md. So all of these are just named, have a strong naming convention, so you can easily find which snapshot corresponds to which test, which corresponds to which R file.

So what does that snapshot contain? Well, it's a markdown file, I'll kind of explain the syntax shortly, but we use a heading to indicate that this is the test, and then it's going to put the output of that test directly in that file. So now if I run it again, if I run that test again, the test is going to work, because nothing's changed, and so the test will pass. If I change it to fix that typo, I run the test again, I'm getting an error, saying that the previous value was something conflicted, and the new value is something complicated.

Now the downside of snapshot tests is that there's no way to know what is correct, so you as a human now have to step in and intervene. So if this is correct, if you really did mean to make this change, you can run snapshot accept, and that will accept the change. And the way that this works is that when you have this value, when something is changed, there will be a new markdown file, which contains the new value. If you accept it, then it will replace that with the old value.

So let's just dive into that for a slightly more realistic example. Start with a simple one. So I have this package, I have the same test which I copied before, and I'm going to run this test. And I'm going to run this test by pressing commands T, which is a shortcut for devtools test file. So this keyboard shortcut takes advantage of that convention, that if you've got a file called foo.r, the corresponding test will be test.foo.

And so this says I've added a new snapshot, this is the value of that snapshot, I can look at it, there's this markdown file again, which I'll explain shortly. And then if I run that test again, you'll see that all the tests have passed.

The other thing, if you've used devtools test file before, which I'll talk about briefly later, is there's now a slightly different display, a slightly more compact display to hopefully make it easier to run tests for a single file interactively.

So I'm going to change this, I'm going to correct that typo, I'm going to test again, and now the test fails because the snapshot has changed. So it shows you the current value, which is something complicated, and it shows you the previous value, which is something complicated. Now, if this is a deliberate change, you can run snapshot accept. If this wasn't a deliberate change, oh, that was just a typo, I can fix it and rerun and all my tests pass again.

So if you've ever used verify output, this is a little bit different because verify output automatically updates the kind of true known value on disk, which forced you to use it with GIT, basically. So I'm going to fix this, the test fails, you can see now in my snaps directory, I've got foo, that's the foo.md, that's the previous value, and I've got foo.new.md, which is the new value.

Okay, so let's look at a slightly more complicated example. So this is a bullets function, and it's basically used to create HTML bullets.

And this is like a little annoying to test, because if you're going to put this in a test, you'd have to escape all of these new lines and kind of carefully manage all the white spaces, it's just a pain. To test this code, it's a little easier to use a snapshot test.

So I can run this test, and now here is a test snapshot file with multiple expectations in it. The first one, I am just using a bullet with a single bullet, and the second also has a single bullet and also sets the ID. So if I later want to change this bullets function, maybe I've decided actually I don't want this indent. Let's get rid of that. I rerun the test. Again, the changes use waldo to highlight what's changed. So it's a little bit hard to see this, but this is the thing that's the same. It's in gray. What's changed? We can see there is now no space where there was before.

So I look at this, I decide, yep, that's a deliberate change, and then I run snapshot except bullets to update that snapshot.

And that basically is snapshot testing. Of course, there's more documentation, and I think it will take a little while to get your head around it all. I've shown you expect snapshot output here. There's also expect snapshot, which, as well as printed output, also captures messages, warnings, errors. There's also expect snapshot error if you want to capture specifically just error messages. And then we've got expect snapshot value, which captures return value. So this is a little bit different. This is if you, for example, wanted to test the output from a complicated function, you just want to make sure it doesn't change without warning. You can use expect snapshot value.

So the only other thing to mention is just what do these files look like? Why are they markdown files? Well, if you're writing this, this package is a collaborative project and you're using GitHub or some other tool where you do code review, it's really important to make sure it's really important that these snapshots be human readable because when someone goes to review your code, they need to go look at the snapshot and say, is this a reasonable change or not? So they use markdown. Each file, again, is there's one snapshot file per test file, which normally corresponds to our file in the R directory. There's a heading for the test name. And then if you have multiple snapshot expectations and a single test, they're separated by the horizontal row with three dashes.

One thing that I'm working on at the moment and some help from Joshua Kunst is a Shiny app that will help you review all of these differences and accept them or reject them with by clicking buttons rather than typing stuff at the console. And the other big part of that is providing tools for doing image snapshots as well. So this is something that we've implemented in two places and Shiny test, which is used for testing Shiny and VDiffer, which is used for testing ggplot2 . The idea is they're going to pull out the common code, centralize and test that and invest a bunch in this whole snapshotting idea. So you've got a really nice workflow if you do need to do image tests.

Image tests are complicated because they can change for all sorts of reasons unrelated to your code. But sometimes they're all you have and they are a really important part of testing both Shiny itself and ggplot2 itself.

Okay, so that's snapshot tests. Main idea of snapshot tests is that compared to regular unit tests, which have the expected results in the test file, snapshot tests store the expected results in another file, which makes them suitable for testing large output, output with quote marks and backslashes in it and for testing things that you simply cannot describe with text like images.

Parallel testing

Next, I wanted to talk briefly about parallel tests. This is still work in progress by Gabor Chadi. And again, you're going to have to activate this specifically for your package again by putting another line in your description, config, test that parallel true. But the payoff for this is pretty big. It's going to run your tests on multiple processes. So if you have long running tests, this is going to make a big improvement to the total running time of your tests.

Now, the downside is like there's a little bit of overhead associated with that starting out multiple R packages, multiple R processes, loading all the code in those. So there's a little bit of overhead. So if your tests are very fast, like if your tests all run in under a second, it's probably not going to have a huge impact. But if you've got tests that take five or ten seconds, the whole test suite takes minutes, then this should hopefully have a really big positive impact on your workflow.

There are some downsides. As well as this big upside of speed, tests will now kind of effectively run in stochastic order because you'll have like four processes and each process takes kind of the next test in the queue. And so depending on exactly how long each test takes to run, the tests might get run in a different order. And this basically means if you have any dependency between your tests, which is relatively easy to introduce because normally they will run in alphabetical order. If there is any dependency between your test files, you'll get like random test failures that occur sometimes but don't occur other times depending on exactly which order the tests are run in. In other words, this is like a dependency debugging nightmare.

So we're still thinking about tools to kind of ensure that if that happens to you, we've got some mode debugging mode you can switch on to get more insight. And then also, if you have used like global setup or teardown, for example, like you set up a database or some CSV file for all of your tests, you're going to need to think that through in a little bit more detail because those setup and teardown files are now run by multiple processes. So there's going to be a little bit of work to convert your tests to use Parallel. We're still working on this. You know, there's a vignette if you want to learn more and want to try it out. But we're hopeful that this will make a big impact if you've got long running tests.

Goodies: reporters and other improvements

So I've talked about this idea of the third edition. This is a special mode. You'll have to switch on if you want to use all the latest and greatest test that features. We've talked about waldo, which makes comparisons or test makes test failures from expect equal and friends much, much easier. We talked about snapshotting tests. We talked about running tests in Parallel. Now I just want to show you kind of a bunch of little features that I think are kind of cool.

So in the course of working on test that for this release made a bunch of improvements to the reporters. The reporters are the things that actually go and generate the results. So when I press test, this is done by a reporter. If I press command shift T to run all of the tests, these are different reporters.

One reporter that you never normally use is called stop. You don't normally call it explicitly, but it's called stop reporter. That's what happens. And that's the test that's run. So that's the reporter that's run when you run a test interactively. So here I'm just running a test, which is not working for probably reasons that I need to look into.

But if I create a share of another example over here, if I run this test, this is the stop reporter. Now it clearly tells you that your test passed. And it gives you some emoji. If your test does not pass, it nicely displays the failures. And it also displays any warnings. And really conveniently, it also displays the backtrace for the warning. So if you've got warnings in any of your tests now, they get a full backtrace. You can figure out exactly where that warning came from.

One of my kind of pet peeves is, let me do this, is this. I have partial matching warnings turned on. And if this occurs somewhere deep inside a function, inside a function, inside another function, it's really hard to track down. Now in your test, you get this nice backtrace so you can figure out exactly what sequence of functions leads to that problem.

OK. So that's the stop reporter, which is used in debugging. So now uses color, uses emoji, and generally gives you more information about problems when you're interactively running tests. I showed you earlier that compact progress reporter. That is the single line here, which gives you kind of running.

So you get like a running progress bar in some sense of all the tests that are running so you know exactly what's going on.

I've made the regular reporter, which you see most often when you're running all the tests in your package. The biggest thing is I have added a bunch of new praise that uses more emoji, because I think emoji are fun. These random praise kind of veer into a little bit into dad joke territory. But hopefully that's just something, a fun little feature of testthat that keeps you motivated and keeps you going when your tests aren't working well.

And finally, the last reporter is the check reporter that's run inside of our command check. Now it reports all of the problems. So it also reports warnings and skips tests. It tells you all of your skips tests by type, which is really useful for checking that you haven't accidentally skipped tests that you meant to run. And it also creates an RDS file with a machine readable list of all of the tests. That's probably not something you're going to use directly, but it is something that we will start to build into our tooling so that things like GitHub actions can give you nicer displays of your tests and so on.

We've also made a few changes to the way that the condition functions work. So here I have a function that calls warn and message. And now if a condition like a message, a warning, or an error is not explicitly caught by your expectation, it will continue to bubble up. So if I run this line, if I run this line, this expectation passes. It does find a warning called hi, but the message called bye still bubbles up. Or if I say I expect a message called bye, I'll see a warning called hi. So if I want to capture both of them, I need to use expect message and expect warning together. This hopefully will give you better control over exactly what's going on. It's going to cause a few more warnings in your tests.

These won't cause your package to fail our commands check, but they will require a little bit of work to make sure that those tests are as you expect. If you want to just ignore them, you can still, rather than using expectations, you can just switch to the base functions, suppress messages and suppress warnings, if you don't care about them.

Now, you might notice here that we've got expectations nested inside of expectations, and you might kind of naturally think, well, why can't I use the pipe for that? Unfortunately, you cannot currently do that because the pipe eagerly evaluates everything. So if I do this, you get a bye and a hi, and then this fails because this is called before this is called. We will start to announce a new work in Magruder that will make this work and hopefully make Magruder a little more compatible with the native pipe that is unlikely to appear in the next version of R as well.

Okay, so that also means if you're one of the, I don't know, 10 people in the world who use the all arguments, so expect warning, expect message, that's now deprecated. You have to take a slightly different approach, but I'm pretty confident that's going to be a much nicer API overall and shouldn't change existing behavior too much.

And the last thing is if you've ever used expect error and it gives you this message to set the class, we no longer encourage you to do that. It kind of fixed one type of test fragility, basically the cost of just introducing a different type of fragility, and so now you're better off just using expect snapshot error if you want to check that a specific error occurs because that just gives you a bunch of nice features for managing the change over time.

Summary and next steps

Okay, so I'm just about to wrap up, which is great. So there's plenty of time for questions. So what have I talked about today? First, test that 3.0 is coming out soon. I should say we'll probably start the release process in about a month, which means it's at least two months before it's on CRAN. So it would really be great if you would try it out, and this is the right time to let us know if this is causing you pain so we can fix it before release.

Plenty of time to do that. So the third edition is going to be part of test that 3.0, which will be on CRAN and at least, well, the soonest possible is probably two months. It's likely to be a little longer than that. The third edition you will have to deliberately opt into. That gives you a bunch of new features at the cost of doing a little bit of work to clean up old APIs.

The new edition uses waldo to make comparisons, which will hopefully make your test failures much easier to debug. It provides snapshot tests, which are an alternative form of testing where the expected results are stored on separate files rather than inline and code, and then the test that provides some functions to help manage those files so that you can accept changes when you have made them deliberately or revert changes when you've made them accidentally.

Test that 3.0 will also introduce parallel testing, which, again, will be a little bit of setup work just to make sure your global setup and teardown works appropriately and you don't accidentally have any dependencies between your tests. But the payoff will be if you have slow tests, they should run much, much faster because they'll run in parallel. And then, finally, I showed you a bunch of goodies, many of them featuring an emoji that will hopefully make your day-to-day life using test that a little bit more fun, a little bit more pleasant.

So if you do want to try it out today, you'll need to get the development version of test that and the development version of devtools. And then in your package, if this is a package that uses continuous integration or similar, you'll need to add test that to remotes. And you'll need to accept the addition to the third edition to take advantage of all the latest and greatest features.

So that was a lot of content in 40 minutes. Again, hopefully, if you didn't take in everything I was saying, there's plenty of additional material and all of the vignettes. And if you don't find those understandable, please let me know so we can make them better. Thank you. And now, hopefully, Jenny will have some questions for me.

Q&A

So the first big one is I think people want to hear more about why you're doing a third edition. So why not create an entirely new package the same way that plyr led to dplyr ? Or why not create test that 3? Or why can you not do this through semantic versioning of the existing package?

Okay. So the first question is, like, why not a new package? And that was something I considered. But, like, 90% of the code between the second edition and the third edition are the same. Like, the biggest difference with the third edition is it just takes stuff away from you, stuff that we now regret. So if we created a new package, we'd have two packages that have, like, 90% overlap in the code. And whenever there was a bug, we'd have to remember to fix it in both packages. So I think for this case, where, like, most of the stuff is the same, we're just trying to get rid of some new things, I think creating a new package would be overkill.

The question about semantic versioning is something that, like, test that does use semantic versioning. We test that 3.0. But that doesn't really help in the R community. Because you only have a single version of a package installed on your computer and a library. You can kind of work around that. But it's a little fiddly. And for packages on CRAN, they always use the latest version of the package on CRAN. So if I just released test that 3.0, with all of these new features in it, like, something like 500 packages would break on CRAN, because they would automatically use the version of test that on CRAN. So I think that's why those are, like, we want to make a little bit of friction, so that you have to deliberately choose to use this new testing, use these new conventions, use this, basically stop using old stuff that you should have stopped using a while ago, without duplicating a bunch of code that we'd then have to maintain in two places.

So now I have a series of questions, and the list is growing longer, that are smaller. And we'll just work through as many as we can. So this first one is, have you tested whether snapshotting plays nice with things like cover or the good practice package? Does this third edition work have any effect on Shiny test? So kind of talking about how this works with other packages.

Yeah. So cover, I mean, snapshot testing is just testing. It just works exactly the way you'd expect with cover. I don't know that much about good practice. I would think it would not, I don't think it would cause problems for good practice. Shiny test, it's not clear precisely how this will play out. I've been having kind of bigger discussions with the Shiny team about how testing Shiny should work. Because my vision for testing Shiny is quite different from their vision of testing Shiny apps. And so we're just working towards that. I think in an ideal world, eventually Shiny test would use kind of more of this new snapshotting infrastructure that test that provides. But there's some other big changes that have to happen to Shiny test. So it's hard to say today whether the Shiny test will eventually kind of integrate this better or there'll be a new package that's kind of a successor to Shiny test that uses some other stuff and is more focused on snapshot testing by test that.

All right. The next one is, do test snapshots need to be included in the package or could you exclude them if they're large and then developers can recreate them from a stable version?

I think if you want people to be able to run your tests, you have to include them in your package. Because they are literally like the correct output. One thing I did forget to mention is that snapshot tests are not right, will be skipped on CRAN automatically. You can opt out of that with an argument if you want. Just because if a snapshot test fails, it isn't necessarily true that it's like a real meaningful failure. It might be an incidental difference or a difference from a downstream package that you don't want to cause a failure on CRAN. But yes, if you want people to run your tests, you have to include the expected results in the package.

All right. Is it possible to use snapshot tests if the function does not return anything, but rather has a side effect? So I mentioned briefly expect snapshot, which captures the side effects of messages, warnings and errors. If it's making other changes to the global state, you would need to make that kind of explicit. You know, like if it's attaching, if your function attaches a package, you know, you might do expect snapshots, like search, which would capture the search path. Or maybe if it's creating files in the directory, you could use, so it doesn't directly handle side effects apart from messages, warnings and errors, but you could easily wrap up a little function that captures those side effects and makes them explicit and snapshot that.

All right. The next person missed a little bit about what's going on with context, but sees that it's deprecated and thinks that it's needed when using the JUnit reporter. So how are those two things going to work out?

Okay. So this is just a per package. I just searched to find all the uses of context and you can see like the test-lmap.r file contains a called a context lmap. The test-mapper.r calls as it contains a called a context as underscore mapper, which is a little mildly inconsistent or test-win contains a called a context win. Test-utils contains a called a context utils. So by and large, it's like a one-to-one mapping between what you put in the context and the name of the file. And that is just, you know, you don't want that duplication because you end up with inconsistencies.

Well, maybe if I can find, well, here we can see that this is a test for pmap. It probably used to be test-pmap, but we renamed the file and forgot to update the test context. And so now there's like this mismatch between the two. So that's the reason why we're getting rid of context because it's just, it just introduces this duplication from little real guys. But the context is still kind of implicit. Like we just take the file name. So the context that tests the JUnit reporter, for example, we'll use, we'll just be map underscore in. So it shouldn't change anything practically with the JUnit reporter.

Okay. So this is a person like me who uses with mock and local mock. So if they're going to be deprecated, what is the suggested way to mock functions in external packages? I read that the goal of mocker, the package mocker, is to provide a drop-in replacement, but it does not have this feature.

Yeah. So basically, the way that with mock works is truly an abuse of R, and it does something that I now believe to be tremendously ill-advised and the chances of it stopping working in a future version of R are high, because what it does is so, you know, it's like, you know, future version of R are high, because what it does is so bizarre and horrific.

So basically, you have to use another approach. So both the mock R and mockery packages provide slightly different techniques. Unfortunately, you just have to use those. Although I'm just so concerned that the way that with mock works is not, it's not good, and it's surprising that it hasn't caused any problems to date. So I just, yeah, unfortunately, it's like something that is really nice and really useful, but was achieved in a really, really horrible way.

Um, does Snapshot have a limit on the size of the data? And can you combine Snapshot tests with more explicit tests of the Snapshot?

Uh, yeah, so that was kind of, so there's no limit on the size. And I think, you know, generally, you know, you'll, you'll, you will sort of mingle Snapshot testing with regular testing, like where you can, like, if this is a list, you might want to test that this is a specific value. And there's another component of the list that's just a, you know, a big blob that you don't want to kind of type out. So I would imagine, you know, I really imagine that you will want to mingle those as much as possible.

I think it will be tempting to overuse Snapshot testing, just because it's so convenient, because you have to write like the right hand side of expect equal, you have to, you know, to make clear in your head exactly what you're expecting. But I think if you do that, if you, if you use Snapshot too much, you'll find your tests are a little bit fragile and kind of break more often than you're otherwise like, and that will kind of push you naturally back towards a finer grained unit testing that you're, that you're used to and test that.

I think it will be tempting to overuse Snapshot testing, just because it's so convenient, because you have to write like the right hand side of expect equal, you have to, you know, to make clear in your head exactly what you're expecting. But I think if you do that, if you, if you use Snapshot too much, you'll find your tests are a little bit fragile and kind of break more often than you're otherwise like, and that will kind of push you naturally back towards a finer grained unit testing that you're, that you're used to and test that.

You know, of course, putting really large files in your package is going to be annoying for other reasons. So I don't think you want to put like, you know, megabytes of data using this, but certainly like kilobytes of data, which would be a pain to put in a test file. It's a really nice sweet spot for Snapshot testing.

Okay, I have two questions that are a little bit of emoji pushback, not to