Resources

Hadley Wickham | {purrr} 1.0: A complete and consistent set of tools for functions and vectors

{purrr} has reached the 1.0 milestone, with new features like progress bars, improvements to the map family, and tools for list flattening and simplification. 0:00 Introduction 0:11 What is purrr? 00:32 What is functional programming? 03:08 Announcing purrr 1.0 03:58 Progress bars 05:18 Better error messages 07:18 New map function: map_vec() 09:58 New list_* functions 12:04 Flattening and simplification 17:40 Breaking Changes 22:34 How the tidyverse handles deprecation 24:41 An overview of functional programming 26:22 Closing, resources to help with deprecation, how to submit issues See more in the {purrr} 1.0.0 release blog post! https://www.tidyverse.org/blog/2023/03/tidyverse-2-0-0/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I'm Hadley Wickham, the Chief Scientist at Posit, and for the purpose of this video, the Chief Maintainer of the purrr package. So what is the purrr package? It's kind of hard to describe what purrr is to be honest, but to put it into words, I think I've got a pretty good sense of like what function feels like a purrr function to me. But I think the best way to think of it is as a toolkit for help, for supporting like functional programming.

What is functional programming?

What is functional programming? That's also kind of hard to explain, but the chief advantage is it gives you a bunch of tools for operating on vectors, or pairs of vectors, or triplets, or etc. of vectors, where you work on each individual element of the vector in isolation. And that allows you to apply a function to do something to each element, knowing that there's no way that the other other elements are going to affect the computation at all. So many of the functions of purrr are alternatives to for loops.

For loops are by any means like not a bad thing. You should never be embarrassed or ashamed of using a for loop. For loops are great because they're like very, very concrete. It's like very obvious what you're doing, because you're doing like, take this element and do this to it. Take that element and do that. And you can very easily think about like stepping through that. So the functions in purrr, particularly the map functions, are kind of a step up in abstraction. And that step up is, it's a challenge, right? But I think there's some good reasons to do it.

And the first one is it makes it easy to avoid, or ensures that you avoid some like performance bottlenecks with loops. Now for loops themselves are not slow, but if in your for loop it's easy to get to accidentally repeatedly modify or repeatedly extend a vector, then that can be slow. Although changes in the latest versions of argument are much less bad than it has been in the past. So performance, a little bit useful. I think one big advantage of switching to purrr is that you can also switch to, it's not future, but the fur package. So the fur package is, has exactly the same syntax as purrr, but it spreads the computation across multiple ports. So purrr, because it guarantees that all of the computations are independent, much much easier to share that work across multiple ports.

The other thing, particularly in this latest version of purrr, is that it gives you a really, a access to a really powerful useful tool, and that's progress paths, which I'll talk about a little bit later. But purrr, definitely kind of an advanced programming tool. You can go a very long way with for loops, but I think if you can master the map functions of purrr, you can write code that is more succinct, more clear, and more likely to be correct the first time you write it.

code that is more succinct, more clear, and more likely to be correct the first time you write it.

Announcing purrr 1.0

So today I'm going to talk about purrr 1.0, and the 1.0 release of any package is pretty special, because it's kind of, this marks the package as at least like the initial development is kind of done. This is a package that we are confident about, it's stable, and we expect it to be around for the long term. And for purrr in particular, this is kind of an opportunity to take a look at it broadly, and say which of the functions in this package are really related to the core purpose of functional programming, of working with vectors and functions, and which functions of purrr are not, and maybe need to be superseded or duplicated.

Progress bars

So why don't we start with some of my favorite features of the purrr 1.0. And I think the most exciting feature, the coolest feature, is the new progress bar. So let's just make a little, very silly little example. Let's take a brief look at what life is like without progress bar. So let's just imagine we're going to map over 100 numbers, and in lieu of doing anything useful here, I'm just gonna, I'm just gonna sleep, I'm just gonna pause for a 10th of a second.

So when I run this code, like I know something's working, I know this is going to take 10 seconds in total, but I'm getting like no feedback on what's going on. And for a real life job, you don't know exactly how long it's going to take, and you don't know when something's happening, like should you just take a deep breath, should you go grab a coffee, or should you come back tomorrow. So what's great now, in the latest version of purrr, is you can say dot progress equals true, and get a convenient progress bar that tells you both how far through the process you are, and an estimated time till completion, assuming that every single iteration takes about the same amount of time. Now that alone, I think, is pretty great for interactive usage.

If you are putting these inside of a function, you can add here, instead of just a true, you can add a string, which will be used as a label for the progress bar, so you know exactly what's going on.

Better error messages

Okay, so my next, next really useful feature, instead of a part of a big push across the tidyverse, is better error messages. So again, I'm just going to make up some some dumb kind of sample data I've just taken, the numbers between 1 and 500, and randomly reorganize them, and I'm going to make a little function, and this function is, most of the time, it's just going to double that number, but every now and then, that's going to have an error. Again, I'm just kind of simulating what might happen in your real code.

Not particularly useful by itself, but now if I run this code, what's really handy is, Matt is going to tell you exactly where that error occurred. So I can look at the 318th argument, and I know that's 1, 2, 3, which, by inspection of this function, is pretty easy to determine, but in real life, this is really helpful at, like, figuring out exactly where that problem occurred. Now, of course, you can still use all of the, you know, useful functions and occur, like, safely, and possibly for, if you expect errors and want to capture them some way, but this new ability to tell you where the error occurred is hopefully something that's helpful, and just, just makes your life a little bit easier every day.

We've also generally just, like, reviewed the error messages elsewhere. Again, it's part of this kind of big tidyverse effort to make sure you get error messages. So maybe if I try and supply an environment or something weird instead of a function, now we have this pretty standard kind of error framing where we try and tell you, like, what went wrong? It was an environment. Well, what did we want? A function. And you'll see this more and more as we work on improving error messages across the tidyverse.

New map function: map_vec()

We also have a new map function called map_vec. So if you're an existing purrr user, you probably know about map_lgl that produces a logical vector, map_int for integer vectors, map_dbl for double vectors, map_chr for character vectors. You might wonder, like, what about all the other things? Like, is there going to be a map date? Is there going to be, like, a map factor? Is there going to be some kind of map thing for every single possible type of object in R? This is something we've been struggling with for a while in purrr. We can never quite figure out what to do. But there's now a new function called map_vec that's going to work with any type of vector object. So anything like a factor or a date or a date time is now going to work.

So, for example, I could map over the numbers 1 to 3 in my function. I'm just going to add those to today's date. That's going to give me another date and that's going to give me a vector of dates. Or I could take some letters, first three letters, and then when I use map_vec, it's going to give me a factor, just as you'd expect.

So it uses the rules from the vectors package. Vectors is a package that hopefully you as a data scientist don't need to care about, but it's something that we're using more and more in the kind of the depths of the tidyverse to make sure all of the rules that we use are consistent. So you don't have to memorize these rules, but hopefully as you use functions around the tidyverse, you kind of just, you just naturally over time learn what those rules are. You know, you might not be able to make them explicit, but because everything is consistent and behaves in the same way, you build up this internal knowledge of the theory that underlies these functions.

One thing I should note about map_vec is it's kind of one of its invariants, one of its things that are always true, is that the length of the input is the same as the length of the output. So if you try and create a new vector, maybe I'm going to try and just double every element, that's not going to work. And again, I'm going to get this error message telling me what is supposed to happen. Every element has to be size one. In this case, I've got an element of size two.

New list_* functions

So what happens if you do want to do something like this? Well, you can do this with map. That's going to give you a list with each of those components. And new in this version of purrr are a set of functions called list_c, list_rbind and list_cbind, which lets you take the output, the functions like this, and, for example, combine them all into a vector. Again, this is using vectors under the hood, so this is going to give you exactly the same, it's going to use exactly the same coercion rules that map_vec uses, but in this case, you can use it to produce something where every element does it differently.

So, for example, we could take the element and repeat it the number of times. So we get one, one, two, twos, and three threes. If we just map that, we can see a list with those elements. And then if we use list_c, that's going to concatenate them all into a single vector. So list_rbind and list_cbind work similarly, except for data frames. These kind of replace these sort of supersede map_dfc and map_dfr and map_dfc. These functions are super useful, and we know lots of people use them, but they're not really map functions because of this idea of this invariant, that the input should be the same length as the output for a map function. That is pretty important for the meaning of this map. And so we've decided these functions aren't really that great, so we've superseded them.

So if you're not familiar with what that means, that means they are not going away, they're not deprecated, they're not on a path to being removed, but we don't recommend them anymore. So in some sense, these functions are the safest functions to use because we're going to just put them away and we're never going to touch them, but we just don't recommend using them. So instead, we recommend combining a map with one of these list concatenation functions, whether it's list_cbind, list_rbind, or list_c.

Flattening and simplification

So another area of purrr that's got quite a lot of love in this race is this general idea of flattening and simplification. So this is about what happens if you've got some complicated nested list and you want to simplify things. And things have changed. Since we wrote this stuff in purrr, quite a lot of things have changed. In particular, for example, tidyr now has some really nice tools like unnest_longer and unnest_wider, so that if you are working with a very hierarchical data set, you've got something called a JSON API and you want to turn it into a data frame, these tidyr functions are really the place to start now.

Previously, you had to use these quite low-level purrr functions. You can still keep using those, but generally unnest_longer and unnest_wider can allow you to express what you want a little bit more simply and a little more clearly. But that means that this gave us an opportunity to look at these functions in purrr and think now, you're not being forced to use them for these data data analysis operations. Now we can think about what are the right primitives here. These are things you're probably not going to use as much day-to-day. They're more useful when you're programming. They might be tools that we're going to use elsewhere in the tidyverse for solving problems. But we wanted to take a look holistically and think what's going on.

And certainly, we caused a lot of confusion in our team. We couldn't remember what does flatten do, what does simplify do. We've had some internal functions and other packages which do other things and just ended up causing a lot of confusion. We didn't have a shared language that we could use when talking to one another. So we tried to unify things, simplify things in this version. And just to give you one example, let's talk about the flatten function.

So here I have a list. The first element of the list is just a number. The second element of a list is itself another list. And the second element of that list is another list. So we have this kind of complicated tree-like structure. Now what we can do is use list_flatten to remove one layer of that hierarchy. So let's just call it, I'm going to use str again just to illustrate the structure a little bit more easily, and then see what happens. Let's just give a little bit more room in the console so we can kind of compare them. So previously we had a list of two things, now we've got a list of four things. You can kind of see what's happened, right? This list, this sub-list of three items, has kind of been like pulled up to the top level.

So that list itself had another list in it, so we could call a list_flatten again to flatten that. And now we have like a, what we'd call like a perfectly flapped list. A list that doesn't contain any other lists. And if we call list_flatten on that again, nothing's going to happen, we're still going to get a list. So list_flatten is always going to return a list. You don't know how long that list is going to be, it's going to be kind of a sum. It's normally going to increase in size because you're going to have sub-lists of varying number of children, they're going to get kind of hoisted up into the parent. But you've made this list as flat as possible.

Now if you want to make this simpler, you can't make this list any flatter, but we could use like a simpler data structure, right? For example, so we could, if we call to the simplify now, we're going to go to a simpler data structure, which in this case is a numeric vector. So let's just illustrate the difference between flattening and simplification with a list that has like a variety of types in it. So the same structure as before, but now we've got a logical in there and a character vector. Okay, let's flatten it. Look at the results. So we've still got a list. We'll flatten it again. We still have a list. Now, can we make this any simpler? Well, what happened if we call this simplify? Well, no, we can't make this any simpler because we're trying to combine a double and a character, a number and a string together, and there's nothing simpler than a list that will allow you to have both of these in a single data structure.

So this is as simple as we get. In this kind of idea, like now we've got this clear distinction between flattening, which is decreasing always. When you flatten a list, you still get a list, but the number of items might have changed. And simplification, we're going to always get something that's an atomic vector or a factor or a date, but you're always going to have the same number of items in there. I think it has this nice kind of symmetry. Either you're changing the size by keeping the type the same, or you're changing the type and keeping the size the same, and that is a nice kind of split that hopefully allows you to solve problems without getting confused between what different functions you use.

now we've got this clear distinction between flattening, which is decreasing always. When you flatten a list, you still get a list, but the number of items might have changed. And simplification, we're going to always get something that's an atomic vector or a factor or a date, but you're always going to have the same number of items in there.

Breaking changes

So to finish up, I wanted to cover what's possibly like the biggest set of changes to purrr in this release, which is basically saying these are all the things that purrr no longer does, that we don't believe that these are actually related to the core purpose of purrr. And I think this is a really important part of the life cycle of packages, that as we understand what the core purpose of the package is better, we can remove things. And that comes at a cost. If you're an existing user, you might love some of these functions. That refinement hurts existing users, but we do it for the future users, because when people in the future come to purrr, they're going to see a more tighter, more cohesive set of functions that are easier to understand.

So most of these we have like superseded. There are some things that we've deprecated, because we think they're really pretty uncommon to use them, or there's major problems with them. So again, most of these will continue to work. Superseded functions will stay around for a long time. Deprecated functions are definitely on the path out, but we're trying to do that as gradually as possible. So if you use a deprecated function in the kind of first release it's deprecated, it'll warn you like every eight hours that that function's going away. We're trying to walk this line between like being annoying enough that you eventually get rid of the deprecated function in favor of the new solution, but not so annoying that it's as bad as just taking the function away.

So deprecated functions in this release, they will warn every eight hours. In the next bigger release, which will be in one or two years, they'll warn every single time you use them, and then some release after that, probably two or three years after that, they will error when you use them. So a deprecated function, you've still got like a couple of years to fix it, but you really do need to work on that at some point, because those functions are not going away.

How the tidyverse handles deprecation

So let's do a quick overview of those functions. I'm just looking at the news file, which tracks every single change you've made to the package. If you're like a really big user of a package, I think it's a really good idea to familiarize yourself with these files, because you, in the blog post, we try and just hit the biggest point, the things that we think are most important, but useful for you to scan the news functions, because things that might be important for you, might not be important for lots of other people, or we might just be wrong. Also great, like if you've got, you know, if you've got your own blog and want to find something to write about, like read through the news of packages and look at stuff that we haven't advertised, because I can guarantee you there are useful blog posts to be written about things that we've changed, that we're not gonna, we're not gonna be writing about, and that's a great way to kind of dig into the development of a package and help educate others as well.

So kind of quick summary, cross I think was a pretty old function. I don't think many of you were using it, but it's, you're much better off using tidy as expand grid function. We had a bunch of functions that kind of used non-standard evaluation in some way. There were the, it was these just three functions in purrr that used non-standard evaluation or tidy evaluation in some way, which just felt like not a good fit, so we've got rid of those. So no function in purrr uses, no non-deprecated function in purrr uses tidy select anymore. We had the lift family of functions, which I think are a kind of a really cool idea. They're sort of a little bit mind-blowing about what you can do with functions and functional programming, but I just, it's just an unusual style of programming and we think that most people are better off not knowing about it.

Then we have a bunch of functions that I'm like, I have no idea why we ever included these in purrr, especially these like random number generators ones. I'm not sure why they ever ended up in purrr, but they're now on the way out. And splice, which is a much older tool that allows you to kind of cram together like lists of lists. We used to use it as kind of an alternative for some of the stuff that we've replaced with dynamic dots or tidy select. That just doesn't make any sense. Other stuff in maps or things you could do with map functions, that again, I don't know why we ever implemented these because they're not really vectors, so they don't work anymore. We got rid of the raw variants. I don't think anyone ever actually used them.

We've made some changes to how things like map_chr works. Previously, you could just do something like this. And that would give you a character vector. Now it's going to warn, and the reason we warn is because the way it does the coercion is like pretty, pretty weird, and I think there's... It's pretty weird, and if so, if you're creating like really small numbers, it just like silently truncates them, which is clearly pretty bad, so we got rid of that.

Whatever, there are a bunch of other deprecated functions of previous releases. They're kind of like marching along the process to going away altogether, so a couple of functions. So here's two functions that have gone away now that we deprecated five years ago. Now we can finally get rid of them.

So it's really like I think part of the really important part of the process of package life cycle and package development is not just adding new functions, but it is equally important to remove those functions, and we know that removing functions that used to work is frustrating, and so we have developed this process, this life cycle process of deprecation, where it starts with pretty mild warnings and kind of gets progressively aggressive, more and more aggressive over time. We hope this is useful. We hope that this kind of gradually pushes you towards the golden path without like forcing you rudely when you're working on something else to urgently switch something, but at the same time we believe that we want these functions to be better. We want these things to have a more consistent cohesive syntax because that makes it easier for future users, and I think we're doing pretty well. We've certainly got a lot better, I think, at handling deprecations and breakages, but we're certainly, you know, still trying, still learning, and if you have any feedback, please hit us.

An overview of functional programming

If you're interested in learning more about functional programming in general, I think a really good place to start is Advanced R, which has three chapters on functional programming. Functionals, Function Factories, and Function Operators, and there's a lot of text here, but you don't have to necessarily read the text because I think it's got some really good pictures as well, and this is kind of the basic idea of a map function. You take a map, the first argument is a vector, and you've got a function, and what that gives you as a vector in the result where you've called f on each element of the vector in turn, and once you've kind of got that idea, then we can kind of scroll down, and what happens if you have like a list of elements, or go down a little bit further, what happens if you supply additional arguments, or what happens, what happens if you have multiple vectors that you want to kind of iterate through, iterate on parallel, like so this, once you, I think these these pictures give you a really quick overview, like what does map2 do? Well map2 has two vectors, and it calls f on the first element of the first one, and the first on the second one, then the second element of the first one, and so on, and then once you've kind of internalized what map2 does, you might wonder, well what happens if we've got three vectors, or I don't care about the output, and I think this kind of, if you just skim through these pictures, when we can get into the reduced family and other more complicated things, this can give you a kind of great sense of the basic idea of these functions.

Closing and resources

So in this video I've given you the top highlight, my top highlights, of purrr 1.0. If you want to learn more, like read the blog post, if you want more details than that, I'd recommend reading the news, as I mentioned, the news file, as I mentioned earlier. If you are using a deprecated function, or a superseded function, and want to learn what you should do now, I forgot to mention this earlier, but if you go to the documentation for that function, and scroll down to the examples, you'll see that there's a bunch of examples that kind of show you like what did you do previously, and what you should do now. So hopefully that will guide you in updating your code too.

If you've got other problems, if we have broken something that's really important to you, a great place to do is, a great place to go is the github page for purrr. Love your issues, love questions. If you do have a problem with your code, really really useful for us if you create a reprex or a reproducible example. Of course we have a package to help you do that, so you might want to learn a little bit about using reprexes. This just gives us some like runnable code that we can immediately run. There's no like trying to figure out exactly what you did. We can take your code, see what you did, run it ourselves, and that just makes answering a question that much easier. If you have something that doesn't rise quite to the level of like a bug or a github issue, another great place to reach me is on mastodon. Very happy to answer kind of any general questions there. I'll point you in the right direction to places to look up. So I hope you've enjoyed learning more about purrr, and thanks for watching this video.