R-Ladies Rome (English) - What's new in the tidyverse - Isabella Velasquez

Transcript#

This transcript was generated automatically and may contain errors.

Welcome everyone to R-Ladies Rome. We are going to talk about what's new in the tidyverse .

So again, welcome everyone to the third event hosted by the R-Ladies Rome chapter. My name is Federica Gazzelloni and I am one of the organizers. I'm thrilled to have you all here tonight and we are also delighted to be joined by Isabella Velasquez who will be our esteemed speaker for the evening. A bit of Italian, so benvenuti. Questo è il terzo evento per questo gruppo R-Ladies Rome. Sono Federica Gazzelloni, una delle organizzatrici. Siamo in compagnia di Isabella Velasquez. Ciao Isabella, la nostra leader per questa sera. Tutto il materiale verrà condiviso durante la presentazione e già lo avete nella chat, quindi se volete potete dare già un'occhiata anche alla documentazione tidiverse di cui parleremo.

I like cannot believe how many times select and filter have gotten me you know in my data analysis workflows and so it's just something the tidyverse team recommends that you use.

Another note is at the bottom of all of these posts I'm going to be linking to the blog post that I am referencing as well as the release notes and any other resources that I have found. Hopefully again that is helpful for you as you continue you know learning about what's new in the tidyverse.

dplyr 1.1.0: per-operation grouping

Okay so dplyr 1.1.0 has been released and it has a bunch of new functions and things that you can do and again just highly recommend looking at everything it's incredible. And the first thing that we're going to be talking about is what is called the per operation grouping. So this is something new so let's go through it. As usual you can install package who you'd like and I'm going to load it here. So per operation grouping is something experimental that's going on instead of group by you have the option of using by for your for grouping and pipes.

So let's cover what group by is again. It is a function that lets you group by one more variable. So say you have this table of transactions you have companies a and b and you have the year and you have revenue for that year and you notice like there are some years that have duplicate rows for the revenue. So a very common thing that you may be asked is how do you get revenue by company and year the total revenue by company in year. And so very commonly you will take the data frame transactions you're going to group by company and year and you're going to mutate total equals the sum of that revenue.

So taking a look here what you have done is I created a total column and for company a year 2019 it it will have both of the rows but it will have some the 20 and the 50 from that year and that company into the total column. And so that's great that works and gives you what you need but you may notice up here there's this message that says groups company and year. And so the reason for that is group by is what is called persistent grouping. It lasts for more than one operation and just because you have created this column and done what you want it to do doesn't mean that the groups have gone away.

So another option that you may have thought of is you could use summarize. So summarize is another function from dplyr and you can take the transactions group by and then summarize and as you notice here the difference between this and what we did above is this is a 2019. I'm sorry that it's not showing all the way that you have one row for each group of company and year as opposed to multiple like we did up here and that's what summarize does it gives you a result of one for each group. And if you notice here the groups it just says company so year is has been peeled off the group which you know is good if you know but if you don't know it might be very confusing as you try to do further operations and you know you're like why are things not grouping or summing the correct way.

So what if you just didn't want groups anymore and you had a couple of options. The first one is that you could do ungroup so group by your operation ungroup and doing so means that you no longer have any groups there. The second option is within your summarize statement to say groups equal drop so saying summarize this and then drop the groups. And so in both of these cases you can see that there is no longer group listed in your output.

And now there is a new by idea so this is the per operation grouping and what that means is you do you take your frame data frame you do your thing you get back a bare data frame. I also like group by those ones that have groups still even after you do your operation are called group data frames by doesn't do that it's a bare data frame your operation a bare data frame. And so the way that it works is your by statement is in line with the operation that you want to run. And so in this case we want to create a total column with mutate that sums the revenue and we're saying do it by company and year just directly in the line.

And doing that it gives you that original output that we were looking for but now we don't have a grouped data frame which is very very handy. So that means it's no longer grouped on the way out it does the one operation and then it drops off. And this makes more sense on the website but essentially it's reiterating what I just said before it was a bare table transaction and then you ended up with a group data frame and that's bare table or data frame transaction and then bare table or data frame.

This has several advantages again you didn't see that message about regrouping you never have to remember to ungroup again which is pretty sweet if you've ever encountered that before. Another thing is that order doesn't matter if you remember with the summarize company stayed year was the one that was peeled off and that's because year was the second group. In this case it doesn't matter whether you have company or if you have year it because you're doing it line by line as opposed to kind of like all together. Another thing is you know you might consider this easier to read because again you're associating exactly what you're grouping by with the operation. And another advantage is you can use tidy select which things like all of you know contains and things like that you can do that within your by operation which is you know very handy if that's something that you need.

But there are some caveats buys only for selection so it doesn't create columns which is something that you can do with group by. It always returns an ungrouped or a bare data frame and so in your previous code you may have used group by depending on that group data frame so it's just something to note if you intend to you know switch out your code or anything like that. Again you have to create those columns ahead of time because by is not going to actually create columns for you so they need to already exist. And there's also the question of sorting so group by sorts by ascending order and that impacts the results by does not so just something else to be aware of.

Just a quick note that this was inspired by data table which is another package for data manipulation. The tidyverse team saw that they had you know kind of implemented this per operation grouping and you know thought like what about for dplyr. So that's kind of the summary. There's various dplyr verbs that are supported by by so here's the list here and there's more information in the documentation and you might be wondering is there a period or not by everywhere I try to put by slash by and this is depends on the function that you're using. It's just a thing about dplyr some use the period some don't and so if you try to do it with the incorrect one it will just give you an error and it's very informative it's just going to let you know that you need to use by without a period instead here. So running that gives you what you need and so then you can continue forward and you may be wondering what about group by. It is not going away it's not fabricated it's not superseded and there's no pressure to use by or by only if it makes sense to you.

this part of the package yeah so it says within the same format is it possible to not have revenue while summarizing and directly have unique rows like it's used to being grouped by

yes sorry uh is it okay if I reach out to you afterwards I want to make sure that I understand your question correctly okay let's read the other thank you yeah um how are in group operations on already grouped tables uh so how they the the group operation on are handled do we get to regional groups after the operation

oh that is great question I actually have not tried to see what happens if you use by with in a group let's see what happens if we try it um I think it's it's a it's a it's a it's a try it um I think the assumption is you're using one or the other you're either using group by or you're using um by by but let's see so we have transactions this one is grouped transactions group let's save it in two and live coding is always a little dangerous so please bear with me it doesn't work um so the question is like can you do I still use the reader pipe uh so something like

summarize

okay okay I think it looks like it it retains the original group um but it's a great question uh I'll dig in a bit deeper but um I I think another point with by is that it is like one of those examples of something that make you be a little bit more explicit um in that like you know if you want to keep company in year you have to write company in year and so uh yeah I think it's like learning about your your workflows and things like that but um yeah that's a good question I hadn't thought about that okay uh I don't see other question but I am I have one about the dot by option okay actually uh two questions about that so you can use that

um what so as I group by okay so and you use it inside the mutate function what I want to so you said that I cannot create colons but so with mutate I create colons when I make modification it's like but I'm not able to create new columns or just the name of the column the yeah I believe that's referring to that um like this does create a new column but like you I believe like here these have to already exist

okay okay so those ones that you use for grouping my question is can I do more than one like vector uh summarizing vector yes and um but the idea is again like you do it uh like one operation for each by so you would pipe it and create a new operation and then specify the groups for that operation there

so you kind of like layer on um all the things that you want to do and one more so the thing it belongs to a particular type of functions or just some have required it and other not uh the dot yeah yeah so it depends on the verb the dplyr verb that you're using um if if you're using for example spice max it seems like it requires one without a dot and so what you would do if you use the incorrect one it will give you an error but here in the information it's going to tell you why you got an error and what to do instead so in this case it says did you mean to use by without a period and so you can rerun it like so and you'll see that that works so that's great so I just realized that you are using just just realize that you are using this uh by grouping this by for grouping inside different functions different functions so right yeah so just realize that that I was thinking about mutate and now yes of course you you need to use the dot for example with slides so what is this uh has been implemented for all functions or a certain type of functions that yes okay right yes and yeah yeah so uh for certain verbs so you can see them here um again the the whether or not you use the dot depends on the function but uh these are the ones that are supported by by

thank you yeah of course thank you great questions and if I can't answer anything I will find out and I will update the documentation um and hopefully you can serve as a good resource too so thank you awesome so another dplyr uh addition or changes is pick reframe and arrange

Pick, reframe, and arrange

and so I'm going to refresh my R session to make sure I don't have any output okay awesome so again oh why is this still here

I'm using my personal laptop so please bear with it it's a little old oh okay well that's okay okay so we're going to load dplyr

I think I may have overloaded my computer there we go okay sorry uh load dplyr so while that is loading so across is an operator that you could do for um for applying functions across columns and so it's very commonly used for that but it turns out that you can actually use it for column selection uh instead and so um the developers and notice that this was something that was happening right um so it has all these things and you want uh the the number of columns that start with x the number of columns that start with y and you put it within summarize if you notice you're using across with unsummarize and it does work it gives you you know the two columns of x one column have and it does work it gives you you know the two columns of x one column have y

uh but the issue is that that's actually not what across was um originally implemented for it was for for functions and so uh but that's okay not all hope is lost dplyr 1.1.0 has a new function specifically for this and it's called pick so if you run that it will give you exactly what you expected up above so it's just an alternative um well it's a better option because it is a function that's meant to do what you expect it to do so if you have been using across in functions like um mutate or summarize uh use pick instead and they mentioned that it still works across still works in these functions for now but it's going to be deprecated

okay reframe um is something that has been implemented because of what the developers had been noticing in terms of how people use these different functions and so in dplyr 1.0.0 summarize could return uh per group length of any length and that was different from before before it had to be a length one if you got a zero or anything more than one throw an error so for example here you have a table maybe df you have um i'm sorry a vector called tables df a table and then um say you wanted to intersect uh this df table and then by g so it's referring to g as the group and using the by operator which we just covered and then returning the intersection and if you run this you will notice that g the group has multiple results per group

and again this is something that that started in dplyr 1.0.0 but it raised a lot of concerns first of all it increased the chances for bugs and the idea behind the summary is one row per group not multiple and so it's like oh why can summarize do this now and for folks who use dbplyr it made translation very difficult and so this feature has now been walked back so that means if you summarize and get zero or more than one row per group you will get a warning but again there is an alternative so if this is something that you intend that you want to do there is the function called reframe and so um again it's just do something for each group it as opposed to using summarize like we did above we use reframe which is a function that's like that's meant for this purpose so again um something to note is reframe will always give an ungrouped data frame and so even if the input was grouped so we'll ungroup them and again another habit to kind of pick up and carry forward as you know as these functions get

and then finally I wanted to mention arrange so this one isn't a replacement arrange has existed before but a difference is that it's now using what's called the C locale instead of the system locale for sorting vectors and essentially what that means it just makes dbplyr 1.1.0 like a lot faster than before so here is a 500,000 random strings that I'm created creating in df you can look at it here and so in the previous locale uh from before let's see how long it takes to sort this

okay that took nine seconds so quite a bit of time and let's see with the new locale from the newest version of dplyr that took 1.27 so you can see there's significant difference in terms of the speed of sorting if you want to use the old locale there is this option with with R however it's going to be removed at some point just be aware or if you want to be explicit with another locale you can mention it as an argument here so it's not that you have to use the C locale at all again very much much faster but there is something to be aware it slightly changes how vectors are sorted to be honest I'm not 100% sure the details behind that but if it is something that could impact you just be aware that it that that's the case

newest version of dplyr that took 1.27 so you can see there's significant difference in terms of the speed of sorting if you want to use the old locale there is this option with with R

Case when, case match, and consecutive ID

okay so that was uh pick reframe and arrange all right let's talk about case one case match and consecutive id also changes in dplyr told you there's a lot um let's load our package so if you have ever used case one you're in good company if you feel like it's a very popular function for generalized SL statements yeah and so I have a few changes to case one so

I don't know how long you haven't been using R but I remember a time when I could run this and not get an error and then at some point things changed and all of a sudden I started getting this error na must be character not logical so from one day to the next had to know what kind of na I needed for my case one statement oh and I'm sorry and just a quick overview of how case one works you say case one um here is a vector of numbers it's saying when x is greater than or or equal to 10 call it large when it's greater than or equal to zero it's small when it's less than zero it's an na and so um yeah again in the good old times you didn't have to say any uh character which would like this and so I think yeah um however now thanks to vectors the the package that I mentioned before uh you no longer have to specify the type of na very very exciting um so now when you do this it just works and no error thrown so that's one really big change um if you're like me you can never remember the different na's uh the second change is uh before if you wanted to set a default in case one you have to do this uh so you say large small missing and then for everything else make it uh like true squiggle other why I don't know I actually remember the day my co-worker taught me this and I was like why is it true and he was like I don't know but that's you know the way that you set defaults in uh case one and so in this case uh negative one which doesn't which is not greater than or equal to 10 not greater than or equal to zero and not na would fall under other which is here and so um yeah so this syntax is really odd and it's very different from anything I had seen before and kind of hard to remember too but now with the new version of dplyr there's an explicit argument called default so it makes it much easier to read much easier to remember what exactly you need so you do dot default with um the thing that you want for everything else that doesn't fit your your logical statements you run it and uh yeah and so it's just uh another um you know way that that dplyr has improved the way that we work with data it's true it's not deprecated yet but it will be in the future so also another thing to kind of just start picking up and and using in your workflow

okay so if you've ever used case one there are times that it can be a bit repetitive so here we have a vector of countries we have usa and canada uk china mexico russia and we want whenever um whenever the value is in usa canada mexico we want it to be north america whenever it's wales or uk we want to replace it with europe and whenever it's china we want to replace it with asia

so very handy like a much better alternative than using a bunch of nested if health statements but it is uh a little still a little repetitive so with the new version of dplyr there's a special there's a special case that lets you do this without having to rewrite stuff all the time so now you give it you know what it is that you want to replace and instead of having to note in every single time you could just say like this is what i want to replace it by

so again it's really just a nice special case alternative for you know streamlining your code and so before you may have noticed for na's we have to say is na for the value now you can just list na like this and say what you want to replace it by

and then it also works with default which we just covered instead of true and just a thing to note if fowls has the same updates as case one uh so

all right and last thing in uh this document is consecutive id and so i highly recommend reading davis von's um blog post the the example that he gave is quite fun but i just created another short little one for today so this is friends dialogue and so here we have the text the thing that was said and then here we have who who spoke and a funny fact monica was the first person to speak in france and if you notice in the dialogue there's some times where you know it switches from monica to joey to chandler but then phoebe speaks two times and but the transcript is kind of broken up in two different rows and so there might be a case where you want uh what she said to be together in one line and so if you um try to do that in r

and so if you um try to do that in r you can try to use uh summarized right to put the groups together and use the string r function called string flatten to kind of collapse the data into consecutive uh dialogue but if we do that we see um we we went too far originally we had 10 lines of dialogue now we only have four lines and that's because for everything in monica it grouped it together and put it in one line for everything for joey everything for chandler and everything for phoebe and so that means um that the transcript is now out of order and not exactly what we want and so the dplyr new version has something that can help with this and it's called consecutive id so you create a new id that provides you the consecutive id and you can see here now monica spoke first so she's one uh joey's two chandler's three phoebe spoke twice in a row and since it's the same person it will assign the same id uh for any consecutive series of of the group and similarly monica five and then chandler six six six

and so now that we have this id we can use that in our summarized statement to group the dialogue correctly and now we have the correct order of consecutive dialogue for each of the characters

okay that was case one case match and consecutive ids

there is a question can you use regex in case match i i believe so anything i believe anything in case when you could do in case match but i will double check

stringr updates

thank you okay now we're going to talk about a new function i would like to clear my outcome i'm sorry if it takes a little while um so string r is a package that uh makes it really easy to work with string it has a ton of functions that you know help you find repeated values pull out the first character pull out the last three characters etc and so in um three years it's just got its first release that's three years ago and so it has accumulated a lot of functions that are now officially part of string r so let's load it

uh the first one and probably um the the biggest like change is this uh function called string view and so it lets you see a character with special strings and so you might have something like this with a lot of dash ends and when you print it out it looks exactly like that but that's not very informative because that's not actually what it looks like and in base r there's a function called bright lines that lets you take a good look at it and so here we can see uh the dash n is news and the new line and so they're actually on three separate lines a the the colon or not the colon sorry the dash and then the quotation mark and the c and so in the new version of string r there is string view so that is something new and there you can see very clearly what exactly that string looks like so very very handy um especially if you're working with stuff like that looks like that um and so another thing is that highlights special characters so in this case this is a special kind of space so when you print it it looks normal though it looks like a regular space that you've seen anywhere but if you try to do an equality of that and one with a regular space you'll see that it's not actually um the same but with string view it provides really nice highlighting um and and functionality to really clearly point out um those kind of special characters similarly with tabs if you've ever run into something that had a tab that you couldn't see now string view lets you see lets you you know find that very easily

and it also makes matches really stand out um you know if you're ever double checking your work and making sure that you caught everything that you expected to catch so here we have a b c d e f f g h i and we want to single out every time we see a e i o u with string view it has like again that nice color highlighting puts it in the greater than and lesser than sign so it's really easy to see oh here's the a here's the e here's the i this is regex for um the last character in a string so again just another example of how string view shows you exactly what you're looking for and then finally um this is regex for anything that that is like a duplicate and here we go so pointing out the pp the ll the rr etc

that is string view um like i mentioned uh string r the new version has just a ton of different functions that may be helpful for you there's string equal which tells you if two strings are the same um you can ignore case if you want this one is another string equal so these are a with an accent but they're just encoded in two different ways but they are equivalent and so now this is what it looks like if you do you know a1 equals to a2 it's going to say false but since it's the same character for just a different encoding string equal will let you know that they're actually true

another one is string rank so it'll give you the rank of the values so in this case so in this case you assign one to a two to b and then four to c because there's two b's string unique as you can imagine it returns any unique values and you can also ignore case in this

and another function um before if you wanted to split a single string and return a character vector you needed to do string split and then unlist because it would always return a list but now with the new version of string r there's a function called string split one and it will do all of that in one go like that so it's split in by the by the dash um and gives you a character vector

and finally um like a sign yeah is a string split i and so what this says is uh it will extract a specific piece of a string that you that you tell it to so we have this string we say split it by the dash and give me the second um output there's the second value and then and then as you can imagine this is going to give you b it's going to give you e and then it's going to give you g because that's the second of this string after it's been split if you give it something that doesn't exist it'll output in na because abc only has four after it's or three after it's been split so it's going to return an na and you can also do for the last value by using a negative number in the string split i oh yes and i this is a string like which is like string detect for anybody that uses sql and wants to use like a similar syntax just need to add a heading to it and that's a string r any questions

things that release um elements already unlisted so that's very useful because you know you do grouping and things and then you want to use the the vector that you have yeah um

extrapolated for uh you know extracting some uh strings that correspond to the vector that you have just created and everything and then you need to unlist and then uh yeah so that is very useful yeah for sure um i think like these functions are going to be very helpful like i work a lot with text data and like you know it comes in all different formats all sorts of like hidden stuff um it'll be very helpful to see you know where they are uh before you know running something and having it be incorrect

What's new in tidyr

oh cool all right thanks well so uh now we're going to move on to tidy r which also has several new changes tidy r 1.3.0 it has a few new separate functions and oh that's why it looks like that i'm going to move to my visual editor all right

okay so i've loaded tidy r so um there are there were several ways of separating um with tidy r that had like a lot of different ways of or you know the it wasn't very standardized so for example if you wanted to separate with a delimiter like comma or colon or whatnot um for columns that you would use separate and then you would say the delimiter in step or you for rows you would use separate rows for separate by position you would use this like similar to that but then there was no equivalent for make rows and for regular expressions there was extract but only for columns and so uh in the 1.3.0 release of tidy r now there's a new separate family of functions that supersede all of these um and so uh that means there are better alternatives for you out there what those are and how they look like is separate um wider and and then by what you want to separate by and separate longer by what you want to separate by

so now if you want to separate with delimiter you would use it for a column you would use separate wider delim for row separate longer delim um as you can imagine for position for columns separate wider precision with regular expression separate wider regex and then for rows and positions or separate longer precision um so they are longer but I think um you know what doing this presentation it just made so much more sense when I was like oh I want to do it by this now um it was a lot easier to just switch out the limb to position to regex as opposed to trying to remember you know is it uh an argument or is it a you know a whole different uh function so that is what's new in tidy r um the new separate family of functions

and so I created a full list data from the tidy hideout package from our opensci and it provides a bunch of real-time data um on Canadian uh station water stations and so once we load it I'm going to pull some data and I'm going to show you what the date looks like the date column

just give second okay so now we have a column called date um but as you can see it's kind of all uh put together the the um year month day the hour minute second and there are different ways of doing this but for the purpose of this demonstration we're going to use tidy r so say that we want to separate the date and the time um into their own separate rows we can use separate wider position um if if we choose to so uh because the date is like you know the same length throughout and the hour minute second is the same date throughout let's just see how to do it with a separate wider position so with the function we're going to say we're going to select that column and then we're going to say a year month date create a column called year month date that's 10 characters long um another column that's called space that's one and our uh minute second that's eight characters long

so doing that gives us uh what we expect now um the date is it's in its own column space is in its own column and then our minute second in its own column but you may be wondering like we don't really need this column like you know can we do it without um putting that out there and the answer is yes but uh how do you do it so say you were to run something like this where you just try to ignore it and just don't mention the space at all um you just try to pull out 10 characters for the day the year month day and eight characters for the hour minute second

okay um well yes so if you try to do something that is um too few or too many of whatever it is that you need to do it will give you an error but not just that it will tell you what exactly happened so in this case um this this is expecting sorry the number of characters in all of this it's 19 but we only gave it 18 right we only gave it 10 and 8 so it was expecting 18 but it got 19 so in this case it is what is called um you can start using these debugging features that are now part of tidy r 1.3.0 to kind of figure out like what exactly happened and so it will even tell you use this to diagnose a problem use this to silence a problem and so let's uh try and debug we're going to use too many equals debug and so rerunning that added here right

it will provide these really informative columns that will tell you what exactly happened so here is date uh date is 19 characters long um is it okay no and it tells you which rows failed um and all of this is pretty uniform but you can imagine if they're not uniform it would be very helpful to kind of pinpoint like oh this one failed because maybe there's an extra space um so now we know the problem happened with every single row in this uh in this instance and uh we need to fix it and because we gave it 18 it's actually 19 um but the tidy r does have a way of um of emitting output that you don't want or components that you don't want so what you do is just don't assign it to anything so in this case we do um year month day plus 10 and then one which signifies the space that we don't want and then hour minute second equals eight and doing that we have the right um number of characters which is 19 and then when we do it we get only the two columns that we were actually interested in so let's go ahead and run this and let's see

what happens and as another example if you wanted to do it with the limb instead of position or wanted to continue this with the limb instead of position um here the year month they are separated by dashes here they're separated by hour minute seconds are separated by column columns and so you can um use separate wider the limb and specify you know the different the limbs like so um as you can see it's very uniform um very easy to remember and then you get exactly um what you would expect if you separate it out by dashes and by column

and then finally to give an example of separate wider regex so we're going to load the tidy census package which is an amazing package that makes it very easy to pull data from the census

so we have um a data frame and under name it has uh like the block the block group the census uh tract you know everything like mushed together with um with uh commas as well you can imagine there are also various different ways of splitting this out but say you know we're only interested in the block numbers uh the block group numbers and census tract numbers everything else like we really don't need the block you can use regex to specify you know only give me the numbers and then put that within separate wider regex uh to to get the the output that we want so in this case we're saying you know look at the name within um within name uh pull out the number associated with the block it'll be whatever is black and then the space and then whatever is between that and the comma and so then similarly with block group and with census tract and then it also has regex for county and state uh to get stuff like to not capture the comma so anyway lots of regex statements uh within separate wider regex to get you exactly what you need and just like simplifies the workflow like rather than having to have multiple steps for um data manipulation just being able to do it all at once and so here we have the output where for block we just have a number for block if we just have the number um and no commas no no characters or anything like that so that's tidy are

rather than having to have multiple steps for um data manipulation just being able to do it all at once and so here we have the output where for block we just have a number for block if we just have the number um and no commas no no characters or anything like that so that's tidy are

any questions oh i guess that's not tidy so new separate functions are still a bit more untidy are

so in tidy are there is uh some improvements to unnest wider and on this longer i'm just gonna load it again

so uh in the spirit of a lot of what we've mentioned so far um if you have something so um here's a nested list so in the first id it contains a and b and the second id it contains def and uh it will be more explicit with errors if there is one so you're trying to unnest wider here and it's going to say hey you can't unnest elements with missing names can you please use a name step to create some names so really nice really informative and so you can fix it by doing so

and by specifying name step but then you can see that it's all been unnested here i don't know why it's all squished though apologies uh on the website it should show up a bit better

and there is a function called uh unnest i'm sorry there's an argument called keep empty in a nest that lets you keep values and essentially um now unnest longer has also has a keep empty so before it used to like here here we have a nested data frame with one two three as id and then within one that was null so when you uh oops sorry oh yes uh this was the one is null two is a blank um an empty integer and then three has one through three when you did unnest longer it would um you know get rid of ids one and two because it was uh empty but now there's a keep empty argument and so when you do it it will keep them but also mention that they're empty so another improvement from tidyr

What's new in ggplot2

any questions about those changes okay yeah i know so many so many changes yes it's amazing um i'm gonna see in my visual thank you cool so let's talk about ggplot um so new version of ggplot 3.4.0 and one of the changes is there's a new aesthetic called line width and essentially whenever um it will take over sizing for the width of lines previously that was done with the aesthetic called size um but as you will see like there was some some questions about that so loading that

i tried to like um there's a way that you can use different versions of packages within the same document but it just takes a little while to to load so i'm just going to show you like the image on the website um if that's all right

okay so essentially before it when you wanted to change the width of lines you would specify size and so scale size and then you'd give it a range like so and the issue is like for line widths it's it's length right um versus it was also the same aesthetic that you could use for changing the size of points and that one is area and so like there's just like a little bit of discrepancy as to like what exactly size did and so um now it's been replaced for these sorts of um aesthetics by line width and so now you can see instead of size you're going to use scale line width uh they look very similar but i feel like putting them side by side like this you can kind of see that there's a little bit of a difference so it will change your plots if you've used size for line width before

gradient you know for final with line width and um just a note if you try to use size with lines now you're just going to get a deprecation warning and it's going to be very explicit just please use line width and so something like point range if you've used it before is a mix of lines and circles um and so it's going to be uh like if it is size is meant to be used for that aesthetic or for that um that aesthetic like then it's not going to give you an error it's only going to give you one if it doesn't make sense and a really quick note is that for ss plots the default is now a little bit thinner so it looks quite nice um and that's just another thing like if you rerun old things um with a new version of ggplot and it looks different

so generally again another example there's better error messages in ggplot now with the newer version and so say here you're going to run this and i do this many many times you run this it's going to throw an error and the reason is because i used a pipe instead of a plus sign so if you've ever been like me and that and that it's going to be really informative as to what exactly um what went wrong it also tells you uh when it went wrong so here it says it must be done by aes so i know it's an issue here and um and what to do about it so use the plus sign instead of a pipe

and uh here's just like another example that where it just shows you kind of like explicitly what it happened so if you take a look at this maybe you can anticipate what the error can be we're running it here we go the stack count can only have an x or y and it'll let you know it happened in the first layer so i know you know i'm looking at here to to fix the problem so better error messages in ggplot any questions about ggplot

What's new in purrr

i didn't see any questions in there okay thank you cool okay let's talk about purr the purr is a package for functional programming which is working with functions and r um it is uh like incredibly helpful especially if you're doing things you know across uh was across tables working with lists um and so yeah very very handy function and you can probably guess it also has a new update 1.1.0 um where oh it should be 1.0.0 i believe sorry um where there's a lot of like big new changes so purr is seven years old and so this is the first major release as in it's 1.0.0 which is very very exciting um because it's a major release it's kind of like the chance that the tidyverse team had to go through you know really think about the fundamentals of what purr is as a package uh implement the functions that they believe should be part of purr and you know and then release it so that's why you uh we'll see like there have been a lot of changes um that come with a major release

so first up is uh in mapping so there is hold on i'm going to restart my session i know it takes a while but i think it's worth it

loading all right so first thing is progress bars uh this is very exciting so if you ever run a long um running job in purr and you're like is this uh even close to being over now you do not have to wonder by adding progress equals true uh it will let you know what what exactly um how far it has been so made myself a bit to open the console so here it's going to map through this function so neat

it gives you an eta and everything and say you know you're doing multiple things or you just want to make sure that you name things and understand what exactly is going on you could give it an identifier like so and so here you can see what the identifier is just put waiting so that's a purpose for us

one more time just better errors that will let you know what um exactly caused the problem so here we have a list we have a 10 5 and then a character of a x and so then we're going to map over all of these elements saying please divide by 2 so as you can imagine it's going to hit an error and if you know this was more complex you might be wondering what exactly happened and it will tell you where so in index 3 that's where um the error happened and what exactly happened the non-numeric argument you know i try to divide a character by two

and so a final uh addition on this document for per there's a new map vec uh function so what the map family does is it applies a function to every element of a list and um then you also specify like kind of what you want back so map always returns a list you know map int would return an integer map double that character etc so in this case uh leading back to the earlier one we're going to divide each of these elements by two you get in a list 0.5 1 and 1.5 and so now there's a general map back so here you have to kind of know like what exactly you wanted the output to be and um you know and then like say you you mapped when you wanted to map character then you have to go back and fix it so map back is a more generalized and you can work with other sorts of things like dates for example so here we're going to apply this function to 1 2 1 2 and 3 and we're going to use map back to get back a date so as opposed to say map which would return us a list here map pack gives us back what we gave it which was a date um but a thing to note if you try to combine different things again like character number um it's just going to throw an error because you can't combine the two

so that's per um just have a quick question about yeah you go back um where you were um well not very practical with this function but so i need to put the slash is that a requirement for the function to

oh i'm sorry could you repeat that so you say it's this um um so here that it's 1 2 3 and then map back yes yeah i need to just do this back slash oh yes um so the backslash and this um so that is the uh i believe since r 4.2 um a way that you can write functions in r and it's just a shorthand really for like that um so instead of having to write out function you could do a slash um and then here is like you know what you put within your function and then next to it is the function itself and uh in the in the per blog post um i do recommend reading it i mean i recommend reading all of them but that one in particular kind of goes through like um a bit more on the base r and how to use it now with um with per uh i think like you know for the most part with dplyr with tidy r we've seen we didn't really need to think too much about how to create these what they're called anonymous functions um but with per it happens a lot and so we're going to see like you know that you have the option to do it um the previous way but i i think the hope is like um to get a bit comfortable with creating these functions in base r um to be able to use them within per uh as opposed to you know kind of um the previous way of doing it before there was um this functionality in base r

yeah thank you yeah and and um again not to like uh yeah um but please like uh take a look at the blog post i wrote with my brother i think like it we tried really hard to kind of break down the different steps of creating this in base r because it is very new and and it looks very different

yeah because it reminds me of python lambda function yes so that's exactly what it is yep um yep so now base r just has like that functionality as well

cool um so i know we just have a few documents left and then we can open up to more general questions um there are two functions from per called keep and discard they keep and discard elements by value so here you know we want to create um 10 10s and then we uh say like for for each one uh say like for for each one like create a sample of um five and then we just want to keep the ones that uh have a average greater than six and here here that anonymous function is showing up again so you could do it either way this or that and um so here we have two results of of um vectors that where the mean is greater than six which is all we wanted to keep um from this uh from the simulation i guess and similarly with discard you know we want to remove all the results that we have with a mean greater than six so um but if you notice that's by value we have to say like the value is greater than six or whatnot so now there's a new one that lets you do it by position as opposed to by value so say we have

this list and we want to keep a b and c um we use this new function keep at in order to keep it by by the name as opposed to having to specify what exactly the value is

and similarly one for discard

and um sorry and then one last thing is you can also provide it a vector or sorry a logical vector and keep at and discard at instead of having to give it the names of the thing that you want to

Flattening and simplification

so two new things are flattening and simplification

okay so two new things are flattening and simplification and so list flatten this is incredibly helpful it's just a way to flatten lists hierarchy by heart hierarchy so in this case we have one list in the second element like it has another list the third one has another list and then it kind of branches out from there so you can see it's a deeply hierarchical list here so say we want to remove one of the layers we could use list flatten once and that removes one of the layers and so uh everything kind of shifts over one um and so the number of lists actually changes because this is like list within a list and then another list and so that's why it's list of two and now it's a list and then three other lists which is a list of four um but especially if you work with lists it's very helpful to be able to kind of move hierarchy depending on what you need if you would do it again you know uh it'll keep going until there's no more uh there are no more lists and then it will just stay you know in one hierarchy

like so like it won't make the list disappear

and finally there's a list simplify which um produces a simpler type of the thing that you give it so here's a list oh i didn't print x but the type of x is a list say we want to simplify it we just want these elements in a character vector we could use list simplify run it again now we have a character run it again now we have a character vector which is great

um there are a few rules it'll only succeed if every element has a length one so here we have three and four within this list so it's not going to work um everything must be compatible so we have a character with two numbers so that's not going to work and if um if you want it so that like simplify is possible but if you can't simplify just don't throw an error you can add strict equals false

and then you can also be specific as to what exactly you want the the output to be

oh and one more note um a deprecation so map dfr and map dfc have been superseded i'm sorry i'm deprecation superseded and instead they suggest that you use list rbind and list cbind so whereas before you may have used this in the past this is another one of that map family of functions um they're suggesting that first you use map created a list and then use list rbind to create what it is that you need

Q&A: list flatten

okay uh any questions thank you so there is one from dorota hi dorota can you she said can you control which list is removed uh in the hierarchy

i'll try and open it up again yeah i believe it does it by a hierarchy the specific function um i can double check like so it will just move things over um i think there are other functions if you want to be more specific so does it remove sort of from the inner to the outer list is that the idea um i guess i'm just i just want to clarify yeah yeah yeah no no no for sure okay so let's walk through this example so here we have list one and so i think this is one list and then here we have another list um that has

well i think you're muted sorry

oh i'm sorry i don't know where i muted myself apologies um i was going to say let's let's walk through this example um so i think it does it like um from the outside in if i had trying to conceptualize it like so first we have one which is its own list here and then we have the second list which is a list made up of two and another list and then made up of five so that's represented here and so when you use list flatten what that does is the first list this one stays the same because it has nowhere to go like there's nothing to flatten um but this one now becomes a list of three because the two has become flattened into its own list this one has um you know but it's this one stays a list of two um but now it's not within here and then this one

yeah okay i see what happened yeah so if you see it here um i like i had to print this out a bunch of times to be like exactly evidence so um and so then if you do it one more time now all of these once they well this list this list and this list again cannot be flattened anymore um and then this one with its two children become flattened oops i'm sorry here we go these two with with uh this list with these two children flattens and now they're all in the same hierarchy

thank you appreciate yeah yeah of course it kind of reminds me of excel you know like that you know where you're like back back back and you make the numbers smaller

Breaking changes in purrr

cool and then finally just some breaking changes um uh so as you have seen map dfr and map at dfc have been um superseded and then there are also some breaking changes that you should just be aware of in purr it's if these are um uh functions that you have used in the past so the tidyverse team did a lot of research and made sure that you know these are things that you know everything has alternatives everything has been um you know communicated with with folks that they knew were using these within their packages um but if this is something that's coming in your data workflow again it's just something to um to keep in mind so that your code doesn't break

and so let me load her and so uh one thing is that now pluck which is a function that lets you get an element within these kind of nested structures that we were just talking about so it has a default setting and now when you have um something in your default it will only return the default for null and absent elements so here we have a list where y a is a character like an empty character and then uh here we have null um for y b

and so here you might expect if you pluck a from y and set the default as an a because it's an empty character that you would get in an a but now purr will give you back your empty character so that's again um something new to keep in mind versus if you pull this null um value out from b it will be replaced by the default and if you do something that doesn't exist like c it will also give you the default so that's one breaking change um it impacts map because map uses uh pluck it wouldn't given an integer vector character vector will list so just again another thing to be aware of as part of the vectors package it's going to be more explicit and in the things that it wants and then give you like better errors

so before you might have been able to get away with using map character with numbers but as you can see it's really not ideal like um you're getting these character values out but they have a bunch of zeros and probably not what exactly you want uh now it's going to um ask like now it's going to give you a warning and b be like please be explicit and what does explicit mean it means like saying like yeah i do want a character um for map character like so

so that's also like kind of assuming that's generally what you were expecting instead of this

another thing just to note is um there are different ways that r deals with nulls and now um previously like it would depend how you wrote what you wanted to do which result you would get if that makes sense um so here you know we want to set a to null so we can assign it with this dollar sign and when we look at it now it is no longer in part of this list and then uh but we could also do it this way with the brackets and say x2 a make that null and then when you look at that it is different a still exists but it's a null so um like you could imagine a world where you expect them to do the same thing but they actually do different things just because of how base r works with brackets and dollar signs and things like that um now in per there's a a consistent way when you use a list modify to change something into a null it's going to be the same no matter how you do it

and if you want to remove it like that first example that we showed use this element uh this function called zap

nope thanks

Deprecations: cross and expand_grid

and finally some deprecations and so uh if you've ever used cross all of that has been deprecated in favor of tidy r's expand grid so an example of why is when i found in the github discussions so here we have some letters uh what cross does is it provides all of the combinations of you know the things that you give it

so running across the f on x up above i'm just gonna wait here a little bit

this is where the the progress bar comes in handy so that's like 39 seconds um and then using expand grid to do the same thing that took uh 0 2 3 so that is why cross has been deprecated um and you um can check out the documentation for examples of how to like uh translate your code if you'd like to do so and then at the bottom here and again more um in more detail in the in the tidy blog tidyverse blogs blog posts are just various things that have been superseded and deprecated um for various reasons like splice and lift and prepend you know when and this was just part of the effort of finding you know what is the core of per um what they were hoping to do with a with a major release and that is what i have thanks

this is where the the progress bar comes in handy so that's like 39 seconds um and then using expand grid to do the same thing that took uh 0 2 3 so that is why cross has been deprecated