Profvis

Transcript#

This transcript was generated automatically and may contain errors.

Okay, so as Bill said, I'm going to be talking about profvis and profiling tools for faster R code. So thanks already for coming, I just want to say that. So here's a question that probably comes up a lot if you work with R a lot, which is how do I make my R code faster? And that naturally leads to another question which is why is my R code slow? And my strategy for this talk is I'm going to walk you through a couple examples and show you some of the tools, some of the tools that can help you diagnose these problems and and speed up your code and hopefully we'll also learn a bit about a few particular things in R that might be slow and surprisingly slow and ways to speed it up.

Okay, so I'm just going to start with this example here. Let's say you've got you've got a data frame that has 400,000 rows and 150 columns and I'm just generating some random data here with our norm and I you know this this data is I picked it because it's a good size for this demonstration so it's there's nothing special about this it's just it's just a good size and good timing for this for demonstrations here. Okay, so one thing you might do is well I'll write a function that will normalize the column so let's say I want to take each of these 150 columns and make it so that the mean of that column is zero and so I take, sorry, so the mean of the column is zero so I take the original data for each column and then I find the mean of that and then subtract that mean from from each column so so I've got this function norm calls and what it does is it takes it takes the mean of every column by using the apply function like this and and then it subtracts the mean that I calculated from each column and then returns the resulting data frame so so this is just my you know my first a first pass attempt at normalizing normalizing these columns.

Okay, so then after I define this function then I would say norm calls data on the data that I've generated. Okay, so let's let's try that. So I've got my code here this is the same code that you saw before first I'll generate the data, this takes a few seconds, maybe more of a few seconds, but fortunately we don't have to do this part once. Okay, so it's done, find my function, oops, sorry, I typed an extra character there accidentally, sorry about that, so to find my function, oh I'm sorry, okay, I just trying to print out the data right now and that there's a lot of it so it's very slow so let me just try to stop this and restart my R session. Okay, all right, so starting over let's create the data, okay, and then we'll define our function norm calls and then we'll run the function and this time we'll wrap it in invisible so it doesn't try to print it out because that is very slow. So if you saw that took a few seconds to run that. Again, I'll do it again so you can see it takes a few seconds, it's sitting there and then returns to the console.

Using system.time and the sampling profiler

Okay, so that's, you know, this is a, it takes a few seconds and if we want to take a closer look at how to, the amount of time that it takes, sort of the first pass thing that you might do is just run system.time, so you have all the same code and you wrap it in system.time and this tells you how long it takes to run the code inside of it. All right, so let's try that, I have it right here and the total time that took was 2.6 seconds. Okay, so if you're trying to figure out what, you know, parts of your code are slow or how long things are taking, this is the first thing that you might do is just to see how long does it take, but then after that, you know, typically you're left to your own devices and you have to, you have to take a look at your code and think about what, sort of take guesses at what parts are slow and what parts are not slow.

So, and that can require, you know, to get an accurate idea of what the parts that are expensive, that takes a lot of, that can take a lot of experience and sometimes even, you know, what you think is slow is not, that's not the right, it's not actually slow. So in order to help out and figure out what parts are slow, so you can speed them up, you've got, you can use profiling tools.

So R has a built-in profiler, it's a sampling profiler and what it does is it will tell you what the R process is doing over time. So to use it, it's really simple, you just say R prof, then you run your code and you say R prof null and that stops the profiler. So what happens is, is this is, let's say this is time along the x-axis here, the R process does a lot of computation, it does a lot of computation and then periodically every, I believe the default is 20 milliseconds, every 20 milliseconds the profiler stops the R process and it queries it and says what function are you in right now and our code is very simple so it'll be in the norm calls function and then it records that, it saves that and then it goes back and the R process does all sorts of computation here, lots of computations, it's and then it gets stopped again by the profiler and again it's in the norm calls function so it writes that, it saves that again and it keeps going, it does this every 20 milliseconds until it's done.

So now obviously in this case, if this is all that it recorded, this is not particularly helpful information, we know that we're in the norm calls function all the way in this entire time here, but it actually records more information than that. What it does is it records these, not just the sort of the outermost function that's being called, which was norm calls, it will also record other functions in the call, what's called the call stack. So norm calls, it calls apply, all right, so this is our norm calls here, this is our function definition and that it first called apply and apply it turns out will call as.matrix, as.matrix.dataframe on list, so this is the call stack and then these are all R functions and then at the top of the call stack usually typically this, the functions that are at the top of this will call it to C code, which the profiler does not record and does not give any insight into.

So that's what happened, what might happen at the first sample and maybe the second sample, the third sample there's something different that happens like norm calls, calls apply, so apply then calls mean, which calls mean.default and so on, and then eventually after it does all the apply stuff then it will call the minus function because minus, you know, these operators in R are all functions. So this is the information that the sampling profiler will record.

So this is the information that the sampling profiler will record.

Now the data that comes out of our prof, the built-in R profiler is, it's a little bit, it takes some practice and some expertise to work with. So we have a tool, oops, oh yes, actually I should, oops, let's go back to here. One important thing to remember is that the, if this is like the full complexity of all the stuff that you're, that the computer is doing and all the processing that it's doing, this is what the profiler is going to tell you. It's a very low resolution view of what the computer is doing, but that's actually, but that's okay because, you know, for this for this sort of thing we're trying to, you know, we're trying to find the large expensive things, large slow expensive things and speed them up. So we don't need to look at all the fine detail typically to, in order to make improvements. We can see the large features here like the continents and the oceans and that's, you know, that's sort of like those are the large things that we need to look at for speeding up our code.

Let's just goes to show that sometimes it's counterintuitive what the expensive parts are creating data frames it turns out is expensive in R.

So now I won't go into detail about this but the way that you can solve this or that I solve this particular problem was instead of creating a data frame for each iteration of the loop is to put the data in a list because that's essentially what a data frame is it just it's it's essentially a list but a data frame enforces a few a few things like each of the vectors needs to be the same length but if we're careful we can just use a list instead and then I create a data frame at the end by calling data dot frame and then extracting out each of these columns from the from the list of lists that I have but again I won't go into detail about this but I just wanted to show you that now that we since we know what the actual expensive parts are here instead of spending time trying to optimize the wrong thing like converting it from a for loop to you know I'll apply or something like that it's been time fixing the things that are actually expensive which is the data frame creation if I run this this faster version of the code the whole thing took takes 20 milliseconds okay as opposed to 400 and that's again that's that's we can do this because we know we know what that but the actual slow parts of the code are and that's that's what the profiling tools give us information about.

Okay so those are my two examples and and as you can see I've been using profvis with RStudio and I believe this still requires a recent daily build of RStudio which you can find if you just google for it RStudio daily build and RStudio has some other nice integration features which I want to show you so first is that there is a profiling menu so I've been running all of this stuff by you know wrapping these call all my code in profvis but you don't have to do it that way if you're using RStudio so what you can do is you can say select your all of your code that you want to run this is again this is not wrapped in profvis this is like if you're running your this is your you know your code in your whatever our file you're working on you can say profile and then profile selected lines this takes a few seconds and then it shows up here like that so you don't have to you don't have to modify the code and add profvis in there anywhere and once you have it like this and you want to see if you want to save it for later you can just click on the save button here and let's just say call this temp so let's save it as temp dot our profvis it's if I close that and then I find it the file here so it's temp dot our profvis and if I don't click on it it'll open it up again here and also it turns out that if you want to share this with people there's a couple ways of doing it if you can actually what you can do is this our profvis file is actually it is really just an HTML file so if I if I rename it to temp HTML and double click on it it opens it up in a web browser and and I can interact with it the same way that I was interacting with it before okay because it lives just a web page so that's so that's one way of sharing it is that you can save it as an HTML file and send it to somebody but you can also there's a publish button here and if you hit that you can publish it to our pubs so and if you're not familiar with our pubs it's just a free hosting service that that RStudio runs for for documents so I'll just publish it and I'll say this a profvis example continue and then it's online at this URL here so then you can share this URL with people and you can you know if you were trying to if you want to show them how the profiling of your code and and you know what's fast and slow you can just send them that URL so that's it's a nice convenient way of doing it.

Okay and one other thing I want to show you is that you can also do that's previously I selected my code and I ran profile profile selected lines another way that you can do it is you can say profile start profiling and then run your line then run your code so well I guess I only have to run this one here so I started profiling then I ran that line of code and it's still profiling but then if I hit stop it stops and then it pops up the information here but this you when you do it this way you don't get the source code showing up here so that's it's it's not quite as nice that way you don't you don't get to explore your the code that you are running.

Oh yes and it's awesome as I mentioned before you can use profvis without RStudio you know if you're using our in the terminal or our GUI or say are from emacs and you run process it will open it up in a separate web browser instead of instead of popping it up in that in the RStudio panel like I was like I was working with there.

Wrap-up and Q&A

Okay and alright so that pretty much wraps things up so some things to remember hopefully from this you've learned how the sampling profiler works and you've got a feel for the profvis interface and and also we've learned that sometimes the performance bottlenecks are counterintuitive and it's really useful to have visualizations tools like profvis in order to help you know figure out what those bottlenecks are so that you can fix them now if you go to the the website for the package it's RStudio github.io slash profvis there's a lot more information there's a lot of details that have glossed over in this talk and if you're curious about those here's about learning more about it you can visit the website there.

Okay and I think I can take some questions now so first question is I run code of our cluster is there a way to pull the profile result to local I I don't think that the profiler will give appropriate information if you're running code on a cluster or you know in a distributed system so I don't think unfortunately the the profile records the information that's necessary in order to to give you that the information that you want there.

Let's see there's another one do you have to define the functions in the profvis call or can you just call the function that's that's a really good question so in my examples I was defining the function inside of the profvis call and what that gives you is that when you do it that way you can view the code in the code viewer if the code is if the function is defined outside of the out of the profvis call then you won't be able to won't be able to see it in the code viewer so that's what that gives you now there are ways of loading if you're say developing a package and you want to load a package that you can see all the source in the package there's ways of doing it which are in the website but but generally speaking it's it's best to run all the functions in profvis so that you can you can explore them.

Okay when I tried profvis the spring in RStudio on a Linux machine it did not work it only worked in Windows RStudio any progress that is surprising to me so I it should work in RStudio in Linux and if it doesn't that that's a bug so now it's possible that if you had an old version of RStudio it might not have come up in RStudio if that if that was the issue but profvis itself should work fine across all our platforms.

Okay how will this work with shiny that is a really good question um why don't I try to find an example here okay oops sorry it's RStudio okay so I have an example here of a shiny application and I can I can run it interactively too but on this website there's an example about profiling a shiny application actually I'll just run this code since it's right there see my studio okay so I'm just using run example to run one of the built-in shiny examples no very shiny let's do that okay so this is our tab sets example the slider on a few times summary okay and then when I close the shiny app profvis well you know that it's finished you know process the data takes a little time to process it okay now here we go let me pop this out so it's a little bit larger.

Okay so this is the shiny app there's a so the whole thing was within this run example which then called run app and the important things to look at are this these outputs here is this so this is blue or well I guess it's might be hard to tell it's a it's actually a light blue output dollar plot here and you notice I moved that slider a bunch of times and so every time it calls output dollar every time I move that slider it would re-render this plot and so this gives you a view of like what's taking time and I happen to have installed shiny in the way that it displays all the source code for shiny here so you can actually look at this rule this it's showing me all the internals for shiny when I when I mouse over it which is that's actually the reason that this is a little bit the inner interactions are a little bit slow here so I'll put dollar plot calls a bunch of stuff you know and then it called end up calling hist which was in the in the example and I call plot you know and so on and so on zoom in and get a little bit better look at this but this gives you an idea of like what's you know what takes time in the shiny application now this is a you'll probably notice that well you'll notice that the the call stack is significantly deeper than in those simple examples that I showed earlier but it's actually far more complicated than that there's this high to internal function calls checkbox here and you split it.

Okay so it's I have it split vertically instead of horizontal yeah instead of horizontally so this is actually the full complexity of what's happening in in shiny and you can see there's just you know there's a lot of there's a lot of function calls and so we you know normally we try to hide all that unnecessary information by using hide function calls and so you get you see something like this that's all it's a little bit easier to interpret and again now this is also in the if you go to the the website for profvis there's it has a same example and and on the website there's all this all these things are interactive as well so that because because they are profvis just as generates web pages so it they can be embedded into our markdown documents which is how this is all generated so I hope that that's helpful that gives some understanding of using it with shiny.

Okay let's see so here's a question I don't see the profile item in the top command list it says I'm running with the current version of RStudio do I need to use the daily update so a daily update of RStudio so I'm using RStudio about RStudio this is version 099 1281 these daily these daily updates there's like sort of the release version of RStudio and then there's these daily builds the release versions will say they're up-to-date but if you want all the cutting-edge featured all the cutting-edge features then you want to use the daily builds so we can just you know say RStudio daily and and then it'll come up here RStudio daily builds I'm on a Mac so I'd say desktop Mac and then they're just right here so I would just download and install this.

Okay here's here's a question we discussed why 2,000 megabytes is allocated but only 903 was deallocated did the resulting new data frame need one gigabyte of storage and that was that was back with the it was this example here so what happened was it allocated a lot of memory while it was taking these means it didn't deallocate it all because as you can see during this processing there was a garbage collection event that happened part way through but at the end there wasn't another garbage collection event if there was another garbage collection event here then I believe the total memory used would have been it would have deallocated the same amount that it had allocated well not exactly because it's still needed to allocate some memory for the means but that's a that's relatively small so and there's ways of triggering the garbage collection you can there's a function called GC which will just trigger garbage collection so if I had done that at the end it would probably come up that way.

All right so here's another question why is matrix sizing a data frame so computationally expensive it's it's actually not computationally expensive it's it's just takes a lot of memory it uses it takes requires a lot of memory allocation and and moving stuff around memory that's the expensive part in fact a lot of times in our when when you have things that are that take a lot of time but they're surprisingly you know they're things that are slow in our but are surprisingly slow it's often because they're moving a lot of memory around in a way that's inefficient so some of this stuff that I did here you know using instead of convert to a matrix and then transposing with a perm which is what apply does just using L apply directly that that saves time because it didn't it didn't have to move stuff around in memory a lot.

Okay another question here can I save the profile file within our code so instead of using the other RStudio interface and it is possible so if you want to be able to share this with people without you know outside of RStudio or doing a programmatic way you can do it so there's a frequently asked questions section of profvis and it is how do I share there it is okay so you just run profvis save that save that in verbal P and you say HTML widgets colon colon save widget and then and then set the file name that's that's how you'd save it without using without using the RStudio graphical interface.

Okay this is a really good one in the flame graph do blank the blank spaces correspond to C++ code a CRC plus plus code being executed or time that you aren't clicking on anything in the shiny site so let's see so where is that last okay so in this example so the blank spaces so I think I well if it's these blank spaces this means that there's or at the very top above everything else but that's time that is spent probably in C code there's spaces between these you know this this sort of empty space between things this is just because in these areas the call stack is not as deep there weren't as many function calls so that's why there's nothing in between them now there are these cells that are white there are cells in the flame graph that are yellow and there's ones that are white the yellow ones are ones for which we have source refs and that means that we can show that we can show you the source code in the in the code panel on the left so that was the code for this shiny application and for shiny itself because as I mentioned I had loaded shiny with with all the source the source refs and so that I can view them all I can view all the shiny code in that code panel the white stuff these are we did for these this is calling code that we don't have source refs for so these are for packages that we don't have what's called source refs installed and that's this is a these are probably in the base or yeah that these functions are probably in the base package and those are not installed with source refs by default in R so we just don't have them there's ways to do it so you can install you can install it these packages with source refs but I haven't done that.

Okay okay how do you install shiny to show internal functions great okay we were just talking about this so again this is in the frequently asked questions sections of the website let's see how do we get code from an R package to show in the code panel so there's this question here and you need to install the package with source refs so we basically typically well you'd say install packages in the package name type equals source and then you could pass it this install option with keep source and so any package that you install that way the internals of that package will will show in the the code panel of profvis.

Does coding simpler but nested functions worse in memory usage I mean is it a good practice to define a function inside of another function that will call it you know defining functions within other functions in R has a very very low memory overhead and and also memory and time overhead so that's actually it's pretty much there's very little cost to doing that and if you define functions within functions and it can often help like just to mentally sort out your code so I you know if you have a place where you're thinking about doing that it's it's probably a good idea but that but there's no there's probably no real performance benefit or or penalty for doing that.

Okay here's another one will profvis works with code being run in parallel say with a parallel package and I believe the answer is no I haven't really tried this out myself I think somebody told me that it did not work and that's because the built-in our profiler doesn't record information for with parallel that that can be useful it doesn't record the information that you need when you're profiling using parallel.

Profvis - Profiling tools for Faster R code | RStudio Webinar - 2016

Transcript#

Using system.time and the sampling profiler

Introducing profvis

Comparing approaches for column means

Second example: parsing profiler output

Wrap-up and Q&A

Featured software#

rstudio

webinars

Profvis - Profiling tools for Faster R code | RStudio Webinar - 2016

Transcript#

Using system.time and the sampling profiler

Introducing profvis

Comparing approaches for column means

Second example: parsing profiler output

RStudio integration and sharing profiles

Wrap-up and Q&A

Featured software#

profvis

rstudio

webinars