Resources

Aaron Jacobs | Introducing xrprof: A New Way to Profile R | RStudio (2021)

Tracking down performance issues in R code usually means using R's built-in Rprof() profiler or one of the packages built around it. But the changing nature of the R community (towards more deployed applications) makes local profiling workflows frustrating, which is why I have written a new profiler: xrprof. xprof is compatible with existing R tools, but unlike them it can be used to profile R code that is already running -- in fact, it is designed to be safe to point at R code running ""in production"". xrprof also works seamlessly when R is run inside Docker, and can even be run in complex environments like Kubernetes clusters. Taking inspiration from the {jointprof} package, xrprof can also show function calls at the C/C++ level alongside those from R. This can be immensely useful for diagnosing problems in packages that make heavy use of compiled code. About Aaron: Aaron Jacobs is a Senior Data Scientist on the R&D team at Crescendo, a technology company in the sports betting space with a large internal R ecosystem. Prior to Crescendo he worked in Canadian public policy research. Aaron has a strong interest in the engineering side of data science and the emerging use of R "in production". He is the author of several CRAN and GitHub packages, as well as xrprof -- a new R profiling tool

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is Aaron Jacobs. I work at Crescendo Technology in Toronto, which is a company focused on the technology side of sports betting. I'm also the author of xrprof, which is a new profiling tool for the R ecosystem that I'm hoping to tell you about today. It's an honor and a privilege to be part of this year's unusual RStudio conference.

Now, from my perspective, there's been an important change in the R ecosystem in the last five or so years. R has traditionally been a language for data analysis, visualization, summary, exploration locally, interactively, with some reporting. But with the emergence of frameworks like Shiny and Plumber, it's now being used more and more to build real applications. Now, on the one hand, this has created huge opportunities for our users to expand the scope of what they can do, and the value that they can bring to themselves and their organizations.

But with this increased opportunity has come much increased complexity and increased responsibility. The reality is that for many R users, presenting these applications means that these applications have users other than their authors for the first time. And often they're running in environments that are not the same local development environments that we're used to. This means that you now get bug reports like, hey, why is your Shiny app slow? Or this report is taking an hour to run, but it used to take two minutes. What's happening?

Profiling in R

Now, traditionally, to answer performance related questions like this, you would make use of a profiling tool. Profiling itself is a term from computer science. It effectively means that we collect data about where our program is spending its time. Effectively, which functions are being called most often.

Now, for R users, this is actually kind of a comfortable situation because profiling produces data. And R users know how to deal with data. We can use all of our favorite tools for analysis, visualization, summary of profiling data, just like we could with any other kind. And profiling itself is a fundamentally empirical approach to performance analysis. Don't guess, measure. And then use those measurements to make informed decisions.

Don't guess, measure. And then use those measurements to make informed decisions.

Profiling is such an important field for performance analysis that actually R has its own built-in sampling profiler. It's available through the RProf function. And there's a variety of functions in base R and the wider ecosystem that allow you to make use of RProf's format to produce, again, these analysis, these summaries, these visualizations. And so the prof tools package or the prof is package. There's also AProf and GUIProfiler, which are slightly less popular. These all do exactly this summary, analysis, visualization.

And actually, because there's a wider profiling ecosystem beyond R, lots of these packages can actually produce conversions to popular formats that are used in external tools. So for example, the ability to convert RProf formats to KCacheGrind or Google's PProf tool, or the SpeedScope format, or also the original FlameGraph software format. These are all available in the R ecosystem. So there's a big existing set of tools you can use for performance analysis using the RProf function.

The problem with local profiling

Now, all of this sounds great, but let's get back to that whole Shiny performance issue for a moment. One of the interesting things about Shiny is that it's often used by coworkers or peers or otherwise as an internal tool, which means that the users are often people who essentially would send you an email or a message on Slack if they were issuing you some kind of report. And for example, if you receive some sort of performance bug report, it's basically going to come down to, hey, your app is slow. Can you make it faster so that I can do my job?

And of course, this is not really a helpful or informative thing for them to tell you. So you start asking follow-up questions like, hey, what were you doing? And they'll tell you, you know, I had clicked X and Y. You'll ask them things like, okay, that's not really enough information for me to recreate your issue. Do you remember what you set these inputs to?

And sometimes they can remember and tell you perfectly, but much more often, usually they either don't remember or they misremember and they tell you what they did, but really they didn't. And you spent a long time running down this rabbit hole of trying to recreate their issue locally. And the fundamental reason for this is that when you're using RProf, you have to be doing profiling locally. So you're trying to reproduce an issue or an environment and then profiling locally to collect data that you can use to analyze your program.

Now, there's all kinds of reasons why this is frustrating, but in addition to the frustration, there's also the problem that like your local environment may kind of differ from wherever your code ends up really running for these users. So a classic example of this is, say for myself, I often write and develop applications in Canada, but deploy them to servers that are in Europe for European users. This means that, you know, the kind of databases that they're connecting to, or the other network services that they might contact, might have very, very different latency profiles when they're actually deployed, as opposed to when I'm working on them locally.

And this can kind of create both bugs, but also high performance issues where stuff that works fast for you locally doesn't work very fast at all for your users. So in addition to the difficulty of trying to recreate what your user was doing, you're also trying to recreate this environment. And overall, this is a fairly error-prone process, and it really slows down tracking down performance issues.

Introducing xrprof

So this kind of disconnect between where code is really running and where you want to profile it is actually the fundamental motivation behind a new project that I've been working on called XRProf. Now, XRProf is not a clever name. It just means external RProf, and it is exactly that. It's a standalone program that allows you to profile R code that is already running.

So the fundamental way of thinking about this is that RProf allows you to profile by modifying your R code, usually locally in an interactive session. XRProf, on the other hand, is designed and oriented around profiling R code that is already running without modification. And this is very powerful if, for example, you want to profile code that's running in a production environment, say on Shiny Server, on RStudio Connect, for R applications that are running under Docker. XRProf can see and understand R programs that are running in all of those conditions and profile them for you without modifying them. This allows you to actually do your profiling with real users and real data. Makes it much easier to track down performance issues because you have data that's close to the source.

Now, XRProf is fundamentally designed to be a compatible drop-in replacement for RProf, which means it produces the same format, and it works with all of the existing R ecosystem tools, and by extension, through conversions to other formats, all of the wider profiling world outside of R. Because of this compatibility, it's really, really easy to use, and it should be very familiar if you've used profiling data in R before.

Profiling C and C++ code

Now, there's another piece of this puzzle. One of the other major things that's happening in the R ecosystem is that more and more R packages are making use of C and C++ code. Now, it's always been possible to embed native code in R. It's actually one of the strengths of R as a programming language. But I think partly because of democratizing tools like Rcpp, which have made it much easier for R users to write C++ code, it's now much more common to include C++ in an R package.

This is great on the one hand for performance, but it does create a problem, which is that RProf can actually only see functions that are happening inside of the R code. So, it can only see R functions. That means that the more and more of this code that is happening in C and C++, the more is totally invisible to RProf.

And this creates serious issues when the C and C++ code is actually the source of the performance problem. Now, I know you might be thinking, hold on, C and C++ are much faster than R. And this is true, but in practice, C and C++ code can make exactly the kind of poor performance decision you can make in R. You can do stuff you don't really need to do. You can do things much more often than you really need to do them. Or you can choose, you know, a kind of inefficient data structure to represent the problem or an algorithm to actually solve the problem.

So, just because code is written in C and C++ doesn't mean that it somehow becomes free of these issues. And I think that if you were trying to track down a C or C++ related performance issue in an R package, it's actually been kind of painful so far. Either you just guess what the problem is and keep trying various approaches, or you try and use some sort of C or C++ level profiling tool and an R level profiling tool at the same time, and then try and cross-reference what's happening. This is actually in practice a very inexact and frustrating process. There's just not enough information on either side for you to be able to knit these things together.

And this secondary frustration was actually a big technology motivation of XRProf as well. So, XRProf can actually see what is happening in your C and C++ code. And what it can do is append that to your R level functions so you can see a more holistic picture of what's going on. I think this is a really, really powerful tool, especially for R package developers who need to do performance analysis that crosses this language barrier. And don't forget that R itself is actually written in C. So, in practice, sometimes when you're having performance issues that touch base R code, actually the issue is in the C.

A real-world example: mongolite

So, again, what does this kind of look like? This is a more involved example. And in fact, it was a very real issue that I first encountered a couple of years ago.

So, at work, we have a whole bunch of semi-structured data in a database called MongoDB. And there's an existing R package called mongolite, which can be used to query MongoDB. And it actually works really, really well the vast majority of the time.

Now, if you're not familiar with MongoDB, effectively what it does is store data as a kind of JSON. Which means that tabular data, the kind you would see in a CSV or in a SQL table, is essentially represented as an array, a JavaScript array of JavaScript objects. So, this effectively looks like a big long list of dictionaries of key value pairs, where every key value or every row is represented as a dictionary.

Okay. Now, it turns out that there's a known performance issue with mongolite for large databases with lots of columns. The queries can take a long time. A surprisingly long time, at least. And when I first encountered this a couple of years ago, I wanted to figure out why.

So, I profiled it using Rprof. But I actually discovered that I couldn't really see anything. And the reason for this is that all of the interesting parts of mongolite are actually implemented in C. Which means that it was very hard to figure out what was going on.

And so, at the time, I did exactly this very awkward dance of using C and C++ related tools and R related tools and trying to cross reference them. And it was very painful and took me a long time. And I wasn't even really ever sure if that was measuring the right thing. So, when I redid the analysis for this talk, it was kind of comforting that what had taken me several days a few years ago took me something like 90 seconds when I used XRprof.

And so, this is what happens when you use Rprof. This is a flame graph of what you would see. Which is not that informative because you just don't see what's happening inside of the R code that's mostly implemented in C. But if you can see the C code, this is what you will see. So, here the R functions are in blue and the C functions are in orange.

Now, of course, I know this is much too small for you to actually be able to read function names. And that's not really the point. So, I'll just point out a couple of things that I was able to pick up on once I had this view. The first is that there's a certain amount of time that R is communicating with Mongo itself. And this is important because effectively this represents the total upper bound of how fast we could ever possibly make this R code. Because the reality is we still need to spend time transmitting data from the database over the network. And there's nothing really we can do about the time that it takes to do that. Other than, say, querying less data. So, this kind of serves as a total upper bound.

But there were two kind of odd things that I noticed or picked up on when I looked at these profiles. Which is that R is spending a lot of time allocating memory. And if you don't know what that means, well, mostly it doesn't matter. Because allocating memory is normally really fast. And in order for allocating memory to show up on a profile, it means it needs to be happening a lot. Or at least happening really inefficiently. So, when I saw this, it immediately piqued my interest. And it made me think what could be causing this many allocations? The second thing that I noticed is that MongoLite seems to be spending a ton of time transposing lists and checking for uniqueness.

Which seems like an odd operation to be doing. Because, again, checking uniqueness is normally pretty fast. In order to spend so much time doing it, you have to be doing it a lot.

So, originally, actually, I puzzled over this for a while. Before I realized that, ultimately, it actually totally made sense. So, remember how I said that JSON data represents tabular data as an array of objects? Well, it turns out that how MongoLite works is that when it reads in rows, it takes these JSON objects and turns them into R lists. Which is a very natural representation of a row. Basically, you know, you have 20 columns. And that gives you a list of length 20. And, you know, every element of that list can have a different type. And, you know, the names of your lists are the names of your column. And then what happens at the end is, basically, it takes all of these lists and smooshes them together in a data frame.

Now, this is kind of interesting because if you imagine, if we were to create a data frame in advance, we knew how many rows we were getting, we would essentially just do one allocation for every column. Right? So, if we had 20 columns, we do 20 allocations. But if we read in a million rows of 20 columns using this approach, we do something like 20 million allocations. So, you can kind of see where, you know, maybe some of this was showing up in the profile because instead of doing 20 allocations, we're doing 20 million.

And the second thing is, because we have all these lists, when we smoosh them together, we have to do a bunch of things like check that all of the names are the same. And all of the types are the same. And we have to do this for every single row we add against all the data that we've previously added. So, again, you can see how this adds up. Now, actually, this story doesn't have a super happy ending. I did propose changing MongoLite so that it used a different algorithm. The problem was it was a very complex patch, something like 700 lines of new C code or something like that. So, it wasn't accepted into MongoLite. But it was about five times faster. Which just goes to show that even though you're writing something in a fast language, like C or C++, doesn't mean you can't make exactly the same kind of performance issues that you'd encounter in R code.

Which just goes to show that even though you're writing something in a fast language, like C or C++, doesn't mean you can't make exactly the same kind of performance issues that you'd encounter in R code.

So, hopefully, that serves as some inspiration as to what might interest you if you're trying to do remote profiling for any kind of production code. Or you're trying to do profiling, local or remote, that involves R code that uses lots of C or C++.

Now, I should say XRProf itself, the project is open source. It's available on GitHub. Most of the development this year was generously funded by the R Consortium. And I'm currently looking for users to try it out, see if it works for you, see if you can break it, and let me know. I'm hoping that XRProf can become a new tool in the R profiling toolbox.

R profiling toolbox. I'm Aaron Jacobs. You can find me on Twitter, on GitHub, and on my own site. I work at Crescendo Technology in Toronto. And if this stuff kind of piques your interest, we're always hiring. Thanks for listening, and I hope you enjoyed this and all of the great talks at this year's RStudio conference.