Gergely Daroczi | Getting things logged | RStudio (2020)

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon. Welcome to the programming session in the afternoon. We're going to jump right in and get started with Gergely, who's going to be talking about getting things logged.

Yeah, thank you very much. So my name is Gergely Daroczi, and I'm working at SystemOne, which is an ad tech company, but I will not talk about marketing, neither about our business models, and neither about the statistical models we are using, although there are some interesting problems. So we are doing a lot of time series forecasting, hierarchical models with quite large amount of data, working with multi-armed bandits and so on.

But I will just focus on all the open source R packages that we have built mainly to use R in production. And you can see a couple examples here. We have written thin layers around JDBC drivers so that you can easily connect to, let's say, to Snowflake from R, and a few other things like being able to use gRPC right from R, either as a server or as a client, and a few other things. And I will focus on this logging for R package.

And this is my favorite one, so log warnings and note the plural form. So once you issue this command, from that time, you will get detailed logs about all the warnings thrown by any R function.

A few other things, so logger is mainly to be used in R scripts and not really in interactive mode. So this is a quick example, loading the logger package, listing all the available packages on CRAN, logging the number of packages available on CRAN, and just iterating over all the letters of the English alphabet, counting the number of packages, including that letter, and depending on the number of packages, we will either use trace or debug log level to print the name of the package with the actual number, sorry, print the actual letter with the number of packages related. So this is the output. You can see that the trace log messages did not show up because by default, we are using the debug log level, but I will share more details on that later.

This was one of the first user requests. Can we have colorized outputs? Now that's possible if you're interested, although I would like to emphasize that if you want to grab outputs, it's probably not the best thing, but that's definitely doable.

How logger works internally

So let me give you a more structured overview on how this logger package works, and I will use this example of calling the log info on this string, and in logger, actually, these four steps are happening. So first, we check if the log threshold is met. So this is fairly easy. So here we know that the info level is, sorry, the log level is info, and the log threshold could be set to, let's say, error, which means I do not want this to be logged. So I just exit early, and this is happening fairly quickly, so there's not much overhead. On an average laptop like this one, this happens in 20 microseconds, so not speaking about milliseconds, it's 20 microseconds, which is important so that you can put as many logging functions in your production scripts as you wish. If you don't really need the log outputs, you can just set a higher log threshold.

So next step, after the log level threshold was met, is using the formatter function, which is just focusing on this part. So the hello curly braces name will be transformed to hello word, because the name was replaced by word, so this is just creating the log message that is to be recorded later in the means of a log record. A log record is a rendered version of the log message, and this simple layout is just adding the log level prefix, a timestamp, and the actual message, but you can use other layouts like JSON and other things as well. The fourth step defines where the log record has to be delivered, which can be a console, it can be a file, or it can be a database or maybe Slack, and the end result is here. So once you load the package, all these functions are automatically run, so that means that Logger is ready to be used right away, no need to set up anything, although you can fine tune it to your needs.

Logger components in detail

Let me go through all the related atomic items of Logger. Log levels are just integers, so making this comparison is cheap. We have the log message formatters, for example, the one that I mentioned previously, that is using Glue in the background, but you can use Sprintf, maybe because Glue is not installed or you just prefer that syntax, and there are a few others, and honestly, you can write a log message formatter pretty easily. You just have to make sure that you are transforming whatever object is passed to a string.

Next step is the layout, which is, again, the layout simple is just adding the log level prefix, timestamp, and the message, but you can fine tune that to your needs. I have this generator function, which provides you quite a few meta variables that you can use in your generator. For example, it can automatically log the host name, where you are running the R script, the process ID, the namespace from which you have been calling this Logger function, and so on, and of course, you can have layouts, for example, in JSON and other cases as well.

The log record destination is one of the most complex things, I think. You can log to console file, you can log to both, and you can deliver the messages maybe to Slack or other instant messaging systems. You can send the messages to syslog, you can send the log records to a live stream or to a database table and so on, and this is a recent feature, being able to use a log record appender in an asynchronous way, and to show you a quick example of that, I'm defining my own appender function, which hopefully also demonstrates how easy it is to create your own functions, so this is just writing to a given file, but doing that pretty slowly after sleeping for a second.

So once you start using this appender, and you log 25 lines, you can guess what happens here. This script runs for 25 seconds, not doing anything besides logging, so if you are using that in a production application, you are blocking your production job by logging, which is not cool at all. Instead of that, there's an async way of delivering the log record to the destination, and if you're interested, it's pretty detailed in the documentation how it's done. To keep it short, we have a local message queue where we are just pushing the log records, and the actual delivery is done by a background process, which is monitored by your main R session, so just to show you a quick example of that, you just need to wrap your appender function in appender async, and then it will create your temporary storage for the message queue, it will start the background process and monitor it for you for free, and then when you run the very same stuff, you will get the results within a few milliseconds instead of waiting for the log records actually delivered, and after that, if you are checking on the log file, you can see that the background process is still delivering the log records, but your R script managed to finish the job in time.

Last thing about the logger definition, it's not that simple as described here, because you can actually have multiple thresholds, multiple formatters and layouts if you would like to fine tune that to your needs. You do not have to do that, but if you would like to deliver errors to a database, all the trace logs to your file and the info messages to your console, that's totally possible. Because of time constraints, I will just go through this real quickly, but I will post the slides later online, and you will find these examples, so you can just stack all these log records pretty easily.

Helper functions and integrations

There are some helper functions, and I will just mention some of these, so there's TikTok logging, which is just printing the number of seconds since the previous log request. Other functions that I already mentioned, for example, the log warnings. These guys will automatically capture all the messages, warnings, and errors thrown by R and log it properly so that you know when those errors, messages, and so on happened.

There's a helper function for Shiny , so this is a fairly simple Shiny application, and this line might not be familiar, so just imputing this single line to your app, it will monitor all your inputs, so all these free inputs in your Shiny app, and it will give you free logging of all the changes happening in your application. So just putting that line in your Shiny app, you will get detailed logs of what inputs were changed by the user, which is, I think, really useful to track down how users are managing your app and what errors might have happened.

Using logger in R packages, and this is the main thing why I started to work on logger so that it's flexible enough to be used in an R package as the R package developer imagined what things should be logged, and giving access to the user to fine-tune where the log messages should be logged. So a few examples, there's this Boto R package, which is a wrapper around Python's Boto3 module, so this is a simple way of interacting with AWS. For example, you can read files from S3 by just specifying the object and the function to be used, and it will download the file, store it in a temporary file, read everything, and clear up your workspace. You get some logging for free. If you don't like that, you can just kill that, and you can get rid of the log messages, or you can actually update the log threshold to be traced so that you get more detailed logging.

For example, here it will do the decompressing, it will log the change in the file size, and also it will log the cleanup process. If you don't like that, again, you can just set, let's say, an error or very high log level threshold, and you will get rid of all those messages.

Logging database connections

Another example for logging is this one. This is a pretty common pattern in R scripts, using DBI and hard-coding passwords and host names in R scripts. I didn't like that approach, so I was thinking about something better. And just to keep it short, by the way, if you are interested, there are some other related slides on my homepage for this. A database connection reference can be as simple as this one. For example, just using SQLite. For that, you just provide the driver, and let's just use a temporary file for the actual database. If you want to run a simple query like this one, select 42, you get logging for free. Again, if you want to get rid of that, it's really easy to turn that down, but I think it's really important to have this in production jobs, like you see how or when we are making the database connection. You get the actual SQL statement logged to your console file, Slack, whatever. You get the number of rows, which is important when you are checking production job output and you see that no rows were returned and so on. And there are some attributes also logged, which might be useful.

More complex example is using MySQL and just using the Shiny demo server provided by RStudio . Although the password is not that secret, but still it's not very cool to have this in an unencrypted YAML file. You can use, let's say, Amazon KMS to do the encryption and decryption for you. So using this database connection can be as easy as this one. You write your SQL and this package will do the decryption via KMS for you, log the actual SQL results and so on. Yeah, it's fork safe and there are a few other features that might be interesting. Please check the slides, everything posted on my personal homepage. We have some other packages. All these packages have, I think, pretty detailed package down documentation. If you miss anything, please let me know and I will be happy to improve the documentation.

And thank you very much and just happy logging to everyone, so thank you.

Q&A

Thank you, Gergely. That was a great overview of the logger package. We have time for maybe one or two quick questions. So you did skip over it right at the end, but how does the logger package log computations inside MCL Apply or Parallel LApply?

So with MCL Apply, when you are forking, it will actually fork all the appender functions and formatter functions or whatever you have defined and will just use those. If you are afraid of making that in the case when you are using a database insert as the appender, actually the DDBR package is aware of that. So when you are forking the process, it can check if the PID changed, and if it did, then it will just create a new connection automatically. So it's making all these processes pretty fork safe.

One more quick question. Logs are data. How do you analyze the logs? Yeah. That's a great question as well. So it really depends on how you define your log message layouts. Anyway, you can define the log layout to be, let's say, a JSON line object. So this way, you are just logging JSON blobs, and you can just parse that with your favorite tool. So if you want to analyze your logs, probably you should not use the default text layout. You can also deliver the log records to a database. So you have that in a structured way right away.

Logs are data. So if you want to analyze your logs, probably you should not use the default text layout. You can also deliver the log records to a database. So you have that in a structured way right away.

Gergely Daroczi | Getting things logged | RStudio (2020)

Transcript#

Using R in production

Why logging matters

Introducing the logger package

How logger works internally

Logger components in detail

Helper functions and integrations

Logging database connections

Q&A

Featured software#

rstudio