Resources

Gergely Daroczi | Getting things logged | RStudio (2020)

One of the greatest strengths of R is the ease and speed of developing a prototype (let it be a report or dashboard, a statistical model or rule-based automation to solve a business problem etc), but deploying to production is not a broadly discussed topic despite its importance. This hands-on talk focuses on best practices and actual R packages to help transforming the prototypes developed by business analysts and data scientist into production jobs running in a secured and monitored environment that is easy to maintain -- discussing the importance of logging, securing credentials, effective helper functions to connect to database, open-source and SaaS job schedulers, dockerizing the run environment and scaling infrastructure

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Good afternoon. Welcome to the programming session in the afternoon. We're going to jump right in and get started with Gergely, who's going to be talking about getting things logged.

Yeah, thank you very much. So my name is Gergely Daroczi, and I'm working at SystemOne, which is an ad tech company, but I will not talk about marketing, neither about our business models, and neither about the statistical models we are using, although there are some interesting problems. So we are doing a lot of time series forecasting, hierarchical models with quite large amount of data, working with multi-armed bandits and so on.

But I will just focus on all the open source R packages that we have built mainly to use R in production. And you can see a couple examples here. We have written thin layers around JDBC drivers so that you can easily connect to, let's say, to Snowflake from R, and a few other things like being able to use gRPC right from R, either as a server or as a client, and a few other things. And I will focus on this logging for R package.

Using R in production

But before I focus on this logger package, I would like to clarify what I mean by using R in production and what is in production. And I included a couple of projects from the past years. I will skip many of these and we'll just focus on some important points. Like I will not mention the first one that was terrible, but this is happening in a lot of companies, I think, writing batch jobs in R that is to be run in an automated way, maybe triggered by Jenkins or cron schedule or something like that. Maybe just generating reports or doing some ETL processes or doing some daily operations like charging fees on customers, card, and so on.

So this is pretty common, but I think there's no standard way of logging everything in production which can be really useful if something goes wrong. Also you want to get notified if there's any errors with your jobs. I've been using R to be integrated in web applications. Jeffrey Horner's R Apache module was great. It makes it fairly easy to have scalable APIs written in R, although there are some issues like security that you should take seriously. Also you need a way of tracking all the errors in a central place and so on.

Using R for stream processing is not a standard thing, I think, but you can definitely do that. I was working at a fintech company, and we had the debit card transactions coming in in real time, and we had some R scripts doing the real-time notifications of our customers via text message, email, push notifications, and so on. So that's totally doable, but it's not really standard.

So just abstracting this away, I think using R in production is mainly using R without manual intervention. So having a cron or Jenkins or API request triggered the job. Also you should use some containerized environment. I would like to emphasize it's not as simple as putting everything in a Docker image and you are all set. But you have to fix your R version, your R package versions, your operating system versions. For example, if you are using Debian testing and the app team decides to do a major version update, for example, for the Postgres client overnight, then your jobs might fail.

Also you have to use a standard R profile and so on. I will skip security. That's a totally different topic, but really important. And I will focus on these. So you really have to monitor the jobs to get notified if there's some problem. You want to get a notification as a developer, and also you want to notify the users of your R applications. For example, in our case, the buyers of different campaigns. And it's, I think, not enough to monitor the errors, but you have to log everything so that if there's a problem, you can debug that pretty easily.

Why logging matters

This 10-second video really well describes how I feel about the R code I've written 10 years ago, or let's say yesterday. And I like to use this animation to highlight the importance of version control, but I think that's true for logging as well. For example, you have this super important business stuff that R script that you would like to run, and the only output you get is this one. Script out of bounds, I think the most frequently searched term on Stack Overflow. And here, at least, you get some hints. There's a function called fng, but if you get an error like this one, .subset2error, you have no idea what has happened, so logging is important.

I've seen this design in a lot of scripts. So you're looping over some vector and printing the iterating variable so that you know where the script is happening. So that's just a poor man's progress bar, or so that you know where the script failed. Or there are some fancier versions of that, logging the timestamp and the percentage left, which works great. For example, if you have an output like this one, at least you know that the script failed after the 17th iteration, but it's not perfect. Also, you can log to file by sync or cat, or you can have your own homebrew function to do the logging to a log file.

And this works in most cases, but for example, if you are forking the R process, your log lines can be mixed up, and then this subset2error happens. You have no idea in which part that happened. So I think it's a time to get a proper R package for logging. And of course, I did my homework. I looked around on CRAN, what packages are available for logging, and there are quite a few. I tried to use this nice regular expression to filter for all the packages related to logging, and obviously, I did a very bad job with that. Anyway, there are at least 10 packages doing logging on CRAN, but there were some issues. I will not get into the details. I've written a pretty long apologetics about why I'm not using the other packages. If you are interested, please check the GitHub page.

Introducing the logger package

Anyway, I came up with a design which can be used for all the other R packages. This is a design which is flexible enough to fit all the needs of the R users. Posted this on Twitter, got some good feedback, so ended up working on this package. This is called logger. Usage is hopefully fairly straightforward. So you load the package, and then you have this log underscore info, warn, error, debug, and trace functions, where you can just pass a string. So it will just print to the console the log level, the timestamp, and the message you have posted there.

And there are some formatters. By default, if the glue package is installed on your computer, that will be used. So you can use variable names or R expressions. That will be evaluated in your string. Glue will do the string interpolation, and here log the number of rows in the data set, and so on.

There are some helper functions in logger, and I will just show you a couple examples. For example, this log evil function will evaluate your R expression, this one, and also log the result. So you can see that this was wrong. This is the result. This is useful if you just want to record the changing values of a variable in a loop or in an R script.

And this is my favorite one, so log warnings and note the plural form. So once you issue this command, from that time, you will get detailed logs about all the warnings thrown by any R function. For example, here, I think this is useful so that now, right away when the warning happened with the exact timestamp, not like the default R behavior, printing all the warnings at the end of the R script.

And this is my favorite one, so log warnings and note the plural form. So once you issue this command, from that time, you will get detailed logs about all the warnings thrown by any R function.

A few other things, so logger is mainly to be used in R scripts and not really in interactive mode. So this is a quick example, loading the logger package, listing all the available packages on CRAN, logging the number of packages available on CRAN, and just iterating over all the letters of the English alphabet, counting the number of packages, including that letter, and depending on the number of packages, we will either use trace or debug log level to print the name of the package with the actual number, sorry, print the actual letter with the number of packages related. So this is the output. You can see that the trace log messages did not show up because by default, we are using the debug log level, but I will share more details on that later.

This was one of the first user requests. Can we have colorized outputs? Now that's possible if you're interested, although I would like to emphasize that if you want to grab outputs, it's probably not the best thing, but that's definitely doable.

How logger works internally

So let me give you a more structured overview on how this logger package works, and I will use this example of calling the log info on this string, and in logger, actually, these four steps are happening. So first, we check if the log threshold is met. So this is fairly easy. So here we know that the info level is, sorry, the log level is info, and the log threshold could be set to, let's say, error, which means I do not want this to be logged. So I just exit early, and this is happening fairly quickly, so there's not much overhead. On an average laptop like this one, this happens in 20 microseconds, so not speaking about milliseconds, it's 20 microseconds, which is important so that you can put as many logging functions in your production scripts as you wish. If you don't really need the log outputs, you can just set a higher log threshold.

So next step, after the log level threshold was met, is using the formatter function, which is just focusing on this part. So the hello curly braces name will be transformed to hello word, because the name was replaced by word, so this is just creating the log message that is to be recorded later in the means of a log record. A log record is a rendered version of the log message, and this simple layout is just adding the log level prefix, a timestamp, and the actual message, but you can use other layouts like JSON and other things as well. The fourth step defines where the log record has to be delivered, which can be a console, it can be a file, or it can be a database or maybe Slack, and the end result is here. So once you load the package, all these functions are automatically run, so that means that Logger is ready to be used right away, no need to set up anything, although you can fine tune it to your needs.

Logger components in detail

Let me go through all the related atomic items of Logger. Log levels are just integers, so making this comparison is cheap. We have the log message formatters, for example, the one that I mentioned previously, that is using Glue in the background, but you can use Sprintf, maybe because Glue is not installed or you just prefer that syntax, and there are a few others, and honestly, you can write a log message formatter pretty easily. You just have to make sure that you are transforming whatever object is passed to a string.

Next step is the layout, which is, again, the layout simple is just adding the log level prefix, timestamp, and the message, but you can fine tune that to your needs. I have this generator function, which provides you quite a few meta variables that you can use in your generator. For example, it can automatically log the host name, where you are running the R script, the process ID, the namespace from which you have been calling this Logger function, and so on, and of course, you can have layouts, for example, in JSON and other cases as well.

The log record destination is one of the most complex things, I think. You can log to console file, you can log to both, and you can deliver the messages maybe to Slack or other instant messaging systems. You can send the messages to syslog, you can send the log records to a live stream or to a database table and so on, and this is a recent feature, being able to use a log record appender in an asynchronous way, and to show you a quick example of that, I'm defining my own appender function, which hopefully also demonstrates how easy it is to create your own functions, so this is just writing to a given file, but doing that pretty slowly after sleeping for a second.

So once you start using this appender, and you log 25 lines, you can guess what happens here. This script runs for 25 seconds, not doing anything besides logging, so if you are using that in a production application, you are blocking your production job by logging, which is not cool at all. Instead of that, there's an async way of delivering the log record to the destination, and if you're interested, it's pretty detailed in the documentation how it's done. To keep it short, we have a local message queue where we are just pushing the log records, and the actual delivery is done by a background process, which is monitored by your main R session, so just to show you a quick example of that, you just need to wrap your appender function in appender async, and then it will create your temporary storage for the message queue, it will start the background process and monitor it for you for free, and then when you run the very same stuff, you will get the results within a few milliseconds instead of waiting for the log records actually delivered, and after that, if you are checking on the log file, you can see that the background process is still delivering the log records, but your R script managed to finish the job in time.

Last thing about the logger definition, it's not that simple as described here, because you can actually have multiple thresholds, multiple formatters and layouts if you would like to fine tune that to your needs. You do not have to do that, but if you would like to deliver errors to a database, all the trace logs to your file and the info messages to your console, that's totally possible. Because of time constraints, I will just go through this real quickly, but I will post the slides later online, and you will find these examples, so you can just stack all these log records pretty easily.

Helper functions and integrations

There are some helper functions, and I will just mention some of these, so there's TikTok logging, which is just printing the number of seconds since the previous log request. Other functions that I already mentioned, for example, the log warnings. These guys will automatically capture all the messages, warnings, and errors thrown by R and log it properly so that you know when those errors, messages, and so on happened.

There's a helper function for Shiny, so this is a fairly simple Shiny application, and this line might not be familiar, so just imputing this single line to your app, it will monitor all your inputs, so all these free inputs in your Shiny app, and it will give you free logging of all the changes happening in your application. So just putting that line in your Shiny app, you will get detailed logs of what inputs were changed by the user, which is, I think, really useful to track down how users are managing your app and what errors might have happened.

Using logger in R packages, and this is the main thing why I started to work on logger so that it's flexible enough to be used in an R package as the R package developer imagined what things should be logged, and giving access to the user to fine-tune where the log messages should be logged. So a few examples, there's this Boto R package, which is a wrapper around Python's Boto3 module, so this is a simple way of interacting with AWS. For example, you can read files from S3 by just specifying the object and the function to be used, and it will download the file, store it in a temporary file, read everything, and clear up your workspace. You get some logging for free. If you don't like that, you can just kill that, and you can get rid of the log messages, or you can actually update the log threshold to be traced so that you get more detailed logging.

For example, here it will do the decompressing, it will log the change in the file size, and also it will log the cleanup process. If you don't like that, again, you can just set, let's say, an error or very high log level threshold, and you will get rid of all those messages.

Logging database connections

Another example for logging is this one. This is a pretty common pattern in R scripts, using DBI and hard-coding passwords and host names in R scripts. I didn't like that approach, so I was thinking about something better. And just to keep it short, by the way, if you are interested, there are some other related slides on my homepage for this. A database connection reference can be as simple as this one. For example, just using SQLite. For that, you just provide the driver, and let's just use a temporary file for the actual database. If you want to run a simple query like this one, select 42, you get logging for free. Again, if you want to get rid of that, it's really easy to turn that down, but I think it's really important to have this in production jobs, like you see how or when we are making the database connection. You get the actual SQL statement logged to your console file, Slack, whatever. You get the number of rows, which is important when you are checking production job output and you see that no rows were returned and so on. And there are some attributes also logged, which might be useful.

More complex example is using MySQL and just using the Shiny demo server provided by RStudio. Although the password is not that secret, but still it's not very cool to have this in an unencrypted YAML file. You can use, let's say, Amazon KMS to do the encryption and decryption for you. So using this database connection can be as easy as this one. You write your SQL and this package will do the decryption via KMS for you, log the actual SQL results and so on. Yeah, it's fork safe and there are a few other features that might be interesting. Please check the slides, everything posted on my personal homepage. We have some other packages. All these packages have, I think, pretty detailed package down documentation. If you miss anything, please let me know and I will be happy to improve the documentation.

And thank you very much and just happy logging to everyone, so thank you.

Q&A

Thank you, Gergely. That was a great overview of the logger package. We have time for maybe one or two quick questions. So you did skip over it right at the end, but how does the logger package log computations inside MCL Apply or Parallel LApply?

So with MCL Apply, when you are forking, it will actually fork all the appender functions and formatter functions or whatever you have defined and will just use those. If you are afraid of making that in the case when you are using a database insert as the appender, actually the DDBR package is aware of that. So when you are forking the process, it can check if the PID changed, and if it did, then it will just create a new connection automatically. So it's making all these processes pretty fork safe.

One more quick question. Logs are data. How do you analyze the logs? Yeah. That's a great question as well. So it really depends on how you define your log message layouts. Anyway, you can define the log layout to be, let's say, a JSON line object. So this way, you are just logging JSON blobs, and you can just parse that with your favorite tool. So if you want to analyze your logs, probably you should not use the default text layout. You can also deliver the log records to a database. So you have that in a structured way right away.

Logs are data. So if you want to analyze your logs, probably you should not use the default text layout. You can also deliver the log records to a database. So you have that in a structured way right away.