Melissa Van Bussel | Achieving a seamless workflow between R, Python and SAS from within RStudio
Some of my best friends use Python...and all of my coworkers use SAS. Statistics Canada is the official statistical agency of Canada and employs over 6,000 employees. Statistics Canada has a legal obligation to ensure that personal information collected for statistical purposes is kept strictly confidential. An internal system which prevents the release of confidential information is only implemented in SAS. As such, many Analysts and Data Scientists at Statistics Canada must use the SAS programming language as part of their workflow. It is therefore imperative to find ways to work with open source programming languages and SAS seamlessly. I will present a method for achieving a harmonious workflow between R, Python, and SAS, all within RStudio. Talk materials are available at https://github.com/melissavanbussel/rstudio-conf-2022 Session: Some of my best friends use Python
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
I'm Melissa, and today I'm going to be talking about how to achieve a seamless workflow between R, Python and SAS, all from within RStudio. Now we're at an RStudio conference, and more specifically, right now, we're in a session that's called Some of my best friends use Python. So I already know that everyone in this room loves at least three of the words that are in the title of my talk.
But before we talk about R or Python, I first want to take a moment to talk about something that I know we're all even more passionate about, which, of course, is SAS. Don't worry, I'm totally kidding, I was not referring to SAS here at all, I was actually talking about food. And I apologize in advance, because I know we're still about an hour away from our lunch break today, but I do just want to take a moment to talk about food before we go any further.
I love food, but I'm not a very good cook. So I really like these meal kit boxes that you can get delivered to your door, because the recipes are always super straightforward, and they always come with the exact right ingredients measured out in the exact right proportions. And usually if I follow along with the steps in the recipe in one of these meal kit boxes, I can get the results to look pretty close to the pictures that are in the recipe. But there's just one problem with these meal kit boxes. And that problem is that they give bad cooks like me false confidence.
These meal kit boxes have me thinking that I'm totally capable of recreating these masterpieces totally on my own, but a few weeks later, if I try to reproduce the food, I don't have all of the ingredients available, and suddenly I have to start making some substitutions, and all of a sudden my food can end up looking a little more something like this.
And I find that the same can be said for a lot of programming tutorials. If you have all of the right ingredients measured out in the exact right proportions, then it's pretty easy to follow along with a simple recipe and get the desired results. But if you're missing some key ingredients, then you start having to make some substitutions or find some workarounds, and then you'll end up with something that's perhaps technically edible, although certainly less than ideal.
And that leads me to the outline for my talk today. I'm going to be discussing how to incorporate these three languages into your workflow, but I'm going to present it in two different ways. The first way is an ideal approach, and this approach is going to be kind of like following along with a simple three-step recipe, but the downside here is that I'm going to assume that you have all of the ingredients available. And the second way that I'll present is a less than ideal approach, where you're missing some key ingredients, and so you have to find some workarounds.
Context: working at Statistics Canada
But before I jump into that, I first want to start by providing some context as to why this particular challenge arose for me. I'm an analyst at Statistics Canada, which, for those who are unfamiliar, is an agency within the government of Canada, and Statistics Canada is Canada's official statistical agency. And the particular challenge that I face at work is that my employer wants to adopt open-source programming languages like R and Python, but we have a legal responsibility to ensure that personal information collected for statistical purposes is kept strictly confidential.
This means that when we publish statistics, we first need to use an internal generalized system called GConfid, which is a series of SAS macros that prevents the release of confidential information in tabular data using cell suppression and controlled random rounding methods. And GConfid is only implemented in the SAS programming language. And what this means in practice is that many analysts and data scientists at Statistics Canada have to use SAS as part of their workflow if they want to be able to publish official statistics.
My typical workflow at StatCan usually looks a little bit something like this. I work with some SAS datasets that first get passed through some pre-written preprocessing scripts that are written in SAS. And once the data are ready to be analyzed, I bring the data into R, and then I do the vast majority of my analysis in R. But if I want those results to be publishable, then I need to bring the data from R back into SAS so that I can use these GConfid macros.
And you can think of GConfid kind of like a black box. It's a series of SAS macros that is completely uneditable by me, and these macros are themselves confidential. So I cannot and I would not want to implement GConfid in any other programming language. So for this step, I really do need to use SAS. Once I'm finished with GConfid, I bring those results back into R for data visualization purposes. And as you can see, my workflow really does require me to be able to go back and forth between SAS and R seamlessly.
And while I myself don't typically use Python at work, I do have a lot of colleagues who would appreciate the ability to incorporate it into their workflow. So all of this led me to the question, is there a way to combine these three languages in the same script in a way that does not cause a headache?
The ideal approach: SASPy
And the answer is yes, and there are multiple different approaches. The most straightforward approach is a simple three-step recipe, and step one of this recipe is to gather your required ingredients. For this particular recipe, the required ingredients are the languages that you're interested in using plus Java. And the other ingredient that's not pictured here, but which really is the most important ingredient for making this particular recipe succeed is admin privileges. And if you don't have admin privileges on your computer, you might not be able to get this recipe to work.
Step two of this three-step recipe, assuming you have all of those required ingredients from step one, is to install SASPy, which is a Python package that works very similarly to how Reticulate works. So just like Reticulate is an R interface to Python, SASPy is a Python interface to SAS.
The third and final step of this three-step recipe is to perform a couple of system configurations. And there's a lot of little details that are in this step, so I won't go over that today for the sake of time, but if you're interested in following along, there is a video on my YouTube channel that explains all of the steps and includes an example usage of combining these three languages in the same script using SASPy.
Alternative approaches
Now that was definitely the most straightforward approach and the recommended, well, my personal recommended way of how to achieve this workflow, but if you have limited control over your computer, then it might not be possible for you. So in the next few slides, I'm going to present a few alternative approaches that will allow you to combine SAS and R in the same script, and sometimes Python. The approach that you choose is going to depend heavily on what you need to do and on how much control you have over your computer. And for the three alternative options that I'm about to present, for all three of them, there's going to be a big tradeoff between complexity and usefulness.
The first option is to use the SAS Markdown package, which, as you might be able to guess from the name, is a package that allows you to include SAS chunks in your R Markdown or Quarto documents. This option has the benefit of not requiring admin privileges, but the downside is that there's no interaction between SAS and R at all. So what this means is that the SAS data sets that you define in your SAS chunks are not going to be accessible to you in your R chunks and vice versa.
For me personally, I really need this interoperability between these two languages, so this particular option isn't the most useful for me, but this package does have some really interesting use cases. For example, if you're a course instructor, you could use this in order to show your students how to perform the same task using multiple different programming languages all in the same place. Once you have the SAS Markdown package installed, it's fairly easy to switch between the three languages within your R Markdown or Quarto documents, and all you need to do is specify which language you'd like to use in each individual code chunk.
If you want to use SAS, just set the engine option to be SAS HTML, and you can set the engine path option to be the file path to your SAS executable. To use R, leave the chunk header as the default, and to use Python, set the chunk header to say Python. And if you were to knit an R Markdown document that contains three chunks that looks like this, the result will be something that looks like this.
The second approach that you can use is a SAS-first approach rather than an R-first approach. When SAS is launched, there's an option that you can enable, which is called RLANG. And the RLANG that you see there on the slide in all caps, I should mention, has absolutely no relationship to the RLANG package in R. So this all caps RLANG that you see here is a SAS concept, not an R concept. The RLANG option can be enabled without admin privileges, and it's pretty easy to move data back and forth between SAS and R using this method, which makes this option quite a bit more useful than the SAS Markdown package, for my purposes at least.
Once the RLANG option is enabled, you can use proc IML in SAS to execute R code. And in order to execute that R code, just use a submit statement with the R option enabled. And then anything that you put between the submit and end submit lines will be executed as R code rather than as SAS code. Now I mentioned that it's pretty easy to move data back and forth between R and SAS. And if you want to go from R to SAS, you can use the import data set from R subroutine. And if you want to go in the opposite direction, so to go from SAS to R, use the export data set to R subroutine. Both of these subroutines use the exact same syntax. You start by specifying the name of the SAS data set, and then specify the name of the R data frame.
I mentioned before that this option doesn't require admin privileges, but really that's only partially true. If you want to use this method from base SAS, then you don't need to have admin privileges in order to enable this RLANG option. But if you're like me, and if you prefer to use an IDE like SAS Enterprise Guide, then you will need admin privileges in order to enable this feature. If you are interested in using this method from within SAS Enterprise Guide, there's a video on my YouTube channel that goes through the full instructions of how to set this up.
The third option that I'm going to talk about today is the one that is the most complex, but also the one that's the most useful if what you really need is the interoperability between SAS and R, and if you don't have admin privileges, and if you want to be working from within RStudio. And this is actually the approach that I was using myself at work for the past couple of months.
Myself and a coworker on my team, as well as a colleague from another part of our organization, worked together to develop an internal R package that contains a function that allows you to take an R data frame, manipulate it in SAS, and then return the results to R as an R data frame without ever leaving RStudio. And this package is not publicly available, but you can easily come up with a similar solution using the same general underlying principles that I'm about to show you.
Myself and a coworker on my team, as well as a colleague from another part of our organization, worked together to develop an internal R package that contains a function that allows you to take an R data frame, manipulate it in SAS, and then return the results to R as an R data frame without ever leaving RStudio.
The first argument that this function takes is the name of the R data frame that you'd like to manipulate in SAS, and here I'm just using the MT cars dataset, which is a dataset that comes built into R and includes variables about characteristics of cars, such as miles per gallon, horsepower, cylinders, and so on. The second argument that this function takes is the location to the SAS installation on your machine, and the final argument that this function takes is the SAS code that you'd like to apply to your R data frame.
This function assumes that in SAS you have access to a dataset that's called in underscore df, and then the SAS dataset that will eventually get sent back to the user as an R data frame is called out underscore df. And the example that I'm using here, I'm just using the frequency procedure in SAS to create a frequency table of the cylinder variable from the MT cars dataset. And by executing just those two lines of code, the results from SAS are brought to the user in RStudio without ever having had to exit RStudio.
In terms of the actual implementation of this function, this function starts by creating a temporary file folder and then saving the user inputted R data frame as a dot RDS file in that temporary file folder. From here, a temporary SAS program is created, and the user inputted SAS code is then inserted into that temporary SAS program, making use of proc IML and the import dataset from R and export dataset to R subroutines that we talked about a little bit earlier. This temporary SAS program is then executed by sending a command to the command prompt, enabling the Rlang option, and then the temporary files are deleted after the results are brought back into R. And finally, the results are returned to the user as an R data frame.
Closing thoughts
We all love R, but there might be times where we want or need to use other programming languages. Tools like SASPy are making this easier than ever to combine multiple languages in the same place, but if these options are unavailable or infeasible to you for whatever reason, you might be able to find some clever workarounds if you're willing to put in the work.
For me, it's been a little over a year of exploring all these different options and more in order to figure out what's viable for me in my workplace, and I'm very happy to say that as of two weeks ago, some changes to our IT infrastructure have resulted in Statistics Canada employees being able to use SASPy without any issues at all. Thank you so much for listening. I'd be super happy to connect on LinkedIn or wherever if you want to chat about this more.