Resources

Javier Luraschi | Scaling R with Spark | RStudio (2019)

This talk introduces new features in sparklyr that enable real-time data processing, brand new modeling extensions and significant performance improvements. The sparklyr package provides an interface to Apache Spark to enable data analysis and modeling in large datsets through familiar packages like dplyr and broom. VIEW MATERIALS https://github.com/rstudio/rstudio-conf/tree/master/2019/Scaling%20R%20with%20Spark%20-%20Javier%20Luraschi About the Author Javier Luraschi Javier is a Software Engineer with experience in technologies ranging from desktop, web, mobile and backend; to augmented reality and deep learning applications. He previously worked for Microsoft Research and SAP and holds a double degree in Mathematics and Software Engineering

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

So, in order to give some context of where Spark fits when you're doing data analysis in R, it makes sense to ask this question, right? Like, if you have slow code, what do you do? And you know, like, it could be slow code because you're dealing with a lot of data. It could also be slow for other reasons, right?

So the answer is definitely not Spark, at least not as an entry point. You know, one of the techniques that you can use is just sample data, right? Like, if you have a lot of data, you can reduce the amount of data that you have. And you know, as long as you do it properly, and you know, you can use techniques like statistical sampling, or why not, reduce the amount of data that you have, and that definitely is progress in the right direction.

Another technique that you can use when you have slow code is, in the same way that Joe Chen was presenting in the keynote this morning, you can use ProfViz, right? So if something is slow, you can look at why is it slow. It might be the case that it's your code or it's a package. You might have to use the package in a different way, or you might have to change to use a different package, right? But definitely, ProfViz is a great tool, and you should use it when you hit this problem.

Scaling up vs. scaling out

And the next one is scaling up, right? So the next solution is you get bigger machines, which Joe presented and also Darby on our previous session mentioned. One great way of scaling up is by saying, hey, if I have a bigger machine somewhere, now we can use RStudio Pro job launcher to run that particular instance on a machine with more resources.

For me personally, I worked last year on a package called CloudML, and it was intended to take your deep learning model, package it up, and send it to Google Cloud, train it, and then get the results. What I've seen on the GitHub issues is that a lot of people are actually using this not for deep learning, but just in general for modeling, which is totally fine. You don't need to use it for deep learning if you don't want to.

And in general, there's multiple ways of scaling up. This talk is about scaling out, which means if you're scaling up and you've hit a ceiling on the compute power that you have access to, or if you simply have a lot of machine spare to use, what can you do to take advantage of all those machines? And that's where Spark fits in.

Now, this slide is not fully comprehensive. There's many other ways of scaling up and scaling out, but at least that's the context of how you would usually scale out.

And for those of you that prefer diagrams, this is how it looks like, right? You can scale up or scale vertically, or you can scale out or horizontally, basically by adding machines. It does mean that you need to install Spark on each machine on this cluster, but once you have Spark running on each machine, you can basically run, in this case, a linear regression across all the machines without having to manually run a subset of the data and then figure out how to aggregate it.

Using R with Spark via sparklyr

So how do you use R with Spark? Well, first of all, sparklyr is an R package. You need to install it from CRAN. And the second step is you need to install Spark on that particular machine. You install it just by running Spark install, pretty straightforward. And then you connect by saying spark connect master equals local. With sparklyr, you can very easily work locally by specifying the parameter master equals local, but you can also run on a variety of cluster providers like Databricks, IBM, Microsoft, and Google, or on-premise like Cloudera and MapR, et cetera.

So definitely, it's the same interface. You work with it locally pretty easily. And then you can change this parameter and connect to a proper cluster when needed. One of the design philosophies that we follow in sparklyr was to be as friendly to the R community as possible. So we try not to reinvent the wheel. So if you already know how to use dplyr, you can use dplyr with sparklyr. If you want to use SQL with the DBI package, you can also as well just run a DB get query and get data out of Spark in parallel with the ease of use of R.

Not only that, but you can also do modeling. And we have beyond like 50 feature transformers and modeling tools available in Spark. And you can access them from sparklyr with ease.

One of the features that we worked on last year was introducing pipelines, which Kevin Kuo presented last year. And pipelines allow you to take those workflows that you have already completed on your modeling step and allows you to export them to production.

What's new in sparklyr

What's new in Spark? We covered streaming. Last year, we worked on MLib, which basically allows you to take a pipeline and export it out of Spark and put it into any Java-compatible environment that runs the JVM. We added support for Kubernetes and a little bit of RStudio 1.2 integration. So you have some of the great features of RStudio 1.2 are also available when using Spark. Like you can write a custom query, SQL query, or you can use the jobs pane just to track your Spark jobs and then reopen the job on Spark.

What are we currently working on? We're mainly working on two big features. One is support for Apache Arrow, and the other one is XGBoost on Spark.

Apache Arrow integration

So what is Apache Arrow? Wes did a much better job explaining this this morning. But the one-liner is Arrow is a cross-language platform for memory data. Really the only thing that matters to you, I think in an ideal world, you don't even need to know what Arrow is. We just want to make things faster for you. And that's exactly what we're trying to accomplish. So you should be able to include the library, Arrow, and then sparklyr and get performance improvements.

Really the only thing that matters to you, I think in an ideal world, you don't even need to know what Arrow is. We just want to make things faster for you.

Let's take a quick look to see how that looks like. So what we're going to do is we're going to start another Spark instance, so this is connecting again locally, and then we're simply going to go and process 10,000 records. That's not that much. Usually the first time takes a little bit longer because Spark is warming up, but it's just 10,000 rows, which is not that much. This computer only has two cores and I think 8 gigabytes of RAM, so nothing fancy at all.

So we're processing this data set, and it just takes nine seconds, and next time it should be a little bit faster. But what is crazy is that once you run this with Arrow, which is not on CRAN yet, it's being underdeveloped, under active development, and once you run it with Arrow, we're processing one million rows, so just not 10,000, a hundred times more data, and we're doing that on four seconds. So that is just, I don't even understand how this can be possible, but it's just great. So really hoping that you don't have to worry much about Arrow, but at least while using sparklyr you're going to get these performance improvements.

one million rows, so just not 10,000, a hundred times more data, and we're doing that on four seconds. So that is just, I don't even understand how this can be possible, but it's just great.

XGBoost on Spark

And the other feature that we're working on is XGBoost on Spark, so you're going to be able to train XGBoost models on Spark, and this is exactly what we're doing here. It's basically training, it's an extension that Kevin Koo is working on, and it trains basically a model using XGBoost. And what is great about this is that the next talk we're going to introduce MLflow, which is relevant to this case, because I'm basically leaving a model for Kevin to decide if he wants to use it on his talk, and he's going to teach you how to share models in a big data science team.

All right, so we covered those two, and with that I just want to say thank you and leave you with some resources. So thank you.

Q&A

Thank you so much, Javier. We do have time for a couple of questions, if anybody has some.

So I have a question about Arrow. That seemed really cool, you just load the library, you don't have to call and call and call that library, sparklyr will automatically look for Arrow being installed?

Yeah, yeah. I mean, in general, I don't know if, I'm not the expert in Arrow, but the way I see it is Arrow is a technology that provides support for developers, so the fact that I'm using Arrow should be mostly transparent to you, but to you, you shouldn't change anything, and in fact we're looking forward to Arrow because it should also give us some free features. Currently sparklyr doesn't support nested data, there's some restrictions around that, so by having a cross-platform that can enable multiple technologies, features like nested data should become a free feature that we get on sparklyr just by using Arrow, which is great.

Hi, thank you. You demonstrated in the, right here, you demonstrated the Spark tasks as real-time, you called on demand in R and it worked and then it stopped, you know, it ran and then it stopped. Do you have the capability to do the process, push the tasking straight to Spark and then return to the console where you can then use the result from that, you know, in a multi-node Spark topology?

Sorry, I'm not sure if I follow the question. Okay, well, so you've got one, you're pulling from one topic, you do something to it, then maybe you push it back into another topic and then you do something else, another task is pulling from that topic. Seems like everything you're doing there is pulling from a Spark topic, a Kafka topic and then doing stuff locally. Are you pushing tasks to Spark so that Spark continues to do that task while you're doing other things?

Right, like that's the whole point of this, so especially on the Shiny app, like the Shiny app actually takes the raw stream, which, you know, if you're using a cluster, you basically have like multiple machines pulling from Kafka and the stream is actually processed in parallel, you know, like when you do the aggregations and when you do the filtering, that's actually happening in the cluster and you are the one that chooses what data either gets collected to the driver node, which you obviously don't want to collect a lot of data, but also you have the option of just pushing back to some other stream, so you can, you know, connect it back to a Kafka queue for output or you can put it on a dashboard and definitely you need to be careful which data you end up collecting, but there is no restriction as to how much data you process as long as the destination can handle that much data, right? So that's, yeah.