Sparklyr: Using Spark with RMarkdown | RStudio Webinar

Transcript#

This transcript was generated automatically and may contain errors.

I want to go through Spark and do some examples of how to analyze Spark on data that would not fit into R and R memory. So I'll show you how you can use R and Spark together to analyze some data that wouldn't fit in there. And I'm confident that, not just confident, I'm excited and optimistic about the direction some of these technologies are going.

Again just from a personal standpoint, to be able to see like some of the tools that are now being delivered to analysts on large-scale data is extremely exciting and so I hope we can, you know, some of that, you know, that you can see some of that here in today's discussion.

Alright so if you've been analyzing data with R, chances are you're probably familiar with doing some analytics on laptop. Maybe you've used it on the server as well. You might be connecting to, you've got to be connecting to some sort of data store whether that be a data warehouse or a database or a feed or flat files or file share or something and you're probably, you know, I shouldn't say you probably, we all have extracted data into R for analysis in R and that works great for things that might be a few gigabytes or even tens of gigabytes but there's also, and that's fine because a lot of problems don't require any more firepower than that and for that set of problems, R will take care of the solution by and large through, you know, the large number of packages that are available to us.

However, some problems are, you know, very large, have hundreds of gigabytes or terabytes or even petabytes, truly large data. They're involved with, you know, production workflows, they might involve like some streaming technologies, they might have like a variety of, you know, different, you know, complex data sources and pipelines that need to be addressed at different stages and for those problems the question becomes, you know, what is the role with R there, right? Like if I have data that has, that are truly large and I want to do analytics on that data then, you know, how can I go about doing that?

Now there are a number of ways to address that problem, Spark is not the only one but I want to show you why, you know, Spark is, you know, is one way that you can solve that problem.

What is Apache Spark?

So what is Apache Spark? It's a fast and general engine for large-scale data processing, right? So this is, and I want to emphasize the engine part of that, right? So this is used to manipulate data at scale. This can integrate with the Hadoop ecosystem which is important because, you know, often the data that you want to access is already in Hadoop so then the fact that Spark can integrate with that ecosystem is extremely important.

It also supports Spark SQL which is basically HiveQL or HQL. It has nice built-in machine learning algorithms that we're going to talk about that increase the toolkit for analyzing data at scale and it's designed for performance, this was designed to work interactively with the analyst as opposed to batch processing. And finally it's extensible, so you can extend this platform, the Spark platform to include other modules or custom code or other things that you want to contribute to that ecosystem.

Introducing sparklyr

So sparklyr, sparklyr is an R package that RStudio built. So we, you know, we built it and released it on to CRAN a few weeks ago. It's open source, it's free to use, anyone can use it and also mentioned, you know, Spark and Hadoop are also open source so you have a complete open source stack here if you choose to use that.

We integrated sparklyr with RStudio IDE. We created a new Spark pane that allows you to browse the data and the metadata inside of your Spark instance and we've created a dplyr backend with sparklyr as well so that you can access, you can use dplyr to do all of your SQL translation on your Spark data frames. Finally we made it extensible so you can plug other things into it like we previously discussed.

So if you're investing in Spark, if your organization is investing in Spark, there's really nothing stopping you from using it with the full power of R and that's my main message to you today that if your organization is interested in Spark or using Spark, then the R analysts can access it as well and that's important because Spark appeals to a wide audience so other people that use Spark might have skills in Scala or Java or Python and we want you to know that if your tool of choices are, you can also use Spark to do your work.

So if you're investing in Spark, if your organization is investing in Spark, there's really nothing stopping you from using it with the full power of R and that's my main message to you today that if your organization is interested in Spark or using Spark, then the R analysts can access it as well.

Now sparklyr communicates with Apache Spark through APIs and I want to talk about three of these today. The first one is Spark SQL which we use dplyr for and then the next one is machine learning and then this third one is the extensions. So let's jump into the first one which is dplyr.

So I just ran a very simple aggregated query on 200 gigabytes of data and it took, you know, I don't know what that was, a second or so, then I collected the data and ran the plot command.

Alright this section of the data is going to show you some nice, you know, window functions again, it's doing percentiles which is pretty cool, it's doing some Unix timestamp conversions on the date so it's translating the dates and it's going to map the difference between two locations and then it's going to plot them and so you can see again running this statement and collecting here is what we've paused on and then we're good to go.

So again this, you know, function very quickly, now granted you do have a where statement here, right, that's going to limit the data but it's still going through all that data to, I haven't indexed it on these locations, not like a traditional database, so I'd argue it's still running fairly quickly.

Now that I did that, and this is just briefly, this is your pickup time and this is your trip duration and time, so you can see that at 15, which is 3 in the afternoon, taxi trips end up taking a lot longer between these two locations than they do in the middle of the night and then also in the morning, basically the commute times, right.

Okay so this one is going to actually run a model, I'll go ahead and kick it off here, there we go, you're going to, we're going to partition the data into training and testing, I'm going to cache that data so that I can run this model, here's the formula, again you can use your standard R formulas and then I'm going to run ML regression on the training data and I'm going to summarize it. And then I get, I can see the most important, what's the largest t-value, fair amount, right, fair amount has the most significance in this case, which makes sense because we're predicting trip amount.

If you want to do some visualizations with HTML widgets, you can do the same thing, similar type of code here, I'm going to do some aggregations and the HTML widget is basically a JavaScript library that is included in R that make, you know, these plots interactive. Here I can see the pickup is the airport, that's the green circle and then the drop-offs are the most popular drop-offs, so there's another, LaGuardia airport here, JFK and LaGuardia and then, you know, Midtown appears to be the most popular drop-offs, I do not live in New York City but that makes sense to me.

Alright, so that's some HTML widgets and then finally I'm going to show Shiny Gadget here and Shiny Gadgets are, you know, shiny apps that you run inside of the IDE and I can choose a pickup location and a drop-off location, I can do a plot. The first one of this takes slightly longer than subsequent versions of this. So here's the pickup and drop-off between those two locations, here's the map, so that's the pickup and the drop-off and here's the data.

Now if I want to change this, I just go back to the inputs, right, and I just can choose anything else, I can do West Village and then go here and this will update in a second or two. There you go. So you can do any sort of interactive analyses that you want here, do the map again and get a sense for, you know, how performant this is on a large-scale data set that's residing in a Hadoop cluster.

And then finally, when you're done with this, again, you can hit the preview and you've got the entire document here that is reproducible, ready to share, great documentation for any future use that you might do, again this is going to be embedded in here as well, as well as all of the other images and the code and the pros that you use to describe your data.