Introducing an R interface for Apache Spark | RStudio Webinar

Transcript#

This transcript was generated automatically and may contain errors.

Good morning everyone. Thank you so much for joining today's webinar. Before we get started, I would like to review what you can expect out of today's presentation. First is that we have new slides and illustrations, so if you have followed Sparkly R for some time, hopefully today's material will be new to you. Also, the demo code that we're going to use today, I'm calling it reproducible, mainly to say that we're not going to use a large Spark data cluster or a yarn cluster that takes all this stuff and nobody can ever recreate that. I'm going to use local mode mainly because I want to focus on what Sparkly does, and because how cool it is that you can also do local mode and actually run Spark, I'll be able to show you some of the inner workings of Spark as well as we go through the code.

Even though this is not a what's new in Sparkly R today, this webinar, I will be highlighting some of the new features in Sparkly R. There's three specific features that I'm very excited about to share with you today, so we will be definitely reviewing those. As is mentioned here, the goal today is to share with you something new and useful, so if you have been using Sparkly R for a while, that one or two things that you can pick up today that you can try later or at home or something, that would be great. Or, if you're new to Sparkly R, that hopefully most of this presentation will be beneficial for you.

Architecture overview

All right. So, let's move to the first slide. So, I was actually trying to look for something that would represent this overall architecture, and ended up just putting this together. Usually, we talk about, you know, Sparkly R and R, how Sparkly R provides a deep layer backend to Spark, how it also makes machine learning libraries available that are inside Spark in R, and that, of course, you can create extensions. But behind all that, I want to take a few minutes to review how everything starts from the storage and ends up in R with this method.

So, for example, we have big data storage sources, like, for example, HDFS or Hadoop. We can have an S3 bucket, databases, file systems, that all are accessed in Spark via a package. Now, these packages are not to be confused with R packages. These are Spark packages that are specific to the kind of data source that you're going to access, right? So, let's say that you want to access a Redshift database via Spark, then you would use a specific package to be able to do that, so that Spark can read the data. So, once Spark is connected to your big data source, then it opens the door for all these really cool things, like cluster computing, the fact that you can do all this stuff in memory is really big.

Also, the fact that you can extend the API, meaning that not only has things that come out of the box, like the machine learning libraries, but also you can create your own programs that can run in parallel. Another thing is that it provides a SQL interface when you're interacting with data, which is really good, because it allows a very easy way to start using your data with Spark, and it also enables something like the things like the deep layer backend, for example.

One of the most important things I want to mention here as a tip, or something that sometimes is not fully clear, is that Spark is not in itself data storage. All the data comes from the source, which sometimes causes some confusion, because you may have a different product, a data product that says, hey, we can access all these data sources, like Oracle, we can access Impala, we can access Spark. So Spark looks like a data source, but it's really not. That's because the product interacts with Spark, which in turn interacts with something in the background. But Spark doesn't have data. As soon as you turn off the session, the data that you would have cached in memory is gone, right? So that's very important to notice, because that will help, as you are working with Spark VR, how to manage those data frames and the data that you're interacting with.

One of the most important things I want to mention here as a tip, or something that sometimes is not fully clear, is that Spark is not in itself data storage.

So we mentioned that Spark accesses the big data sources via packages, and then Spark VR uses the Spark shell to access the Spark capabilities, and then it makes available to R all these cool things via R functions. So let's go ahead and dig deeper into this portion right here about the Spark shell and Spark VR.

So, you can see, even in local mode, how, you know, how much performance you can get out of caching your data. And how important that is for data analysts, right? Because then now, this allows you to iterate constantly over the data without the tax of having to go back every time to get more information or wait for it to read the files again.

Sampling and new dplyr features

Another cool thing that we can see in Spark ER 0.6 that I want to highlight is an improved way of doing sampling. So, if you use sample frac or sample in, you can use the same one in Spark ER. And if you tried it before 0.6, it was probably really, really slow for you. Just know that that's improved now. That's going to be much faster than before. And it's going to take a little bit while here. But it is much, much better than it was. Because it's using a more native way of sampling to Spark than to do a more complex operations like it was doing before.

Another tip I want to mention is the fact that between sample frac and sample in, inside Spark, it's just the same table sample command. But for some recent table sample with just a number, like a sample in, doesn't work as well, in my opinion, as sample frac. The result here, you notice that between the two years that we have in the data, it's pretty even bringing in the data from both files. When I try this with sample in, it actually just brings more data from 2003. I tested this sometime back again, and it was the same thing. So, our recommendation is that you use sample frac and use a fraction of the data. Just as a tip. Again, this is not a Spark ER thing. It's more of a Spark, you know, how or even high selection, how it works.

New and deep layer. So, if you attended Hadley's latest webinar talking about what's latest, what's new in deep layer, we've had pool and case when. I wanted to bring out that pool and case when does work in Spark ER, and I think it's really, really nice, especially pool. I use it all the time. It lets you take the first column, if I don't pass anything, or if I pass the name of a field, then it's going to take that field and bring it back as a vector. Of course, this is a character vector that I can then use all those things. Very convenient, because we always end up having to do some subsetting after running a command. Whenever you use pool, you don't need collect, so it makes it very convenient at the end of my chunk here.

Right now, we're using it for an actual analysis, right? So, we're telling it to give me the top five airports with the most activity, right? So, we have the top five here ordered by how much activity they have. Case when also works, and I did a quick one here, as far as if it's origin, it's the biggest, and then do use numbers, depending on how many flights it has, and everybody else do a default variable, so if it's over 200,000, it's big, over 100,000, it's medium, and everybody else is small, and then you can see that working, and you saw how fast that is. I'm not running this in our memory, right? Everything is ran in Spark, and it runs really fast, because how optimized it is here, and the fact that it's broken out into smaller files.

SDF functions and feature transformers

I mentioned SDF functions. This is the second function I want to highlight, because I think it's really awesome how it works, and we've had a lot of customers and folks that just are using Sparkly are asking about, you know, what about spread? You know, tidy R, I always use spread, and I have to bring the data into memory, and then do spread there, so now, with SDF, you can do basically the same operation, and for those of you that haven't used spread yet, I want to show you basically the concept of it.

So here, if I were to have the data grouped by origin and destination, I'm taking my top five airports, and then I want each, instead of having each combination like this, but rather, I want to see just one line for DFW, one line for ORD, one line for AAH, and then a column for each of the destinations and the number inside those columns, right? If I want that crosstab, you know, look to it, you can actually do all that inside Spark by just sending that to SDF pivot, so I'll run this, and the pivoting itself is taking place in Spark.

So, I'll pause here for those of you who's been wanting this function for a while to, you know, catch your breath a little bit. All right. So, feature transformers. I actually didn't use feature transformers as much, but then I found out that there's more to Meet.ai, hopefully, you got that reference. So, the first one I want to show you is the feature binarizer.

That one will let us do something really cool, which is, if a numeric column is over a certain value, then it's going to send a value of one, and if not, it's going to be a zero, right? So, in this case, it's working with flights. Let's say that we want to tag anything that is considered a delayed flight, which is anything that's over 15 minutes. So, if I run this, you'll notice that I tell it to look at the departure delay, and if it's over 15 minutes, then create a new column called delayed, and make it one if it's over 15 minutes, right? So, you don't have to use if else, you don't have to use more complex logic to make it work in SQL or in DeepLayer, you can just call feature binarizer, and you get that functionality.

All these feature transformers are meant for data preparation for analysis, right? So, we'll see that in a minute, how we can actually combine all these. Feature bucketizer is another really cool one. This one allows us to do basically the same as cut. So, I can send it the splits, and it takes a sample here, the departure hour, excuse me, the scheduled departure, and then depending on the time range, it'll give me a number, right? So, if it's zero to 400, it's going to be zero, 400 to 800 is going to be one, 800 to 1200 is two, and so forth, right? So, this will let us create, you know, those more, you know, segments inside our data that then we can use for other things like, of course, running as a model. Another way that I found that's very useful is if you want to create a histogram with it, that it's much faster to run a feature bucketizer than try to do this with a SQL statement to create the buckets, because obviously it's using a lower level API inside Spark.

Machine learning and model training

All right, so, going to machine learning libraries here, before we go into that, I want to show you here how we can combine dplyr verbs with feature transformers, with SDF functions, and get them to all work together to run an analysis. So, a new feature, a new, excuse me, function here is that SDF partition, so I can get it where, I can tell Spark to run the training, I create a subset for training at one percent of the data, a subset for testing at nine percent of the data, and then hold everything else for 90 percent. So, SDF partition requires you that the total of this needs to be one, so this is normally how I do it, because I don't want to split it, you know, 40, 50, because the data's too big in this case, so I want to use just one percent to run a demo training today.

Usually you would be able to do, like, type it and then compute here, but what happens here is that sample, excuse me, SDF partition actually creates one Spark data frame for each of the splits, so you have to specify which ones you want to cache, right? So, in this case, I cache training, and it's important for me to mention here, so now we have training, right? It's a much more smaller set at five megabytes, and it's also partitioned in eight partitions. This is, I want to mention this because we're going to come back to this because of a different function that I want to show you.

All right, so I'm going to train the model here. I'm going to use logistic regression, so delayed is the same one that we used earlier using the feature buttonizer, so zero, one, just delayed or not. Departure delay is a continuous variable, and then departure hour, which is that one that we use as a feature bucketizer, but in this case, I did this, I used a paste to add an H to the hours to make sure that it uses, it's recognized as a categorical variable instead of a continuous variable.

I'm sure there's better ways to do this, so I apologize if I didn't do it the best way, but this worked pretty well for me, so again, I want to just kind of give you a sample of what a model would look like. I ran an SDF predict, so that's the name implies that we run into predictions, but instead of running in over the training data, we're running over testing data. I didn't cache this data because you saw that it runs pretty quickly, so it's okay. It all depends on how big your data is. I want to show you the top records because there's some differences on how a predict works here, the SDF predicts specifically, and the difference is that it will return the same data set that you sent to it, but with more fields, and you'll notice that immediately we can see that the departure hour has been, actually Sparkly R has created dummy variables for us for each of the values inside departure hour, and it returns a prediction, right? So, we have a prediction of 0 or 1, and the actual value of 0 or 1, so now we can see how the model performs here.

So, I'm going to do, I'm just going to compare it and see how I do it. There's obviously much better ways than doing this, you know, but I guess I want to highlight the fact that you are able to run those models inside Spark without bringing your data back into memory, just as the results in this case, right? So, we can see our false positives, our true negatives, you can see which ones were false positives and true positives, so they're okay, and we can iterate through the data. Of course, we have cached training data, so if you want to test different models, it's not going to take forever to run.

Distributed R

All right, so our last feature I want to talk about today is distributed R. We've had several people and customers asking for this, the capability of being able to take an R function and run it in Spark, right? This is a big, big new thing that Sparkly R has in this next version. So, because it's brand new, I definitely suggest that before you start writing up solutions using it, that you test it real well in your cluster before you implement it. We're constantly working on improving it, so I think in itself already it's worth showing it to you today so you can see how it works.

So, I'm going to run Spark apply, which is the command that we will use to be able to distribute R data, excuse me, R functions in Spark, and I'm just going to use a very simple one called nrow, right? So, the nrow command, we would expect to return one number, but actually return eight, and what that means is what it's doing is taking the count from each of these RDDs in Spark and giving us the count, right? So, inside this 700 kilobytes memory chunk, it's around 18,000 records, right? So, this is okay, because we can run it over each of these data frames, but normally we don't want the data grouped by the RDDs. We want some sort of custom grouping.

So, the team has created this group by argument that you can use to run that parallel deployment of your R code, right? Because that's usually what you want. Why do you want to use R code in Spark? Because you want it to run parallel, you want to run the same machine learning algorithm over specific categories, the same, you know, whatever function over those categories. So, you want to use, that's the reason why, right? So, then you have this, and now I can group it by the departure hour, and the last one, and I'm running out of time, is the fact that you can run something like a broom function, and run essentially the same GLM function over each of the origin airports, and it will return a tidy data set, but all this did not run at any point inside my R code.

So, very excited about this. I think it's really cool. I hope you start testing it, and giving us some feedback, because this will open a lot more doors of how you can use Spark, and Spark VR. Hopefully, this will be one of the tools in your tool set. You saw how we can effectively use the other pieces and other functions that make this combination of tools really exciting. Okay. So, I think this is the end of the demo. Let me go back to ...

So, the team has created this group by argument that you can use to run that parallel deployment of your R code, right? Because that's usually what you want. Why do you want to use R code in Spark? Because you want it to run parallel, you want to run the same machine learning algorithm over specific categories, the same, you know, whatever function over those categories.