Advanced Features of Sparkyr | RStudio Webinar - 2017

This is a recording of an RStudio webinar. You can subscribe to receive invitations to future webinars at https://www.rstudio.com/resources/webinars/ . We try to host a couple each month with the goal of furthering the R community's understanding of R and RStudio's capabilities. We are always interested in receiving feedback, so please don't hesitate to comment or reach out with a personal message

Nov 29, 2017

50 min

RStudio Rstudio Webinar R Progamming RStudio Sparklyr R Sparklyr SparklyR

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

All right, so as Anne already mentioned, this is the third one in our Sparkly R series, and we are going to be covering different topics for this webinar.

First of all, we're going to start with an overview, as we always do, and we're going to close up with questions. But really, we're going to cover three topics in the webinar. The first one is about scaling R with a new Sparkly R function that we introduced in Sparkly R06 called Spark Apply. Edgar, already on the first webinar, he gave us an intro to this function. We're going to get into the details of how it works in a more deeper way.

Second topic, we're going to briefly discuss Libby, which is a service that can run together with Apache Spark to make us be able to connect from our desktop environments to Libby and Spark services. And the last topic that we're going to consider today is building applications that you'd use to spark with Shiny and Sparkly R. With that, let's get started with the overview.

And as I mentioned, this is the third in the series of webinars . If you are starting with this webinar, I highly recommend that you also take the intro webinar. I'm not going to cover everything, but if this is your first webinar in the series that you're starting with, all you need to know is that Apache Spark is a cluster computing environment. It's independent from storage, so definitely you can have your data in different data sources that Spark will enable you to compute against. And Sparkly R is the topic of this webinar, which is R package. It's just an R package like any other package in the CRAN ecosystem. And it enables basically connecting us from R into Apache Spark.

On the second webinar, we briefly discussed how to create extensions. We went over how to use, first of all, how to use extensions, how to use R code to extend Sparkly R into other functionality that is not core from the package. We also took a look at writing Scala code to really take advantage of 100% of the functionality available in Spark. And finally, creating R packages that can as well be published to the CRAN repo. And definitely another interesting series if you're looking into topics like using H2O, GraphX, or connecting to multiple data sources.

Overview of the dataset and topics

Now for this webinar, we're going to focus mostly on Spark and Sparkly R, but we're going to take a look at one dataset that is a little bit more interesting than what we've been doing that we did on the previous webinar. This dataset, it's called the Common Crawl Project and is basically a source of web pages that this organization has already provided. It's a pretty big dataset. In order to be able to extract information out of this dataset, what we really need to take into consideration is how to scale our computations to be able to access the data and both process it.

And we're also going to mention that we're going to talk about Libby as a way of connecting to Spark from your desktop computers. And last but not least, how do we build Shiny applications using Sparkly R? And again, since Sparkly R is an R package, you can use Sparkly R with any of the other existing packages in the R community. And using Shiny is definitely a good match, but you can also consider using, for instance, Sparkly R with R Markdown and other packages to really create compelling use cases while using Apache Spark with R.

Scaling R with Spark Apply

So first topics, scaling R. First of all, I want to mention kind of like what is the motivation for using R, directly R, while you're scaling your computations, right? So you're always using R code while working with Sparkly R through tools like Deployer or the MLlib wrappers that Sparkly R supports. But there's cases where you might want to use the actual R code running at scale on your cluster.

So the first one is the example that we have on the top. And really what it's talking about here is, you know, a lot of you are experts on R, and what Sparkly provides is a way of leveraging our current skills with R in Spark. So for instance, suppose that we have the iris table, and we want to change the table to contain some jitter. And if you're familiar with R, you might be also familiar with the jitter function. Now there's a way of doing this in Spark, I'm sure, but on top of R heads might not really remember what is exactly the function signature, et cetera, right? We're just familiar with R, and we want to use the tools that we know and love.

So in this case, what we can do is run over the iris Spark table that we previously copied into Spark. We can run Spark apply with a custom function. And what is important here to notice is that on the right side, the S apply from that is applying the jitter function over each column. This is R code, right? This is code that we're familiar with, and it's R code that is executing across the cluster. So very easy kind of like entry point to reuse the skills that we already know in R.

And the second use case is basically as a way to complement R. By this, what I mean is that Spark is a rich ecosystem where you can do a lot of things. But there's still a lot of packages on the R community that provide a lot of value, right? And for instance, one case is that at times you might want to run a linear model, a linear regression over Spark, and MLE or H2O might be just what you need. In some other cases, you might want to use the actual linear regression for a subset of the data that would match better the expectations that you have from the model that is created.

So for that particular case, you can also use Spark apply. In this case, we're using the broom package to tidy the linear model that we're creating. But the linear model that is being executed, it's exactly the same linear model that you would run in R. It just happens to be the case that is being executed over a dataset that it's large enough that it's worth splitting into smaller pieces. So we're running a group by query here over the species, which means that basically we're asking Spark to split the dataset into groups by species.

Something worth noting in this, in the concept of Spark apply, is that the function that you provide, it's gonna, the parameter that it gets, which is a data frame, and the result of this function that Spark apply expects should be also a data frame. I've definitely got, we definitely got a bunch of questions of like, hey, Spark apply is not working appropriately, and a lot of them were mostly related with not returning the correct data type, which should be a data frame from all the functions that you write on Spark apply.

How Spark Apply works

So those are our two main use cases, and what I wanna talk a little bit is give you an overview of how Spark apply actually works.

So this diagram shows how your Spark cluster probably looks like today. If you're already a Sparkly R user, what you should probably, what you're probably doing today is you have a machine, which is usually your gateway into the cluster, that might also be the master node for this cluster that basically has R, the R runtime installed, where you are already using Sparkly R and several other packages from this machine. It is usually the case that RStudio or Shiny server would be installed on this machine, so kind of like this, this should be a diagram that looks familiar to you, that you use on your day to day.

So the first important thing to understand for Spark apply is that it requires a change in your cluster, and by this what I mean is that it is your responsibility, Sparkly R won't help you with this particular step, to install R on each worker node. This is usually a pretty straightforward step, there's usually, in my demo for instance, I provisioned my cluster with script, and basically I just had to add one more line to say, hey, when you're provisioning this cluster, make sure that R is installed in each worker node.

What is important to notice, because if you have a cluster that looks more like this, you won't be able to use Spark apply, it is the case that you will be able to use any of the Sparkly R functionality that you already know and are familiar with. But if you want to use Spark apply, you need to get R installed on each worker node, and this really depends on each different clusters, require different ways of installing R on each node. Service providers like for instance, Databricks already support execution of R codes across their clusters, so you wouldn't have to worry about this.

So the next step is, okay, what happens when we actually run Spark apply? So you would run Spark apply from RStudio, and in this case, we're running the same function that we looked at, which is just using jitter to add some noise into our dataset. And what happens there is on the first call, well, the first time that a worker node needs to make use of this distributed functionality, what Sparkly R with Spark are going to do is they're going to deploy all your packages that you're currently using to each worker node.

So just to be clear here, you don't need to ask your system administrators to manage packages while you're using Spark apply. Spark apply will do that for you while you're executing, in this case, you know, pretty small closure, would basically deploy all those packages for you to each worker node. It does take a little bit of time, we'll see an example of this, it depends on your cluster, but it's not bad, and you get basically all the functionality, the rich functionality that you would expect to use in R on each of the worker nodes.

And then, well, last but again, not least, your code is going to be executing over each worker node. So in this case, again, is the jitter function. What each worker node is going to get is this code with a subset of the dataset. The way Spark works is based on RDD partitions. So whenever you get a partition, that is basically a set of rows on your dataset, we are going to apply the function that you give us through Sparkly R in each node, and then Spark will collect the results and put them back all together for you.

And then, well, last but again, not least, your code is going to be executing over each worker node. The way Spark works is based on RDD partitions. So whenever you get a partition, that is basically a set of rows on your dataset, we are going to apply the function that you give us through Sparkly R in each node, and then Spark will collect the results and put them back all together for you.

Real world example: Common Crawl dataset

The next part that I want to touch base on is taking a look at a real world example of using Spark apply with a decent dataset, right? And I mentioned at the beginning of this webinar that the common crawl is a very interesting project because basically it's what it does, it downloads a lot of web pages into a common data store. So you could potentially do this same exercise with a web scraper, but it would take a really long time, right? Because you need to do the work of actually retrieving each web page by yourself and that takes even more resources and adds much more complexity. So instead, what we can do is use this dataset, which is called the common crawl.

For this particular demo, I wrote a library called Spark work that basically reads these files called work files. And that's the type of file that is stored in this project. To my knowledge, there's no parser of work files on Spark and the file itself, it doesn't map well to existing file formats. It's not that we can use like a CSV reader or things alike. I'm sure that you can hack your way into using existing tools as well. But it was like an interesting dataset and also file format to consider to doing our own parsing because there's always going to be edge cases where you have to write your own extensions in order to process data that is domain specific.

So this library, Spark work, uses two interesting features. One is Spark apply and the other one in this case, we can use any package. So the package that I chose to parse these files is RCPP. And it's interesting because obviously RCPP brings a lot of computation power and it's very efficient. And you can see here the code. We won't get into the details. You can take a look at the GitHub repo if you're interested. But it's not that much, it's less than 100 lines of RCPP code.

So I'm going to switch to this cluster. This is a cluster that I literally provisioned a couple hours ago. I saved us a little bit of time by starting the cluster before this webinar. You can see that there's about 50 nodes in this cluster. So definitely gives us some good sense of how this works at scale.

And the way that I set up this cluster was basically following the diagram that we saw on the couple slides ago on this webinar, right, where basically you have RStudio installed in one of these, in the master node. And that's where I'm going to do all my day-to-day work as I'm analyzing data.

So the first interesting thing to do here is to connect. So I already run these lines for us. And basically the interesting thing here is that since we're using Yarn client mode, we want to connect as Yarn client. Sometimes for demos, we use master equals local because it helps us get faster up and running. But in this case, definitely we wanted to take advantage of the cluster.

And the next interesting thing is basically we're going to see the result of executing a very simple function across the cluster. As I mentioned, this cluster has 50 nodes and each node has eight CPUs. So I wanted to take full advantage of these. And I executed this line.

If we go to the Spark UI, first of all, what is worth noting is that you should see all your worker nodes in this cluster. And then the interesting thing here is that the first Spark Apply call that we run, it was this one, number zero. You can see that it actually took 1.3 minutes. It took 1.3 minutes, even though the function is pretty simple, it's just counting the rows. And this is the one-time cost that I was mentioning that Spark Apply requires as you start processing data in your cluster.

Basically what you would see here when we zoom in is that a lot of the time for this particular job was spent on task deserialization time. This is basically Spark using, copying all your package dependencies in a distributed way across all the nodes. And you can see that for the most part, this is a parallel process and it runs pretty fast for the amount of machines that we have.

If we were to run this one more time, so we're again running this again on the cluster and you can see that it's pretty fast, probably took about two seconds. And mostly since the cluster has already been initialized, you can make full use of your without any overhead anymore.

All right, so the next call that I run is basically reading data from the Common Crawl project. And in this case, since we have 50 times eight, we have 400 nodes. We were able to read 400 files from this project. Each file has about three gigabytes. So we read about 1.2 terabytes of data. And you can see the result of this job here. Basically it took about 6.6 minutes to read these 1.2 terabytes. It scales very nicely on number of nodes. So if you wanted to actually read the 72 terabytes, I believe from this data set, you could throw machines in and be able to process this data.

Now that we have our data set loaded, what can we do? Well, one of the things that we can do is count the number of tags. And the Spark work library kind of like gives us these parsing statistics by default. And we can see that from all the web pages that we process on these 400 files, we found about like 36 billion tags. If we count the number of lines that we found out, which roughly matches the number of web pages that were processed, we can see that about 10 million web pages.

One thing to notice is that what I'm interested in extracting from this data set was the keywords tag from HTML. So for those of you that do web development, this might look very familiar. For those of you that are not familiar with web development, basically each web page in the web has, well, it's not required, but a lot of them have a tag that helps search engines and other consumers of your web page explain what your web page is about. So you're going to see this tag on a lot of web pages. It basically says, hey, this is a meta tag. It contains keywords. And then inside it, it basically contains a comma-separated list of keyword values that describes the web page that you're browsing to.

So what we want to do is we want to split this tag, extract the tag, and split it by commas, such that we get a nice, hopefully tidy table of how this data looks like. And it has already run. You can see here that it wasn't that bad performance-wise. And we can take a quick look at, first of all, how many keywords do we have total? We had about 10 million pages. And you can see that we have about 85 million keywords. So it seems to be the case that there's roughly about eight keywords per web page on average, something around those lines.

What I want to mention is that usually what you want to do when you get to this point where you have processed a significant amount of information that you might want to reuse later or share with colleagues and with other tools, you might want to write it as a Parquet file. And we're going to use this functionality later in this webinar. But basically what I'm doing is this table of keywords that we found, we're also writing it on disk across the cluster as a keywords table.

We can take a look at some of the keywords that we found. So for instance, if we were to look at all the web pages that contain the keyword math, and we were to get, for instance, in this case, what we're doing is just getting all the pages that have that, and then we're basically just getting the keywords that are related to the next two pages that have the math keyword in it. Here are some of the ones that are related. These ones, this set is not ordered. So there's definitely, there's some more popular ones here, like geometry and math. Definitely you would see those two keywords more correlated than other ones.

Connecting to Spark with Livy

We'll definitely use again this dataset, but first I want to talk a bit about Libby. So what is Libby? Libby is a service that enables you to connect to Apache Spark remotely. In this cluster, we've been using RStudio directly from the cluster. So that means that we need to connect into the cluster or make the cluster available. You can always have obviously great authentication through RStudio into your cluster, which is a great way of setting up your cluster. In some other cases, you might be either limited or you might have a need to use Spark from your desktop machine.

The way that you would do this is kind of like a two-step process. First you would have to also ask your system administrator to provide Libby in your cluster. Some clusters, especially provided by cloud providers already support Libby. I know that at least Microsoft Azure supports Libby by default in their clusters, but there might be others that provide Libby as well by default. And if not, it can always be installed. It's a couple lines, it's basically a download of a zip file and then the configuration and starting up the service.

Once you have that ready up and running, connecting is pretty straightforward. I highly recommend that you set up authentication with Libby, especially since it runs over HTTP, which is basically a website that anyone that has access to this cluster would be able to enter. So highly recommended that as part of setting up Libby, you also set up authentication with Libby. But once you have that set up, you can basically just run Spark Connect with the address and Libby.

So in this case, we're not going to run through the whole connection, but I do want to show you how this would look like if you're using the latest version of SparklyR, which I also highly recommend. And what you could do is as long as you're using RStudio, the RStudio preview 1.1, which you can download from our preview site, you can basically set the address to your Libby service. The port is not going to be 8787. This is the port from RStudio. I believe the default is 8989 for Libby, but that's something that your system administrator can give you. Same for username and password. And you can connect. And basically, this is a desktop version of RStudio, and you would get connection to that remote cluster as well.

Building Shiny applications with SparklyR

All right, so to cover our last topic in this webinar, we're going to talk about Shiny and well, you definitely some of you are probably more experts than what I am on Shiny, building applications with Shiny, but there's a few things that I want to cover that definitely you want to be aware of as you build applications in Shiny that use SparklyR.

What we're looking at on this diagram is one of the functions that you probably are familiar with in SparklyR called table cache. What this function does is basically creates memory, you know, loads into memory your table persisted, well, under storage in SparklyR, right? So usually this operation takes a long time, right? Like if we when we were working with the common crawl data set, if we were to run table cache on the regional data set, it would basically have to read everything, right? And that would take some time.

So underneath the covers, what happens is as users we run table cache underneath the covers, SparklyR is connecting to Apache Spark through a socket. And this operation is synchronous, meaning that it will have to wait until things complete to continue. So we run table cache, part of the operation, you know, like starts triggers the caching functionality on Apache Spark. But then when we try to complete this, as this code is trying to finish completion, it's going to lock. And that means this is good while you're working with in a single user environment, right? Like you don't want to cache a table and, you know, like then work with something that hasn't finished caching.

So in order to fix this, this behavior, the easiest way to go is to basically assign multiple R processes to one R process to each user. And this is this functionality that is very easy to enable on our Shiny Server Pro. So definitely, definitely worth considering while working with Sparkly R applications.

So once you enabled one R process per user, basically, you know, each user is going to have their own R process. And when the second user comes and triggers the Spark application, you won't have to wait on resources except for, you know, if if if the Spark cluster is busy. Right. But yeah, definitely, definitely worth knowing about this. It's something that you want to experience while building your Shiny application, but definitely something that you want to consider, especially when when you deploy your application.

Now, taking taking a look at a quick example, we're transitioning an application from using from using Shiny to using Spark is actually pretty straightforward. This this, I believe, is the hello world application that you get by default when you're using Shiny and finishing installing Shiny. There are a couple of things that you want to change either on the global on start function. You want to make sure that you learn you load Sparkly R and you connect to your cluster. This is a simple example. We're connecting to local and then we're copying the faithful table, the faithful table into Spark. So that's that's part of the setup. And then the other line that I had to change to make this work with Sparkly R was line 27, where we actually filter this data set instead of use instead of filtering a data frame. Now we need in this case, we're using the player by pulling the wait time. And that's that's pretty much the rest of the application is is exactly the same.

So I have a very simple text box somewhere in here. Oh, not that one. This is a text box and we have a plot. And for plotting the related keywords or I'm going to use the we're using the word cloud package and then on on the server, it's it's it's basically where I'm sort of like almost copy pasting the the player query that we use to explore some of the keywords. So let's take a look at what we're doing.

So we're we're getting the keywords from the text box. We're splitting them. This is this has nothing to do. Really, with with with sparkly our spark. But once we have this list of keywords, I'm I'm we're going to run operation with progress, saying that we're executing something under spark and we are going to perform like a similar query where we're basically getting all the web pages that contain that keyword. And then we're going to join them with the with the same table to get the full list of keywords. We're going to group by keyword and we're going to count them and then we're going to filter. Well, we're just going to remove the keyword that we added because obviously it's containing no web pages and then we're just going to arrange the the keywords top to bottom to see which are the most common ones. And we're going to get the top thousand and then we're going to collect the result.

So we're only bringing one thousand rows of data back to our. And this is important because you don't want to run. You don't want to run. You don't want to retrieve a significant amount of data back to your main node. Right. Right now we're running on 50 nodes, so it's definitely not possible to bring all the data back. So we need to be conscious when we plot or when we explore results to only retrieve the data that we are interested in and that fits into the into the current our process. And yeah, the last step is to run this word cloud over over over the data that we retrieved. Word cloud would take the keywords and the count. So that's that's pretty much it.

So I'm going to run it locally and we can we can play a little bit about with this before we go into questions.

All right, so this is this is our Shiny application on on on behind it, it's running on Spark. So we're going to see here we see that we have the last job was the job 27. So let's just start with something like let's just search for.

Statistics. And basically what's happening under the covers is we're we should see a job running here, which already completed its job 28. And when we back when we get back, basically all the keywords associated with statistics that we found in these data set. It's going to zoom in a little bit, probably the things that you would expect, maybe not. I don't know. From probability answers, help algebra.

And there's definitely some some more if just for fun, mostly if we were to use stats instead of statistics, the keywords might be pretty different of what we find in the web. For instance, in this case, NBA is one that seems to be closely correlated with stats, you know, and fantasy, soccer player, et cetera. Right. Of course. And, you know, we can we can take a look at other other other keywords.

Superheroes or why not, we should get a good sense of what what the keywords are associated with them, books, resources, Batman associated with Superman, Avengers, Wonder, Da Vinci. And now the cool thing to notice here is that, you know, we basically, you know, we're able to get get us up and running in literally one hour of compute time. I mean, I set up the cluster a little bit earlier, but basically you can run these with one hour of compute time with 50 nodes. We can you can run about one one point two terabytes in this case, but you can also scale it for more. The cost is not astronomical at all. You know, it's less than the cost of going out for lunch in a lot of places.

The cost is not astronomical at all. You know, it's less than the cost of going out for lunch in a lot of places.

Wrap-up and resources

Let's get let's get back and close up here, so we've covered shiny and sparkly are so we have the basics of how to build these type of applications. Interesting resources for this webinar, definitely, you know, we've been recommending going to spark.rstudio.com, you would find more information about spark apply in this website, a website that you're obviously familiar with if you want actually to download shiny and you want to configure it and find the admin guy and all that administration guide, shiny.rstudio.com for all Apache Apache Levy related questions are currently a documentation currently Levy's under Apache as an incubation project. So you can go to levy.incubator.apache.org and read more about it and how to configure it appropriately. And obviously, if you have any if you find any issues in any of these topics or you have you find find yourself with a question that hasn't been answered on Stack Overflow by the community, you can find us on GitHub on RStudio with under sparkly R.

So that's what we have. Definitely, you know, we have half or most of this sparkly R team in this session. So if you have questions that are related to this webinar or even if we're not, if time allows us, we can we can try to address as many questions as possible. And if not, definitely reach us out on on GitHub and we'll we'll follow up.

Q&A

I see that Javier is using LM. Does this mean we can't use normal R machine learning packages with Spark apply instead of using of instead of just relying on Emily?

So here here the quick answer is yes and no. You know, like there's there's no automatic way of parallelizing an algorithm that was creating for running in a single computer to make it to out of to make it automatically parallelizable into multiple computers. So even though you can run machine learning on each machine, it doesn't mean necessarily that you can run at scale and machine learning algorithm that has already been written are now there's there's interesting ways of doing this. Right. For instance, in the example that we looked at, a lot of a lot of times you don't need to run a linear model across your entire data set. You might have to run a linear model across many medium size data sets.

Right. Like, for instance, if if you were to be if if you were to doing predictions per city for your company, you can you can partition your data, say that you have like a thousand cities. You could partition your data by city and then run any machine learning that you already are familiar with in R per city at scale across your cluster. So that would run really fast because it's running across your cluster. However, if you had to run this prediction across all your cities. Right. And that this is a data set that doesn't fit. Like then the answer is no.

There's an interesting use our talk that you can look read about is called the title. I believe it's software alchemy when one of our users explores aggregating machine learning models that were independent, independently computed, such in with especially for models that have the property where you can sort of average the results. And I believe in his talks, he mentioned that in some cases you can do that with linear linear regression where you could potentially run linear regression across multiple partitions and then sort of like aggregate them. But but this is like a little bit more advanced. I think I think in general, the answer is no, is no, you won't be able to automatically run these models at scale unless you partition the data, which it does happen in a good amount of cases, but not in all.

Have you guys tested the performance against any other packages that allows to connect to spark? Well, I mean, this is this a pretty broad question, right? And yes, so we've done these in different areas. We haven't done an exhaustive performance test across every single function because that would be too time consuming. But for instance, comparing sparkly are with spark are on connection times. We're at a par when when I run these data set, I actually also I run the same. I also run the same parsing of the spark work files using pure Scala code, and it was taking roughly about the same time. And yeah, so we take it in a in a case by case basis. So we don't have like a comprehensive performance benchmarks across everything. But definitely when when something is interesting or when someone raises this question, a question for a particular function, we try to take a look and do due diligence to be as fast as any other any any other not just package, but also any other technology like Scala, Python while working with spark.

Does anything exist or are there plans to have a helper that will spin up spark EMR cluster for us giving our AWS credentials? Yes, I would really like that. But it's honestly not that bad. We don't have any particular plans to do this. What I would recommend is if you search for for that particular question, sparkly are an EMR. And there was a blog post from the Amazon AWS team. And the script that I run is basically this script. They they run they have a script called RStudio sparkly emr5.sh. And basically, when you provision, you know, in your in your AWS EMR console, when you provision this cluster, you can add a bootstrap action. In this case, I'm using my my own script because I modified it to use the latest version of RStudio by default. But this one actually runs pretty well. So, yeah, but I mean, to answer your question is no, we don't have any any plans that I am aware of. I think it would be nice. But, you know, the existing the existing solution is not that bad. I mean, it's honestly, you know, copy paste and you get your cluster up and running.

Can you elaborate of why we should parquet? Oh, I totally forgot about mentioning that. Yes. So you so in the demo, in this demo, like I actually didn't end up, you know, like reading back from parquet. What what I would say is what I wanted to say during during our demo there is that a lot of times, you know, like the cluster that you use for computing your initial, you know, understand understandings of your of your data, like might be much bigger than the cluster that you need to share your data in a spark, you know, shiny application. So what I wanted to mention there is that a lot of times what makes a lot of sense is to use a pretty big cluster for the initial computation of data and that you can scale back the cluster within with the data that you're that you actually need for running your shiny application and then load it back from from parquet.

Sorry about that. I feel like it wasn't very clear why we were saving it as as parquet. But definitely, you know, like it makes sense to, you know, especially if you're going to be disconnecting from the cluster and switching switching to an application that is running 24 seven, you know, like you don't need the full cluster where you did your initial analysis and saving to parquet and loading it back from parquet is usually the steps that you follow or that I you know, that I would recommend following for your spark application.

So is parquet better than caching? Well, it depends. Right. And also, is there a quicker way to cache using Sparkly R? Well, the fastest way to cache would be to put it in memory. And that's that's still desirable for a lot of cases. Right. Even when we write to parquet, you would want to write it, load it back into memory to get fast, responsive times with with shiny applications. But no, I mean, I think in general, parquet is not better than caching, except for cases where you don't have enough memory. Right. So definitely, you know, if you don't have enough memory to cache all your data, saving your your results as a parquet file also makes a lot of sense, right, because it would save you computation.

But in this in this particular example, and again, the point was that if you're turning if you're moving this to more of a if you're building this application to share with your colleagues for this particular case, you don't need to recompute everything all the time. You want to save it as a parquet file, then load it in memory each time that each time that the shiny application each time that the shiny server starts for the first time, the shiny application and then you're you're good to go. All right, well, thank you so much. I believe that was the last question. Thanks again, everyone that attended.

Featured software#