Analyze and explore data stored in Snowflake using R

James Blair, Senior Product Manager, Cloud Integrations at Posit, will demonstrate using the R language to analyze and explore data stored in Snowflake. He will also show you how easy it is to set up an R environment inside Posit Workbench that runs as a native app on Snowpark Container Services. You also find out how using the dbplyr interface can be used to push computation data into Snowflake, giving you access to greater memory and compute power than in a standard R session. It's easy to get started. In just a few minutes, you can work in your R session securely inside Snowflake using the RStudio Pro IDE in Posit Workbench. Posit also supports VS Code and Jupyter for data scientists who prefer to work in other languages like Python, so you can continue to use the tools you know and love. Learn more about the Snowflake and Posit integration: https://posit.co/solutions/snowflake/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, my name is James Blair and I'm a Senior Product Manager at Posit over Cloud Integrations. Today we're going to walk through a demo of using the R language to analyze and explore data that's stored in Snowflake. We will do this by setting up an R environment inside of Posit Workbench that's running as a native app on Snowpark Container Services.

We will also explore how using the R, dbplyr interface can be used to push computation into Snowflake so that you have access to greater memory and compute resources than you typically would have in a standard R session. Let's get started.

Accessing data from the Snowflake Marketplace

The first thing we need to do is gain access to our data. Luckily, Snowflake makes this very easy through the Marketplace. If we go to Data Products inside of Snowflake and then the Marketplace, we can search for different providers. In our case, we're going to look for Cybersyn.

Cybersyn offers a number of different data assets on the Snowflake Marketplace, but we're going to scroll down and we're going to look for the Web Traffic dataset.

Here we can see the Web Traffic Foundation Experimental dataset, select it. I've already subscribed to this dataset, so I have the option to open it, but if this is the first time you've gone through this process, you would select the option to get this dataset. Now that this dataset is configured and we have access to it through Snowflake, we need to open up Posit Workbench.

Setting up Posit Workbench on Snowpark Container Services

Posit Workbench within Snowpark Container Services can be set up and configured by a Snowflake administrator. Once that service has been configured, the administrator can share a URL with you that gives you access to the service.

Here we'll open Posit Workbench and notice that this is running inside of Snowflake. The first thing that we need to do here is create a new session. I'll click on the New Session button, choose RStudio Pro, and then you'll notice that I have an option to sign in to my current Snowflake account. I'll click Sign In. This will bring me to a page where I can enter my Snowflake credentials and then verify that this is signing into the account with the role that I want it to use. Once I click Allow, I'll be notified that I can close this window and come back to my session for RStudio. Here I can see that I'm now signed in to this Snowflake account and I can click Start Session.

Once our RStudio session is opened, we need to gain access to the code that we're going to walk through. This code is made available in a GitHub repository. I'm going to visit the repository, choose the code to clone this repository using HTTPS, and then come back into RStudio. Here inside of RStudio, I can now create a new project.

Using version control from Git, and then copy and paste the URL from the repository into here. We'll give this a name. I'll choose the R version that I want to use and we'll create our project.

Connecting to Snowflake with dbplyr

Now that our project is open, we have a Quarto document that we can open that contains code that allows us to connect to and work with the data contained in Snowflake. The first thing we're going to do is load some of the R packages we need into our R session.

Once our packages are loaded, we can connect directly to Snowflake using the dbconnect function from the dbi package. Notice that in order to create this connection, we didn't have to supply any additional credentials, details, or authentication information because when we signed in to Posit Workbench, we signed directly into our Snowflake warehouse. This allows Posit Workbench to supply credentials for us when we make these connections into Snowflake. This is both easier for the developer and more secure when it comes to connecting to Snowflake data.

This is both easier for the developer and more secure when it comes to connecting to Snowflake data.

Now we can create references to two tables that are contained within the Cybersyn dataset that we gained access to earlier. Now that we've created these two objects, time series and attributes, we're going to do some slight data cleaning and combine the two into a single object called time series.

One of the things that's unique about the way that we've connected to Snowflake here is that this time series object does not represent data that's stored in local memory inside of our RStudio session, but instead it represents a SQL query that needs to be executed against Snowflake in order to create the view or the object that we're referencing here.

To show an example of this, if we use the show query function against time series, we can see the SQL statement that will be executed in order to create this object within Snowflake.

Analyzing and visualizing web traffic data

Now let's take a look at some of the top web domains from this web traffic dataset that we've collected. If we filter our data to include only the domains that we've listed, we can now create a final view of this data that will allow us to visualize and analyze web traffic to these select domains. Here's a view of what that final data will look like. Notice that in this final version of the data, we have each domain, a date, the number of page views, users, and sessions. All of this was done while keeping all the data inside of Snowflake and using the dbplyr interface from R to analyze and work with that data.

Now that we have our data in this final form, we can create some compelling visualizations. For example, comparing web traffic to Airbnb and VRBO. We can do this using ggplot, which will result in a graphic that shows the comparison between VRBO traffic and Airbnb traffic since 2021.

We can also use the GT package to create a summary table that contains some summary statistics about the different domains that we have access to. Here's an example that shows each domain and the number of page views, users, and sessions with a highlight on Airbnb and VRBO.

Finally, let's do some marketing analysis. There are two main ratios that we might be interested in. Content engagement, which is the number of page views compared to users, and repeat rates and website usage, the number of sessions compared to the number of users. We can calculate this for all per session for all the domains that we have by running the following code. We can then take this code and this final table that we've created called marketing and visualize it using ggplot. This graphic that's been created now shows marketing ratios over time for each of the select domains that we've chosen from our web traffic data set.

Interactive exploration with a Shiny app

Before we conclude, let's take a look at one more thing. If we go back to the files pane inside of RStudio IDE, there's a folder called app. Inside this folder is an app.r file that contains a Shiny application that creates an interactive application for real-time exploration of this data set.

This application uses many of the same principles that we just outlined in the Quarto document. If we select run app here, this will create an interactive application running on top of data inside of Snowflake that allows us to explore and investigate different comparisons across different web traffic domains. Here we're comparing Facebook and Google, but we could change this to be Facebook and Instagram, and we can see that the plot and the page update as a result of the selections that are made here, all backed by queries that are being submitted in real-time to our Snowflake data warehouse.

all backed by queries that are being submitted in real-time to our Snowflake data warehouse.

As we've seen today, R is a powerful language for interactive data analysis and exploration. We've demonstrated how you can use RStudio within Posit Workbench running on top of Snowpark container services in order to securely and quickly access data inside of a Snowflake data warehouse. This interaction speeds up the ability of data scientists to arrive at meaningful business insights using data that's securely governed within Snowflake.

We focused today on using the RStudio IDE within Posit Workbench, but as we saw earlier, Posit Workbench also supports other environments like VS Code and Jupyter-based environments for data scientists who may prefer to work within languages like Python. We hope you've enjoyed the demo today, and thank you for watching.

Featured software#