bigdataclass

This is a two-day workshop teaching R users how to connect to and analyze data in external systems like databases (SQL Server, Oracle, PostgreSQL) and Big Data clusters (Hadoop, Hive, Spark). It focuses on using dplyr, DBI, odbc, and sparklyr packages to translate R code into SQL queries and run models directly in Spark.

The workshop covers practical topics including connection configuration, security practices, and deployment strategies using RStudio’s professional tools. It addresses the challenge of working with data that’s too large or impractical to load directly into R by enabling analysis where the data lives. Students learn to use familiar dplyr syntax while the underlying operations execute efficiently in databases or distributed computing environments.

bigdataclass

Contributors

Edgar Ruiz

Christophe Dervieux