ZJ | Easy larger-than-RAM data manipulation with {disk.frame} | RStudio

Learn how to handle 100GBs of data with ease using {disk.frame} - the larger-than-RAM-data manipulation package. R loads data in its entirety into RAM. However, RAM is a precious resource and often do run out. That's why most R user would have run into the "cannot allocate vector of size xxB." error at some point. However, the need to handle larger-than-RAM data doesn't go away just because RAM isn't large enough. So many useRs turn to big data tools like Spark for the task. In this talk, I will make the case that {disk.frame} is sufficient and often preferable for manipulating larger-than-RAM data that fit on disk. I will show how you can apply familiar {dplyr}-verbs to manipulate larger-than-RAM data with {disk.frame}. About ZJ: ZJ is a machine learning developer based in Melbourne, Australia. He regularly contributes to open source projects. He has more than 10 years of experience in banking before joining the tech sector. In his free time, he enjoys playing Go/Baduk/Weiqi

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hi, I am ZJ. I'm a data scientist based in Melbourne. I want to talk about this package called disk.frame. Have you ever encountered this message? Can I allocate vectors of size whatever? So what is the issue here? Well, the issue is that R tries to load the whole data into RAM, so it doesn't really fit sometimes. If your data is quite large, how do you deal with that? In this talk, I'll talk about disk.frame, but Apache Spark, Vi, for example, Spark VR is also possible.

How disk.frame works

So how does disk.frame work? Well, disk.frame is very simple. It's just a folder with many FST files. If you're not familiar with FST files, see this website. Each file is called a chunk. And basically, if your data set is too large, I don't have to load everything to memory. I just break that whole data set into smaller chunks and load it chunk by chunk. Now, once I can load the data chunk by chunk, I can do many things. For example, I can operate on the chunks in parallel.

For example, I can operate on the chunks in parallel.

ZJ | Easy larger-than-RAM data manipulation with {disk.frame} | RStudio

Transcript#

How disk.frame works

Loading data into disk.frame

Demo with 1.8 billion rows

Wrapping up

Featured software#

dplyr

rstudio