Resources

Javier Luraschi | Datasets in Reproducible Research with 'pins' | RStudio (2020)

Open source code is an essential piece in making science reproducible. Tools like 'rmarkdown' and GitHub facilitate running and sharing outcomes with colleagues and with the broad scientific community at large. However, it is less clear what tools should be used to retrieve, store and share datasets; while it is possible to make datasets part of your workflows today, it is usually hard and we are often left with manually sharing or downloading links to datasets. Not only that, but it's also hard to share or discover datasets. In this talk we will introduce for the first time the 'pins' package. A package designed to: pin, discover and share resources. Meaning that, you can use 'pins' to simplify your data science workflows by easily fetching resources from GitHub, Kaggle, CRAN and RStudio Connect. We will present a 'pin' as a generic resource that can contain tabular datasets like CSVs, unstructured data like JSON files, image archives as ZIP files and so on. This talk will be highly interactive showing you how to get started by installing 'pins' from CRAN, retrieve and cache resources, share and discover useful and fun data resources to improve and enhance your day-to-day workflows

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Yeah, thank you everyone, and welcome to the Strange Downloads Lighting Talk, where you're going to learn how to use the pins package to make sure your data science workflow remains reproducible.

Now I know that most of you take reproducibility for granted. You think or you want to live in a world where everything is reproducible, and you know, to be fair, we have like great tools like rmarkdown, which was designed from the very beginning to be reproducible.

However, you know, you want to be able to copy paste stuff anywhere, right? You want to be able to grab R code and place it on a different R session. You want to be able to grab that same code and put it on a different machine, and that all should work really nicely. And that's the case. And if that's the world where you live, you're very lucky.

The upside down of data science

Well, let me tell you now about a very, very dark place, which I happen to call the upside down of data science. So in this world, not everything is as reproducible as it may seem, and you might find that there's code that requires very strange downloads.

So for instance, in this case, we have a package that supports Python and R, and when you copy paste the code, it basically doesn't work, right? It's going to fail, because one thing that you find over and over is that there's like a local file named whatever.csv, and when you run the code, it doesn't exist. So you need to scroll all the way up and figure out, like, how exactly to download the file, then you download it, then you put it on a specific path, you change the code to point to that path, and when you want to rerun the code in a different session or a different machine, you do it all over again.

Introducing the pins package

So is there a better way to do this? Well, today we're going to find a way of closing this awful portal to the upside down of data science with a new pins package, which basically, all it does, it allows you to download a remote resource locally.

So all you have to do is say pin, and then you have a URL, and it basically converts the remote URL into a local URL, and that's about it. Then you can make use of that resource, and not only that, but the pins package allows you to cache the resource, so if you rerun the code, we're not going to be redownloading this over and over again, and if you happen to lose internet connection and you run this code, the package isn't smart enough to not rerun the code and just use the cached version and make sure your code doesn't break.

the package isn't smart enough to not rerun the code and just use the cached version and make sure your code doesn't break.

But what else? I'm sure a lot of you work with data sets, and sometimes you tidy your data set, and then you wish you could share that with others, and the pins package also allows you to do that. You can say pin with a specific board, and then all you have to do is register your board, so for instance, you have boards for Kaggle, GitHub, RStudio Connect, Azure, Google Cloud, and S3, and whenever you pin a remote local data set, you basically are sharing it in these remote cloud providers or products.

That's about it, and let's see if I can get a little demo to work. No, it won't work, so thank you. That's it.