
Javier Luraschi | Using pins with Python and JavaScript | RStudio
Last year, pins got released as a brand new R package to pin, discover and cache remote resources for R users. This package has matured to support many use cases; from caching remote URLs, and easily sharing datasets with other R users, to building automated pipelines. However, in order to truly collaborate in multi-disciplinary data-driven teams, one needs to consider how to collaborate beyond R. How can we share resources with designers and machine learning experts who happen to use different programming languages like Python and JavaScript? This talk will introduce the pinsjs project, a cross-language community project which has the goal of bringing pins to the broader open source community to enable rich workflows across larger data-driven teams. About Javier: Javier is the author of “Mastering Spark with R”, pins, sparklyr, mlflow and torch. He holds a double degree in Math and Software Engineer and decades of industry experience with a focus on data analysis. Javier is currently working on a project of his own; and previously worked in RStudio, Microsoft Research and SAP
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Hi everyone, I'm Javier Luraschi, and today I'm super excited to talk to you about using pins with JavaScript and Python. But why? Like, you're probably already aware that R is a pretty great programming language for data science, so why should we care?
Well, let me show you something. There was a survey created by Stack Overflow that shows that the top 5 programming languages that are the most popular are JavaScript, HTML with CSS, SQL, Python, and Java. These happen to be the programming languages that basically run the software in the world. Now, R is not there, it has 5.7% awareness compared to 67.7% awareness of JavaScript. So the likelihood of you having to interoperate with JavaScript or having to collaborate with someone that loves JavaScript or Python or SQL is quite high. So how can we get R to interoperate with these programming languages and for us as R data scientists to collaborate with others? And that's what I want to explore in this talk.
Introducing the fictional team
But let me give you some more context. Rather than talking about technologies, let's talk about a fictitional team that is working on something. And this team is composed by three members. The first one is data scientist Darla. Many of you will feel familiar with her because she happens to love R. R is her main programming language. She's very into data science and she's very competent with the tools she knows and loves, like the tidyverse or tidymodels or even vase R or other packages from the R community.
There's one more character that we'll introduce, which is Graphics Greg. Now, Graphics Greg comes from a background on design and he loves using JavaScript. There's no way that Greg is going to learn how to use R. And that's not necessarily bad. R is great for data science, but Greg loves doing static websites and helping people with their design. So really the skills that he needs are CSS, HTML, and JavaScript, and he's very competent with that. And there's one more person that perhaps exemplifies the Python community, and this would be machine learning Monica. Monica likes to do machine learning with things like network analysis and deep learning, and she happens to love Python. She might be familiar with R, but really she's really into Python and it's really hard to convince her to do otherwise.
The collaboration problem
So the problem that we have at hand is a pretty common problem, which is we live in a world where not everyone speaks the same language. Not everyone speaks R. So let's try to solve this problem from the perspective of Darla, which is our data scientist who loves R. What can Darla do to get us closer to collaborating with Greg and Monica? So a lot of the times what Darla does on her day-to-day is analyzing files and she has them in front of her. She has different files, perhaps we can call them data frames, and she usually tidies them, analyzes them, shares them, and creates shiny applications on R Markdown out of them. And it wouldn't be really acceptable for Darla to ask Greg or Monica to have to learn R or to install R and RStudio to really collaborate with her.
So in her mind, what she's thinking is like, well, if I can share the basic building block, which is the dataset that she's working on, if I could share that with Greg and Monica, I have a better chance of convincing them of collaborating with me. So what she's going on her mind is thinking, well, how can I share these data frames in technologies like GitHub or Kaggle or RStudio Connect to kind of like help Greg and Monica reuse my datasets without having to install R?
The pins package recap
And there's actually great news for many of you that are already familiar with this particular problem, which is Darla, knowing that she's a competent data scientist, knows that there's a package called Pins, which we've been developing in the R community for the last year and a half, and allows her to do tasks like this, allows her to retrieve data and upload data into the cloud. So what is the Pins package as a quick recap? Well, first of all, the Pins package allows you to get a remote resource into your local machine. So say if you have like a remote URL with a CSV file, you can create a local cache on your machine very easily. Also, once you're connected to those remote services like Kaggle or GitHub or RStudio Connect, you can also search those services for datasets that are interesting to you, all within the Pins package. And last but not least, and especially important for this particular problem that we're trying to solve, is the Pins package allows you to share datasets in different cloud providers like Google, Kaggle, Azure, and RStudio Connect. And you can do that quite easily. And this is something that Darla is already familiar with. She knows that she can share datasets with other colleagues that also use R and the Pins package.
The Pins package is actually quite simple. You install it from CRAN, you load the library, and then you can ask it to pin a dataset, either an R object or a remote resource or a local file into your local board, which is what we're doing in this particular example. And then after you pin that resource locally, you can use your favorite tools to process it. Like in this case, we're just reading a remote JSON file, saving it locally, and then using JSON Lite to read it. Now, that's the simple use case of the Pins package to bring things into your local machine. But you can also use the Pins package to push things out of your local machine. And what Darla could do here potentially is register a board, which is a concept from the Pins package. The board that she registers could be GitHub, DigitalOcean, Google Cloud, RStudio, Kaggle, Azure, or AWS S3. And she can basically push the dataset from something as simple of an array of 10 numbers by saying, pin one to 10, name equals numbers, and then board equals, say, Google Cloud, whatever.
And the dataset that she has locally can be pushed to be shared to those particular services. And if she needs to get it back at any point in time, she can just say, pin get, and the dataset will come back from those services into her local machine again.
You can do a little bit more complex stuff. Like you can say, suppose that you're collaborating in an image processing project and you have ImageNet, you can retrieve a smaller version of ImageNet called TinyImageNet by saying, pin this remote resource locally into my local machine and we'll call it ImageNet. And then what you can do is you can also say, oh, give me back ImageNet, and now I want to push it to perhaps Kaggle, because maybe you did some cleaning of the data or some processing. So all of these are tasks that the Pins package supports, including finding. You can just say, pin find, and it will find the different datasets that are available in the different boards.
Introducing PinsJS
So this is quite great. And if we think about it from Darla's point of view, we're pretty close to the space that Greg and Monica are interested. We've moved our datasets from our local machine to the cloud. But then again, we have the question that in this case, Nelson Bigetti is asking and he's thinking, wait, this is a pretty great package for the R community. Why can't we just do the same from JavaScript or Python? Rather than having to be downloading random JSON files or CSV files and figuring out authentication and permissions, why can't we just have the same code being used in Python or JavaScript? And sure enough, the whole point of this talk is for me to introduce you to a new project, PinsJS, which is a reimplementation of the PinsR package into JavaScript that supports Python as well. This enables you to use pins from technologies like web browser with HTML or web applications running Node.js or even Python. And the way that we accomplish this is basically by reimplementing the functionality available on R in JavaScript. And this allows you to interoperate with R. So if someone shares a dataset from R, you can then quite easily get the dataset from JavaScript or Python, which is quite great, and even push datasets from Python and JavaScript back into cloud service providers.
And sure enough, the whole point of this talk is for me to introduce you to a new project, PinsJS, which is a reimplementation of the PinsR package into JavaScript that supports Python as well.
So how does this look like? Well, here we have some HTML code. It's quite simple. It's declaring that this particular HTML file needs the pins library and also is running within the browser environment. And then we're basically just defining some JavaScript callbacks to use pins. In this particular case, we are saying board register to register a local board. And then we're creating a dataset, which the dataset is just the number 42, and we're saving it in a local board. And then we're retrieving the value and then we're printing it into the web page inside a div, which is our result.
And we can complicate this a little bit more. Rather than just creating one dataset, we can create multiple datasets in JavaScript, say using a for loop from one to 10, we'll push those datasets into the local board. And then we can search them and create perhaps a page table on JavaScript that shows us all the different pins that we have stored, which you can see on the top right. And we can even go a little bit step further.
If you're really into state-of-the-art JavaScript applications, you can use a library called Babel, which allows you to transpile modern JavaScript into compatible JavaScript. And you can make use of perhaps the new pipe operator to get the iris JSON file using pins and then pipe it into the function that basically reads this dataset from JSON, and then creating a data table with the entire iris dataset. And that's what we see on the top right. So again, you're just using JavaScript to different degrees, using pins in the way that you would expect to use JavaScript.
And similar for Python, if you want to use pins from Python, all you have to do is run pip install with a URL that is hosting the pins.js Python library. And then you import the library with import pins, and then you create pins with pins pin number 42, and then the board where you're storing this data into. You can also do things like retrieving a pin, obviously, from either the local board or a remote board. And then you can also find and to figure out which are the pins that are available to you, which in this case is just the number that we just pinned on the iris dataset.
Real-world use case: Game of Thrones
All right. So this gives us kind of like a pretty broad overview of how the pins package works. But we want to see it in action on a real kind of like use case. So let's think about this. Darla, Greg, and Monica get together over a lunch, and they're trying to figure out which is the most important character of the last book in Game of Thrones. So, you know, like they have some insight that perhaps, you know, like either Daenerys or maybe Tyrion are the most important characters, but they can't really figure out exactly which one is the most important one.
So Darla, being the competent data scientist that we know she is, she just after lunch, she gets back to her office and, you know, launches the R instance and runs the pins library to find out if there's any interesting datasets on the Kaggle service. And sure enough, she finds that Kaggle has at least three datasets, one containing all the scripts from the HBO series, the other one, the subtitles, and the third dataset contains the actual books from, you know, with all the dialogues and content from Game of Thrones. So sure enough, this looks interesting. She uses pin get to get this particular dataset and finds out that there's five books, five files, one for each book. She loads the first book using pin get and the head of that particular file. And she finds that sure enough, the file starts with a Game of Thrones book of one song of ice and fire.
Great. She has the data. So what she can do next is something that you're pretty familiar with. She can use tools like deployer or tidy text or even string R with regular expressions, whatever skills you already know and that you are learning about during this conference. Basically, Darla makes use of those skills to transform the dataset from raw text into a tidy table that contains the proper relationships between characters. She finds out, for instance, that one of the first interactions is Adam Arbran with Jamie Lannister and that with a weight of three. And maybe the way that Darla accomplished this was by parsing each sentence and extracting the characters from each sentence and figuring out just how they're connected. So yeah, sure enough, she creates a beautiful dataset, which contains most of the interactions. And she reports back and tells Greg and Monica is like, hey, I think that on the fifth book, there's a lot of relationships between Jon Snow and also Tyrion and other characters. So check it out. And what she does is she shares the dataset using pins by registering a board in the S3 AWS service. And she shares this cleanup dataset with Greg and Monica.
And what is great is that when Greg hears the news, he's like super excited because Greg is not that much interested into doing data science, but he's very interested in creating intuitive visualizations that can really help us understand how data behaves. So she looks at Darla's dataset and just runs to boot his Sublime Editor or Visual Studio Code or whatever he uses and loads the pins package and then retrieves the dataset from S3. And as you can see, he's accomplishing this with just two lines of code. He gets the data and then he loads it and he's good to go. So then he thinks about it and he's like, well, maybe I can use Dtree, which is a modern data visualization library available in JavaScript to kind of like create kind of like a network visualization of how all the relationship of these characters looks like. And sure enough, he uses his skill to create this particular graph, which looks actually quite compelling and interesting. As you can see on this graph that graphic Greg created, there's kind of like two major components. There's one component on the top kind of like surrounding Jon Snow with other characters like Theon, Greg, Joy, and also Stannis Baratheon. But there's also like another component of characters surrounding Daenerys Targaryen, like Tyrion Lannister and Cersei. And it's interesting because this is just like a different type of skills that is super great looking, intuitive, that anyone can understand. Perhaps it's interactive and it really prompts other people to understand this data set with the skills from Greg and Darla combined.
And sure enough, like Monica is also super excited about this. She looks at the data set and is like, wow, I really need to get into this. But what goes on her mind is like, I really want to know exactly who's the most important character. Is it Jon Snow or is it Daenerys? I want to know with a numeric value, which is the most important character. And sure enough, with Pins, she can now load the Pins library and retrieve this data set that Darla carefully created and load it into her Python session and load the characters. And again, with three lines of code, like she's ready to go. She can kind of like do some more analysis. And taking the data set from Darla and the inspiration from Greg, it really pops on her head and she figures out that maybe she shouldn't use some graph processing. So she loads the NetworkX library, which can allow her to do graph processing over graph data sets. And she basically uses a concept called degree of centrality, which, you know, the way that she interprets this concept is like, if she can find the degree of centrality of this particular graph of Game of Thrones characters, that could also imply which is the most central character in the roles of the different in the different books that are available. And sure enough, after running her script and doing some magic, she finds out that on the first book, King Stark is the most important character. But on the fifth book, the race between Jon Snow and Daenerys are pretty close. Jon Snow seems to be a bit of more of a central character with a score of 0.19, while Daenerys Targaryen has score of 0.18. So again, you know, she finds out that perhaps objectively Jon Snow is the most important character in the entire season and the series, or at least on the last book.
But honestly, more importantly, what we just realized here is that we found out a way for three people with three different technologies and three very different set of skills to collaborate together using technologies that they all love and common libraries that they can reuse to better collaborate and tackle bigger, more complex data science problems.
But honestly, more importantly, what we just realized here is that we found out a way for three people with three different technologies and three very different set of skills to collaborate together using technologies that they all love and common libraries that they can reuse to better collaborate and tackle bigger, more complex data science problems.
So I'm really excited to see what you do with this new library. If you need more information, please visit pinsjs.github.io. I want to say that this is a community library that we've been developing, and it definitely needs the help of the community to extend support for more boards beyond S3, local boards, and RStudio Connect to, you know, boards like Kaggle and GitHub. And I also want to say thanks to people that have been involved in this project, like Natalia Stefanova and Michael Calleghan. Thank you so much for listening to this talk, and I hope that you are ready to start sharing your datasets.
