Resources

Riva Quiroga | The development of "datos" package for the R4DS Spanish translation| RStudio (2020)

Originally posted at https://rstudio.com/resources/rstudioconf-2020/the-development-of-datos-package-for-the-r4ds-spanish-translation/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Hello, everyone. Do you hear me well? Yeah, cool. Hola. So I'm going to talk to you about the development of the datos package that's part of the R4DataScience translation. So this talk is about how it's to be an R user when English is not your first language. So we have to acknowledge that there's a language gap, that people that doesn't speak English or that are not very proficient in English, that can be a very difficult thing to do.

So for example, when you try to learn something new, you want to insert some data into a vector and want to learn how to use the append function, you may run into this resource, the R cookbook. So you want to learn how to use this function, and you have this whole context that explains you how to use it. But when English is not your first language and you're not very proficient, that context doesn't exist at all. So you have to try to figure out how to understand that function without the context, and sometimes that can be very difficult.

So someone may ask, why don't you just learn English? That will be much easier. Okay. Well, that's a possibility. But for many people, learning English is a privilege. The country where I came from, I live in Chile, one level of an English course can cost you like $500, but the minimum wage is $350. So it's not only costly, but we also have to think it's time consuming. So you want to learn R. So why I have to learn English first and then learn R, that's not the idea. So happy R users like me want that everybody learns R. So sometimes we have dreams like this one, for example, have R for data science translated to Spanish and have R for data science. So O'Reilly, you can help us.

one level of an English course can cost you like $500, but the minimum wage is $350. So it's not only costly, but we also have to think it's time consuming. So you want to learn R. So why I have to learn English first and then learn R, that's not the idea.

How the translation project started

So the story I'm going to tell you starts in 2017, and as many stories in the R community, it started on Twitter. So Hadley Wickham announced that R for data science was going to be translated into Chinese, French, Korean, Portuguese, Russian, Serbian. Cool. So Marcella Alfaro from the Costa Rican R community asked, well, what about Spanish? And Hadley said, strangely, no. And that's very strange, because Spanish is the language that had the second most amount of native users, so there's a lot of Spanish-speaking R users that want to read this.

So in 2018, Edgar Ruiz from RStudio, he forked the original repository and decided to start the translation, but he didn't do nothing at all, because this is very difficult to do, translate a book. So he needed a community, and he was alone in this. And a couple of months later, at our OpenSciConf, Laura asked, Hadley, what about the translation? And he said, no, it's not going to happen. O'Reilly is not interested in Spanish. But you can do it if you want. So she spread the news around the RLadies network and on Twitter, and the next week, we have the first conference call between all the people that were interested in this project. And just two days after the video conference, we have our X sticker, because, you know, if you don't have an X sticker, you're not a real R project. So we became a sticker-driven project. We have a team of stickers, so we can start.

So everyone started forking the repository, and 21 people were involved in the translation from different backgrounds, from different countries, and we have another around 30 people that review those translations. So this was about a 50-people effort project. So all these people speak different varieties of Spanish, which is very useful, because you don't want to be rude. You know, translated something in some way, and that's very offensive in another part of Latin America. People from different backgrounds. There were people that were from statistics, from biology, from social science, from the humanities. Also people with different R experience, experts and novices, and people with different GitHub experience, and this was very risky, because we forked the repository and the people forked our fork, so you have around 30 people trying to manage how to contribute.

Decisions about infrastructure and translation

So in this first conference, we decided two things. First, that we want to translate the text and the datasets that are used in the book, because they're useful not only for the book, but for teaching R, and also which infrastructure we were going to use. So we used mainly GitHub and Slack. In GitHub, we have our fork of the book, we have our repository with all the documentation of the project, and the repository for the package. And we have responsibilities. Edgar was leading the development of the package, but I was involved with taking care that we didn't break the repository, and I was involved with the translation and coordinating the whole process. So GitHub is very useful for translation, because you can have both versions of the text, and reviewers can put comments in between, so they can make suggestions how to improve the translation. And Slack was also very useful, well, we can organize this, but also because of polls, because we have to make many decisions about how we are going to translate some stuff. For example, we have empty cars. We have to translate the cars, and we can say cars in three different ways in Spanish. So we have to decide autos, coches, or carros. So autos won, and now we have emete autos.

And there's some other very difficult stuff to decide. For example, we have the pipe, and in Spanish, all nouns have a grammatical gender, so we have to decide if the pipe was going to be masculine or feminine. So we made a poll, and everyone voted that the pipe was masculine, except me, because for me the pipe was a she. So now I have to change my mind about how I thought about the pipe.

So we decided that we want to translate the text and the datasets, and for the text, there were previous experiences that were very helpful. For example, the Carpentries has translated some lessons to Spanish, and they have this guide with the agreements about how they are translating technical words, so we get inspired about that. And also the Programming Historian, it's a website where we publish peer-reviewed tutorials that help people from the humanities to learn digital tools, and has a version in Spanish and French. So there are, like, guidelines for reviewers, for translation, for authors, and that was very useful to get inspired by that.

Translating the datasets with Datalang

And what about the dataset? There were two possible approaches. What we wanted is, for example, we have the diamonds dataset, we want diamantes, and we wanted to have, like, the variable translated and the name of the different values. And one possible approach was just to, for example, rename the variables and then translate the name of the values. But another approach was using Datalang, a package that Edgar Ruiz developed, that has this function that's called translated data, that takes a YAML file with the translation, the specification for the translation, and as a result, you get your dataset translated.

So how it works, you have this YAML file, so you put which is the source packet, the original dataset in English, how it's going to be named in Spanish, you specify how you want to translate the variables, for example, price became precio, and the description. You also explain which values you want to translate it, and you also translate the help. So you have documentation in Spanish also. So you not only get the dataset translated, but also you get the help page for the dataset. So that's very useful for people who are learning.

So this approach of translating the book and the dataset using this packet make very easy to contribute, because you can participate in a packet just by knowing how to edit a YAML file.

Finishing the package and going to CRAN

So last year, when we were developing the packet, we sent a proposal to Luz, but the packet wasn't finished, so it became a conference-driven packet. We have to finish it before the conference. So we were working fine, but when we decided to send the packet to Crown, we realized it was too bulky, because we have a packet with a lot of datasets inside. So what we decided to do was that the dataset translate on the fly. So actually the translated datasets are not in the packet. In the packet, you only find the YAML files. So we have these two functions that translate the data on the fly. Once you call, for example, diamantes, it goes to find the YAML translated dataset and offer you the translated version. So that makes the packet much lighter.

So we were very happy with that, but one day we were trying to do something, and we realized we were not able to call the dataset that way, and we were, like, very stuck and hardly remember that this function exists, delayed assign, and that helped us to solve that. So now you can call the data packet whichever way you want.

So we are in Crown now. Edgar was leading the development of the packet, and I helped with the whole process, and also people from the community got involved. And where we are now, the datasets are all translated. The code, we had to edit all the code from the book, because it has, like, the original datasets. The translations are ready, and the reviews are ready. So we have, like, the first complete version of the book in ES.R for DS. So now we have this first version that people can use to learn R.

And there was a thing we had to decide. What about the edition? Because we want this book to be updated every time. So I decided that a good idea is to watch the repository. So every time they change something in the repository, I can edit it back to the book. And one day, January 15, I woke up, and I had to collapse my inbox with 42 messages, because they decided that that was the day to update all the pull requests that were in the repository.

So I thought I was able to be a conference-driven editor, but I failed, so I'm still, like, making the last changes.

What the project really created

So is this really the end of the project, now that the book is ready? This started in 2018, but I think it's a project that will last forever, because what we have created is not just a translation and a book package. What we created was human and technical infrastructure to shorten the language gap that currently exists. And I think our big contribution, developing the data package, was creating brand-new errors, because people now, if they fail using the package, they are not going to find, like, ways to solve it alone. But I think this is an opportunity, because failing is a way to make our community stronger.

What we created was human and technical infrastructure to shorten the language gap that currently exists.

So thank you very much.

Q&A

So once again, if you'd like to ask a question, we use the app, app.sli.do. We have one question here. Have you seen an increase in Spanish-speaking R users since the release of the package?

Yes, I think so. We put the book online from the beginning, so you can start, like, using the chapters. Some of them were translated very fast. So, yeah, there are many people that are using the package to teach, so it has now a live outside-the-book translation.

Fabulous. That's the only question we have. Well, please, thank you.