Karl Broman | R qtl2 Rewrite of a very old R package | RStudio (2019)

Transcript#

This transcript was generated automatically and may contain errors.

I'm an applied statistician working at a university. My primary goal is helping my collaborators make sense of their data. Secondarily is to try to develop new statistical methods to make sense of their data. Thirdly, to implement those methods in software. For nearly 20 years I've been working on an R package called RQTL. I want to tell you about that experience.

The idea for this package came from a friend of mine visited and we were talking about the need for specialized software in the applications that we worked in. It was actually the week that R version 1 came out in January of 2000. He wanted to write it in MATLAB and I wanted to write it as an R package. I was five years younger and devoted more time to it, so it became an R package.

This figure shows lines of code over time. Each dot is one CRAN release. It was first on CRAN at the end of 2001. It's grown to be nearly 40,000 lines of R code and 25,000 lines of C code. Over 15,000 lines of the dot RD documentation. Back in the day we typed by hand. This didn't actually, for the first seven years it was without version control. The way I kept track of versions is by tarring up the file at its current state and just saving that tar GZ file. I didn't get into version control until 2008.

About the science and the package

I feel like I should tell you a little bit about the package. In doing so, I really need to tell you a little bit about the science. I studied genetics of complex traits, mostly in mice. Mice have a big advantage for genetics in that you can make inbred lines. You can take a pair of mice and mate them and then do repeated mating between siblings for 20 generations or 200 generations. What you get at the end is a strain of mice that are all genetically identical. If you can figure out which one is a male and which one is a female, cross them, the offspring will all be genetically identical again.

Our interest in the genetics in mice is really focused on understanding the genetics of human disease. We're not trying to cure disease in mice. We're trying to have a better understanding of the underlying biology of human disease.

The basic experiment that's done, it would be an intercross where you take two strains of mice that differ in some trait. One strain has high blood pressure. The other strain has low blood pressure. You're trying to find what genes are contributing to that difference in blood pressure. You do a cross that's identical to what Mendel did with his peas to mix up the high blood pressure chromosomes and the low blood pressure chromosomes. Then what you're going to do is get a bunch of offspring, measure blood pressure in all those mice, which really for that phenotype would be the hardest part of this study. Measure the blood pressure of all those offspring and measure genotype along each chromosome. Whether a mouse is blue-blue, pink-pink, or blue-pink at a given position. Then we're going to scan across the genome and look for reasons of the genome where there's an association between what genotype you have, whether you were blue-blue, blue-pink, or pink-pink, whether that's associated with the outcome blood pressure.

The results of this sort of analysis is a plot across the mouse genome on each chromosome, the curves for a test statistic that's indicating when this thing is high, you are saying there's evidence for an association between genotype at that location and blood pressure. Those peaks we call QTL or quantitative trait loci, a region of the genome that is affecting a quantitative trait blood pressure. Effectively what you do is at every position in the genome you do an analysis of variance to see is there a difference in the average trait across the three genotypes. The goal of my package RQTL is called QTL because we call these sites we're looking for QTL, the genetic loci that affect this quantitative trait. It does a bunch of other things.

The good and bad of RQTL

Going back to the package, I just want to talk a little bit about the state of it and my experience with it. 20 years into a project, there may be some good things about it, but they start to become harder to identify. Some of the code in this package is really good and I'm proud of it. Some aspects of the interface, the user interface, what the user interface is, the code that the user types to get something to happen, I think was a good idea that I had back in 2000, 2001. It includes a lot of data diagnostics features and data visualization and it's quite comprehensive. I had this idea that I would implement all methods good and bad so that you would have one place that you could try things out and understand them. It's quite flexible. I can implement new ideas for analyzing data with this. Other people could potentially do so as well.

It's easier for me to think about the bad things. The first example I want to give is just about getting data into R for this kind of cross. For this sort of experiment, I need a set of traits and then genotype data of are you blue, blue, blue, pink, pink, pink at sites along the genome. Then I need to know where those positions are in the genome. The main input format we use is just a comma delimited format where the first few columns are all phenotypes and the traits and then the remaining columns are all positions in the genome and their genotypes. The first row has for those later columns with the markers, the sites in the genome, it has their chromosome ID. The third row has their position. The first thing I do reading this in is I need to split the trait data from the genotype data, which I do by looking for where's the first non-blank cell in that.

As RQTL was used, people started to load in data with lots of traits, like say 25,000 traits. They would ask me why is it taking so long to load the data? I would say how much data do you have? They'd say 25,000 phenotypes. I'd say it's going to take a while. It's a big data set. Then I had a collaboration where I had data that was 20,000 traits. It was like 20,000 phenotypes and then the usual kind of markers and maybe 300 mice. It took a minute to load this CSV file into R. I was like something's wrong. It shouldn't take a minute to load a file that's really not very big.

So I looked into the code, which I declare to be the stupidest thing I've ever done. It's the stupidest code ever. This code is trying to find the first non-blank cell in that sort of second row. It does that by looping from 1 to N, where N is the number of columns. It first says is the first cell blank? Is the second cell blank? Are the first two cells blank? Are all three of the first cells blank? Are the first four cells blank? Are the first five cells all blank? When that fails and it breaks out of that loop, and then it does a bit of extra work of trying to find that.

In loading my data with 25,000 columns of traits, that took about a minute. It was about 58 seconds was spent on this for loop and two seconds was actually the loading the data and everything else.

I like to say open source means everyone can see my stupid mistakes. And version control means everyone can see every stupid mistake I've ever made.

I like to say open source means everyone can see my stupid mistakes. And version control means everyone can see every stupid mistake I've ever made.

So, that's not the only bad. That code is no longer in the package. You can find it in version control. It's on GitHub. But there is other bad code that's in the package. One of the R functions that does sort of a major thing of a two-dimensional scan of the genome to try to look for a pair of these QTL is 1,400 lines long, which is not, I mean the function is 1,400 lines long. And the C code, I mentioned that this package has almost 25,000 lines of C code. Almost 20% of that is related to this one sort of thing that I'm trying to do. I mean, it's complicated, but it's also complicated and not really, I mean, yeah. It's not what I would do today.

The package, I'm really happy with the interface for the users. I'm less happy with maybe the data structures that I set up in the year 2000. There's some really pretty baroque data structures, to put it nicely. There are lots of inconsistencies in the code. You have functions that are doing kind of similar things that have names that don't, like they're not designed. They're not designed. And they have arguments that are really the same argument, but the two different functions have different names for that argument. You know, a marker might be marker, or it might be M name one.

I would say that maybe the biggest mistake I've made has been really useless warning messages. A lot of users of RQTL, this is their first experience with R. Their first experience with software other than Excel and web browsers, maybe, and Word. So, they get a warning message, X prime X is singular. What are they supposed to do with that? What does that mean to them? And so, a lot of the questions I get are, I get this warning message, X prime X is singular. What should I do? And I think, in looking at this yesterday, I think X prime X, you know, I know what that means, but I would, that could be a picture of something, really, to someone that doesn't know the details of linear regression. X prime X is singular. I mean, X prime, I mean, parsing that sentence is difficult.

Tests. Tests. There's not very many tests. I actually said this. I don't need tests. I have users.

Jenny Bryan has said, if you use software that lacks automated tests, you are the tests. And that, it brings me some comfort that we agree. Hadley, his paper about unit testing and the testthat package. It's not that we don't test our code. It's that we don't store our tests so they can be rerun automatically. And this, to me, is just like the essence of Hadley. I mean, this is exactly why I love Hadley. Because he starts it off by saying, you know, it's not that we don't test our code. So, it's not that we're like terrible people. It's just we have room for improvement.

I don't need tests. I have users.

How it happened and lessons learned

So, yeah. How did this happen? How did I spend 20 years trying to, you know, develop and maintain and support a package that has these problems? You know, I didn't know any better. I didn't know any better at the time. I mean, I started to know better. But by then, it was like too late, it seemed like. There was not enough rewriting done. Because, I mean, the to-do list, I have a to-do list in here that's just absurd. I mean, just like none of that stuff I'm ever gonna do. So, going back and redoing stuff that I've already done just seems like, I mean, it works. I don't have any tests that would prove it. But I did test it. There's not enough planning. I would say, you know, I mean, personal weaknesses. There are tradeoffs to make that maybe I made the wrong choices.

A couple other quickly things that I've learned making RQTL is that, you know, when you're writing documentation for a package, what users care about is tailored examples. You know, show me your package in action doing exactly what I want to do. So, I could like plug in my data at the top and run it, do just what you did. And so, if you're gonna put effort into documentation, it's really think about how people are gonna use it and focus on that first. Kind of a general user guide sort of second. Examples within the detailed documentation third. Formal documentation, it seems like it's important, but no one, I mean, seems to read it.

User support, I put a lot of effort into answering people's questions. I get emails a couple times a week at least still. And I think if you want people to continue to use your software, you really need to help them to get started. And so, I mean, and so I'm, I really want to make myself useful. If they need help getting their data into RQTL, I'm happy to have them send me the data and I'll figure out what's going on and send them code and their data back or fixed code. I mean, fixed data that works.

RQTL2 and moving forward

Going back to QTL mapping, sort of the main target is to try to find a gene to sort of isolate a position in the chromosome where the gene that's affecting this trait is sitting. And those regions are too big, so sort of modern experiments of this kind are really focused on trying to break down the chromosome smaller by doing more generations and sort of mix together more than two strains. And neither of these really works, I mean, can be studied effectively with the current package. And so, and also the size of datasets have increased enormously.

So, I've been focusing on RQTL2 now as a rewrite that basically starting over completely to try to be able to handle high dimensional data, to be able to handle these multiple strain modern crosses. And I'm trying to avoid making the mistakes that I made the first time around. RCPP, roxygen2 , unit test with testthat have been, just make everything work better. My life is so much nicer now. But I still have some personal failings. So, RQTL2 is not like, not without its flaws.

But, you know, finally, I mean, I devoted a lot of my time to software, even though sort of the people that evaluate my work don't really care about it at all, because like, they just care about grant money and citations of, I mean, papers. But it has actually been really helpful for my own scientific work to find new collaborators. And it just, you know, statisticians, our key job should be to make ourselves useful. And this has really been sort of the way in which I've been most useful, I think. A lot of people involved in it. And, you know, I'm a Bitly squatter, if you want to see these slides, they're at RStudio 2019 at Bitly. Thanks very much.

Karl Broman | R qtl2 Rewrite of a very old R package | RStudio (2019)

Transcript#

About the science and the package

The good and bad of RQTL

How it happened and lessons learned

RQTL2 and moving forward

Featured software#

roxygen2

rstudio

testthat