Nick Strayer | Stochastic Block Models with R: Statistically rigorous clustering | RStudio (2020)

Nick Strayer | January 31, 2020 Often a machine learning research project starts with brainstorming, continues to one-off scripts while an idea forms, and finally, a package is written to disseminate the product. In this talk, I will share my experience rethinking this process by spreading the package writing across the whole process. While there are cognitive overheads involved with setting up a package framework, I will argue that these overheads can serve as a scaffolding for not only good code but robust research practices. The result of this experiment is the SBMR package: a native R package written to fit and investigate the results of Bipartite Stochastic Block Models that forms the backbone of my PhD dissertation. By going over the ups and downs of this process, I hope to leave the audience with inspiration for moving the package writing process closer to the start of their projects and melding research and code more closely to improve both

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I don't know if anybody watches the Great British Bake Off, but you know they always have someone going on and saying, I've practiced this at home a bunch of times and I've never made it in the time limit. I've practiced this talk three times and I've never made it in 20 minutes, so buckle in. So, I'm going to be talking about stochastic block models with R and how they can be used to do statistically rigorous clustering, but with a lens of rigorous code. And I don't know if you read the actual title on the conference page, but in my title with 11 words in it, I made three typos, so I'm getting there.

So this talk is going to be laid out kind of roughly into this organization of what I'm selling, kind of every talk is selling something, the why, the process and the product.

So quickly just about me a little bit, can help to orient you a little bit. Previously I've done some software engineering roles, primarily working with JavaScript, and I had the pleasure of working at the New York Times in the graphics department for a while building data visualizations. And currently I'm a Ph.D. candidate at Vanderbilt Biostatistics and I focus on visualization and unsupervised learning. And in the future, I'm currently looking for data science positions. I'm going to graduate in about four months. So if that interests you at all, you should probably leave right now so I don't ruin everything after this.

What I'm selling

So getting into what I'm selling, because like I said, every talk is selling something and it's nice if they're a little bit more up front about what that is. So I'm selling two things really. First is just that, a thing. It's this package SBMR that I've been making that performs uncertainty aware, unsupervised clustering of graph data. We'll get into a little bit of that jargon in a little bit. But more importantly, I'm selling an idea. And that idea is that building an R package alongside a research project can serve to strengthen both the package and the research.

And that idea is that building an R package alongside a research project can serve to strengthen both the package and the research.

Jargon break

So let's start with why. But first, I'm going to do a jargon break. A lot of times you sit in talks, especially in modeling talks, and people throw out phrases at you that you don't know, and then you stop listening, and that's no fun.

So first I'm going to define bi and polypartite networks and networks in general, because I'm going to mention them a little bit. So bipartite networks or polypartite networks are networks that have nodes of multiple distinct types where connections can't form between nodes of the same type. So in this example, we have various pollinators and various flowers. And you can draw an edge between a pollinator and a flower based on if that pollinator pollinates that flower. So butterflies pollinate lilies and clovers. But you can never have a connection from butterfly to moth, because that doesn't really make sense.

And then also stochastic block model. This is a phrase I'm going to use a bunch, and I'm also going to abbreviate it as SBM. And this is simply just a network model that clusters nodes into blocks of nodes. And these nodes are grouped by the similarity of their edges. So in this example, we have this red cluster right here, and it's characterized by connections within itself. So all of the nodes tend to connect to nodes that are in the same cluster. The same thing goes for the blue. And all that needs to be known about a given node is what block it is in and what the connection propensities of that block are.

So because I exist in academia right now, and if you don't show an equation in your talk, people assume that you don't know what you're talking about. So this is just the actual equation that governs this whole stochastic block model. And it's quite simple, even though there's a lot of Greek letters going on. But basically, you just need to know the data, which in this case is A, and that comes in the form of edges between nodes, which node each group, or what group each node belongs to, and then a matrix of the expected edge counts between those groups, which I was talking about earlier, is the propensity. And then you might notice, if you're familiar with some statistical formulas, that this looks very familiar. And that's because it's simply a Poisson model. So Poisson models model counts based on a parameter that is the average expected number of counts. And so in this case, the number of edges from a node to another node is governed by this Poisson distribution, where the propensity or lambda value is what is learned in the model.

So really why I showed this equation is that a few years ago, I got the Poisson distribution tattooed on my arm. And I did that because I really just liked it. I thought it was a cool, elegant distribution. And little did I know that I would be spending the next couple years of my life intimately familiar and sort of hating the equation.

Why stochastic block models

So why stochastic block models? So in my research, we lurk a lot with electronic health records, or EHR data. And so with this type of data, questions that often arise are things like, who are these people similar to and why? So we might go to a collaborator, and they would have that question of, I'm interested in patients with type 2 diabetes. Can you show me other patients that are similar to them and tell me why they're similar?

So the first thing that I tried was deep learning methods. So these worked well, but they had a couple of problems. And one is that they lacked interpretability. It could tell you that these nodes were similar to each other, but it wouldn't tell you why it said that. You can get that out of the model, but it's a little bit hard. And it needed specialization for these networks that have multipartite structure.

And then also, we wanted a method that acknowledges the limitations of the data. So statisticians were very interested in uncertainty. And neural networks, again, there are kind of ways that you can get uncertainty out of them, but they are not fundamentally built with that idea in mind.

So next is, why would I write a package to do this? And so there's a package in Python called GraphTool. And it fits stochastic block models. But there's a few issues with why we didn't use it. One is that my lab that I work in uses R exclusively. So every time I would show them Python code, people's eyes would glaze over.

We're also quite focused on uncertainty. And GraphTool is great, but it doesn't focus quite on uncertainty as heavily as we would like. We also needed to tweak the underlying methods to optimize it for these bipartite networks. And I wanted to present at RStudio conference, and I couldn't do that if I wrote a Python package.

The process: getting started

So we'll get into the process now. So with every R package and really every kind of research project that someone does, you kind of got to get to the starting line. And honestly, that's the hardest part most of the time.

And so for this project, there were kind of two things we had to do. And one was look into possible graph libraries. So we decided we wanted to use R to do this graph manipulation. So we looked at the elephant in the room is iGraph. And iGraph is a great package. However, much like GraphTool, it does a lot more than what we needed it to. And I don't know if you've ever tried to install a package that depends on iGraph, but it is a pain.

And also, we wanted to add these kind of computationally heavy steps, something we'll illustrate later. And we had to do that in R by virtue of wrapping a package. And that kind of brought everything to a halt.

And if you're more interested in kind of this tradeoff of adding a dependency versus writing stuff from scratch, I'd really encourage you to look at Jim Hester's It Depends talk from last year's RStudio, where he kind of talks about how big of a difference taking away one dependency can make in the uptake of your package in terms of, you know, does it take five hours to download or does it take five seconds?

So after that, we decided that we were going to use Rcpp. And Rcpp, as you might know, is kind of a nice interface between R and C++ code. And it really just lets you write C++ code to do what you want. And then you can wire it up to R extremely easily. So this low-level interface kind of allowed us to get really close to the metal and keep it nice and memory efficient. We didn't want to keep, you know, pull around a bunch of extra data with our model such that we were wasting a lot of bits. And we also wanted to make sure our processing was nice and efficient. We didn't want to, you know, waste excess flops on things that weren't necessary for what we were doing.

Testing

So once you've kind of gotten a package started, the next step and, you know, in research, the next step is doing tests. And so this kind of, I think about this in two ways. So from one, we come from the software side of things. And software side of things, this comes in the form of unit tests. So unit tests are these nice toy examples that you kind of break the problem down as simple as possible. You try to probe, you know, the limits of the problem, find the edge cases, and make sure that your software behaves as you think it should. And so this can benefit the statistics side of things by building strong foundations for larger models. If you're really sure that the model works in these kind of minimal examples, you're going to be confident about it scaling up. And then it also builds intuition about behavior.

So, you know, kind of with these, you know, graph-based models, there's a lot of intuition that goes on behind it. And when you kind of code those minimal examples, you can gain that intuition pretty well. And then from the other side, when you do a statistics methods project, a lot of times you have to build one. You really should be doing simulated data. And so this can actually help the software side of things, I think, quite well because when you have simulated data, it's stochastic by definition, and it's going to find your edge cases, right? You will never be able to fully probe kind of the limits of your functions with just unit tests. But stochastic data, if there's a bug that you found, it will be found in, you know, one of the thousand times that you simulate data. This happened many, many times. And it also gives you an idea of real-world performance, so you don't kind of get sucked down a rabbit hole of thinking your stuff works really well when you have maybe an O of N squared operation hidden somewhere that doesn't show up until you start using big data.

Optimizing and documenting

So kind of carrying on with the idea of speed, the next step a lot of times is kind of speeding stuff up. This is more of a software side of things. But one thing that you'll quickly learn if you start writing a package to do modeling is that the exact equations that you read in a paper or whatever are not going to be efficient. And so the example in this package is that we have this. This is the entropy of the model, which is essentially as the negative log likelihood, if you're familiar with those terms. Basically, we want this number to be as small as possible. And one of the things that we do is we move a node to another group, and we test how much this value changes. And if you do this whole thing, it's a very computationally expensive task. But you might notice, if you kind of dig into it and tear the equation apart, you know, different parts of it don't actually change, and so you don't need to calculate them. We have this divide by 2 here because we're double counting using this formula, so we can really optimize the equation. And by really digging into this, we actually have to think about what's going on a little bit more deep than you might necessarily do if you're just sitting there writing one off R scripts or writing math on a page.

So kind of taking a step back also for the test side of things, this documentation serves as a test in a way, too. So as you write these, as you do this research, and you're constantly updating your documentation over that process, it keeps you focused on the big picture. And so this is nice because you don't, like I said before, you don't want to, you know, get stuck right in the unit test, working little toy examples, open it up to the real world, and then you start having problems where any time you run it with something over 100 nodes, it takes a million years to run. And it can serve as an informal sniff test for results. So this picture on the right is an example of I was, you know, working with some data, and it kept on having these really big drops. And I was like, what is going on? And it was actually dropping into negative values, which is something that shouldn't be possible. And by digging into that, all my tests were passing, all of the different tests, you know, all the behavior that I was thinking about was working. But I actually found a bug in the logic of the code from looking at this. And I wouldn't have caught that if I wasn't constantly updating the documentation and looking at how things were changing.

So another mildly selfish part of this thing is writing vignettes. And I think about this right in the context of publishing. And so as a statistician or as an R developer, one of the things that you really want is for people to use your stuff, right? That's how you get tenure. That's how, you know, you get recognition in your field. And if no one knows how to use what you're giving, no one's going to use it, right? And so vignettes are a really great way to do that, kind of a nice real world example of using your package. And then one of the nice things about this is that for people, if you're like me, and you don't actually like sitting down and writing academic papers, this can trick you into writing a lot. And so this kind of right here is an example of this is a single vignette from the package. And, you know, when my advisor says, Nick, are you writing? I get to show him that I have this nice long vignette. And I say, of course, I've been writing.

The product: SBMR

So now moving on to the kind of last part, which is the actual product that this whole process has been leading up to. And so quickly taking a step back to reorient ourselves, what exactly is SBMR? And that's a tentative name for the package. It probably will change. It's not very creative. But it's a native R package, kind of, right? It uses C++ code. But thanks to RCPP, it's essentially native at this point. And it's stochastic block models. And its primary focus is investigating the uncertainty of the found structure by sampling from the Bayesian posterior. So if you're familiar with Bayesian models, you can get uncertainty by looking at the posterior distribution. That's how we get uncertainty in this clustering. And it provides a growing list of visualizations to communicate the results. So this is the link to the package down site for it. It's not exactly ready for primetime yet. But if you want to go use it, there's a good amount of documentation written. And please report all the bugs that you will find.

So with any package or any kind of method research product, there's, you know, you need to know what you need to do to use it. And so in this package, you need really, I think, a class of two things, two classes of things for what you need. One is data. So you need data that can be represented as nodes and edges between those nodes, also known as graph data. One of the examples that we use it for is patients with diagnoses. So you have a patient as a node, and they're connected to diagnoses as another node. Another example that we've used it for is a genotype for people. So you have a person and a genotype. A more classic example is like social relationships and networks of, you know, friends on different social networks. We don't currently support nondiscrete edges. And that's just because, as you saw, the distribution that we're using is the Poisson distribution, which is a discrete distribution. There are ways to extend it to nondiscrete edges, but we have not gotten there yet. But those would be examples like correlation networks, where every node is connected to each other just by varying amounts.

And you also need kind of a mentality, right? There's a lot of different ways to do clustering and unsupervised learning. So kind of to use this method, you're going to want to have an interest in how and why your data cluster together. This is kind of the nice thing about stochastic block models, is they're very transparent in why you are clustering. So any given cluster is entirely characterized by its vector of propensity to be connected to another cluster. And so if you have five clusters, every cluster is characterized entirely by a length five vector that says how likely it is to be connected. Any given node in one of the clusters is to be connected to the other clusters. And you need a desire to understand how stable your clusters are. So if you just wanted to fit a stochastic block model and you didn't care about using R, you could use graph tool. Currently there are a lot more features there. But what we focus on especially is looking at the stability of these clusters.

Live demo

So now with the few remaining minutes that I have, I'm going to give a quick demo of kind of the workflow as you run through the package. So first thing you can do is simulate a network. And I'm going to reload this page just because this is an interactive network. And I feel like it's more interesting to see the nodes you from what I'm saying. So we're going to simulate a pretty dense network with three clusters of nodes and 40 nodes per cluster. And so we have a function built into the package called sim basic block network. And we just give it the various values. There's a lot more customization you can do. Right now I'm just keeping it at this default simple example. And as you'll see, this is a very dense network, right? We have a lot of edges. And that's actually one of the harder types of networks to cluster because there's a lot of competing information. It has to do a lot of sorting to figure out what's going on. And as you see in this force directed network, there's no clear structure in the data. There might be a little bit denser region of blue nodes in the center. But if someone were to say, can you cluster this by I, it would be pretty hard to do. But we know that there are three clusters because we simulated three clusters.

So the first thing that you want to do whenever you're doing Bayesian sampling is if you're going to sample from a posterior, you want to get as close to the optimal starting point as possible. And this is because when the posterior sampling, you could spend a lot of time kind of getting to equilibrium. And so you might be familiar with the concept of burning in a chain, which is where you kind of let it run for a while until it finds its natural resting point. And then you start sampling. But that's kind of an inexact science. And so we have an algorithm built into the package that does agglomerative merging. So it basically takes every node, starts it in its own cluster, and then slowly merges those clusters together one by one until you have one node. And it records the model fit. So in this entropy value. And so higher value is worse fit. And so what we're really looking for here is kind of the point where clusters, true clusters start being merged together because the model says I have to take these three clusters and merge them into two. So it had to kind of merge two clusters that were not able to be merged in reality. And so ideally, there will be a sharp kind of decrease, increase in model penalty for that change.

And so we have this heuristic for detecting that point, which we call the delta ratio is something that we've developed. It's not particularly important. But we can see that right here, we've very clearly picked out this point of three clusters, which is what we know. I drew a line here to say, you know, we know that there are three clusters. How does it tell us? And so very clearly, the model has said, I think that there are three clusters in your data. And so let me choose that. And so we can do a sanity check because we know what actually generated the data. And so conveniently, we actually perfectly separated our clusters, right? So this is the simulated block of the node. And this is the inferred block. And so what we see is that every single inferred block perfectly corresponds to the data generating block. And this algorithm honestly works incredibly well. It's, you could really, if you wanted to just do all of your inference in this model without going to the Bayesian values, you could do pretty well. It does surprisingly well.

So once we have that kind of equilibriated posterior situation, we can do MCMC sweeps. And so we can just pipe our model object to this function MCMC sweep, tell it how many sweeps we want to do. We give it an epsilon value. And epsilon, in this case, is an ergodicity parameter. So it kind of influences how random your choices of different nodes are. So, you know, when a node is sitting and being checked in a sweep, how likely it is to be, you know, given to just a random group rather than one that follows the data.

And we also have this interesting value called track peers here that we'll get into in two seconds. But what we can see is that we have a very stable chain. And honestly, that's because it's perfectly equilibriated, right? We know exactly where it is and it perfectly fit there. And we get a little bit of variation as we go over our sweeps, as the number of nodes moved. But everything is working pretty well.

So, you know, that's fine. But what we're interested in is looking at the uncertainty in this clustering. And so when we do that track pairs option, what it does is it keeps track internally at every single sweep. It says, you know, these two nodes, are they in the same cluster internally? And then as we do our sweeps, we just keep tallying up that value. And that gives us this kind of consensus matrix. And that will tell us, so, right, blue is, you know, 100% of the time we were in the same group. And so we can aggregate all of these MCMZ sweeps to get an idea of the kind of stability of our network.

And so this is an easy way to see it. However, you know, this is nice and got little nice triangles here. And that's because we know the true structure. So what if we don't know the true structure, so we can't sort it? Because if we didn't have this sorted, it would look kind of, you know, it'd be hard to tell unless we knew the true sorting of the network.

So we have a new thing that we're calling consensus SNE. So you might be familiar with t-SNE, which is t-distributed stochastic neighbor embedding. We're calling this consensus SNE. It's a totally made-up name that I just did today. But it's basically taking these propensities and building a network between nodes where the node is connected to another node if it is, you know, follows some threshold of connection. And so what we can see is with our data, and I'll refresh this again, it perfectly separates itself into the component clusters that we generated. And what you'll see is in, you know, less certain models, there might be some connections between, so you get the hierarchical structure without having to, you know, figure out how to sort your heat map.

Future plans and thanks

So I have very little time left, but the future plans, we're going to focus a little bit more on node-level uncertainty to allow us to try to find, you know, outlier nodes, values that don't really fit the generating distribution that we've inferred. And we're also going to just put a lot more visualizations. I've been working on this product for a long time, and I have lots of visualization ideas, and I want to kind of start pumping them out now that we've got the fundamental kernel going on. And just a quick thanks. So to my advisor, Yao-Min Zhu, at Vanderbilt, lab biostatistician, Siyue Zhang, who's been working with me on this, and the other lab members for their thoughtful comments during the whole process. I'm supported by the Vanderbilt Biostatistics Development Grant. And most importantly, my cat has been very helpful all the days that I banged my head against the wall. While I couldn't figure out some bug, he was there, and I guess, you know, my wife was there, too, to help me out. So thank you very much.

Featured software#