Jeroen Janssens - How I hacked UMAP and won at a plotting contest | PyData Amsterdam 2024

www.pydata.org In this talk, I’ll share my journey of animating UMAP, a cutting-edge dimensionality reduction algorithm, by visualizing not just its final output but each intermediate step as well. I’ll explain why and how I modified UMAP’s source code, while also demonstrating the use of Polars for data wrangling, Plotnine for visualization, and ffmpeg for animation. The result ultimately earned me a runner-up position in the 2024 Plotnine plotting contest. PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R. PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases. 00:00 Welcome! 01:06 Plotnine contest 02:18 Hoi! I'm Jeroen 03:13 Visualizing the behavior of an algorithm 07:58 Keeping track of intermediate predictions 09:23 Tool: UMAP 12:31 Tool: Polars 13:15 Python Polars: The Definitive Guide 13:40 MNIST dataset 14:46 Tool: Plotnine 17:04 Tool: FFmpeg 18:58 Final result 19:38 To conclude 20:40 Q & A

Jeroen Janssens

Oct 22, 2024

27 min

Python Tutorial Education NumFOCUS PyData Opensource Learn Software Python 3 Julia Coding Learn to Code How to Program Scientific Programming

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

I've been a long time fan of plotnine . So when I heard that Posit was going to organize the first ever plotnine plotting contest, I of course had to participate. And there were some excellent submissions that really showed off plotnine's capabilities.

So here's a stacked area graph. This is actually the winner of this contest. And my own submission was the runner-up, was the only animation. What you see here is UMAP being animated, so we're not just showing the final result as you usually do, but also all the intermediate steps as well.

And here's what the jury had to say. The judges praised both the process of creating the video and the visual style. The aesthetics of this animation evoke a sense of nostalgia, reminiscent of the retro visualizations often found in early machine learning books.

Which is a very nice way of saying that I've been around for a while. My PhD thesis has been hopelessly outdated for quite a while now. And now the Unix command line, which may appear outdated, is relevant as ever. And I can assure you that it's going to be around long after we have retired. And as you will see later on in this presentation, it will save the day once again.

Well and then this Polars book here, which I'm currently writing together with Thijs Nieuwdorp. Well that book is going to be outdated the day it comes out. It's insane the pace at which they are developing new features. By the way, Polars has its origins in Xomnia, where I'm currently employed as a machine learning engineer.

And I'm here today to share with you my journey of how I created this UMAP visualization. But more generally, and more importantly, I would like to convince you that visualizing the behavior of an algorithm can be very insightful.

But more generally, and more importantly, I would like to convince you that visualizing the behavior of an algorithm can be very insightful.

Understanding algorithm behavior through visualization

What do you typically do when you want to get a better understanding of an algorithm? You can read the article, you can study the implementation, or maybe implement it yourself from scratch. Or you just run it a bunch of times and see what happens.

So when I say behavior of an algorithm, I'm talking about how the output of that algorithm relates to its inputs, right? So we have the data, we have the hyperparameters of the algorithm, and the algorithm itself of course. Well, in the output there are, of course, the predictions, there is often a performance metric associated or derived from those predictions, and perhaps there's also some internal statistic. You can visualize this statically, or as an animation like I did, or perhaps even interactively so that you can play around with the inputs.

Now that's a bit abstract, so here are some strategies. I'm sure that many of you have seen a plot like this, the elbow method to determine the optimal number of clusters for a clustering algorithm, given a certain data set of course.

And here's a visualization created by Leland McKins, the creator of UMAP, where we here see one of its hyperparameters, one of its hyperparameters being changed, and then the output is also being animated, right? So this is different from the animation that I created. But it's yet another way of understanding, yeah, the algorithm, and then specifically how this hyperparameter influences the output of this algorithm.

Yesterday, Vincent Wammerdam gave a very interesting talk, run a benchmark they say, or they said, it will be fun they said, and don't be fooled, I'm pretty sure that Vincent is actually having fun here. What he did is he ran a particular algorithm with a whole bunch of hyperparameter settings and then used, well, parallel coordinate plot to correlate the performance metric with certain hyperparameters.

You can plot the decision boundary of your machine learning algorithm. Right here on the right, this is what I've used long ago during my PhD to visualize density plots or to create density plots. And what you see here is how a new data point would be scored, right? And red means being an outlier according to the data set. And this allows you to, well, you can compare multiple algorithms with each other, right? How they are influenced by the data.

What I've done here is I've applied three different algorithms, right? These are unsupervised outlier detection algorithms and applied them to three different artificial data sets. Of course, this is limited to two dimensions.

What I've also done is created artificial data that have hyperparameters themselves so that you can tweak your data. For example, here in this bottom data set, we have two clusters and I gradually increase the overlap between these clusters and then you can see how each algorithm is able to deal with that.

And only recently, my co-author Thijs Nieuwdorp, he discovered a regression in the performance of polars. So what he did, he benchmarked different versions of polars. I'm not sure if it's entirely clear, but from a couple of versions ago up until the newest release, what we notice here is that there are certain queries that all of a sudden perform much worse, right? And it's now clear where this happened and I believe that on Monday, of course, the new release of polars will fix this.

Hacking UMAP to capture intermediate steps

Okay, so then this sixth strategy that I would like to share with you, keeping track of intermediate predictions, that's what I did.

So in this journey, I took the following steps. I hacked UMAP so that it saves intermediate results. I ran it on the MNIST data set. I draw each intermediate result and then stitch all the frames together into an animation and I, of course, I've won it, runner-up position, but still proud of it. I blogged about it and the quick presented refers to the fact that I only learned this weekend that I'm allowed to give this presentation because unfortunately, Leland McKinns himself was unable to come and some others. So with this presentation, I do hope to pay tribute to his awesome work on UMAP.

Of course, I can't do this by myself. I need tools and for this particular journey, I used four tools. I'm going to introduce each of them briefly. So, but I want you to know that these tools, they're useful by themselves, right? And you can, of course, mix and match. If you're more familiar with another plotting library, that's fine too, right?

UMAP, UMAP in a nutshell. UMAP is a state-of-the-art dimension reduction algorithm created by Leland McKinns, I believe in 2008, inspired by t-SNE. It has a solid theoretical foundation and I must admit, it's a very simple algorithm. Has a solid theoretical foundation and I must admit that I have difficulty understanding all the math that's in there.

So the intuition behind UMAP, just as with t-SNE, is quite okay. It's quite intuitive to understand, but I wanted to get a better understanding. So let me first give you a few more examples of UMAP, right? Mostly some pretty pictures to look at.

Right, so what you see here, they're kind of fuzzy clouds, but those clouds consist of individual data points, right? Data points in a data set and that has been mapped from, of course, a very high dimension onto two dimensions. And I must say that these last two plots were created by Leland using his tool called DataMapPlot. I have a link to this tool at the end of this presentation.

Okay, so then the hack. And then, of course, not hacking by breaking into some other system. No, here with a hack I mean a clever workaround to a problem, often involving unconventional methods. And this here, this last line here at the bottom, that is the hack, yeah? So this line is quite trivial. What was the tricky part is finding where I needed to put this line. So where was this stochastic gradient descent happening?

And it's ugly. Oh my, it's ugly. It saves all these NumPy files in your current working directory, right? But it gets the job done. So for my purposes, this was fine. Maybe later on when the book is done and I have some more time, I'll file a pull request. Of course, not with this hack, but then properly solved.

So what I did is I forked UMAP. The code is on GitHub. I forked it. I changed the file. And then I needed to make sure that while the project that I was working on was pointing to the correct UMAP implementation, here it's pointing to another GitHub repo, but it can also point to a local directory, a local file.

Running UMAP is the easy part, thanks to its scikit-learn-like API.

And yeah, so the final embedding also has to be solved, right? Again, the hack is quite dirty. But once I ran this, I had a whole bunch of NumPy files, which I, of course, wrangled together using Polars. Now, Polars, as I learned yesterday, needs no introduction.

But if you do have any questions, come to me, or Thijs, or the Polars team, and we will happily tell you all about it. But there are two things that you need to know. Polars is fast, right? Polars is all the way on the left there. This is before we actually fix the regression. And Polars is popular. Just look at Polars Go. These are GitHub stars, right? I mean, that's as good as a proxy as anything.

Did I already mention that I'm writing this book together with Thijs? You should check out our website, polarsguide.com. You can join the Discord server to talk about Polars. You can get early release PDFs of the book. And you can sign up to get a notification when the book is out.

The MNIST dataset and drawing with plotnine

Okay. So, of course, we're going to use the MNIST dataset, right? It's been used already way too many times. And it's the dataset, well, that I am at least familiar with. So why not just keep that constant so that we can focus on the algorithm itself? And for those who are not familiar with MNIST just yet, it is a collection of 70,000 handwritten digits digitized into 28 by 28 pixels, which gives us a 784-dimensional space, right? And using UMAP, we're going to embed that into a two-dimensional space. On the right here, you can see some samples of these handwritten digits. The colors, by the way, are unknown to UMAP. UMAP is an unsupervised algorithm. We're only using color here for our own purposes, right? For inspecting it.

So, yeah, this is the Polars code that I used. I mean, the code is not that important, but just look at it. Look at it. It's beautiful, isn't it? Then the drawing part. My favorite part, I should say. And for this, I use plotnine, which is, well, the grammar of graphics for Python, created by Hassan Kibirige , and he's been inspired by R, right? The ggplot2 package that, of course, everybody knows who has been using R. The API is very much alike. It's based on matplotlib. It's based on Python.

It's based on matplotlib, and it allows you to create ad hoc plots very quickly, but it also allows you to modify it so that you can create production quality visualizations. It plays well with Polars. The only downside, I would say, is that it's not interactive. So it only produces static plots.

And this here, this one-liner captures plotnine, and it captures the essence of plotnine, right? You give it some data. You say how the columns should be mapped to certain properties, and then you specify the geometry. Now, it doesn't look very pretty, but it quickly gives you a sensible plot on which you can, yeah, further elaborate. And you can really go to town with this.

This here, by the way, is the very first epoch, right? The very first iteration that UMAP is performing. So what I learned by doing this is that it actually uses spectral embedding, for its initial solution. Now, I may have, I could have perhaps read that in the paper, but I found out a different way.

And we can, again, we can also visualize the legend. If you do modify, right, really customize your plot, this is what your code would eventually look like. It may seem overwhelming, but realize that this is often, you know, produced in an iterative manner. So once you understand the API and the underlying grammar of graphics, I can assure you that you rarely need the documentation for this. So I'm a big fan.

Stitching frames with ffmpeg

Hey, all right, so final step. We need to stitch all those frames together. And stitching, actually, there are two kinds of stitching. So we can produce the legend on the right. We can produce the embedding on the left, right? We need to stitch those together. And we also need to stitch all those individual frames together into an animation.

So both of these things can be done in Python, right? But due to a couple of bugs, I needed to resort to something else. I looked into the source code of plotnine. Under the hood, it uses matplotlib. There were some artifacts when I wanted to combine these plots. So, but I found out that under the hood, it uses a command line tool called ffmpeg. Right? That's always the case, right? All your fancy DevOps pipelines or CICD integration, it's all command line under the hood, right?

So luckily, we can, of course, invoke this command line tool ourselves. Ffmpeg, it has so many capabilities. And besides what we're going to do, you can turn a video into a GIF. You can record your screen activity. You can add subtitles to a video. It is really, really powerful.

Now, this is overwhelming. I would be overwhelmed by this as well. Luckily, it turns out that your favorite LLM is very good at producing an incantation like this. Yeah, so here, we take care of the stitching, meaning horizontal stacking, also using ffmpeg, and then a complex filter. If you rather solve this part in Python, then, of course, you can do that. I don't know, using Pillow or whatever.

Yeah, so we're specifying two sets of images here. Then this filter complex, which tells, okay, we should combine these horizontally. Then a couple of video settings. And then finally, we get this video, umap.mp4.

All right, let's do this one more time then. All right, you can see that the noise is being reduced. Okay, one more time. Hey, so yeah, so lots of noise during the optimization. You can see a little bit of moving around. Not that much. I was surprised that it wasn't moving that much. When you visualize t-SNE, which I probably should have included as well. Not that I'm thinking about it. There's a lot more going on.

Takeaways

But yeah, I did learn a couple of things here. And that's what I want to leave you with, is that hacking umap and winning the contest, runner-up position, wasn't really the point, of course, of this talk, right? It's the mindset that you can visualize the behavior of an algorithm, right? That this can be a really insightful exercise. That's my main take-home message for you. And there are various strategies to do so. You don't always have to hack an algorithm in order to do this. You don't always have to produce an animation, okay?

And of course, these tools that I've used, umap, Polaris, plotnine, and FFmpeg, they're fantastic and also by themselves. Oh yeah, and again, I want to stress that in the very end, it was the Unix command line that saved the day. Thank you very much.

It's the mindset that you can visualize the behavior of an algorithm, right? That this can be a really insightful exercise.

Q&A

I'm curious about, given the focus on Polaris and data visualization and everything, given the recent developments in Polaris, where there now is also a data visualization backend, is that something that has changed the plans for your book in how you're going to work with data visualizations and Polaris? Or is it going to be used?

All right, I already see Thijs laughing there in the back. You've touched upon a very sensitive topic here. And that is that I had already written chapter 16, data visualization, when HV plot was the default backend. But I should have known better. I should have known that this API was unstable and that it could change any moment. But I was just too excited about using Polaris for plotting. And so you can understand that when I first heard from Marco Gorelli that the default plotting backend was going to change to Altair, that I was not pleased. Although, of course, now I totally understand why Altair is a better choice for this, right? Especially for having this interactive workflow.

So, and I believe your question was, is the change of the visualization backend influencing the book? Yes, very practically. I have to rewrite chapter 16, yeah? So the book is probably delayed by a month. Now, it's going to be all right. We expected the book to be available somewhere in January.

Hi, Jeroen. I was wondering, since you're very much into different visualizations, do you have a favorite library? Maybe a favorite library for yourself that you like just to do the easy plots, and maybe the favorite library of your stakeholders as Xomnia.

Well, my favorite library, don't be surprised, is plotnine. This morning in the train, Thijs gave me some CSV with these timings, right? You remember the slide where I showed the timings of the different queries that Polaris was performing and different versions of Polaris? Thijs gave me the CSV this morning, and in the train, I was just hacking my way through this, you know? And what I like about plotnine is that it's based on the grammar of graphics, right? So it takes some time to getting used to, but once you understand it, it all makes sense. And then you rarely have to consult the documentation.

I mean, don't get me wrong. With Seaborn, right, you can produce nice-looking plots. But I personally, I always have to consult the documentation on which arguments that function accepts, for example, right? And matplotlib, right? Well, plotnine is based on matplotlib, so matplotlib is the OG of plotting, and you can do anything you want in matplotlib, but it just takes a lot of coding, yeah? Altair, great for interactive visualizations. Perhaps less so for creating animations or videos, right? You would have to, it's browser-based. So it's also to say that each data visualization library has its pros and its cons. So I've definitely used others in my work, yeah. So if you're, like, left on the, you know, on an island with one plotting tool, it's going to be plotnine.

Yeah, thank you for your presentation. I was just wondering if you could extend plotnine functionalities, like the way you could do in ggplot2, because ggplot2 has a very rich ecosystem of packages, themes, we have gganimate, ggthemes. So how is plotnine in that regard?

Yeah, so the question was, ggplot, right, the R package has a rich ecosystem of all these plugins. plotnine, unfortunately, doesn't have such a thing just yet. And this is all because, well, for the last few years, Hassan has been the only person working on this. So he's using all his time to create this wonderful package. Hopefully, now that Posit has adopted plotnine, I do hope that soon it will have also the ability to add plugins, because I think that's really, really powerful. In the same way that Polars has the ability to write plugins for, right? Very powerful. So not yet, hopefully soon.

Did you finally understand Riemann manifolds math after this experimentation? Yeah, it came to me in a dream when I saw this animation again. No, I'm afraid that the advanced math used in the paper will always be a mystery to me. But that's okay. UMAP, I can still use it, right? Actually, I would say that UMAP is one of the very few algorithms of which I can safely say that you can just apply it, right? Common wisdom is that you shouldn't blindly apply a machine learning algorithm to some data and then expect magic to happen. So I would actually say that UMAP is the exception to that.

Featured software#