Semantic Search for the Rest of Us with DuckDB (Marcos Huerta, Carmax)

Transcript#

This transcript was generated automatically and may contain errors.

Semantic Search for the Rest of Us. So, what is Semantic Search? Well, Semantic Search matches meaning and not just words. So, a traditional text search or, you know, kind of exact string matching search, you search for quick and you get documents that have the word quick. Or maybe it'll, you can use stemming and search for quickly and get documents that contain quick. Whereas, Semantic Search, you can search for fast and get documents that turn quick. And it can match synonyms, it can match the semantic meaning.

So, I'm going to talk about this in the context of a kind of goofy web app that I made called a Semantic Emoji Search. It was a goal that I had. I was nerd sniped into making this app a couple of years ago. So, the idea was you search for launch and you get the emoji for rocket, right? You can search for spicy and get a pepper. You can search for frozen and get ice or snow. So, this was the goal I had. This is what I wanted to build.

And so, the goal was a website that looks like this. And spoiler alert, I did build something. I did show this up off two years ago. But the caveat here is that I'm not a, this is for me, this is kind of a goofy side project. I am not going to, I don't have access to enterprise caliber tools. I'm not going to go buy a license for Atlas, you know, Semantic Search. There are plenty of commercial products that will sell you Semantic Search. And if you're trying to deploy this for like a big corporate website, you can go buy one of those.

So, I'm trying to deploy this on my own personal website with relatively, you know, a small budget and kind of my existing resources. So, I want to use open source tools. I want to kind of do this in an open source way. But I thought I knew how to do this. I thought I can do this. I know what I need to do. There's a Python package. And that's the other caveat is that I will be talking about Python mostly.

Sentence transformers and embeddings

So, there's a Python package called Sentence Transformers. And this is a package that uses basically medium to small language models, not the big language models like we hear a lot about, but the smaller ones. And it can create embeddings or vector representations of the words or the sentences that you have. And so, I thought I'll use this package and I'll take all of the descriptions for the emojis. So, that little purple guy there is smiling face with horns. So, I'll turn that. I'll make an embedding of smiling face with horns and all of the other descriptions of the emojis. And then I will encode, make an embedding vector of the search term, you know, whatever it is. And then I'll just compute, do the math to compute like the vector distance of all of these vectors. And that'll work.

And so, a little bit more about embeddings if you're not familiar with them. And the large language models can do this as well. But again, I'm using these kind of small models. The idea is that these models are trained in all of this data out there on the Internet. So, they understand that, like, when they turn mathematics or statistics into this vector, this 384 numbers, basically a higher dimensional, this high dimensional vector, it'll understand that, like, mathematics and statistics projected into 2D space here, obviously, are kind of close to each other. And so, you can do kind of just distance calculations, usually cosine distance. And then tiger and lion are similar to each other. So, that's the idea.

And so, the way a semantic search would work is that you would kind of take all of the emojis and kind of throw them into the semantic space, this higher dimensional space. And if you project that down to two dimensions, you'll get something that looks like this, right? Kind of putting emojis into, like, semantic space. That's what this is. This has been projected into 2D with UMAP. And so, if you want to search for weather, you know, you'll encode the weather as a vector. And that'll kind of be here. And so, you'll get all the emojis kind of near that part of the semantic higher dimensional space.

The memory problem

So, that's the idea. I'm gonna use sentence transformers. This should work, right? So, this is super simple. There's not much code in this talk. I do have a QR code at the end with code examples and links to everything. So, this is what you would do to use sentence transformers. You import it. You kind of make your model with one of these relatively small language models that come with sentence transformers. It's pre-trained. And then you kind of take the word spicy and turn it into this vector of 384 length array, right?

The only problem here is if you do this, you will use 1.45 gigabytes of memory just to do those three lines. And that's fine on this laptop or even my computer at home, but I'm trying to deploy on a really small, like, 2 gigabyte of memory server in the cloud. I have a digital ocean droplet. So, this is not gonna work for deployment. This is an out of memory problem. I'll burn all of the memory on my server just doing this one little toy app.

The only problem here is if you do this, you will use 1.45 gigabytes of memory just to do those three lines. This is not gonna work for deployment. This is an out of memory problem.

That's why I don't want to do that. So, I found a workaround. I pre-computed everything. And so, two years ago, I showed up this app, and I was talking about the front end. I kind of wrote this in Dash and Shiny and Streamlit. But the way it worked, I pre-computed everything. I found a list of the 10,000 most common words, and I pre-computed all of these distances, and I threw it all into a SQLite database, and that was it. So, it worked for one word. If you typed in two words, nothing.

I think cosine similarity is still reasonable for most of them but it does depend on how they construct the model and the vectors that come out of it.

All right, thank you. Let's thank Marcus again.

Semantic Search for the Rest of Us with DuckDB (Marcos Huerta, Carmax) | posit::conf(2025)

Transcript#

Sentence transformers and embeddings

The memory problem

Enter Llama CPP and DuckDB

Cosine similarity in SQL

Applying it to legislation search

Caveats and model selection

Speed and performance

Advanced topics and R support

Q&A

Featured software#

Shiny for Python

Shiny