Resources

Semantic Search for the Rest of Us with DuckDB (Marcos Huerta, Carmax) | posit::conf(2025)

Semantic Search for the Rest of Us with DuckDB Speaker(s): Marcos Huerta Abstract: Semantic search matches a search query against documents not just by directly matching words, but by ranking results using semantic meaning - for example searching for “cat” could also turn up documents that have “tiger.” While many enterprise level solutions exist for semantic search, I will show a low memory technique in Python that enables live semantic search for small to medium document collections that need to be deployed in constrained memory/ CPU environments - such as Posit Connect Cloud or a small cloud server. I will show how a web app (such as Shiny for Python) using just a few hundred megabytes of memory can enable live search using the fixed array support in DuckDB, and the open source library LlamaCPP. posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Semantic Search for the Rest of Us. So, what is Semantic Search? Well, Semantic Search matches meaning and not just words. So, a traditional text search or, you know, kind of exact string matching search, you search for quick and you get documents that have the word quick. Or maybe it'll, you can use stemming and search for quickly and get documents that contain quick. Whereas, Semantic Search, you can search for fast and get documents that turn quick. And it can match synonyms, it can match the semantic meaning.

So, I'm going to talk about this in the context of a kind of goofy web app that I made called a Semantic Emoji Search. It was a goal that I had. I was nerd sniped into making this app a couple of years ago. So, the idea was you search for launch and you get the emoji for rocket, right? You can search for spicy and get a pepper. You can search for frozen and get ice or snow. So, this was the goal I had. This is what I wanted to build.

And so, the goal was a website that looks like this. And spoiler alert, I did build something. I did show this up off two years ago. But the caveat here is that I'm not a, this is for me, this is kind of a goofy side project. I am not going to, I don't have access to enterprise caliber tools. I'm not going to go buy a license for Atlas, you know, Semantic Search. There are plenty of commercial products that will sell you Semantic Search. And if you're trying to deploy this for like a big corporate website, you can go buy one of those.

So, I'm trying to deploy this on my own personal website with relatively, you know, a small budget and kind of my existing resources. So, I want to use open source tools. I want to kind of do this in an open source way. But I thought I knew how to do this. I thought I can do this. I know what I need to do. There's a Python package. And that's the other caveat is that I will be talking about Python mostly.

Sentence transformers and embeddings

So, there's a Python package called Sentence Transformers. And this is a package that uses basically medium to small language models, not the big language models like we hear a lot about, but the smaller ones. And it can create embeddings or vector representations of the words or the sentences that you have. And so, I thought I'll use this package and I'll take all of the descriptions for the emojis. So, that little purple guy there is smiling face with horns. So, I'll turn that. I'll make an embedding of smiling face with horns and all of the other descriptions of the emojis. And then I will encode, make an embedding vector of the search term, you know, whatever it is. And then I'll just compute, do the math to compute like the vector distance of all of these vectors. And that'll work.

And so, a little bit more about embeddings if you're not familiar with them. And the large language models can do this as well. But again, I'm using these kind of small models. The idea is that these models are trained in all of this data out there on the Internet. So, they understand that, like, when they turn mathematics or statistics into this vector, this 384 numbers, basically a higher dimensional, this high dimensional vector, it'll understand that, like, mathematics and statistics projected into 2D space here, obviously, are kind of close to each other. And so, you can do kind of just distance calculations, usually cosine distance. And then tiger and lion are similar to each other. So, that's the idea.

And so, the way a semantic search would work is that you would kind of take all of the emojis and kind of throw them into the semantic space, this higher dimensional space. And if you project that down to two dimensions, you'll get something that looks like this, right? Kind of putting emojis into, like, semantic space. That's what this is. This has been projected into 2D with UMAP. And so, if you want to search for weather, you know, you'll encode the weather as a vector. And that'll kind of be here. And so, you'll get all the emojis kind of near that part of the semantic higher dimensional space.

The memory problem

So, that's the idea. I'm gonna use sentence transformers. This should work, right? So, this is super simple. There's not much code in this talk. I do have a QR code at the end with code examples and links to everything. So, this is what you would do to use sentence transformers. You import it. You kind of make your model with one of these relatively small language models that come with sentence transformers. It's pre-trained. And then you kind of take the word spicy and turn it into this vector of 384 length array, right?

The only problem here is if you do this, you will use 1.45 gigabytes of memory just to do those three lines. And that's fine on this laptop or even my computer at home, but I'm trying to deploy on a really small, like, 2 gigabyte of memory server in the cloud. I have a digital ocean droplet. So, this is not gonna work for deployment. This is an out of memory problem. I'll burn all of the memory on my server just doing this one little toy app.

The only problem here is if you do this, you will use 1.45 gigabytes of memory just to do those three lines. This is not gonna work for deployment. This is an out of memory problem.

That's why I don't want to do that. So, I found a workaround. I pre-computed everything. And so, two years ago, I showed up this app, and I was talking about the front end. I kind of wrote this in Dash and Shiny and Streamlit. But the way it worked, I pre-computed everything. I found a list of the 10,000 most common words, and I pre-computed all of these distances, and I threw it all into a SQLite database, and that was it. So, it worked for one word. If you typed in two words, nothing.

Enter Llama CPP and DuckDB

But several things happened in early 2024 that kind of changed the game, at least for me, in terms of trying to be able to do this kind of with low resources and low memory. A software package called Llama CPP added support for Sentence Transformer BERT-style models, the kind of models that Sentence Transformer uses. Llama CPP was written to do large language models, like Llama, hence the name. But it added support for these models in February 8th, 2024. And then less than a week later, there's a llama. Less than a week later, DuckDB added support as a data type for fixed-length arrays, so you could put an array inside a column of DuckDB. And there's a duck. That's not the official logo. So these two things combined are going to let us do what I wanted to do three years ago, except now I think I thought they would let me do it.

So Llama CPP is, as I mentioned earlier, it is a software written in C++, that's what CPP stands for, and it's designed, again, to run these big, it was written to run Llama, but without all of the overhead of Python. So you don't need Python, you don't need PyTorch, you don't need all of the dependencies. It's just a C++ binary that runs language models.

So this is very similar code to what I showed you earlier. I'm making the model. Well, spoiler alert, these three lines of code, we encode one term with the model, same model I was using earlier, but now 120 megabytes of room, not 1.5 gigabytes of room. So we're using one-tenth of the memory. This is now viable. I saw this, and I'm like, OK, I can now do what I wanted to do three years ago, or two years ago.

And then, but I still have to do, you know, I still have to compute it in all the emojis, and I still have to, like, be able to do the math to compare, you know, the target, the thing that I encode, which was what I was just showing, to all of the database of emojis that I've created. And since DuckDB added support for fixed length of arrays as a datatype, I can encode all of the emojis into these vectors, into these arrays, and then I can pump them into DuckDB, right? So they kind of, this is a DBeaver screenshot. So you have the emoji, and then you have, that's not a single number there in that second column, that's an array. And so if you kind of, in DuckDB, in DBeaver, if you click on the arrow, you'll see all the other numbers, right? Not that you want to look at them, but they're all there.

Cosine similarity in SQL

But it's not just that DuckDB supports the array as a datatype, but it supports, it has helper functions built in that let you do math on them, right? So here's a SQL query, and this is actually on legislation, not on emojis, but the key here, the thing is that second line, right? We're selecting array, cosine similarity, you know, the column name is an embedding column, it's the name of one of the columns in the database, and then the question mark is where in Python we're going to send in the new array that we encode, our search term, right? And the colon, colon, double, 384 just means treat this as a 384 decimal array, or vector. And so that, and then that is now your similarity score, it's doing all that vector math for you, and then you can sort by that, and then you get that, this will give you the top 12 matches, in this case, to the bill summary.

So there's the question mark that I was talking about, so that's where we're going to send in our new thing. So this is the biggest chunk of code I will show you, and again, this is all linked to at the end. This is all I had to write, once I got, once I created the database, this is all I had to write to add emoji, to add live search to my previously one word preencoded website, right? So I'm going to go through this line by line, let's see. So we're going to import, do all imports, we're going to make our model, right, with Llama CPP, we'll connect to DuckDB, connect read only, so that I don't know if I run this with multiple processes, but of course DuckDB can do concurrent reads, it cannot do concurrent writes, so I set it to read only just in case I want to ever scale up. I cache it, this is kind of a standard Python thing, this way if someone searches for the same thing, it will not do the math again, it will just return what it got the first time. We'll take our search term and encode it into a new vector, and then this little block of SQL, you know, what is that, kind of four lines, there's some joins in there, because of how I set the database up, we'll compute the, we'll then take that new vector, do all the array co-sign similarity math and return the top 25 results.

And so now you can search for Neil Armstrong on the Emoji site, and because the models are trained on the Internet and they know Neil Armstrong is an astronaut, it will return astronaut for Emoji, and you can search for other multiple phrases too. And what about the memory usage? Well, I have a bunch of apps running, actually all of these three use semantic search, but as you can see, it's just using a few hundred megabytes of memory, because it's with the GUI and the web, these are all dash apps, with the web UI in front of them, so all perfectly viable, they all live alongside each other on my site.

So this happened, and this worked, and I was very excited, so I'm like, well, where else can I use this? Are there other opportunities I have? And there was one more, I have a website called RecordedVote.org, which is about legislation in Virginia, like the Virginia legislature, and it's obviously called RecordedVote, because I'm analyzing how people voted on them, but people also use it for search, and two things that I wanted to do, one was build similarity, like you have one piece of legislation, what are the 10 most similar bills to that bill? So I could precompute all of that, and I did that before all of this DuckDB, LlamaCPP came out, and I used the model to precompute the top 20, top 10, most similar bills, but Bill Search precomputing wasn't going to work, right, I mean, no one wants to search for solar, you want to search for solar power, right, or solar energy.

But now that I have this new toolkit, I'm like, well, I can now take this and use this on this other project I have. So here's an example search where you're searching for, what is it, solar power regulation, right, so the top three there are lexical searches, right, that's just SQLite, FTS5, you know, full text search, but if there aren't enough results there, then it automatically will do a semantic search, and then show more matches. So classic example I usually make is that, you know, if you search for solar energy, if you search for solar power, you wouldn't get matched, with lexical search you would not get bills that say solar energy, but with semantic search you can get both.

Caveats and model selection

So I'm very excited that this was working well. There are some caveats. Llama CPP requires a special format, so you have to either convert a model into that format using some command line tools in Python that come with the llama CPP command line, you know, GitHub repository, or you can just kind of go into Hugging Face and usually find a model that's been someone else's converted for you and uploaded. Sentence transformer models are suitable for small amounts of text, right, it's sentence transformers, it's not dissertation transformers, so you can't put in a big document, you have to put in, like, think of an abstract, not a whole paper. And if you're deploying to, like, a GitHub-based, you know, something where all your files live on GitHub, there's a slight chance that you might hit that 100 megabyte limit on GitHub if you have a big DuckDB database or whatever.

So the question you might be wondering is what model to use.

So there's two kinds of searches when you kind of look at all the semantic search documentation, there's symmetric search, which is what the emoji engine really is, right? The things you're searching, these strings of text, they're short, they're a couple of words, and the thing you're typing in is a couple of words, so that's symmetric, right? The legislation, I'm actually encoding the bill summaries is what the target is, so this is asymmetric, right? The thing you're searching for is two or three words, and the bill summary is, you know, a couple of hundred words or something. So the models are tuned for these, different models are tuned for these different use cases, so I use two different models for these two cases. I have, and you don't need to memorize this, again, it's all at the end, but I use this all-mini-LM for symmetric search, and then MSMarco is fine-tuned on this kind of question-answer pair dataset, so it's better, presumably, for this asymmetric search. And these are both original, pre-trained sentence transformer models. There are a bunch of models out there. You can go down a real rabbit hole. You can try other ones. I have tried a lot of them. I really want the emoji search to return turtle for slow, and it won't. It's never on there. So I keep trying more models to see if one of them will find it, and none of them do. So snail is like always eighth, and then it never shows up. It's very frustrating.

Speed and performance

So you might also be wondering about the speed. So I tend to use sentence transformers when I'm actually building the database, and then I only use LLAMA CPP at the end for that memory advantage and the live encoding step.

So I have a decent new Mac mini. It's an M4 Pro. It's like the base model M4 Pro. So it'll encode 20,000 of those build summaries in just 26 seconds. So maybe if you have an older Mac, or then it does use the GPU. So if you're trying to do this maybe CPU-only, maybe it'd be a couple of minutes, but it's viable. Obviously, if you have a million documents, it might take you a little longer. You might want to make sure you have access to GPU. And sentence transformers will automatically use the GPU on a Mac, so if you have a Mac, that's very And then the live encoding, which, again, I'm doing in this small little droplet in DigitalOcean, that's what I really cared about, that speed, because that's the responsiveness of the website, and that's just 150, 160 milliseconds. That's just doing the encoding of the search. So that is quite reasonable, and I think if you use it, if you play around with EmojiSite, which I'll send out later, assuming you all hit it at the same time, it should, you'll find it pretty responsive.

Advanced topics and R support

All right, so one advanced topic. Advanced is retrieving re-ranks, and maybe you do have a ton of documents, you have like a million or something, and you still want to kind of run in this kind of, you know, cheap and home-rolled way. One thing you could try and do is like kind of retrieve and re-rank. So like you kind of do, maybe do a lexical search first, get those results, and then only do the semantic math on like the top 25 or top 100 results. So try and reduce how much of that cosine similarity math you have to do on all these big vectors.

One other little thing that I always keep an eye on is this VSS extension to DuckDB. This is like a way to kind of build a vector index that will make all this semantic math faster, but right now it's supported, it's really, it turns on only automatically for in-memory databases because of some issues where if it crashes, like the write-ahead log won't, it won't be stored in there. So anyway, I'm always keeping an eye on this because in theory it can make it faster, so you don't have to do all that vector math on every single entry.

And then since we're up, here we're at PositConf, people might be wondering about R. Obviously DuckDB works well with R, we heard about that earlier. Llama CPP does not, there was a wrapper, but it hasn't been updated since 2023, which is before all of this BERT support came to Llama CPP. You could maybe use reticulate, I don't exactly know what the memory overhead of all that would be. You could, there is a command line utility that just runs, you know, on the Mac or presumably Windows command line. So you could potentially call out to that and kind of get some structured data back out of there. But ideally someone in here or someone somewhere will fork that old Llama R wrapper and bring it to the modern age, which would be great. So if someone wants to do that, please do it.

And so here is a link, this is a link to kind of talk notes on my blog. It has a link to all of the documentation, the sentence transformer documentation is really good if you want to know more about semantic search. It's got great resources on there, a bunch of interesting tutorials and stuff. It has example code, it actually will walk you through making your own little search function, I made a CSV file that has all the data in there, so it just pulls all that in and lets you kind of play with it. And then links to, you know, everything you could want. So with that, I think I have time for a few questions.

Q&A

Thank you. So we've got a few questions. Why does Llama CPP use so much less RAM for the same model?

I think it's mainly, I mean, it's written just to do this one thing, right? So obviously to do with sentence transformers needs Python, Python can do a lot more than one thing. And then Python needs, the dependencies that sentence transformer needs are things like PyTorch and all these kind of big general purpose tensor math things that, and so I think Llama CPP can strip away all of that and it's written in C++ directly as opposed to being written in Python. So I think that's the answer.

For the emoji search, was there more difficulty with emojis that were face-like versus items?

Well, I encode the description of the text. So there's a Python package called emoji, pip install emoji, and inside of there, there is basically an emoji data dictionary and I turned that into a pandas data frame and that's how I got. So I'm not actually encoding the emoji, I'm encoding, like, the string. So yeah, some of them are weird, some of them are very specific, like smiling, you know, eyes. The description of these emojis, sometimes they're more helpful than others. So the ones that have kind of long and not particularly helpful descriptions that are very specific, it's hard to find them semantically. Obviously, some emojis you're better off searching not semantically but searching for just in another way. And so yeah, some of them, you know, the model's going to do what it's going to do and some things it returns easily and some things it doesn't. Like turtle and slow, I can't get that to work.

Semantic search seems to be more prevalent than ever with the rise of LLMs and RAG. Is cosine similarity still the best way to do this? Have there been any interesting innovations?

That is a very good question. I think it really depends on how, and I am not an expert on language models, I think it really depends on how they build the model. I think some of these models are built to kind of, are kind of constructed in a way that cosine similarity is like the best tool to use to actually optimize for dot product. So it really depends on how they built the model. I think cosine similarity is still reasonable for most of them but it does depend on how they construct the model and the vectors that come out of it.

I think cosine similarity is still reasonable for most of them but it does depend on how they construct the model and the vectors that come out of it.

All right, thank you. Let's thank Marcus again.