A hacker's guide to open source LLMs - posit::conf(2023)

Transcript#

This transcript was generated automatically and may contain errors.

So let's start by talking about what a language model is. As you might have heard before, a language model is something that knows how to predict the next word of a sentence or knows how to fill in the missing words of a sentence.

We can look at an example of one. OpenAI has a language model, TextDaVinci003. We can play with it by passing in some words and ask it to predict what the next words might be. There's a nice site called nat.dev. nat.dev lets us play with a variety of language models. Here I've selected TextDaVinci003 and I'll hit submit and it starts printing stuff out.

The pandas were happily playing and eating the frogs that had fallen from the sky. It was an amazing sight to see these animals taking advantage of such a unique opportunity. The staff took quick measures to ensure the safety of the pandas and the frogs. So there you go, that's what happened after the extraordinary rain of live frogs at the panda breeding facility.

Tokens and tokenization

Now you might notice here it hasn't predicted pandas, it's predicted panned. And then separately, us. OK, after panned it's going to be us. So it's not always a whole word. Here it's un- and then harmed. So you can see that it's not always predicting words. Specifically what it's doing is predicting tokens. Tokens are either whole words or sub-word units, pieces of a word. Or it could even be punctuation or numbers or so forth.

So let's have a look at how that works. It's called tokenization to create tokens from a string. We can use the same tokenizer that GPT uses by using TickToken. And we can specifically say we want to use the same tokenizer that that model, TextDaVinci003, uses. And so, for example, when I earlier tried this, it talked about the frog splashing. And the result is a bunch of numbers. And what those numbers are, they're basically just lookups into a vocabulary that OpenAI, in this case, created.

And if I then decode those, it says, oh, these numbers are they, space, are, space, spool, ashing. And so put that all together, they are splashing. So you can see that the start of a word is, with a space before it, is also being encoded here.

How language models are trained: ULMFiT and fine-tuning

So these language models are quite neat. They can work at all. But they're not, of themselves, really designed to do anything.

The basic idea of what ChatGPT, GPT-4, BARD, et cetera, are doing comes from a paper which describes an algorithm that I created back in 2017 called ULMFiT. And Sebastian Ruder and I wrote a paper up describing the ULMFiT approach, which was the one that basically laid out what everybody's doing, how this system works. And the system has three steps. Step one is language model training. We actually described it as pre-training.

Now, what language model pre-training does is this is the thing which predicts the next word of a sentence. And so in the original ULMFiT paper, so the algorithm I developed in 2017, then Sebastian Ruder and I wrote it up in 2018, early 2018. What I originally did was I trained this language model on Wikipedia. Now, what that meant is I took a neural network, and a neural network is just a function. If you don't know what it is, it's just a mathematical function that's extremely flexible, and it's got lots and lots of parameters. And initially, it can't do anything. But using stochastic gradient descent, or SGD, you can teach it to do almost anything if you give it examples.

And so I gave it lots of examples of sentences from Wikipedia. So, for example, from the Wikipedia article for the birds. The birds is a 1963 American natural horror thriller film produced and directed by Alfred, and then it would stop. And so then the model would have to guess what the next word is. And if it guessed Hitchcock, it would be rewarded. And if it guessed something else, it would be penalized. And effectively, basically, it's trying to maximize those rewards. It's trying to find a set of weights for this function that makes it more likely that it would predict Hitchcock.

To do a good job of solving this problem as well as possible, of guessing the next word of sentences, the neural network is going to have to learn a lot of stuff about the world. It's going to learn that there are things called objects, that there's a thing called time, that objects react to each other over time. That there are things called movies, that movies have directors, that there are people, that people have names and so forth. And that a movie director is Alfred Hitchcock and he directed horror films and so on and so forth. It's going to have to learn an extraordinary amount if it's going to do a really good job of predicting the next word of sentences.

It's going to have to learn an extraordinary amount if it's going to do a really good job of predicting the next word of sentences.

Now, these neural networks specifically are deep neural networks. This is deep learning. And in these deep neural networks, which have, when I created this, I think it had like 100 million parameters. Nowadays, they have billions of parameters. It's got the ability to create a rich hierarchy of abstractions and representations, which it can build on.

And so this is really the key idea behind neural networks and language models. It's that if it's going to do a good job of being able to predict the next word of any sentence in any situation, it's going to have to know an awful lot about the world. It's going to have to know about how to solve math questions or figure out the next move in a chess game or recognize poetry and so on and so forth.

So the key idea here for me is that this is a form of compression. And this idea of the relationship between compression and intelligence goes back many, many decades. And the basic idea is that, yeah, if you can guess what words are coming up next, then effectively you're compressing all that information down into a neural network.

Now, I said this is not useful of itself. Well, why do we do it? Well, we do it because we want to pull out those capabilities. And the way we pull out those capabilities is we take two more steps. The second step is we do something called language model fine tuning. And in language model fine tuning, we are no longer just giving it all of Wikipedia. Or nowadays, we don't just give it all of Wikipedia. But in fact, a large chunk of the Internet is fed to pre-training these models. In the fine tuning stage, we feed it a set of documents a lot closer to the final task that we want the model to do. But it's still the same basic idea. It's still trying to predict the next word of a sentence.

After that, we then do a final classifier fine tuning. And in the classifier fine tuning, this is the kind of end task we're trying to get it to do. Now, nowadays, these two steps are very specific approaches are taken. For the step two, the step B, the language model fine tuning, people nowadays do a particular kind called instruction tuning. The idea is that the task we want most of the time to achieve is solve problems, answer questions. And so in the instruction tuning phase, we use data sets like this one. This is a great data set called OpenOrca created by a fantastic open source group. And it's built on top of something called the Flan collection.

And you can see that basically there's all kinds of different questions in here. So there's four gigabytes of questions. Here are some examples of instructions. I think this is from the Flan data set, if I remember correctly. So, for instance, it could be, does the sentence in the Iron Age answer the question, the period of time from 1200 to 1000 BCE is known as what? Choice is one, yes or no. And then the language model is meant to write one or two as appropriate for yes or no. So it's still doing language modeling. So fine tuning and pre-training are kind of the same thing. But this is more targeted now, not just to be able to fill in the missing parts of any document from the Internet, but to fill in the words necessary to answer questions, to do useful things.

OK, so that's instruction tuning. And then step three, which is the classifier fine tuning. Nowadays, there's generally various approaches, such as reinforcement learning from human feedback and others, which are basically giving humans or sometimes more advanced models, multiple answers to a question such as here are some from a reinforcement learning from human feedback paper. List five ideas for how to regain enthusiasm for my career. And so the model will spit out two possible answers or it will have a less good model and more good model. And then a human or a better model will pick which is best. And so that's used for the final fine tuning stage.

OK, so we have built our own code interpreter from scratch. I think that's pretty amazing.

Running open source models locally

So that is what you can do with some of the stuff you can do with OpenAI. What about stuff that you can do on your own computer? So then what we're going to be using is a library called Transformers from HuggingFace. And the reason for that is that basically people upload lots of pre-trained models or fine-tuned models up to the HuggingFace hub. And in fact, there's even a leaderboard where you can see which are the best models.

All right, so you can find models to try out from things like this leaderboard. And there's also a really great leaderboard called FastEval, which I like a lot, because it focuses on some more sophisticated evaluation methods, such as this chain of thought evaluation method. So I kind of trust these a little bit more. And these are also, you know, GSM 8K is a difficult math benchmark, BigBench Hard, and so forth. So, yeah, so, you know, StableBeluga2, WizardMath13B, DolphinLima13B, etc. These would all be good options.

Yeah, so you need to pick a model. And at the moment, nearly all the good models are based on Meta's Llama2. So when I say based on, what does that mean? Well, what that means is this model here, Llama2-7B. So it's a Llama model. That's just the name Meta called it. This is their version 2 of Llama. This is their 7 billion size one. It's the smallest one that they make. And specifically, these weights have been created for HuggingFace, so you can load it with the HuggingFace transformers. And this model has only got as far as here. It's done the language model for pre-training. It's done none of the instruction tuning and none of the RLHF. So we would need to fine tune it to really get it to do much useful.

So we can just say, OK, automatically create the appropriate model for language model. So causal LM basically refers to that ULMFiT stage 1 process or stage 2, in fact. So get the pre-trained model from this name, Meta Llama Llama2, etc. OK. Now, generally speaking, we use 16 bit floating point numbers nowadays. But if you think about it, 16 bit is 2 bytes. So 7B times 2, it's going to be 14 gigabytes just to load in the weights. So you've got to have a decent model to be able to do that. Perhaps surprisingly, you can actually just cast it to 8 bit and it still works pretty well, thanks to something called discretization.

So remember, this is just a language model that can only complete sentences. We can't ask it a question and expect a great answer. So let's just give it the start of a sentence. Jeremy Howard is R. And so we need the right tokenizer, so this will automatically create the right kind of tokenizer for this model. We can grab the tokens as PyTorch. Here they are. And just to confirm, if we decode them back again, we get back the original plus a special token to say this is the start of a document. And so we can now call generate. So generate will autoregressively, so call the model again and again, passing its previous result back as the next as the next input. And I'm just going to do that 15 times.

So this is you can you can write this for loop yourself. This isn't doing anything fancy. In fact, I would recommend writing this yourself to make sure that you know how that it all works OK. We have to put those tokens on the GPU. And at the end, I recommend putting them back onto the CPU, the result. And here are the tokens. Not very interesting. So we have to decode them using the tokenizer. And so the first 25, sorry, first 15 tokens are Jeremy Howard Izhar, 28 year old Australian AI researcher and entrepreneur. OK, well, 28 years old is not exactly correct, but we'll call it close enough. I like that. Thank you very much. Llama7b.

So, OK, so we've got a language model completing sentences. It took one and a third seconds. And that's a bit slower than it could be because we used 8-bit. If we use 16-bit, there's a special thing called bfloat16, which is a really great 16-bit floating point format that's usable on any somewhat recent NVIDIA GPU. Now, if we use it, it's going to take twice as much RAM as we discussed. But look at the time. It's come down to 390 milliseconds.

Now, there is a better option still than even that. There's a different kind of discretization called GPTQ, where a model is carefully optimized to work with 4 or 8 or other, you know, lower precision data automatically. And this particular person, known as The Bloke, is fantastic at taking popular models, running that optimization process, and then uploading the results back to HuggingFace. So we can use this GPTQ version. And internally, this is actually going to use, I'm not sure exactly how many bits this particular one is. I think it's probably going to be 4 bits. But it's going to be much more optimized. And so look at this. 270 milliseconds. It's actually faster than 16-bit. Even though internally it's actually casting it up to 16-bit each layer to do it. And that's because there's a lot less memory moving around. And to confirm, in fact, what we can even do now is we can go up to 13-bit. Easy. And in fact, it's still faster than the 7-bit, now that we're using the GPTQ version. So this is a really helpful tip.

So let's put all those things together. The tokenizer, the generate, the batch decode. We'll call this gen for generate. And so we can now use the 13-bit GPTQ model. And let's try this. Jeremy Howard is a, so it's got to 50 tokens, so fast. 16-year veteran of Silicon Valley. Co-founder of Kaggle, a marketplace for predictive model. His company, Kaggle.com, has become a data science competition. I don't know what I was going to say. But anyway, it's on the right track. I was actually there for 10 years, not 16. But that's all right.

So this is looking good. But probably a lot of the time we're going to be interested in, you know, asking questions or using instructions. So Stability AI has this nice series called Stable Beluga, including a small 7B one and other bigger ones. And these are all based on Llama 2, but these have been instruction tuned. They might even have been RLHFed. I can't remember now. So we can create a Stable Beluga model.

And now something really important that I keep forgetting, everybody keeps forgetting, is during the instruction tuning process. During the instruction tuning process, the instructions that are passed in actually are. They don't just appear like this. They actually always are in a particular format. And the format, believe it or not, changes quite a bit from fine tune to fine tune. And so you have to go to the web page for the model. And scroll down to find out what the prompt format is. So here's the prompt format. So I generally just copy it. And then I paste it into Python. And created a function called make prompt that used the exact same format that it said to use.

And so now if I want to say who is Jeremy Howard, I can call gen again. That was that function I created up here. And make the correct prompt from that question. And then it returns back. Okay, so you can see here all this prefix. This is the system instruction. This is my question. And then the assistant says, Jeremy Howard is an Australian entrepreneur, computer scientist, co-founder of machine learning and deep learning company Fast.ai. Okay, this one's actually all correct. So it's getting better by using an actual instruction tune model.

And so we could then start to scale up. So we could use the 13b. And in fact, we looked briefly at this OpenOrca data set earlier. So Llama2 has been fine-tuned on OpenOrca. And then also fine-tuned on another really great data set called Platypus. And so the whole thing together is the OpenOrca Platypus. And then this is going to be the bigger 13b. GPTQ means it's going to be quantized. So that's got a different format. Okay, a different prompt format. So again, we can scroll down and see what the prompt format is.

We can create a function called makeOpenOrcaPrompt that has that prompt format. And so now we can say, okay, who is Jeremy Howard? And now I've become British, which is kind of true. I was born in England, but I moved to Australia. Professional poker player? Definitely not that. Co-founding several companies, including Fast.ai. Also Kaggle. Okay, so not bad. It was acquired by Google. Was it 2017? Probably something around there. Okay. So you can see we've got our own models giving us some pretty good information.

Retrieval augmented generation

How do we make it even better? You know, because it's still hallucinating, you know. And, you know, Llama2, I think, has been trained with more up-to-date information than GPT-4. It doesn't have the September 2021 cutoff. But, you know, it's still got a knowledge cutoff. You know, we would like to be able to use the most up-to-date information. We want to use the right information to answer these questions as well as possible. So to do this, we can use something called Retrieval Augmented Generation.

So what happens with Retrieval Augmented Generation is when we take the question we've been asked, like, who is Jeremy Howard? And then we say, okay, let's try and search for documents that may help us answer that question. So obviously, we would expect, for example, Wikipedia to be useful. And then what we do is we say, okay, with that information, let's now see if we can tell the language model about what we found and then have it answer the question.

So let's actually grab a Wikipedia Python package. We will scrape Wikipedia, grabbing the Jeremy Howard web page. And so here's the start of the Jeremy Howard Wikipedia page. It has 613 words. Now, generally speaking, these open source models will have a context length of about 2000 or 4000. So the context length is how many tokens can it handle. So that's fine. It'll be able to handle this web page. And what we're going to do is we're going to ask it the question. So we're going to have here a question and with a question. But before it, we're going to say answer the question with the help of the context. We're going to provide this to the language model. And we're going to say context. And they're going to have the whole web page. So suddenly now our question is going to be a lot bigger. Our prompt. Right. So our prompt now contains the entire web page, the whole Wikipedia page, followed by a question.

And so now it says, Jeremy Howard is an Australian data scientist, entrepreneur and educator, known for his work in deep learning, co-founder of Fast.ai, teaches courses, develops software, conducts research, used to be. Yeah. OK. It's perfect. Right. So it's actually done a really good job. Like if somebody asked me to send them a, you know, 100 word bio, that would actually probably be better than I would have written myself. And you'll see, even though I asked for 300 tokens, it actually got sent back the end of stream token. And so it knows to stop at this point.

Well, that's all very well, but how do we know to pass in the Jeremy Howard Wikipedia page? Well, the way we know which Wikipedia page to pass in is that we can use another model to tell us which web page or which document is the most useful for answering a question. And the way we do that is we can use something called sentence transformer and we can use a special kind of model that's specifically designed to take a document and turn it into a bunch of activations where two documents that are similar will have similar activations.

What I'm going to do is I'm going to grab just the first paragraph of my Wikipedia page and I'm going to grab the first paragraph of Tony Blair's Wikipedia page. OK, so we're pretty different people. Right. This is just like a really simple, small example. And I'm going to then call this model, I'm going to say encode, and I'm going to encode my Wikipedia first paragraph, Tony Blair's first paragraph, and the question, which was, who is Jeremy Howard? And it's going to pass back a 384 long vector of embeddings for the question, for me, and for Tony Blair. And what I can now do is I can calculate the similarity between the question and the Jeremy Howard Wikipedia page. And I can also do it for the question versus the Tony Blair Wikipedia page. And as you can see, it's higher for me. And so that tells you that if you're trying to figure out what document to use to help you answer this question, better off using the Jeremy Howard Wikipedia page than the Tony Blair Wikipedia page.

So if you had a few hundred documents you were thinking of using to give back to the model as context to help it answer a question, you could literally just pass them all through to encode, go through each one, one at a time, and see which is closest. When you've got thousands or millions of documents, you can use something called a vector database, where basically as a one-off thing, you go through and you encode all of your documents.

And so in fact, there's lots of pre-built systems for this. Here's an example of one called H2O-GPT. And this is just something that I've got running here on my computer. It's just an open source thing written in Python, sitting here running on port 7860. And so I've just gone to localhost 7860. And what I did was I just uploaded, I just clicked upload, and uploaded a bunch of papers. A bunch of papers. And so, you know, we could look at ... Can we search? Yeah, I can. So for example, we can look at the ULMFiT paper that Sebastian Ruder and I did. And you can see it's taken the PDF and turned it into, slightly crappily, a text format. And then it's created an embedding for each section.

So I could then ask it, you know, what is ULMFiT? And I'll hit enter. And you can see here it's now actually saying, based on the information provided in the context. So it's showing us, it's been given some context. What context did it get? So here are the things that it found. Right? So it's being sent this context. So this is kind of citations. Goal of ULMFiT. Improves the performance by leveraging the knowledge and adapting it to the specific task at hand.

How does ULMFiT improve performance? Or maybe I should say, what techniques? Be more specific. Does ULMFiT. Let's see how it goes. Okay. There we go. So here's the three steps. Pre-train. Fine-

A hacker's guide to open source LLMs - posit::conf(2023)

Transcript#

Tokens and tokenization

How language models are trained: ULMFiT and fine-tuning

Using GPT-4 effectively

Advanced data analysis and the OpenAI API

Building a code interpreter with function calling

Running open source models locally

Retrieval augmented generation