🐙 Lunch & Learn: Let's Build An AI Assistant With GPT-4o (w/ Joe & Winston of @Posit)

Transcript#

This transcript was generated automatically and may contain errors.

Did she forget? No, I was being fashionably late. We were all being fashionably late. I was definitely not forgetting.

Alright, let's just jump into it because there's a lot to go through. As you can see, there are two wonderful guests that we have today. I'm going to let them introduce themselves, but very quickly, the structure of today is I'm going to go show you guys the demos that I'm talking about.

GPT-4o got released, I think it was a week ago, two weeks ago at this point. They were demonstrating how cool the multi-modalities are. They have video functions, really natural voice, and things like that. But the problem is that you can't directly access it right now. Most people don't have access to it. Also, if you go through their API, you can't directly call it. That's why we need to build this application so that we're able to have a similar user experience.

That's what we're going to be doing today. After we build that, show the demos. After we do that, then we're going to actually use it and try to replicate their demos to see if they are, in fact, being very honest with us. I bet it can actually reproduce that, because as we know, sometimes tech companies like to make their things seem really cool, but it might not be as cool.

Word could be really cool, so we shall see. It's the one that they, it was, yeah, at least from the demos, it was honestly quite impressive.

In that case, I will hand it over to Joe and Winston to introduce themselves. As always, guys, drop everything in the chat that you see fit. While Joe and Winston are going to introduce themselves, I'm also going to drop the GitHub link, so if you want to follow along the code today.

We say something into a video camera, and we get an audio response back.

I think it might be worth pointing out that I know you mentioned this earlier, but just in case everyone out there didn't quite catch this, the OpenAI app can do more stuff than this right now because they have access to more functionality than is exposed via their public API. And we can just use the public API. So this is why we're doing it this way.

And by the way, one thing that you brought up, Tina, last week is, could we take all the OpenAI bits here and replace them with open source equivalents? And yeah, the answer is totally. We just need something that converts speech to text, something that goes from text to speech, and then something that does chat completion. So that's something we'll revisit a little bit later on today.

Jupyter notebook proof of concept

So that diagram we took and turned it into a Jupyter notebook that let us kind of give a proof of concept like, is this going to work? Can the model make sense of this kind of input? So I'm not going to go line by line through all of this. But I do want to show you kind of what we encountered when we saw this working.

So first we load up some OpenAI credentials. We're going to make three OpenAI calls. So we need an API key. Create the client. And then the input is in the form of a video file, input.mod in this case.

So the first step is to take that video file and split it into audio and images. And then we're going to take that audio and we send it to the OpenAI Whisper API that does speech to text. And very quickly it comes back with, can you describe what I'm wearing right now? So perfect. That's correct.

And now we have text and we have images. So we're ready to talk to GPT-4o. So we're going to pass in the text. We're going to pass in the images. And a system prompt. And if you're not familiar, a system prompt is like a set of instructions for the GPT to tell it what kinds of responses you're looking for. And in this case, I want it to speak the answer. So I want it to not use sort of formal written tone, but kind of more casual tone. I want it to know that if I talk about I or me in the input, that it's talking about the person in the video. And I want to make it aware. I'm giving you images, but they're actually part of a video. So you should think of it and refer to it as such.

So let's do this completion. Takes about five seconds or so. And hey, so in the video you're wearing a gray hoodie. Great. So it worked.

And then going back from text to speech, we are using this one more OpenAI API. This model is called TTS1. And you can choose different voices. You do not have the option of the Scarlett Johansson voice anymore, unfortunately. But they have a number of different voices that are pretty good.

You will need to install FFmpeg and you will need an OpenAI API key. And those are in the readme. Instructions for those are in the readme of the GitHub repo.

So we can get a sense of how, what kind of responses we can expect. But it's a little bit tedious to do it this way.

Why wrap this in a web app

So we can do that, but if we want to ask it question after question, that's going to be really annoying. We have to open up some kind of video recording app, save it to a file, change the file name in the Jupyter notebook, run all the cells, scroll down to the bottom, hit play. It's just going to be a really annoying workflow.

And when we're coding, it's nice to have all of that code in front of us. But when we're just trying to iterate, now the notebook is getting in our way. And, of course, if I want to share this code with other people, if I want to let non-technical people try their own queries and prompts, this is going to be extremely difficult for them as well if they don't know how to run Python.

So why don't we instead wrap all this up in a web app, and we'll use Shiny to make a much friendlier experience on top of this.

Introduction to Shiny for Python

So what is Shiny? Shiny is a framework for making interactive web apps in pure Python. You do not need to know HTML, CSS, or JavaScript. If you don't want to, just pure Python, you're already halfway there if you know how to code in Python. But if you do know HTML, CSS, and JavaScript, you can absolutely use them to enhance Shiny by customizing it and extending it, which is something that I spent a lot of time doing this week, and I can't wait to show you the results in a few moments here.

Shiny is specifically designed for data-related applications. There's a lot of visualization built in that it works with Matplotlib and Plotly and is designed to work with data frames. So you can use it for a lot of different things, but it is especially well-suited for data-related applications.

The heart of Shiny is a reactive programming model. If you have not heard that term before, you don't have to worry about it right now except to know that this approach is really good at making your apps efficient because only a minimum amount of code needs to run every time you interact with your app in order to get your outputs to be correct. And it's also a very good programming paradigm for avoiding the kinds of bugs that are usually pretty common in UI programming.

It is an open-sourced framework, MIT-licensed, and after you've written your Shiny app, you can either run them locally or you can deploy them in a variety of ways in the cloud or on-prem. We have a number of free and paid cloud options as well as open-source and enterprise on-prem options, and all that is described in the Shiny for Python documentation I'll link to at the end.

On-prem means on-premises, meaning you're running it on your own server that you manage, whether that is in your bedroom or in your data center or in a cloud of your choice. But if you're managing the server, then that's called on-prem, whereas the cloud offerings, you don't even know that a server exists.

Shiny vs. Streamlit and deployment options

So if you know anyone who's built a lot of Streamlit apps, you can ask them. They'll probably tell you that at a certain point, you kind of need to do something else. The main reason is because every time you do anything in a Streamlit app, it re-executes your entire Streamlit app from top to bottom. And for simple apps, that's fine. And for some of the kinds of apps I'll show you today, that is very not fine. And Shiny is designed to work from very, very simple, you know, just one or two inputs and outputs to very, very complex, dozens of pages with hundreds of outputs. And Shiny still works extremely well and efficiently.

The main reason is because every time you do anything in a Streamlit app, it re-executes your entire Streamlit app from top to bottom. And for simple apps, that's fine. And for some of the kinds of apps I'll show you today, that is very not fine.

I would go to the shiny.posit.co. And just start at Learn Shiny. The first page is designed for you to just start from nothing. And, in fact, you don't even need to install Shiny right away. If you don't want to, you can pip install Shiny. But if you just want to start kicking the tires right away, right from within the web page, you can run Shiny apps, like right here.

So this is a pretty standard type of Shiny app. A lot of people use Shiny for various kinds of dashboards, BI dashboards or experiment results, things like that. And, you know, you can change parameters on the page, and everything updates, you know, right away.

So the nice thing about Shiny is that it only executes the minimum amount of code. So let's say if I uncheck this and only this table was affected by that particular control, then that's the only thing that would execute. You don't have to re-execute everything.

So let me show one more demo, and then we'll get back to GPT-4o. So this is a fun demo that shows a 3D visualization of Brownian motion, which is the kind of motion that a molecule undergoes while it's sitting on a surface. So I can hit new data and randomly generate data. And some of these are pretty interesting, but then like other ones of them, like it's a little hard to see what's going on here just because of the camera angle in this 3D perspective. This is using a Plotly output, and one of the features of Plotly is that you can click and drag to move it around.

But to make this demo extra fun, I have added a hand tracking component. So I can just wave my hand and the plot moves. So what's happening here is that in the browser, it's using a Google library called MediaPipe to recognize the features of my hand and sending those to Shiny. And Shiny is calculating a normal vector shooting out of my palm, and then I'm using that to orient the Plotly camera.

Live coding: building the Shiny app

So that's Shiny in a nutshell. Now let's get to the live coding part where we'll find out how good I actually am at writing Shiny code.

So just as a reminder, we started with this Jupyter notebook that works, but it's a little bit clunky. The first thing I want to do is I want to take all this logic and I want to reuse it.

I want to turn it into a Python function. And luckily, Jupyter has a handy utility built in called nbconvert. And I can say, take this notebook and turn it into a Python script. And just with that one command, it gives me a Python file.

I'm going to rename it so it's easier to import. I can't import it if it has dashes. So I'll call it query.py.

And as you can see, this is Python code. It took all the cells out. So in order to use this as a function, there's a couple of things we're going to need to do. First of all, we don't want to use this hard-coded input.mov. We want to take the video file as a parameter. So let's make a function called query. And this will be the sole parameter. And all the rest of the code becomes the body of this function.

And now the only other change I need to do is go through, and there are a bunch of things I put in just to show things in the notebook. I can take them out now because we're not in a notebook anymore. But all the logic is going to remain the same.

Okay, and at the end here, this was me showing the little audio player so we could hear the response. We don't need that. Instead, we're going to take that response audio and just return it.

Okay, so now we have a function called query sitting in a module called query. And this is not the Shiny app yet. This is just a regular Python module. And now we're going to write a Shiny app that is going to use that.

Building the Shiny app

Okay, so Shiny apps are generally called app.py. And you do not author them in Jupyter Notebooks. You do them in Python files, usually using an IDE like VS Code or PyCharm.

I'm going to go very slowly for this part. The actual app is not going to end up being that big, so we're going to sort of iterate our way into it so it doesn't feel like we just sort of jumped to the end.

Okay, so I'm going to start by just putting nothing but a string there and showing that we have a Shiny VS Code extension here. And if you have that installed, then you can just hit this play button if you're on a file called app.py and using Shiny. And when you do that, it'll automatically start Shiny, and it will put a preview up here. And if I change this and hit save, then it'll automatically reload.

So like I said, we're going to do this in sort of baby steps. The first version of this, let's still use video files. I can use Shiny to let you upload.

So I'm making this function call called UI.inputFile, and that will create a file upload control. So I can say browse and then select my movie, and it'll upload it to Shiny. So far, we're not doing anything with it.

So let's show something very simple about this video clip. Let's see how many bytes it has or something like that, or maybe the name of the file.

So I can access this input by using the keyword input, and then this first parameter is clip. That's the ID. So I refer to it here, and then I call it like a function to get the value out. And in this particular case, I happen to know this type of input gives me back a number of different files, because I could upload a bunch of files if I want to. So I'm just going to take the first one, and I'm going to print the name of it.

So what is this? What is this little function with this render express on top? This is that reactivity that I was talking about. Basically, things that appear at the top level, these are things that are rendered into our web page when it first loads. And then anything that needs to change, anything that needs to re-render or recalculate over and over, you put into little functions and then decorate them with these expressions like render.express.

In this particular case, we are reading this input clip and getting the name off of the first entry. So it automatically knows that any time input.clip changes, meaning any time I upload a new clip, it will automatically re-execute this. But if I have other controls on the page and other of these little outputs that are reading those other controls, then they wouldn't need to change when the clip changes.

So it automatically knows that any time input.clip changes, meaning any time I upload a new clip, it will automatically re-execute this. But if I have other controls on the page and other of these little outputs that are reading those other controls, then they wouldn't need to change when the clip changes.

So here is our video file. This is how we actually get the data from this uploaded file. So we have a video file here. So I'm going to call query and give it the video file. And that will give me an audio file as a response, right? And I can take that audio URI and insert an HTML tag called audio.

So actually, because it's going to take a while, let's say, like, calling gbt just so that we know that it's working.

Oh, and by the way, because this is... this error, this is happening because input.clip. Before I've uploaded something, it's null. So I can say if input.clip is not none... it is none. Then just don't do anything. Okay. So that error is now gone.

Oh, because of the screen sharing, you won't be able to hear this, but I'll tell you what it says. Hey, there. So in the video, you're wearing a gray hoodie. Looks comfy. Okay. So that worked.

Adding progress indication

It drives me crazy that it sits there for so long showing nothing. And as a user, we're going to sit here and wonder, is this thing even working? So what we want to do is we should add, instead of printing this to the Python console, which is not going to help anyone, we can add some progress indication.

So this is a feature of Shiny where we can say pset a message processing. So everything that is inside of this with statement is going to have a progress message showing. And it's going to show processing, and that's it. And so when this line finishes executing, then the progress will automatically go away.

So there is our progress. And now at least we have some idea that something's happening.

Adding a webcam input control

So now let's get fancy. The worst part of this experience is having to upload a video clip. I don't want to fire up QuickTime Recorder every time I want to ask a new question. So I did some work in advance after Tina posed this challenge to us last week. This whole week we've been working furiously to create a couple of components that would really help this experience.

So input video clip. So I've made a dedicated control to help make this easier. So we're going to call it clip still. And it actually, instead of uploading a file, it returns data that's immediately ready to drop into the query function. It's going to be a video URI.

And unfortunately, VS Code will not let you use the camera on your phone. So now we have this experience. I have a webcam that's live here. I can switch to different devices. And I can hit record to start recording and then stop to stop recording.

Before I try this, let me make one more change before I switch to screen sharing so you can hear my audio. So let's, because I know a lot about web development, I can say max width is 600 pixels. Oh, but I want it centered and maybe with a little bit of breathing room on top. So I can, again, using my extensive web development skills, I can say put some padding in my direction and center it.

Live demo with the webcam app

So let's try this. Hey, looking at the video, it seems like you're in an office or co-working space. The setup with the desks, partitions, and lighting gives off that vibe. It looks like a pretty productive spot. I don't know about productive, but it is a co-working space. So I got that correct.

And this is so much easier for us to iterate. Hey, from what I can see in the video, your hair looks dark. Probably black or dark brown.

So we're already making a lot of progress. But now I was pretty dissatisfied with the way that output control looks, like that kind of stock HTML audio player. I created another component called audio spinner. And instead of using default HTML audio tag, I'm going to use this control.

Hey there. In the video, you're wearing a light gray jacket over what looks like a dark gray or black shirt. And you also have white earphones in. Looking good.

Opening the app to the audience

So now I think we are almost at the point where we can ask it any scenario that we can think of. Oh, I just realized the time. Tina, do we have a hard stop on the hour? Should I move on to the playing around with this? We don't have a hard stop. But I know some people do drop at the hour mark. So, yeah, maybe we can try out the demo.

So I've deployed this. I've not load tested it. But I'm putting it in a chat right now. Why don't you all click on that and you give this a try. Ask it questions. I've only tested it with English. I don't know if it works with other languages. But let's give it a try.

If something funny happens or something interesting, then post it in the chat. Don't spam, okay? I was nervous right now, guys. We make a good impression. Be on your best behavior.

This is my personal open AI API key. Oh, my god.

Trying out demos

So, Joe, this version of it doesn't have the history? It doesn't have history, so I think I need to, like. No, I think I meant to deploy the one with history. Does it have history? It's supposed to. I'm not juggling any versions of this app today, so let me see. Yeah, this is the one with history.

Can you tell me a story about robots and love, but make it really, really, really dramatic? The longer the answers are, the longer it takes both to generate and synthesize.

In a future where robots and humans coexisted, there lived a robot named Aria, designed with the ability to feel emotions. Aria served a wealthy family, but deep down, she yearned to experience love. One day, she met another robot named Pax while recharging at a robot station.

So, do you think there's a way for us to actually vary the emotional level of the voice? Because I think they were trying to show the fact. The part that understands what you mean by more emotion is not the same that's transcribing the text into speech. So, yeah, there's definitely some of that.

Math equation demo

I'm going to write down an equation and ask it. Can you help me solve this math equation? Just write 3X.

Can you help me solve this math equation that you see here? Sure thing. So in the video, I see the equation is 3X plus 1 equals 4. Let's solve it step by step. First, let's subtract 1 from both sides of the equation. 3X1 minus 1 equals 4 minus 1. This simplifies to 3X equals 3. Now we divide both sides by 3. 3X squared 3 equals 3, 3. So X equals 1. There you go. The value of X is 1.

Wow. That's impressive. You know, I've tried to use ChatGPT to do other more sophisticated math, and I'm a little afraid to try that because it has failed miserably. But that was pretty cool.

Testing with other languages

Somebody said they tried it with Arabic. It's working. I think you should go for something light and delicious. Maybe some dumplings or a bowl of noodle soup. Pretty good. If you're in the mood for something sweet, how about some fruit or a pastry? Just pick something that makes you happy and enjoy it.

Oh, thank you. Nice. That was pretty good. That was really good. I was going to ask it to respond in English, but I asked it what should I do after this live stream? What should I go eat? So that was really good.

Am I just asking it what kind of facial expression am I making? Yeah, but do like an expression.

Okay, I'm going to do it with this special version that you have not seen yet.

So Tina raised the question of how does this model do versus open source multimodal models. So this version of the app is going to run it against three models simultaneously. So the new GPT 4.0, the existing GPT 4 Vision Preview, and this open source one called Lava.

So what kind of facial expression am I making here?

Okay, so it's asking all three.

Let's check out your expressions in the video. It looks like you're making a few different faces here. First, it seems like you're concentrating on something or listening intently. Then you shift to a kind of playful or inquisitive expression. Towards the end, it looks like you're squinting or maybe trying to figure something out. Super expressive.

So that's one thing about this, too, is that the speech and the video is not lined up. So it doesn't know when I finish asking the question, then I made a face. It doesn't know that. It just has a series of images. So I wonder if it will do better. So I wonder if it will do better when the real API is available.

Let's see how GPT 4 did. It looks like you're cycling through a few different expressions here. At the start, you seem to have a neutral or slightly serious look. Then your expression shifts to a small smile, and then it transitions into a more pronounced smile. Almost as if you're about to laugh. Lastly, your expression changes to a slight frown with raised eyebrows, giving off a look of confusion or concern. It's like you're reacting to different things happening, each requiring a different facial response.

I think this one's better. I'm going to vote for that. It looks like the person in the video is making a surprised or curious facial expression. They seem to be looking directly at something, possibly reacting to someone or something off camera.

I think GPT 4.0 is the best. So I voted for it, and then it goes on to this leaderboard visualization here. So I can keep the conversation going. This time I'll make a consistent face the whole time.

Testing the model with expressions and food

Alright, let's see what's going on in these frames. It looks like in the video you're making an expression that seems a bit concerned or worried. Your eyebrows are raised, and your mouth is kind of downturned, which usually indicates concern or maybe a bit of confusion. It's like you're reacting to something unexpected or trying to figure something out. Super expressive.

In this video, you're showing an expression that suggests concern or sadness. Your eyebrows are raised and pulled together, and your mouth is slightly turned down at the corners. This type of expression might indicate that you're feeling a bit worried, unsure, or possibly empathetic to something that you've heard or seen.

Hey there, how can I assist you today?

Okay, so not the open source one. I think I have to give to this one again. So you get the idea. You can keep doing this over and over, and you can keep the score of who's doing the best here. And if I had more time, we could do this in a way that everybody could participate, and this would be the global leaderboard. But that would require a database, which I didn't feel like setting up.

Running Lava locally with Ollama

So someone asked, where are you running Lava? And because I'm running it locally, I can't deploy it. I never used Lava until yesterday, basically. And I did it with Ollama, which I only know how to run locally right now. So I'll deploy it if I can figure out a way to do it in a sensible way. But that's the reason this is just local.

If you want, this version is in the repo. It's called app-bakeoff. So if you load that up... There's actually instructions in the readme of how to run this app. It's the very last thing in the readme is how to run this app locally. So you can try it if you want. But you need to have Ollama installed as well. Installing Ollama is super, super easy. When I did it, I was shocked at how easy it was.

Do you guys want to do maybe, if you don't mind, a couple of the suggestions from comments? And then we can go into Q&A. And then after that, maybe, Joe, you can show some of the other Shiny apps. I think people had a bunch of questions earlier about how to build things.

Math equation demo

I can do the math equation. Let me just write that down here. x plus 2y equals 4. And x plus y equals 5. My experience is that ChatGPT is not very good at math.

Can we get it to actually speak in another language? I want to see if it's good at actually speaking the language. Can you please solve the math equation that's written here? Sorry, the pair of math equations that are written here.

Sure thing. Let's solve the pair of equations written in the video. Here are the equations. 1, 2x plus 2y equals 4. 2, x plus y equals 5. First, let's solve the second equation for y. x plus y equals 5 implies y equal 5 minus x. Now, let's substitute y in the first equation. 2x plus 2, 5 minus x. This simplifies to 2x plus 10 minus 2x equals 4 in place 10 equals 4. Oops, that means there's no solution to this system because it leads to a contradiction. Equations represent parallel lines that never intersect. So, no values of x and y will satisfy both equations at the same time.

I'm having trouble thinking in my feet. Is that right?

I am actually pretty good at math.

AI safety and Posit's values

It's basically saying this can be really scary because it can be used in problematic ways when it comes to programs dealing with humans and especially without proper supervision. Honestly, I absolutely agree. That's something that I brought up a lot in videos as well.

One thing I want to note is that technology is not inherently good or bad. This technology, I think the one thing we can agree on is that it is going to rapidly develop whether we like it or not. But it is up to the responsibility of all of us in order to actually make sure it's in the right hands doing the right things.

One thing I want to note is that technology is not inherently good or bad. This technology, I think the one thing we can agree on is that it is going to rapidly develop whether we like it or not. But it is up to the responsibility of all of us in order to actually make sure it's in the right hands doing the right things.

I actually think, if you don't mind, Joe or Winston, I think this is actually a great place to talk about what Posit believes in. Because Posit has a very special strong stance surrounding open source in these topics. I'm actually going to let you guys talk about that.

I think we do have a very strong stance around open source and less directly relevant to LLMs. But our general philosophy as a company is we believe in developing tools that help people understand the world using data and making that available to everybody regardless of economic means, which is why all of our core technology is open source.

That being said, when it comes to these sort of large AI models that we don't really fully understand and don't have an accurate mental model of how they work and how we can sort of keep them safe, I think within the company there's a lot of different opinions. And I think some people are very, very optimistic and some people are very, very pessimistic.

And I think there are people in the company that are working on taking this technology and making it really good for our users by enhancing the experience in our studio and letting people build tools in Shiny that help enhance their experience of working with these AIs. But we also have people at and affiliated with the company that are working on AI safety and are extremely, extremely concerned about what this means for humanity. And we have a lot of deep involvement in different efforts to make that safe. I think we'll be talking more about that in the future. It's nothing to share publicly right now.

Yeah, Posit really is a very special company. There were many times in which you guys could have chosen to not be open source. There's a lot of times in which you could have not prioritized that. But every step of the way, that's what you prioritize. And I think, yeah, in my opinion, at least, it was what really stood out when I first interacted with you guys, like how strong of a stance.

And behaviorally speaking, some companies, they go like, oh, safety is important. They just leave it as that, where they're like, oh, we should think about this. But really, Posit has done that every step of the way. And yeah, I just love that about them.