Alok Pattani - Using Generative AI to Increase the Impact of Your Data Science Work
Over the past year plus, generative AI has taken the world by storm. While use cases for helping people with writing, code generation, and creative endeavors are abundant, less attention has been paid to how generative AI tools can be used to do new things within data science workflows. This talk will cover how Google's generative AI models, including Gemini, can be used to help data practitioners work with non-traditional data (text, images, videos) and create multimodal outputs from analysis, increasing the scale, velocity, and impact of data science results. Attendees should expect to come away with ideas of how to apply Google generative AI tools to real-world data science problems in both Python and R. Talk by Alok Pattani Slides: https://docs.google.com/presentation/d/18lS3d3pn-ImyOO0gE0DSLreH4iA6-YSZbRJq-I44j0c/ GitHub Repo: https://github.com/alokpattani/gcp-datascience/tree/master/olympics-medals-analysis Blog Post: https://medium.com/google-cloud/achieving-gold-medal-level-data-science-communication-with-gemini-and-vertex-ai-536078880191
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right, good afternoon. Excited to be here. Last talk of the almost last session, so I hope we'll keep the energy up, talk about how generative AI can make you a better data scientist.
So let's start with this setting, right? Many of us here are involved in data science, right? We work on a project, we've been working on it for a while. We imported our data, we worked through a lot of data engineering challenges. We did all the cleaning, right? Data cleaning, 80% of the job, whatever it is, we spend so much time doing that. We transformed our data. We tried all these different models. We finally found a model we like. We built the outputs from that model we like. We predicted, we made predictions from that model, and we're feeling really good. So we're finished, right? Finished our data science project.
Now, if you're familiar with this diagram, this very popular diagram that Hadley came up with, there is this extra part about communication, and Laura talked about this, obviously, at extensively in her talk. And we're not done, right? Most of the stakeholders, we're probably gonna have to show them something or tell them something, and that often is going to be in written form.
So we're not, we're not done. And I'm all, I work at Google Cloud. I've been in data science for 15 years or so. And for about 15 years, I've struggled with taking my data analysis and then finally doing this last step of communicating that is critical.
So the good news here is that, no, we're not done. The bad news is we're not done. The good news is maybe generative AI can help us of this critical part of communication.
So the plan for the talk, I'm gonna talk about Gemini, Google's foundational multimodal model and how that's kind of a game changer here. We're gonna then run through a couple examples where we will explain machine learning model outputs at scale. And then another example where we're gonna look at how we can take our data analysis results and create talking points, again, in text form that we can use to communicate those results.
Introducing Gemini
So wanted to talk a little bit first about Gemini. So to be clear here, like we use the word Gemini to refer to many things, including the chatbot formerly known as BARD, which Laura featured in her last presentation.
But the point I'm gonna talk about is actually the underlying model, which can be called via an API, which is what I'll demonstrate in the examples here. So key things to know about Gemini, it's natively multimodal. That means it can take into not just text, but also video, audio, images, and even things kind of like PDFs, which are combinations of some of those things. And obviously that makes it very powerful from an input standpoint. Sophisticated reasoning, it can then take those inputs and then you can ask questions like, hey, given this question and this audio and video, give me a summary of these.
So one kind of interesting use case I had for this that worked pretty well, I was gonna go on a podcast, this was a couple of months ago. I was really excited to go on. They have a weekly podcast about sports analytics. I like sports analytics. And I listened to the podcast, but I don't know what they're talking about. And I was like, how can I prep? Well, I'll listen to the last few episodes. Well, it's an hour per episode. I don't necessarily have all that time. So what I did was I downloaded the files from the last four episodes, put them into Gemini. I gave it a pretty sophisticated prompt, like, hey, give me some questions you might expect. What are some things they talk about? They have multiple hosts. What are their personalities like? And within a minute or two, I actually had a really good summary of what they do and what I might expect. And then, of course, I could ask for timestamps and go back and listen if there were particular parts that were interesting to me, right?
So that's a kind of example. And the other thing with Gemini is we have a robust ecosystem of other tools to help with things like retrieval, augmented generation, tuning, evaluation. I won't get into those, but it's good to know those.
So one more sort of thing on Gemini. The current version of Gemini, we have two different models. Again, we're talking about models you can call via API, integrate into your workflows. Gemini 1.5 Flash and Gemini 1.5 Pro. The key thing with Flash is it's fastest, it's the most cost-efficient, low latency. So when you have a task where you want to make a lot of calls and you want answers pretty fast, Flash is probably the key. And Pro, on the other hand, is sort of the bigger, more top-tier model. It can have advanced reasoning and you want to do it over vast amounts of data.
A couple of things to point out here when I say cost-efficient for Gemini 1.5 Flash. As of this week, it's 50% cheaper than competing model from OpenAI for prompts of standard length. And again, very popular for these low latency, high throughput tasks. And then another thing to point out here with the context window, both of these context windows are actually enormous. And to give you an idea of what the 2 million token context window is, which is many multiples bigger than most competitors, is you can actually put in 1.4 million words of text, which is many novels, and two hours of video or 22 hours of audio. So this was kind of what I was taking advantage of with that podcast summarization piece.
Example 1: Explaining ML model outputs at scale
All right, cool. Enough about Gemini. So let's see how we can take that and put it into a place of data science.
Again, the happier part of Laura's talk for me is she used the Olympics. I'm also gonna use the Olympics. So that part we're aligned on here.
So this is a sample sort of data science task. I thought it would be, again, timely. So let's pretend that we're gonna build a model to predict Olympic medals won in the Paris Olympics. So we have the common medal table. This is from Tokyo. And we're gonna predict that total column, the total number of medals that a country wins. We have a few predictors that we think might be useful, including their medal share at the previous two Olympics, population, GDP, and again, important like host status. There's a noted bump that the host country usually wins more medals than they do other times.
So this isn't by any means the most sophisticated model. It's not the idea here. We can do something much better if we went sport by sport. But the idea is to get an idea to show like we have multiple predictors of this target that we're interested in. We'll use an XGBoost model. That's the type of model that's pretty good for tabular data like this. And we might have interactions, right? And things that we want to take into account.
So again, I'm not gonna dwell too much on the results of this model. I'll have a GitHub notebook. I'll put a link up at the end that you can see this in very, very good detail. But the model looks to be really good on the training set, of course, but also pretty good on the test set. R squared 94% in this context is actually quite good. We're off by about two and a half medals per prediction. It turns out this is actually a pretty good model.
And that's kind of my point here. Good enough. Let's move forward with what we want to do with it, which is make predictions.
So let's talk about the predictions and things we can get out. So I picked the host country here, France. You can see the predictions we have up top. We have the medal share translates to about 38 total medals. And if you rank that prediction amongst all the predictions of the countries we have, it would rank about sixth. So that sounds good.
But again, for our stakeholders, they may want to be like, okay, well, why? What contributed to France's predictions? So this is what we call a Shapley values plot. Some of you may be familiar. Some of you may not be. Shapley values are a way to basically, on an individual sort of example-by-example basis, take into account how much does each predictor contribute to the model output. So it takes kind of both the global variable importance and also the local variable importance, and we can have this sort of waterfall plot showing. So the main thing that matters here, and in most cases, is actually the last Olympics medal share is a really good predictor of how well you're gonna do in the next Olympics. And in this case, it's France. So they get a good bump for being the host country.
So that's cool. I'm a data scientist. I got my prediction. I got my Shapley values. I'm good, right? I can just turn this over to some... Yeah, I see some people laughing. I can't turn this over to someone. Most sort of stakeholders, senior folks, are not gonna be able to interpret this plot, and they're just going to stare at me.
So what should we do here? Well, this is where we can ask Gemini for help. So I'm gonna build up this prompt sort of iteratively here. I'm gonna put in some details, but not all of them. So first, we're gonna... So again, Gemini and multimodal, so I can give you that. I can put in that plot. We'll show that in a second, but let's set the stage. So you're a data scientist who does a great job of explaining various outputs from data analysis and modeling, including numbers, tables, and plots to help a more general audience understand your results better. So this is to set the stage. We can use this for other parts too.
Now we get into the specifics here. Please use the information below in the provided Shapley plot to explain France's 2024 medals predictions. Put in those predictions, put in that plot. Again, it takes into account images, so I have the prompt, I have the image. There's a few other details I'm leaving out. I go very specific on, hey, this part means this. Shapley values, which they kinda know.
This is also key at the bottom. Stick to only the data information provided in creating a response. Your response should be understandable by a non-technical audience, 100 words or fewer. Obviously you can change that. And in English, I have English in red here. We'll talk about a change to that in a minute.
So we do this, we submit it to Gemini, and we get, so I have this stuff at the top, the response. The model predicts France will win 38.1 medals in the 2024 Olympics, ranking them fifth overall. The prediction is largely driven by France's strong performance in the previous Olympics, as well as the fact that they're hosting the games. The model also suggests that France's GDP has a small positive impact on their medal count.
I think this is pretty good, right? This comes out from the model. They took into account the right context. It was able to read the plot, see which parts were most important, and give me a summary of the results with an explanation.
I think this is pretty good, right? This comes out from the model. They took into account the right context. It was able to read the plot, see which parts were most important, and give me a summary of the results with an explanation.
And this is something where I could potentially present this to a stakeholder or include it in a report and feel pretty good about it.
So that's good, we get an explanation in English. But we're talking about France. And maybe some of the people who are interested in this content speak English. They speak French. So the good thing is Gemini actually knows 40 and growing number of languages. So it's the same long prompt as before, and I changed one word, the last one, from English to French. And what do I get?
So I don't know French, so I can't tell you exactly here, but I did put this back into translate. And it's not the same exact response translated, but it has the same flavor. It explains in French, hey, we think France is gonna win this many medals. This much air translates to about this many medals of place fifth, and the host country piece plays an important role. So now with just one tweak to my prompt, I actually now can get an explanation in a language I don't even know.
Okay, so that's cool. You're like, eh, you explained France. I could maybe explain France. I could run it through translation tool. That's cool. But I don't just want an explanation for France, right? I want it for all 200 countries, 200 plus countries in the Olympics. So this can scale pretty well.
This is the only slide where I have code. I just wanna talk through a little bit about how it isn't that challenging to set this up. So I'm showing how you can do this in R. You can do this in Python as well. We're gonna use reticulate to kind of use the Vertex AI library in Python. Set it up with our Google Cloud Project information. Again, Vertex AI is our Google Cloud machine learning and artificial intelligence sort of platform set of tools includes Gemini along with other things. We're gonna set up then the Gemini 1.5 flash model in this case, because I wanna do it for a lot of countries. So I'm happy to get the larger speed and lower latency of that. And some other pieces for configuration.
And then finally, this is kind of the actual call to the Gemini model. And the two parts in red are I think what we wanna focus on. We put in a link to that plot we built. We save it as a PNG on cloud storage. And then we put in the country specific prompt, the one we kind of built up in the last step. And we have some parameters around the configuration. We talked about those. But this is your sort of function maybe to generate the model. And then you can, I guess you could take this. This is the code snippet. You can put it into a function and then use your favorite PMAP or apply and call it across every country, every plot.
So now flash can write all these predictions, 206 of them in 40 different languages, right? Because all the countries at the Olympics, all different languages, and there'll be a growing number. And this took on my computer with not on a Jupyter notebook running in the cloud, but no special hardware. It took about seven minutes. So you can imagine as a human, how much time would this save you from instead of having to write these reports for every 206 medals. And maybe you have some tricks and shortcuts you can use after you write a few, but you're getting this in seven minutes.
So now we have explanations for every country, right? So we can talk about the US. The model predicts the US will win 96.6 medals in the Olympics. It would be the most of any country driven by the strong performance in the previous Olympics, large GDP. So that's cool. That seems reasonable to me.
And then I'll take my parents' home country of India. A little bit more modest prediction here, eight medals or so, and we get the response in Hindi. So again, now I'm getting an explanation for every country in languages native or commonly spoken in that country, along with my plots if I want to include those in my predictions.
So I think this is a really a great use case of like, hey, we can take all this data science work. We're still doing it. We're still doing it the same way. And then in that last mile communication, we can use large-angle models to help.
Example 2: Writing a report from data analysis
All right, so now's the part where you tell me, well, that's cool, Alok. Like, I like all those predictions from before the Olympics, but like, sorry to tell you the Olympics are over. And yes, I know, I'm very sad about it too. But, you know, and here's our results. So what if we wanted to do an analysis after the fact is how did countries do in the actual Olympics?
So let's imagine now that we took the results from this year's Olympics and we built a sort of, we want to build a report about how countries did, not just who won the most medals, but use that model from the previous step to talk about who did well relative to expectation, right? I probably could have told you that countries like US, China, Great Britain would do better than sort of smaller countries sending a couple athletes. That's obvious, but that's why we have this prediction.
So we built a, you know, let's say we built a plot like this that kind of shows each, this is for one country, obviously, that shows their gold, silver, bronze medals, also their total, but in the bottom middle, in the bottom middle, we show the prediction, and then the bottom right, we show their total relative to that prediction. And now I want to tell stories around all of these, right? But in this case, like this is an example plot for China who did quite well, tied the US for the most gold medals, second overall total medals, and also did well relative to their prediction. So this is cool. I can probably write that explanation for China, write the results for China pretty well, but I actually want to find all these interesting countries, and I am not going to look across 200 plots manually myself, and maybe, but maybe I can get some help.
So again, we'll turn to Gemini. In this case, we're going to use that larger context window. We're going to stuff 206 GG plots, I think it was 206, in there. And again, I'm leaving out parts of the beginning. I'm going to actually use that same preamble. You're a great data scientist. You know how to explain your results, all this stuff. And I'll say, please look through all of these plots carefully. Tell me which countries have interesting results that will be worth writing about in a report summarizing the Olympic results. Focus specifically on countries that have a lot of medals, but also those that's performed well or poorly compared to their expectation. Give me some talking points. Don't be too technical, this sort of thing.
So if we do this, we use Gemini Pro. This is my choice because this felt like a more sophisticated reasoning task to me, right? We're going across many different countries, and I really want to have like the most interesting, like kind of long form report level output. In this case, it's only one prompt. The other one was a, whatever, one call per country. This is one big call. So it gives you three to four minutes of response time. I'll take that, it's probably going to help me save a lot more time on my report writing.
So this is a long response, but I'm going to kind of just map through it real quick. So here's the first part. The 2024 Olympics saw some countries exceeding expectations of their falling shorts. Here's some highlights. Overachievers. South Korea, Canada, Romania. And you can see there's explanations. They exceeded expectations. They won more medals than predicted. They were strong in bronze. They were strong overall. This is where they landed. This is clearly reading each plot and clearly identifying three countries that did do well.
And obviously, like you're taking my word for it. I did go and look at these plots and be like, hey, is this actually true? In most cases, it's getting at something that is an interesting story.
And then it gets at underachievers, right? And like some of these are interesting. So Japan actually, like it actually does a pretty good explanation here. It said, despite winning the third most gold medals, they actually underperformed relative to their overall projection, 7.7 medals, fewer than predicted. So we've got the overachievers. It highlights the underachievers. And then finally, we get the high achievers. So these are the countries that kind of won the most medals. And they're just talking about the sort of different pieces here.
And then they have the sort of LLM ending. This reveals a mix of triumphs, surprises, and disappointments, highlighting the unpredictable nature of this global sporting event. Maybe I'd use that. Maybe I wouldn't. I don't know. But this clearly is going to save me time, right?
And at the top, you see my heading is a great headstart to your report. I am not of the opinion that this is your report, right? I still think we should be, Laura's point, human in the loop. We should be like looking at this, but this should give us a huge headstart. We identified nine countries that we can look into that we may want to use. And then we can also go back and prompt it again. And this is a lot. When I was building this prompt, there's a lot of like, hey, go back and forth. What do I actually want it to look at? You can do a lot more here, too, with structured outputs, things like that.
I am not of the opinion that this is your report, right? I still think we should be, Laura's point, human in the loop. We should be like looking at this, but this should give us a huge headstart.
Wrapping up
All right, so let's wrap this up to the finish line. So we were talking about, we did all of our data science project, but we're still struggling a little bit with that communication piece, right? So Gemini helped us with the multimodal capability, sophisticated reasoning. We used it to explain ML model outputs at scale, explaining every single prediction we make from a model. And then also, we used it to kind of help us write a report based on data analysis we did. So I think we're able to say, yes, we checked that one off with Gemini's help. And like Steph says, we can put this one to bed.
All right, have a link here for the notebook. All right, thank you, Aaron. That was very good talk.