Andrew Gard - Teaching and learning data science in the era of AI

Talk by Andrew Gard

Oct 31, 2024

5 min

image: thumbnail.jpg

Transcript#

This transcript was generated automatically and may contain errors.

Our penultimate speaker is Andrew Gard, who will be talking about teaching and learning data science in the era of AI. What a really unique point in time in the history of the development of data science in the following sense.

Nearly everyone practicing data science today learned to code prior to having ready access to AI, and yet we all have ready access to it now. On the other hand, nearly everyone learning data science today already has ready access to AI tools and is strongly incentivized to use it.

I'm Andrew Gard. I'm a professor of mathematics at Lake Forest College, where I teach statistics in R to students at all different levels. I also maintain the popular RStats learning channel, Equitable Equations, like, share, subscribe. So I'm in the thick of all of this.

AI and the classic data science exercise

I want to look at a question that I've asked students for many years in my introduction to data science with R classes, looking at the classic religious income data set and just asking them to give me a horizontal bar chart for income levels for one group of people in this set. And I want to use AI to approach this.

I'm using GitHub Copilot, although it doesn't particularly matter if you're using a different tool like ChatGPT to do something like this. The comments that you're seeing here are my prompts. I've asked Copilot to pivot the data set and then produce that horizontal bar chart. And in a flash, I get a pretty passable thing right there.

And if you think about this from the perspective of a new learner, your reaction might be, I don't even need to learn how to code. I don't need to know Pivot longer. This is not important anymore.

And I want to throw a little cold water on this, or maybe at least lukewarm water with this cautionary tale. And what I've done here is to just modify the prompt very slightly, combining the two prompts I did before and taking out the word Pivot. And what's happened here is, of course, GitHub Copilot has failed. It has tried to do a ggplot using some variables that aren't actually in this data set.

What really went wrong

So what really went wrong here specifically? Why was one result awesome and the other silly? The main point I want to drive home today is that this is not just a problem of prompt engineering. In the second case here, we're seeing some important concepts that are missing from the prompter's brain. They don't have the vocabulary they need to say specifically what they want. The AI had to guess, and of course, it guessed wrong.

They don't have the vocabulary they need to say specifically what they want. The AI had to guess, and of course, it guessed wrong.

Now this isn't surprising when we look back at the religious income data set. Nowhere in this set does the word income appear, or the word count. So that's information we brought to the table. While we're all hoping and expecting AI tools to get better and to hallucinate less over time, it's not realistic to expect it to be able to reliably guess at information that's not in the set and that we don't provide on our own.

And this is why learning PivotLonger still really does bring value to the table. By learning tools like this, we learn important, I would say, essential data science ideas, like the difference between variables and observations, columns and rows. And while learning to code is not the only way to learn these essential concepts, it is the time-honored way.

Updating our pedagogy

Unfortunately, students are increasingly resistant to this technique. So how can we potentially update our pedagogy to reflect this new reality? Well, I don't exactly ask that same question anymore about the religious income data set, because I get a lot of AI-generated responses back. Some are great. Some are awful. Some are just plain hilarious.

Instead, I'm asking things a little bit differently, providing students with that AI prompt and its output that did not work and asking questions like, why didn't this work? What information is missing? How can we provide it? Can you make this code actually work?

I think this has some important implications for all of us as continuous learners. We think about AI tools as things that are enhancing our productivity and allowing us to move faster. But when you're learning something new in data science and incorporating AI into your process, I want to urge you to slow down. Take some additional time right then and ask questions like, what's working here? What's not working here? How can I break this?

Because of course, AI is not going to excuse us from learning to code at all. It's never going to replace human understanding. Of course, using it is not just an immediate path to failure.

Since I haven't been gonged yet, I'll say really quickly that Equitable Equations brings a human-first perspective to the teaching of data science, R programming, and machine learning. As my fellow educators head back to the classroom in the next couple of weeks, I hope it's a free resource that can bring you a little bit of additional value. Thanks, everybody.