
Leveraging LLMs for student feedback in introductory data science courses (Mine Çetinkaya-Rundel)
Leveraging LLMs for student feedback in introductory data science courses Speaker(s): Mine Cetinkaya-Rundel Abstract: A considerable recent challenge for learners and teachers of data science courses is the proliferation of the use of LLM-based tools in generating answers. In this talk, I will introduce an R package that leverages LLMs to produce immediate feedback on student work to motivate them to give it a try themselves first. I will discuss technical details of augmenting models with course materials, backend and user interface decisions, challenges around evaluations that are not done correctly by the LLM, and student feedback from the first set of users. Finally, I will touch on incorporating this tool into low-stakes assessment and ethical considerations for the formal assessment structure of the course relying on LLMs. Slides - http://duke.is/help-from-ai-conf25 posit::conf(2025) Subscribe to posit::conf updates: https://posit.co/about/subscription-management/
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
All right, thank you very much. Very happy to be here. And it's a tough act to follow, but let's try.
So I'm gonna tell you a little bit about some of my inspirations going into this project that I'm calling Help from AI. And I'll read this paragraph to you. A Microsoft study finds, a key irony of automation is that by mechanizing routine tasks and leaving exception handling to the human user, you deprive the user of routine opportunities to practice their judgment and strengthen their cognitive musculature, leaving them atrophied and unprepared when the exceptions do arise.
Set aside the SAT words, I agree with this statement. And as a teacher of introductory data science, I feel like this is actually a lot of where I live. And I don't want these tasks offloaded necessarily.
The other inspiration was this tweet by an author I don't actually know, but I agree with the sentiment. You know what the biggest problem with pushing all things AI is? Wrong direction. I want AI to do my laundry and dishes so that I can do art and writing, not for AI to do my art and writing so that I can do my laundry and dishes.
My laundry and dishes as a teacher is grading, I think. Needs to be done and is important, but you know, not as fun as art.
My laundry and dishes as a teacher is grading, I think. Needs to be done and is important, but you know, not as fun as art.
Rethinking AI in the classroom
So this is me thinking about, so there are certain conversations that I really don't like having, R or Python. I also really don't like having conversations like how can we stop cheating with AI? That is not where I wanna spend my brain energy. So what I've been thinking about over the last year is how can AI support student learning instead of help them take shortcuts in their learning?
And I myself as an educator also love to learn. So I figured I need to interact with these things a little bit myself, maybe programmatically more than just type things into a chat bot and see where we can get within a year. So this is a story of that.
As a context, I teach an introductory data science course. It is a large course, at least large for my institution. I have 300 students. I have TA support. Many of them are undergraduate TAs who have just taken the course, so super helpful, super good with the students, but don't always have sort of the big picture just established in their head yet.
And in terms of what we cover in the course, it is most of R for data science, sort of beginning of tidy modeling with R, and a lot from my textbook introduction to modern statistics. So there's a lot of interpreting results as well as using code to get to results. And in terms of the technology, students are learning to code in R in RStudio. Every single thing is a Quarto document, and every single assignment is submitted as a GitHub repository.
AI policy and the chatbot project
Now, the AI policy that I set last year was that you can use it for code, and here is some information on how to cite it properly. You can't use it to generate narrative. Unless instructed otherwise, you must come up with your own narrative. And you can use it for learning, but don't believe everything it says, in longer words.
I think that was a little too optimistic. To be perfectly honest, I haven't necessarily changed this, but I use that opportunity of the learning throughout last year to sort of figure out like where do we go next from here.
So the first project, I will admit, a little uninspired, because no one really needs another chatbot, but we started here. The thing about the chatbot is that it is one that hopefully generates good, helpful, and correct answers that come from course content. That's the important piece, and prefers terminology, syntax, methodology, and workflows that are taught in the course.
In terms of the technical details, in the back end, we're working with open source models, but then using RAG to build a knowledge graph, a searchable, traversable graph of a database from the course textbooks, and also using semantic similarity search. I'll give you an example of what this looks like in a second. And so we're using these tools to figure out when a student asks a question, where in the materials could the answers be?
So we embedded this chatbot in the Canvas learning management system, which is what our university uses, which is like where things go to die, unfortunately, but we had to embed it somewhere that is not necessarily publicly accessible, because one of the other motivations behind this project was that I don't want how far a student can go using AI tools to be limited by their credit card limit either. So we wanted to make sure that students enrolled can have access to the same level of using the AI tool.
So this is what it looks like. Let's say you ask a question, it thinks about it a little bit, and then gives you some answer, generally okay answers. But importantly, you can see on the side, we have page numbers to course textbooks, and then it actually opens up, I'm gonna track back a little bit, a page from the open intro statistics book.
So the idea being that if a student is constantly being shot to the same area in the book, maybe they will pause and say, maybe I'll just read this chapter instead of constantly asking questions. Let's ask another question that's more of an R question, and here they're going to get sent to the sort of the right page in R for data science as well.
By making this link sort of immediate for students, my hope was that they can come back to student content. So the chat bot itself is not adding a whole lot to what we have around us, but hopefully this immediate linking is.
And we were able to track students' questions. These were all being sort of like written to a backend anonymously, so I was able to sort of analyze the data over the summer. And we saw that students were using this, but not many of them were using it, but it was helpful as a proof of concept for like, we can build this thing, and we can let students make calls to this thing.
The feedback bot
So come the next project that I think is, at least for me, a little more inspired, a little bit different than what I've seen out there, and something that I've enjoyed working on, and now I can say that my students have sort of enjoyed it as well, so I'll share some details about that.
So this was a feedback bot that, again, hopefully generates good, helpful, and correct feedback based on an instructor-designed rubric, and suggests terminology, syntax, methodology, and workflows taught in the course. I want to underscore the word feedback. I am not talking about grading, and I am not talking about assigning scores here. We're talking about giving immediate feedback to students.
Now, let's talk a little bit about motivations. I talked about, I mentioned a couple things I don't like saying. I also really don't like the word disrupt in anything related to tech and education, but I was like, this is my call. This is my opportunity to do it. There's an increasing number of students who use AI tools as a first step before thinking about how to approach a task. So my goal was to say, can I just wiggle my way in there between the time that a student reads an exercise and the first line of code that they write?
Leveraging, I already have to write really meticulous rubric items. My university uses Gradescope, which if you've used it, you have to have well-defined rubric items. And I have a large number of TAs with varying experience levels, grading a large number of students' work, so I need to make sure that they are sticking to a very well-defined rubric. And these are a pain to write, to be perfectly honest, but that work is already getting done no matter what, so I've been thinking, how else can we leverage this?
And also, hopefully, shifting resources. Could AI also help TAs redistribute their time a little bit towards higher value and more enjoyable for them touch points with students? And away from repetitive and error-prone tasks, which is grading, most of which go unread. Because guess what happens when the homework is done mostly by AI? No one reads the feedback because they're like, well, I don't know, I don't actually know what's in there in the first place. So now, a week after I turn this in, why would I read the feedback written for me?
And self-care. Neither the TAs nor I want to provide detailed feedback to answers generated solely with AI tools. And solely is the word I wanna underscore there. This is not about discouraging people from using the AI tools. I use them myself. It would be very hypocritical for me to say you can't, but to be copying and pasting the first answer ChatGPT gives you, and then for another human to spend time giving feedback to that just doesn't seem like a good use of resources. Or mental health.
Technical implementation
So the technical details is that we have a tool that uses prompt engineering to ground the feedback bot with a question, some rubrics, and an answer. This is a shortened version of the call using Elmer, but I tried to make it fit into a slide. So we have a system prompt that basically says you're a helpful course instructor teaching a course on data science with the R programming language and tidyverse and tidymodels suite of packages. You give succinct but precise feedback.
And we feed the bot with the question, a detailed rubric that even has the answer key in it that takes the student's answer and compares it against this detailed rubric, but then writes a feedback back to the student using a much simpler rubric so that the feedback doesn't just have the answer. So it doesn't say answer should be 30 instead of 60. It just says something that hopefully nudges the students in the right direction.
Demo of the tool
I'll give you a demo. The TLDR of the question is make a box plot and identify outliers from the data. And this is what the rubric looks like, but we don't have to focus on the words. The important thing is that it has bullet points. And so you can imagine that a human would go through this answer and sort of could do almost like a true false thing for these rubric items is how the rubrics are designed.
And this is what the tool looks like. So we are in RStudio. We're writing a Quarto document. The student writes their answer and then they select their answer, narrative and code. And then they go and use an add-in, which looks something like this, where they need to, for now, the homework, sort of like the feeding of the backend is a little hard-coded. I'm saying these are the questions that correspond to this week's homework. Maybe in the future, we'll make that a little more flexible.
But the student chooses the homework number and the question number. And you can see that they can't type other things to the chat box right now. It is whatever that comes from their Quarto document that I am hoping means they rendered, it actually worked, like they feel good about it to begin with. And then they say, get feedback. It thinks for a little bit, and then it gives some feedback.
And the feedback looks something like this. The bolded text is those rubric items that we saw, that we fed it from the simple rubric. Each one of them has some succinct narrative that goes with it, which is more than a human could do in this course for every single rubric item. And then an overall summary as well. And note that all of this is happening in the IDE and while the student is working on the assignment.
So these questions, part one of the homework is now not graded, but it is for them to practice with and get experience with what their evaluation will look like before they submit the other half to be graded by humans, which will use a similar workflow, basically.
Let's give another example so they can close it up. Another question, the tool works in exactly the same way. You can choose your question, ask for feedback, and you can get feedback on this. Now, if this was just code, obviously we wouldn't necessarily, well, we wouldn't necessarily need LLMs to do this, but note that we're giving feedback on narrative and code. So this is not something that at least I know, or even some of the best educators I know out there know, how to write like meticulous unit tests for. And the other thing is that I am not writing any sort of tests. So it's a little bit easier to update these from this semester to semester, as long as I'm willing to write rubric items, which I would have to do either way.
So the tool itself is a shiny app. And then you can use it as an RStudio add-in. It is out there today and you can sort of like look at it, but I will be very honest, this is a little bit antithetical to those of you who might know me. I like making things very, very open, but currently this is sort of hard-coded to send to a backend that you wouldn't be able to send to, because we're hosting the model online, but the calls are there. So I'll talk a little bit about next steps in terms of making it widely available, but the idea and the implementation is very borrowable. Just the model we're hosting, basically Duke won't let me allow all of you to make calls to the model is what it is.
Takeaways and student feedback
So a little bit of takeaways. In terms of the process, it was lots of fiddling with the rubric file, just trying and testing over and over. And it was frankly a little bit unclear or hard to measure to what end we can get the rubric items right. But separating out to rubric simple and rubric detailed, which is something I didn't do when I had human graders, was really helpful for hiding the answers, but still testing against the answers for each of the questions.
In terms of the good, the spell out your reasoning, which I had originally started with, resulted in really lengthy feedback that would be unlikely that any student would be frankly willing to read. But adding word limits really helps, and they're a little bit arbitrary, but I'm sort of feeling my way through them. And the nice thing is it sort of works. Like this actually is helpful, and I believe the feedback that it gives me about 90% of the time.
Now let's talk a little bit about the bad. I think the most concerning is that the feedback tends to catch errors, but not the good. And it seems to reiterate the rubric item whether the item is met or not, which could sort of like frustrate the students, and I worry about that. But it's a little bit on par with an inexperienced TA, to be perfectly honest, who tries to do pattern matching anyway.
The inevitable is the inconsistency. If you send the same answer, you don't always get the same feedback. Is this frustrating? I don't know. We'll see as the students use it more. So far they've only been able to use it for one homework in the class. The semester just started. And hallucinations do happen. Sometimes it says you haven't used the base pipe when you absolutely have. So that's something we need to sort of like work on. And then sometimes it says things that as a student would really annoy me. It's not aligning with rubric expectations. What does that mean, rubric expectations? Really vague.
Now the student feedback, as preliminary as it is, is that they love the immediate feedback in the IDE. They like it as a helpful quick check. They find it too picky, but then they find humans too picky as well. So I don't know. And they said that it's not as helpful as office hours. And I'm like, that's great, actually. Like that is exactly where I would want this to land, to be perfectly honest.
And they said that it's not as helpful as office hours. And I'm like, that's great, actually. Like that is exactly where I would want this to land, to be perfectly honest.
And in terms of revisiting the motivations, I think we're getting in there a little bit because these are also logged. And I'm not seeing that the answers they're submitting to the tool look as generically chat GPT generated, but give me some time because those models are getting better too. So detection is hard. I think we're able to leverage the work we're already putting in. And we have been able to shift resources for more in-person time with the TAs because they're not spending as much time grading. We'll see about part two that's being graded this week if that part is the graded part is still just mostly chat GPT or not.
And in terms of the next steps, we're gonna do a little bit of model evaluation. We're gonna keep working on prompt enhancements. One thing the students requested that I already had in mind was to continue because we already have the chat tool anyway. Can we, after the feedback, actually allow them to talk back and forth? So that's gonna be V2 of this app. I need to improve selection with the visual editor. It's a little bit janky right now. And I wanna work on the tool itself for sharing it more widely. And maybe one day I'll do like a proper assessment on it, but currently I'm enjoying the tool building phase.
I really wanna thank my partner in crime, Mark McHale from Duco IT, who's helped with the backhand and Duco IT for letting him use his time for this. And thank you very much. I'd be happy to answer any questions. Thank you.

