
Leveraging LLMs for student feedback - Mine Centinkaya Rundel
A considerable recent challenge for learners and teachers of data science courses is the proliferation of the use of LLM-based tools in generating answers. In this talk, I will introduce an R package that leverages LLMs to produce immediate feedback on student work to motivate them to give it a try themselves first. I will discuss technical details of augmenting models with course materials, backend and user interface decisions, challenges around evaluations that are not done correctly by the LLM, and student feedback from the first set of users. Finally, I will touch on incorporating this tool into low-stakes assessment and ethical considerations for the formal assessment structure of the course relying on LLMs
image: thumbnail.jpg
Transcript#
This transcript was generated automatically and may contain errors.
Okay, thank you very much and thanks for coming. So we are going to, the title of this talk was Leveraging LLMs for Student Feedback in Introductory Data Science Courses, but in short I'm going to call this Help from AI. And let's see how far we can get with that.
So to provide some context, I teach, I used to teach it every semester, now at least once a year. One of our largest courses at the university, which is the Introductory Data Science course, and this course is made up of sort of five learning modules where we start the students off, introducing them to mostly the toolkit and the ideas in the Hello World section and the toolkit is comprised of R, RStudio, Quarto for writing reproducible reports, so every single assignment is developed in that, and also Git as well. So every single assignment is turned in as a Git repository.
We try to circumvent our learning management system as much as possible, but don't tell people outside of this building I said so. Then we go to our exploring data unit where we start by visualizing data, take a step back and say what if it's not ready to be visualized, so what if we need to get it into the right shape, take another step back and say what if we don't even have it, but it's in a tabular format and you can use a single function to read it in, or it's very unstructured and on the web and we need to scrape it and getting it into a tabular format.
This brings us to the ethics unit, which is placed there for two reasons. One is that I did not want it to be an afterthought at the end of the semester, which is where it used to be when I first started teaching this course, and number two, it plays really nicely after web scraping to stop and ask now that you've learned some tools for web scraping and you've learned some tools for checking whether the website owner is okay with you scraping these data, should you really be scraping these data. That brings us to our ethics discussion. We talk about data privacy, but also things like misrepresenting data science results and algorithmic bias as well. Not necessarily teaching them the algorithms per se, but just giving them some intuition around making decisions with models before we actually dive into the modeling unit.
The third unit, or the fourth unit, is perhaps the more traditionally statistical unit where we start by trying to understand what do we mean by a statistical model. We focus quite a bit on predictive models in this class, so classification as well, and then we end with statistical inference in this unit, and we keep our discussion to simulation-based statistical inference, and at the very end of that conversation say every single bell curve that you've been seeing every time you bootstrapped or did a randomization test was not there by happenstance. There is statistical theory that grounds all of this, so please take another class where you can learn about those.
The last unit is the looking further unit where, depending on how much time is left in the semester, we dive into things either I am interested in diving into or the students are interested in diving into, so that's somewhere on the order of two or three lectures, and students are also working on their projects at that time. You can see the communicate thread throughout.
One thing I will say about this course is that no number or plot or table is reported without some narrative that goes along with it. That narrative can be a whole write-up, that narrative can be a sentence or two, but everything gets interpreted, so the assessment of the student work is not limited to checking the code they used to develop something, but also what they wrote about the final numbers or distributions or whatever they have come up with.
The challenge of AI tools in student assignments
This is in fall 2024 when I taught this course what my grading scheme looked like, and for many, many years I have been a firm believer in, maybe still am a firm believer in, sort of not putting all of the weight on exams and giving students an opportunity to sort of accrue those points in assignments that they can do at home on their own time, come to office hours for help when they're stuck, so on and so forth. The thing is, when you look at this as a student, that is a very high weight attached to something that can take you a lot of time or that can take you much less time to turn in maybe mediocre work if you have an assistant that will happily barf out code for you.
This was my AI policy, and I will say, let me go through that real quick. I said, you can use AI tools for code, but you must cite it, and I provided some guidance for how to cite it, and I said the prompt used can't be copied and pasted from the assignment. How am I going to police that? I can't, but I said it anyway just to have said it. I said you can't use AI tools for narrative, so words like sentences you write must be your own words, and I said for your learning, I don't know, do whatever you want to do, and I think this was all too optimistic ultimately.
Building a course chatbot
So project one, honestly, was sort of like both inspired by sort of developments at the university, particularly Mark McHale, who had been supporting our course for the computational infrastructure for a while, coming and saying, like, we have some resources to work on some AI-driven things. Is there something we could do? And also, my sort of goal that if students are going to use AI tools for learning, is there a way we can keep bringing them back to the course materials that we keep asking them and assigning them to read and nobody's reading?
So the goal was to build a chatbot that hopefully generates good, helpful, and correct answers that come from course content, so it prefers terminology, syntax, methodology, and workflows that are actually taught in the course. The first motivation is because the course size has been increasing over the years. You can kind of see where the course sizes, we teach over 600 students a semester, but our TA sizes isn't growing, you know, as exponentially over this time. Therefore, we may not always have as much support as the students might be needing.
The other motivation was that I asked, like, could we generate good answers comparable to answers from the course instructor or TA that actually stays current with course content? In my opinion, one of the challenges of the wild, wild west of all the LLM tools out there is that they will happily give answers to you, but they're sort of like sampling from this much wider space than my course alone.
So here's an example of how would you sort of like do a task, a particular task in R. In this case, we're trying to sort of like fix this or clean this membership status variable, for example. This course is taught using tidyverse syntax, and for those with a keen eye, you will see that there's none of that here. And in fact, there's some functions here that, I don't know, at least does not ever come up in my course, like set diff, for example.
So when students get answers like this, I get two types of reactions. One, this must be good enough. Number two, how dare she expect me to write this, given what she has taught me? And that latter one is really hard to answer, because it's absolutely true. If this was the correct answer to a question, I have not been teaching them in the right way. But it is difficult when you're entirely new to this. Like many of you in this room might look at this and say, that's not tidyverse code. And you can then say, and that's base R code, and it's good or not good. To my students who have not done any programming before, it's just this looks different than what I have learned. Why have I not been prepared to read this answer that's given to me?
Number two, how dare she expect me to write this, given what she has taught me? And that latter one is really hard to answer, because it's absolutely true. If this was the correct answer to a question, I have not been teaching them in the right way.
So the technical details was that basically we wanted to create this chat bot that is augmented with course materials. And then that is also going to bring us back to the course materials when students actually ask a question to it. And I'll be very honest. I have not implemented the model underneath it myself. Actually, Mark, who is in this room, who has. So I'll show the graphic that he had developed to give you an idea of what's happening.
And this chat bot is not publicly available. We've put this behind the Canvas, the learning management system for the course. And the main reason for that is that one of the other things that in terms of these AI tools that I often think about is the equity aspect of like, should your learning be determined by whether you can put down a credit card at some point to get more answers? And we wanted to make sure that any student in the class had access to this. And sometimes that means we want to make sure no one else has access to it so we can properly support the students in the class.
So we started with PDF documents that come from the course materials, the open interest statistics textbook, and the R for data science textbook. And then we built the knowledge graph based off of that. And so whenever a student asks a question, that's the user query you're seeing on this diagram here, two things are happening. One, that question is getting logged. And number two, student gets an answer. And so we have logs of each of these threads of these questions.
And in case this question comes up, I can't answer the question, how did learning like change? These are completely de-identified. And the students knew that I had access to their questions and that I did not have a way of connecting those to their student IDs. And I guess they trusted me on it because remember that don't copy and paste thing?
So I can't answer the question about student learning and using this tool. Maybe that's an interesting exploration, but we wanted to go forth with this without thinking through IRB implications. Therefore, we never identified the students.
So here's a quick demo of what something looks like. So a student asks a question. It's thinking for a while. And then you're going to see that it sort of gives us an answer. And I'm going to try to pause the video here. But so we have an answer that's written up that should seem very familiar for anyone who has used a chatbot interface. But you're also seeing on the side the resources. Now, that is not all of the resources the model may have used to generate those sentences. There's a lot, obviously, to generate those sentences happening. But those are the course resources that things are grabbed from.
So for a student, what this does is they read about it a little bit. And in front of their eyeballs is the resources from the course materials that they can come back to. And my hope being that this brings them back to those materials to say, oh, it is in the book. Maybe I should read the chapter. Or maybe if they keep getting directed back to the same chapter, at some point they start thinking, this is the chapter I may want to reread. Maybe the others are good for me. And similarly, a question about R syntax. How do we read data from Excel into R? Now it brings us to pages from the R for Data science book, similarly. Obviously, there are many other things it could be pointing to off the web. But the goal is to bring them back to the course materials.
Student usage data
Here's some data from the fall 2024 class. This is very crude data analysis of the students' questions. But let's take a look at the word counts. There is some questions that are way long. And when I look at those questions, it's like, all of this stuff R spit out to me. I put it into this to see if it can help me make sense of it. So fix this code, and then a bunch of code.
In the middle are mostly verbatim questions from the assignments. This is why I figured they believed when I told them I can't identify you, because I had told them explicitly not to do that. But then here are mostly good interactions. And what I mean by that, and not the shortest amount, is that these were threads where the students were going back and forth and saying, but wait, what about that? And how do you do this? And that's sort of a nice interaction, I think, as long as we can direct them back to the course resources, especially.
Building a feedback bot
Now, this then brought me to thinking about, where else can we use this same model that we have built to support the course? And the idea was to build a feedback bot. So let's think about what that means. Now, an ever-increasing number of students are using AI tools as their first step before thinking about how to approach a task has been my experience teaching this course.
This is one of the studies, somewhat recent, that found that AI makes human cognition atrophied and unprepared. But there's been more of these over the last few months as well, of just saying that the more folks are using these tools, the differently they're using their brains, and differently, at least from what we understand to be good learning nowadays.
Motivation two, though, is to say, well, that doesn't mean maybe they shouldn't use it at all. I, at least for the course that I'm teaching, and the things that I'm intellectually interested in, I'm not in a land of absolutely zero use of AI tools, but I would hope for them to use it in a way that can enrich their learning, and maybe also to gain us some TA time as well.
Instead of making the TAs do some of the feedback giving to students, redistribute their time to the one-on-one student interaction. So really, I want AI to do my laundry, so I can do art, not the other way around, is what I mean.
So really, I want AI to do my laundry, so I can do art, not the other way around, is what I mean.
And number three is self-motivation, self-care. I don't want to read chat GPT answers. Like, I am very happy to read edited chat GPT answers that have been made better by a student writing. I don't want to read that. And we have actually put our teaching assistants who do much of the grading for this course in a very bizarre situation, where we tell them their performance is measured, in part, by quality of their feedback, but what they're giving feedback to is not real humans. And those humans are probably not reading that feedback if they never wrote the code in the first place. So it's a very funny and not nice place to be in.
And I have to write these awfully detailed rubrics nonetheless. So a little bit of TLDR, but the technical details is that we can use Elmer, the R package that Hadley talked about for interfacing with an LLM, to give it a system prompt. So this is a shortened version of the system prompt to fit onto the screen, but the idea is you're a helpful course instructor and you like to give succinct feedback. Read the question, read the rubric, and read the student answer, and give feedback against this rubric, okay?
And basically, we're using that same model as before, and we're going to be sort of like logging the student data as we go as well. So let's take a look. A question on fit a linear regression model and interpret the coefficients. I've purposefully picked this example because the answer is not a single number that we're sort of measuring against, so there's some narrative as well.
This is what the student's answer might look like, for example. So they have written some code, so part A is code, part B is a model like write up, and part C and D are narrative answers. So we're sending it all at once.
And this is the demo rubric. So you can see how detailed this rubric is. This is really the rubric we've been using to grade this question, like code does this, narrative does this, so on and so forth. And here's what the feedback from the AI tool looks like. So it picks the rubric items that we had, and then it writes a short narrative next to it saying, you have met this or you have not met this.
I'll give you another example. So this was for a regression question. This was for a data tidying question, and here we had instructed the tool to actually categorize as met, partially met, or not met, to sort of draw the student's attention a little bit. And so take a look at a few things here. So this is basically a pivot task. So it catches, oh, you haven't named the variable or the data frame the way you've been asked to name it, which is, you know, it should be an easy thing to spot. But it also says something about code style and readability, which turns out actually was a hallucination. The student's code style was perfectly fine. So while this looks plausible, this is not perfect feedback either.
Takeaways and next steps
So quickly, the takeaways. In terms of the process, lots of fiddling with the rubric file to land at a place where things are starting to look happy. But at some point, you ask yourself, could I go forever with this? But separating out the rubric and the rubric detailed helped to not give the correct answers away every time.
The good was it sort of works. And in terms of the bad, I think the most concerning is that it tends to catch errors, but not the good answers. So the students might sometimes feel like there's no winning here, which they do sometimes feel in a classroom. But I think it's really important that I would categorize this on order of experience feedback from an inexperienced TA, which we have many of, until they get more experienced, obviously. The inevitable is inconsistency from one run to another. But humans also are inconsistent. And as I mentioned, hallucinations do happen.
And then it says really annoying things. This doesn't align with rubric expectations. What a nugget to give to students to be pissed off with me. They would come to me and be like, what are your expectations, obviously. But I want to emphasize, this is not grading. It's just to provide some feedback to the students so they can get better. So I'm hoping that that's going to be a good thing.
So I will skip over these for now. And I just want to acknowledge Mark and Duco IT for supporting this. And we're going to be running this experiment in fall 2025, actually, in the course for part of the course content. So ask me again in January how I feel about the whole thing.
