Study: Students Who Used ChatGPT Did Not Get Better Grades on Essays

Plus, OpenAI says it can detect AI in images. Plus, Adobe is OK with industrial cheating. Plus, ICAI awards.

Jan 23, 2024

Issue 269

To join the 3,741 smart people who subscribe to “The Cheat Sheet,” enter your e-mail address below. New Issues every Tuesday and Thursday. It’s free:

If you enjoy “The Cheat Sheet,” please consider joining the 15 amazing people who are chipping in a few bucks via Patreon. Or joining the 32 outstanding citizens who are now paid subscribers. Thank you!

Support "The Cheat Sheet"

Study: Using ChatGPT Doesn’t Improve Essay Grades, Though I Am Quite Sure That Doesn’t Matter

Admittedly, this, from the journal Nature, is a minor study. Its study size was small, just nine in the control group and nine in the experiment group. Nonetheless, it has some findings that are, I think, worth adding to our conversation about the use of generative AI in academic settings.

The paper was published in October and written by Željana Bašić, Ana Banovac, Ivana Kružić & Ivan Jerković — all from the University of Split in Croatia.

As exposition, the researchers had nine students write an argument essay using ChatGPT and nine write an essay on the same topic without using ChatGPT. The writing was observed, timed and submitted to professors for grading after being anonymized. Final papers were checked for authentic text, which the team seems to mean plagiarism, and checked for AI-generated text. The essay writers were grad students.

For grades, the research found that essays in both groups received an average grade of C. And that:

Our results demonstrate that the ChatGPT group did not perform better in either of the indicators; the students did not deliver higher quality content, did not write faster, nor had a higher degree of authentic text.

The writing faster finding is interesting. And that the GPT text was not better than actual human writing does not really surprise me.

The researchers do note that, although the average grades were the same, students who wrote without AI did do better on average and the AI-generated text did show up more frequently as “unauthentic.” The authors take the findings about the grades as evidence that

can relieve some concerns about this tool’s usage in academic writing.

If the concern is that students who use AI will outperform those who don’t use it, I agree. But if the concern is that students who use AI are not demonstrating their abilities or, at best, demonstrating different ability, then the concern very much remains. Or if the concern is that students are not learning by doing the writing, that remains as well.

I am highly skeptical that research showing that AI users and non-AI users get similar marks will do anything but incentivize inappropriate use of AI. I think if you told students they could write the paper themselves and get a C or have AI spit it out and do some refining and editing for the same grade, most would take the easy way.

Rephrased, I think most students use AI not because it’s faster or better, but because it’s easier. They will take the C with a smile.

Anyway, the paper also has other findings worth sharing.

The authors indulge the idea that writing with AI is actually harder than just writing without it. It’s a theory I’m inclined to accept and may explain why writers who are already skilled or know the material well tend to get more benefit from AI than others, making AI an accelerator for existing advantage instead of an overall leveler.

This paper says:

The overall essay score was slightly better in the control group, which could probably result from the students in the experimental group over-reliance on the tool or being unfamiliar with it. This was in line with Fyfe’s study on writing students’ essays using ChatGPT-2, where students reported that it was harder to write using the tool than by themselves (Fyfe, 2022). This issue is presented in the study of Farrokhnia et al., where the authors pointed out the ChatGPT weakness of not having a deep understanding of the topic, which, in conjunction with students’ lack of knowledge, could lead to dubious results (Farrokhnia et al., 2023). Students also raised the question of not knowing the sources of generated text which additionally distracted them in writing task (Fyfe, 2022).

And further:

This demanding task could have been even more difficult when using ChatGPT

Again, I buy that.

Similarly, and as mentioned up top, the paper showed that writing with ChatGPT was not faster. And that the more time a student took to write their paper, the better they did:

The other interesting finding is that the use of ChatGPT did not accelerate essay writing and that the students of both groups required a similar amount of time to complete the task. As expected, the longer writing time in both groups related to the better essay score. This finding could also be explained by students’ feedback from Fyfe’s (2022) study, where they specifically reported difficulties combining the generated text and their style. So, although ChatGPT could accelerate writing in the first phase, it requires more time to finalize the task and assemble content.

I also think that, to the extent that these findings hold, understanding that using AI to write is neither easier nor faster will be very helpful in limiting its academic usage. I mean, if it’s not faster or easier or better, what’s the point?

On the AI detection, there are also some good notes to take home from this paper. It says, for example:

The available ChatGPT text detector (Farrokhnia et al., 2023) did not perform well, giving false positive results in the control group.

Though, a few big, red-letter caveats are in order.

One, the team used OpenAI’s classifier which has repeatedly performed very, very poorly. So bad, in fact, that OpenAI shelved it. In other words, it’s a real mistake to use OpenAI’s bad results as a proxy for all detection.

Two, even OpenAI got the classifications largely correct. From the paper:

The AI text classifier showed that, in the control group, two texts were possibly, one likely generated by AI, two were unlikely created by AI, and four cases were unclear. The ChatGPT group had three possible and five cases likely produced by AI, while one case was labeled as unclear.

I can’t really defend OpenAI’s findings in the control group — flagging three of nine as at least “possibly” AI is not great. But for the AI group, it went eight for nine, with the ninth being inclusive. If you discard the “unclear” results, even this test was ten for 13 in accuracy — 77%.

Finally on this point, though the study authors said the AI detector “did not perform well,” they do not call for avoiding them. Instead, they write:

The detectors of AI-generated text are developing daily, and it is only a matter of time before highly reliable tools are available.

It is. Highly reliable tools are available.

OpenAI Says it Has a Detector for AI Images

OpenAI, the maker of ChatGPT, said it is working on a classifier for AI-produced images, according to some news coverage.

From the story:

“Our internal testing has shown promising early results, even where images have been subject to common types of modifications,” reads a blog update. “We plan to soon make it available to our first group of testers — including journalists, platforms, and researchers — for feedback.”

A needless reminder that OpenAI said that detecting text created by AI — they are in the business of creating text with AI — was impossible (see Issue 241). But detecting AI fingerprints in images, no problem. Like magic.

It’s also interesting to me that this photo detection tool from OpenAI is woven into a story about election misinformation and deception, about which the reporting says:

OpenAI, arguably the world’s most influential generative AI company, has moved to clarify its policies relating to elections, stating that people cannot use its GPT tools to impersonate politicians or try and dissuade voters from going to the ballot box.

Ah. I see. They want to be clear that using their technology to misrepresent things in politics is bad. However, they have said next to nothing about using their technology to misrepresent things in education. In fact, OpenAI has been pretty consistently on the other side of that issue, doing everything they can to make misrepresentation in education as easy as possible.

The more I pay attention to OpenAI, the less there is to like. We can detect your AI, no one can detect ours. Don’t use our stuff in politics. But in the classroom, have a blast.

Anyway, OpenAI says it can detect AI when it’s in images. Thought you should know.

Chegg CEO is on the Board of Adobe

Not really breaking news, but Dan Rosenweig, the CEO of cheating giant Chegg, is on the Board of Directors of Adobe, the software company with many products in frequent use by educators.

It’s more evidence of cheating companies being woven into legitimate companies. And evidence of how so many otherwise credible people and companies don’t do any diligence at all.

As a reminder, Chegg’s value has dropped about 90% over the past two years, it’s tangled in several investor legal challenges and being sued over its core value offerings — the answers to homework and test questions. And, just by the way, it’s probably the world’s largest cheating profiteer, as more than 90% of its revenue comes from selling answers.

Adobe, it seems, is fine with all that.

ICAI Awards Deadline January 31

The International Center for Academic Integrity (ICAI), will host its annual conference in Canada in March.

As part of that, ICAI will also issue several awards. Nominations for those awards close on January 31. If you know someone deserving of an award, please nominate them. Recognition is important.