332: Research: Professors (Still) Have "Substantial Difficulty" Spotting AI Text
Plus, I'm jealous of England. Plus, some insightful numbers on misconduct from Western University in Canada. Plus, ICAI early registration.
Issue 332
Subscribe below to join 4,272 (+9) other smart people who get “The Cheat Sheet.” New Issues every Tuesday and Thursday.
If you enjoy “The Cheat Sheet,” please consider joining the 16 amazing people who are chipping in a few bucks via Patreon. Or joining the 44 (+1) outstanding citizens who are now paid subscribers. Paid subscriptions start at $8 a month or $80 a year, and corporate or institutional subscriptions are $240 for a year. Thank you!
Research: Professors Can Differentiate AI Text and Student Writing Only 70% of the Time. It’s Not as Good as It Sounds.
This research is not brand new. It was published in July of 2024 and tested whether professors, when given two short passages of text, could tell which one was created by ChatGPT and which was written by a student.
The paper is by Tal Waltzer, Celeste Pilegard, and Gail D. Heyman, all of whom are from the University of California, San Diego.
The test asked 140 professors and college teaching assistants to review six pairs of short writing samples with the same subject or prompt. The instructors were told one was composed by AI, the other was written by a college student. They were asked to say which was a bot and which was a student.
The headline finding is that:
Instructors correctly identified which essay was written by the chatbot 70% of the time
That may seem fine, but chance is 50%. So, professors improved on blind picking by 20%. That’s less fine. And it also means that 30% of the time, professors tagged AI text as human and human text as AI. That’s very not fine.
Also, for the record, that 70% accuracy rate is considerably lower than any good AI detection system. Those hover in the high 90% area with spotting AI text, while also being down around 2% for incorrectly flagging human text as AI. And that’s without being able to compare and contrast a known AI sample with a known human one. AI detectors don’t have that luxury.
In fact, the 70% success rate looks even weaker since the test subjects — the educators — knew that one sample was AI. As we saw in other research, when teachers are not told to look for AI, they miss 94% of AI work (see Issue 325). But even knowing that one of two options is AI, teachers were still only 70% right.
That’s less than not fine, that’s really pretty awful.
The team repeated the test of paired text, with one AI and one human, with college students and with ChatGTP itself. Both were less accurate than professors and TAs, which I guess is something. For the record, ChatGPT cannot tell whether something was written by AI or not. Do not ask it. It has no idea.
I found it interesting that familiarity with the subject of the text did not improve AI detection accuracy. Also interesting was that professors said they used errors in the text — grammar or spelling or odd sentence structure — to decide a text was human. That makes sense, especially when you can see two similar texts side by side, one with an error, another without. But as a stand-alone query, errors may indicate intent to deceive, as adding errors is a common way to try to trick AI detectors. And, it seems, professors.
That professors and teaching assistants can spot AI text only 70% of the time, even when they know they are looking at it, and that they misdiagnose human writing as AI at a 30% clip, means that schools should be announcing that student writing may no longer be scanned for AI by teachers, due to the likelihood of inaccuracies. If administrators are drawing the line at a 2% false positive rate and all those “false accusations” from AI detection software, there is no way they will allow a 30% false positive rate. Right?
It’s snark. I am not sorry.
Still, the paper has other findings, such as confidence levels and scoring of the submissions. So, you may want to give the paper a review.
Whether you do or not, the conclusion is that teachers aren’t necessarily good at spotting AI chatbot text:
This relatively poor performance suggests that college instructors have substantial difficulty detecting ChatGPT-generated writing.
Also:
an inability to do so could threaten the ability of institutions of higher learning to promote learning and assess competence.
You’re probably tired of reading it by now, but refusing to use an AI detection system is not smart. People, even trained professors, cannot do this on their own.
England’s Policy to Prevent Test and Qualifications Fraud
Ofqual is England’s Office of Qualifications and Examinations Regulation, and according to their publication linked below:
Ofqual regulates the qualifications, examinations and assessments of over 200 awarding organisations (AOs) in England. We set rules about how they should safely design, deliver and award qualifications. This includes rules about how they should prevent, detect and deal with malpractice and qualification fraud.
The United States has nothing even remotely like this. Here, exam security for professional licenses or certifications is left to the providing authority, which presents many potential conflicts.
So, when I read about stuff like this, I feel like a jilted ex — angry and jealous. Jealous because I question why we can’t have nice things like an exam watchdog. Angry because it did not work out.
Recently, the United States has had exam and credential fraud in teacher certifications, insurance licenses, real estate, police departments, fire departments, elevator repair, and financial auditing. That’s just off the top of my head, and just the ones we know about. It is currently not working out.
Anyway, over in England, where people take things seriously enough to have someone in charge, last November, Ofqual published new “action plan for the prevention of qualification fraud.” Imagine that.
From it, I will highlight a few things. It starts, in part:
Everyone should be able to trust that a certificate proves that the holder has undertaken specific training and demonstrated the appropriate level of competence. Employers want to trust that people are properly and legitimately qualified to do their job.
I am so jealous right now.
Among the steps it proposes, or will take:
Ofqual will ensure awarding organisations understand the implications of permitting or failing to prevent qualification fraud. We will also explore all avenues to tackle organisations committing qualification fraud. This includes working with government agencies to assess whether existing legislation can be used to prosecute organisations committing this type of fraud.
I note the inclusion of “failing to prevent qualification fraud.” And the not subtle threat of prosecution for organizations that engage in, allow, or fail to stop fraud.
The English government gets the link between shoddy credentials and public trust and safety. Here, we just say we do. It’s pretty hard to make the case that we actually do.
Western University in Canada Releases Misconduct Numbers
The CBC reports on the most recent academic misconduct figures from Western University, which are both unremarkable and insightful.
The lead paragraphs are:
Hundreds of Western University students failed assignments, courses and exams last year for cheating, plagiarizing, or engaging in one of several scholastic offences, a new report shows.
According to the London, Ont., university's most recent annual report on scholastic offences, covering July 1, 2023 to June 30, 2024, at least 426 offences were recorded, roughly the same as 2022-23 and 2021-22.
I have just a few things to say about the numbers overall.
One, a big cheer for Western University for sharing. You may not have noticed, but for the first two “year in review” issues of The Cheat Sheet, I started with a long list of every school for which incidents of misconduct were public, usually in the news. Originally, I meant the list to be a kind of wake-up call about the size and scope of academic misconduct. I did not do a list this year because, after three years of writing about cheating, I see news about misconduct differently. Schools that release data and discuss cheating openly are to be congratulated. We cannot fix a problem until we know what it is. Good for Western.
The other point here is that, as I think I say often, I have no sense whatsoever whether 426 “offenses” is the right number for Western. With a large school such as Western is, I prefer seeing 426 to, say, 46, for example. But other than that, I don’t draw much from the topline figure.
A third point is the reminder that the number of cases at any school is a very poor stand-in for actual cheating rates, since most misconduct goes undetected and what is detected is very infrequently escalated into formal cases, which are the numbers we see. In other words, 426 is not the number of cheating incidents at Western.
For news, the numbers were broken down more:
At least 143 students plagiarized assignments or tests, said the report, which goes before Western's senate on Friday.
Another 110 students were caught cheating on exams, while 91 engaged in unauthorized collaboration on an assignment or exam. Eleven engaged in contract cheating, in which a student pays a website or asks a personal acquaintance to complete an essay or assignment for the individual.
A student representative said about the misconduct cases report:
"The more supports there are for students, the better academic integrity is going to be," she said, pointing to essay support, peer tutoring and compassionate academic consideration as examples.
"When students feel supported ... I feel like they're less likely to go down a path that maybe would land them on the scholastic offences report."
Sure. Student support is important, and it cannot help but reduce misconduct. But let’s be clear as well that a significant share of academic misconduct is not driven by stress or panic, the types of factors that may be mitigated by support. Much — maybe most — cheating is driven by opportunity and rationalization, and those are best mitigated by limiting opportunity and changing the risk/reward calculus.
Moving on.
For my final point here, I’m going to take sections of the article in reverse order because I think it’s more illustrative. One:
At Western, decisions around AI use is largely left to each professor, [an administrator] said. Many offences in the report, however, were problems long before AI.
While tools like Turnitin and Proctortrack can help with plagiarism and cheating, "no tool is going to be completely foolproof," she said.
Earlier this year, Western announced it would stop using an AI writing detection from Turnitin over concerns about inaccurate results, according to the Western Gazette.
No integrity tool is foolproof. No tool is. But, if that reporting is right, Western unplugged its AI detection software, closing its eyes to what is now probably the most frequent method of cheating.
And now we consider this, from the article:
At least 11 offences involved students using content generated by AI tools without authorization or attribution, the report said.
Eleven.
That is simply not credible.
You’re telling me that at Western University, less than 3% of all academic integrity cases involved inappropriate or unauthorized use of generative AI. Three percent? The school has an enrollment of about 36,000 students and there were eleven formal cases of misconduct with AI? Really? I’ll do that math for you too. That’s .03%. Point. Zero. Three. That’s three in ten thousand.
Now, do we think that Western has 11 cases of misconduct with generative AI because so few students are doing it? Or is it more likely that Western has so few cases of misconduct with generative AI because they unplugged their detection system?
It’s amazing what you cannot see when you close your eyes.
Early Registration Discounts for ICAI 2025 Still Available
ICAI, the International Center for Academic Integrity, has extended its early registration discounts for the 2025 conference, coming up in March in Chicago.
The same link also connects to the conference schedule and award nomination opportunities. Plus, hotel info.
The conference will be, I have no doubt, well worth your investment, despite the highly questionable keynote speaker (see Issue 320).