(352) Research: Human Graders Do Not Spot AI, Authentic Assessments Are No Shield
Plus, a Thai Police Colonel is busted cheating on an exam to be a Judge. Plus, a provost at WVU talks about academic integrity and AI in online classes.
Issue 352
Subscribe below to join 4,550 (+4) other smart people who get “The Cheat Sheet.” New Issues every Tuesday and Thursday.
The Cheat Sheet is free. Although, patronage through paid subscriptions is what makes this newsletter possible. Individual subscriptions start at $8 a month ($80 annual), and institutional or corporate subscriptions are $250 a year.
Study: Human Graders Cannot Detect Text Created by Computers, “Authentic Assessments” Are No “Panacea” for AI Cheating
An new study hit the public square recently, showing — again — that human educators and graders are not good at spotting AI text, even when they’re told it’s there.
Perhaps more significantly, the study also shows that, air-quote:
authentic assessments are neither a shield for academic integrity nor an immediate solution to the GenAI challenge.
The paper is from Alexander K. Kofinas, Crystal Han-Huei Tsay, and David Pike. Kofinas and Pike are listed as affiliated with the Graduate School of Business, University of Bedfordshire while Tsay is affiliated with the University of Greenwich, in London.
In addition to the two findings I’ve highlighted in the headlines, the paper also has two other interesting contributions. Well, they’re interesting to me.
And while I have a few minor issues with the research construction, I think the findings are well considered, especially since they generally reinforce the results we’ve seen from other inquiries.
The Design
The researchers took existing papers that they believed to be human-written, as evidenced by low probability scores from Turnitin’s AI detection systems, and modified some with AI, creating AI-edited text. They also used ChatGPT to create AI-only submissions to the assessment prompts. So, there were three types of papers — presumed human, AI-assisted, and AI-generated. The team also consciously selected papers that were at a failing level, some that were adequate, and others that were considered to be done well.
These papers were given to university-level graders who were told that text created by AI was among the papers they were asked to review. The paper includes that it is accepted practice to have student work reviewed independently by two graders, who then conference to reconcile any discrepancies in their assessments — a practice the team followed for this test.
The two central questions this research aimed to answer were — can experienced, higher education graders identity what assignments were AI created or modified? And, would a deliberate attempt to use “authentic assessments” make a difference in if/how the AI material was detected or scored?
The Findings
In short — no, and no.
From the paper summary:
Our findings indicate that markers, in general, are not able to distinguish assessments that have had GenAI input from assessments that did not
And:
Our findings also suggest that the level of authenticity in an assessment has no impact on the ability to safeguard against or detect GenAI usage in assessment creation … and that the higher education sector cannot rely on authentic assessments alone to control the impact of GenAI on academic integrity
That humans “are not able to distinguish” the AI text is not a new finding (see Issue 253 or Issue 325).
And although there has not been much research on the AI cheatability of “authentic assessments” directly, numerous people and organizations have raised the idea that they are some kind of antidote to using AI to fake scholarly work — a claim which I have routinely questioned (see Issue 138 or Issue 296).
One such organization, as the authors of this study point out, is the UK’s QAA — Quality Assurance Agency — which published a paper describing generative AI as “a powerful catalyst for change” and called for “reviewing assessment strategies.” Among the “desirable outcomes” for such a review was:
Developing a range of authentic assessments in which students are asked to use and apply their knowledge and competencies in real-life, often workplace related, settings.
QAA does not outright say that this will limit AI-related misconduct, but the implication is pretty strong since the entire paper is about how schools should deal with AI.
Another institution on the “authentic assessment” train is the University of Texas, Austin, which made the utterly indefensible decision to turn off its AI detection systems (see Issue 296). In trying to defend or explain that craziness, the school wrote, in part:
We have created lots of resources around developing authentic assessments and to help move us away from simple assignments for which an essay turned in using a large language model would be sufficient.
That’s the obviously absurd, though often repeated, idea that if AI can pass your assessment, you have a bad assessment. The idea that, if teachers just reframed their entire assessment matrices and made them more “authentic” to students’ lives or workplace realities, cheating would go down because AI would be insufficient.
Neither idea was ever true. They are justifications for not actually dealing with academic misconduct. Redesign your assessments to be “authentic” was the pot of gold at the end of the rainbow — fiction, an enticing distraction, and perpetually unattainable.
Authentic assessments may be better, even significantly, for any number of reasons. But they are zero percent better at preventing cheating.
But I rant. And circle back around to this study, which found that — surprise, surprise — “authentic assessments” could be quite easily done by ChatGPT, receiving good grades and going undetected by human graders. The paper says, citations removed:
it is doubtful whether authentic assessments are the panacea that the Quality Assurance Agency for Higher Education and Advance HE suggest when dealing with GenAI.
Further:
it seems that moderate/high-level authentic assessments are neither a shield for academic integrity nor an immediate solution to the GenAI challenge. As we saw in the methodology section, some assessments were of low authenticity, and others were of high authenticity; however, this made very little difference in detecting GenAI usage. All nine assessments were relatively easy to reproduce or modify using GenAI, and the markers could not readily identify the difference between human-authored and GenAI-augmented or GenAI-generated variations.
More:
Authentic assessments by themselves cannot provide enough of a safeguard against using GenAI to circumvent assessments
Shorter version — if “authentic assessments” are your solution, you have no solution.
Obviously, I think that’s a significant finding. But I don’t want it take away from the other topline finding that, once again, experienced teaching and grading professionals struggled to spot AI text in assessments, even when they were told AI was present.
From the paper again:
our results strongly indicate that assessments were easily compromised using GenAI, and many of our markers found it challenging to identify which were human-authored, which were GenAI-modified, and which were GenAI-authored.
Strongly indicate — and easily compromised.
Two Other Noteworthy Points
The easiest of two other points from this paper is that it appears that when presented with a flag of possible AI use from a detector, the protocol at some UK institutions is to move to an oral examination. Sounds great to me.
The problem is that the research team thinks that a high volume of likely AI text flags will necessitate a high number of oral reassessments. That seems probable. In which case, they argue to scrap the written assessments and move directly to in-person, oral exams. From the paper, citations removed for ease of reading:
using such tools can be problematic, as markers may not easily assess the level of knowledge demonstrated by the student without further assessment. In fact, in many universities, current guidance regarding the usage of GenAI suggests that for all assessments that score high on AI detection, students would need to undergo an oral examination, making the marking process far more onerous and time-consuming. Extrapolating from this, if many written assessment submissions require a second layer of oral examinations to prove their academic integrity, we could argue that such assessments should be replaced altogether with face-to-face types of assessments based on performance or oral presentation.
Although implausible in most online and assembly line schools, the team has a point.
Finally, and more significantly, is this:
We suspect that the presence of GenAI indeed influenced the markers' grading process. Considering that the markers were aware that some of the samples they were marking had been generated by GenAI, it seems possible that markers may have subconsciously deducted marks from work they perceived as “too perfect.”
And:
the presence of GenAI affects the way markers approach the marking process
Wow.
By using papers that had already been submitted and graded once, this research was able to compare and found that, upon a second scoring, good papers from assumed human sources were downgraded, presumably because graders were concerned they may be AI. Again, wow.
As yet another way AI misuse hurts students, this one had not occurred to me. And I have to say, I find the possibility disturbing. That a good, honest student could be penalized by comparison in reality — three students were marked as A performance, but only one was honest effort — and penalized in the marking process because work that is too good may be suspicious, is alarming.
If this is happening, it makes not detecting AI or not enforcing policies about its use all the more unethical. How educators and schools can stand aside and let dishonest students penalize honest students, I cannot understand.
For the record, this research also found that AI use did generally not significantly improve the grades for poor papers, but did improve the grades of papers in the middle ranges, though not significantly.
But the idea that the mere presence of AI in a group of submissions — which is a certainty, by the way — could suppress the grades of high performance students ought to be terrifying.
Another Note on “Authentic Assessment” and Misconduct
Those even more interested in connections between “authentic assessment” and cheating may wish to review this paper from Tim Fawns, Margaret Bearman, Phillip Dawson, Juuso Henrik Nieminen, Kevin Ashford-Rowee, Keith Willey, Lasse X Jensen, Nona Press, and Crina Damşa.
Fawns is with Monash University, Bearman and Dawson are with Deakin University, Nieminen is with the University of Hong Kong, Ashford-Rowee and Press are with Queensland University of Technology, Willey is with University of Technology in Sydney, Jensen is with University of Copenhagen, and Damşa is with the University of Oslo. The paper was published in September 2024.
It’s less research than thoughtful literature review and persuasion. It addresses the supposed link between “authentic assessment” and cheating mitigation.
I’ll share a few bits from the paper, citations removed. For example:
Claims that authentic assessment is an effective way to address cheating are widespread, likely in response to the rise in contract cheating and widely-available generative AI technologies. The fundamental (and very appealing) idea is that through authentic assessment, we can ‘design out’ cheating
Indeed.
Continuing:
There appears to be minimal empirical evidence that authentic assessment prevents cheating, despite some confident claims that it does.
And:
In short, we are not aware of any evidence demonstrating that authenticity, in and of itself, reduces rates of cheating.
On the other hand, there is empirical evidence that authenticity does not solve the problem of cheating.
Exactly.
University of Texas, please call your office.
News from West Virginia
A small, buried news article about online classes at West Virginia University also features comments about academic integrity, from an interview with Evan Widders, an Associate Provost for Undergraduate Education.
After discussing that online classes at WVU are growing, the article includes:
“As AI continues to advance, it can be more difficult to make sure that academic integrity is being observed in online courses,” he said. “We’ve found ourselves in this sort of arms race at times with these artificial intelligence providers who are out there providing subscriptions and selling ways for students to more or less cheat.”
Determining if a student used AI for plagiarism can be difficult, Widders said.
“There are no tools we are aware of that can provide a decent confidence level that a student has used AI in an assignment,” he said. “With traditional plagiarism, we had very good tools that were available. AI is a much bigger challenge now.”
However, professors might be able to notice subtle differences, Widders said.
“Generally the thing the AI has written doesn’t seem like the student’s work necessarily,” he said. “In that case, it starts a conversation.”
He goes on to say that the school is trying to teach students to use AI ethically:
“A lot of our efforts revolve around the ethics of AI,” he said. “It’s trying to teach students the ethics of using a tool that can serve as a shortcut.”
Evan, call me. I’ve seen this one. I know how it ends.
Thailand: Police Colonel Caught Cheating During Exam for Judgeship
I try not to veer more than one degree away from textbook academic integrity, although I do swerve now and again into exam fraud generally — as I am doing now because this headline is just too good.
Coverage from Singapore has this headline:
Thai Police Colonel Caught Cheating During a Judge Exam, May Face Suspension or Dismissal
First is the obvious — cheating police and would-be judges. Fantastic.
Second, I give the guy credit. Earbuds? ChatGPT? Did he Chegg it? Nope — dude used an old school cheat sheet, snuck under his exam papers. There are photos. Respect.