Research Edition: Language Bias Among AI Detectors and "Low-Effort" Interventions that Reduce Cheating
Research out of Stanford on Language Bias is Really Bad. Cal Riverside Research, Meanwhile, Spins Gold. Plus, Class Notes.
Issue 216
To join the 3,316 smart people who subscribe to “The Cheat Sheet,” enter your e-mail address below. New Issues every Tuesday and Thursday. It’s free:
If you enjoy “The Cheat Sheet,” please consider joining the 13 amazing people who are chipping in a few bucks a month via Patreon. Or joining the 12 outstanding citizens who are now paid subscribers. Thank you!
Class Notes
I've mentioned a few times that I envisioned “The Cheat Sheet” as way to share important news and research about academic integrity.
Unfortunately, I’m not doing as much on the research side as I’d like. The reason is simple. Analyzing research takes time. I’m the guy who reads the sources in the footnotes. The explanations and details are often complicated. The breakdowns tend to be long.
Anyway, in this Issue, I’m focusing on two recent papers that are being shared and influencing our dialogue. If research reviews aren’t the content you most want from the newsletter, I understand and have given thought to creating separate editions on a different schedule than the regular, more news-laden issues. Frankly, I’m not sure I have the time to do more issues beyond the regular Tuesday, Thursday cadence, but maybe.
If you have thoughts about the mix of research-related content versus news and comment, or how to handle that, please let me know. Replies to the newsletter come to me.
In the meantime, here are some (long) notes and thoughts on two recent pieces of academic integrity research. I hope you find them helpful. Regular content will resume Thursday.
Researchers Slam Potential Language Biases in AI Detection - I Have Serious Questions
In April, a group of researchers at Stanford University led by James Zou of the Departments of Computer Science, Electrical Engineering and Biomedical Data Science, published a paper with the seemingly conclusive headline:
GPT detectors are biased against non-native English writers
For researchers at Stanford to make such a definitive statement, they must have a solid case, right?
No, not really.
We’ll get to that in a second.
I normally would not care what people kind of make up. But this “research” is already finding its way into news coverage. This article, from the Australian Broadcasting Company (ABC) for example, cites the above research. Other people have been using the paper to justify pretty wild stuff.
As such, it deserves some review and scrutiny.
Another quick detour before we start, though. I concede that AI-similarity detection systems probably do tend to highlight text written by non-native speakers more than English-native writers.
If you understand that these “AI detectors” are trained to find the cues of predictable, average and repetitive writing, that makes sense (see Issue 215). If you asked me to write in Italian, I’m sure my prose would be very textbook, formulaic and clunky. It would feel like a computer wrote it. I’m not sure that’s bias.
Further, it’s easy to accept that non-native writers may use AI technologies such as grammar fixers or translators at a higher-than-average clip - tools that AI classifiers can flag because they are, in fact, generated by AI.
Even the study’s main author concedes the simplified writing part, as he told ABC:
Professor Zou said many of the current AI detection algorithms had an over-reliance on a "perplexity" metric, a measure of complicated words being used in the text.
"If there are a lot of complicated words, then they'll have high perplexity," he said.
Non-native speakers' writing was often misclassified as AI generated, Professor Zou said, because they did not use as many "fancy" words.
That’s right. The average and predictable writing of some non-native writers can feel like it’s written by a computer because, as the professor says, they don’t use “fancy” words. Or something like that.
But if you’re using these detection tools correctly, that’s not a big deal. Educators know which students are second-language writers. They know their writing styles. Or at least we hope they do. And if they do, seeing a flag from an AI-similarity checker would be easy to understand, easy to dismiss. No effort. No problem. And if there are questions, a quick follow up conversation with the student can often resolve it. Again, no big deal.
Now, at last, to Professor Zou’s research.
First, there’s what ought to be the obvious error in the paper’s headline. As the professor says, most “AI detectors” actually detect average writing. That’s not bias. That’s math. The detection does not know who, or what, wrote the text so it cannot possibly be biased on the basis of origin.
Then, on the very first page of the paper, the team writes:
Although several publicly available GPT detectors have been developed to mitigate the risks associated with AI-generated content, their effectiveness and reliability remain uncertain due to limited evaluation. This lack of understanding is particularly concerning given the potentially damaging consequences of misidentifying human-written content as AI-generated, especially in educational settings.
Two problems. One is, as mentioned, a flag for AI similarity - even one that’s misidentified - causes no damage whatsoever. When that information is improperly acted upon, we get potential damage - even, then it’s just potential. But in such cases, the error is human, not in classification. Well trained instructors can reduce “potentially damaging” outcomes to actually zero.
The bigger problem is that, in that quoted section above, the last sentence footnotes two sources (cites 22 and 23). Neither one mentions misidentification of human writing at all, let alone any potential dangers. No, really. The paper’s citations on “the potentially damaging consequences of misidentifying human-written content” simply are not.
Go check. Here they are. The links are mine:
22. Rosenblatt, K. Chatgpt banned from new york city public schools’ devices and networks. NBC News (2023). Accessed: 22.01.2023. [link]
23. Kasneci, E. et al. Chatgpt for good? on opportunities and challenges of large language models for education. Learn. Individ. Differ. 103, 102274 (2023). [link]
One is an article from NBC about New York City schools blocking access to ChatGPT on their systems. The other is a “position paper” about how to ChatGPT could perhaps be used in education. Not a word in either one about dangers of misidentification. Not a word about misidentification at all.
I confess, I have no idea how that can happen in research from Stanford.
The paper does say:
Given the transformative impact of generative language models and the potential risks associated with their misuse, developing trustworthy and accurate detection methods is crucial.
I agree, it is. Detection tools should be accurate and trustworthy.
But here’s the hill on which this paper is built:
In this study, we evaluate several publicly available GPT detectors on writing samples from native and non-native English writers. We uncover a concerning pattern: GPT detectors consistently misclassify non-native English writing samples as AI-generated while not making the same mistakes for native writing samples. Further investigation reveals that simply prompting GPT to generate more linguistically diverse versions of the non-native samples effectively removes this bias, suggesting that GPT detectors may inadvertently penalize writers with limited linguistic expressions.
Again - these detection systems don’t determine AI-creation and we’ve discussed why they are likely to flag the kind of predictable and not fancy writing that some non-native writers may use. And - again - these detection systems don’t penalize anyone, ever. So, two more errors.
We’re still on the first page.
And those aren’t nearly the largest errors.
In the tests themselves, the authors say they tested “seven widely-used GPT detectors.” They name them: Originality.AI, Quil.org, Sapling, OpenAI, Crossplag, GPTZero, and ZeroGPT.
The problem is that at least two of these are spectacularly awful and two I simply don’t know, they may be even worse. Moreover, the most widely used AI-classifier in education, by far, was not tested. Treating all seven as equally valid, they’re then lumped together as a proxy for all classifiers, which feels inappropriate. It’s like saying, “we tested seven cars and they were not fast, so cars are not fast.”
Moreover - and this deserves an exclamation point (!) - the paper says:
most of the detectors assessed in this study utilize GPT-2 as their underlying backbone model, primarily due to its accessibility and reduced computational demands. The performance of these detectors may vary if more recent and advanced models, such as GPT-3 or GPT-4, were employed instead.
You’re kidding, right?
In other words, not only did they test bad cars, they knowingly tested old cars. The Model-T topped out at 45 miles an hour. It’s little wonder the classifiers didn’t perform so great when tested with GPT-3 and GPT-3.5. I mean, come on.
But even worse - yes, worse - the research then tested these seven systems on:
a corpus of 91 human-authored TOEFL essays obtained from a Chinese educational forum and 88 US 8-th grade essays sourced from the Hewlett Foundation’s Automated Student Assessment Prize (ASAP) dataset
Are you kidding me? I’m being punked, right?
The non-native papers they tested were “obtained from a Chinese educational forum.” By “educational forum” they mean test-prep center that sells courses and study guides for a variety of tests. It’s not what I’d call credible, by any definition. I certainly would not base research on it.
How TOEFL papers wound up on the site of a Chinese “test prep” company is anyone’s guess - cough, cough. Even if you buy that they’re legit, the TOEFL in China? If there’s a test that’s more compromised more often, I’d like to know about it.
Are we sure that none of these papers used automated grammar checkers? No translation apps? No paid tutors or essay mills? I’m not. Do we know if these essays even passed the TOEFL standards? We do not. The authors tell us only that these test papers were “human-authored,” which I am not sure how they can be certain. It’s an odd chain of custody on which to base a study, at a minimum.
To summarize, we have supposedly human-written papers from non-native English speakers that passed through the possession of a test-prep company, tested on outdated AI checking tools, half of which were terrible to start with.
I’m overcome with confidence.
The paper continues that the detection systems:
misclassified over half of the TOEFL essays as "AI-generated" (average false positive rate: 61.22%). All seven detectors unanimously identified 18 of the 91 TOEFL essays (19.78%) as AI-authored, while 89 of the 91 TOEFL essays (97.80%) are flagged as AI-generated by at least one detector.
First, and again, AI-checkers don’t green light/red light papers. That’s not how that works. Is a score of 10% being “likely AI-generated” considered a finding of “AI-generated”? Is 80%? The authors don’t say what threshold they used to determine when a flag was an indicator of being “AI-generated.” As such, this core finding doesn’t tell us much.
Still, I’m going to go out a limb here and say that if seven AI similarity checkers - even the old clunkers - unanimously identified written work as likely AI-authored, we should consider that it may have been, in fact, AI-authored. When the results are that consistent, the glitch may not be where the authors think it is. Just maybe the source of the tested material wasn’t pristine, though it’s a possibility the authors never consider.
By now I hope you’re getting the point that this paper, despite being circulated and cited widely, is not great.
Even if you are, I’ve got one more. It requires you to stay with me just a bit longer.
To test their theory that the AI-similarity checkers were pinging on perplexity - the use of “fancy” versus predictable and basic words - the study ran their test papers though ChatGPT, asking it to “enhance” the words in the Chinese TOEFL papers and to “adjust” and “simplify” the words in the 8th-grade papers to “mimic non-native speaker writing.” Then, our authors checked the papers again in the seven detectors.
After juicing-up the TOEFL papers, they report that the classifiers flagged them less often, which is not surprising. In addition to muddying the signal on word choices, it’s well known that altering AI-created text does flummox some classifiers, especially the bad ones. If the papers had been run through a paraphraser or translation engine or grammar changer, we would expect the flags to still exist, but the confidence to drop. It’s still AI-generated, but it’s less clear.
Meanwhile, about the 8th-grade papers, the authors write:
In contrast, applying ChatGPT to adjust the word choices in US 8th-grade essays to mimic non-native speaker writing (Prompt: "Simplify word choices to reflect non-native speaker usage.") led to a significant increase in the misclassification rate as AI-generated text, from an average of 5.19% across detectors to 56.65%
Did you catch it?
If you did, that’s where you probably just throw this research paper in the garbage.
In case you did not catch it - this research team took papers, told ChatGPT to change them, then, when AI-classifiers flagged them, called it “misclassification.” That’s simply mind-boggling. When ChatGPT changed the text as directed, that became AI-generated text. Flagging AI-created text as likely created by AI is not a misclassification.
I want to scream.
The team also, for the record, asked ChatGPT-3.5 to create responses to the “2022-2023 US Common App college admission essay prompts” and found that the detectors were “initially effective” at flagging the fraud. Though the detectors struggled when they told AI to “elevate the provided text by employing literary language,” which is expected.
Earlier too, the research team found that:
The detectors demonstrated near-perfect accuracy for US 8-th grade essays.
In other words, when given straight-forward classification tasks - unaltered 8th-grade essays or unaltered GPT-created content, even these old and bad detectors were “initially effective” and “demonstrated near-perfect accuracy.”
If I believed anything in this paper, I’d find that significant.
I’d also find it significant - actually, I think it is significant - that despite all this work and using words such as “bias” and “potential danger,” the paper does not call for not using AI classifiers. It calls for building and using better ones:
As demonstrated, a straightforward second-round self-edit prompt can drastically reduce detection rates for both college essays and scientific abstracts, highlighting the susceptibility of perplexity-based approaches to manipulation. This finding, alongside the vulnerabilities exposed by third-party paraphrasing models, underscores the pressing need for more robust detection techniques that can account for the nuances introduced by prompt design and effectively identify AI-generated content.
But no one on Earth expects the repetition of this study to include those bits. No one will say how, in this test, even the old and busted detectors were dead-on accurate when given direct classification tasks. Or how this study “underscores the pressing need for more robust detection techniques.” Which, I’d argue, we already have.
Instead, this work will be shared and repeated as evidence of bias and misclassification and harm in AI-classifiers. It already is. Some folks just can’t let accuracy muddle a philosophy.
“Low-Effort” Interventions Appear to Limit Cheating
On the other end of the research spectrum is this recent paper by a small team of scholars led by Frank Vahid of the Department of Computer Science at University of California, Riverside. For the record, I’m working from this pre-publication version of the research.
The paper examines real world, in-class tactics for reducing cheating and measures their collective results. The interventions are, as the title foretells, designed to be easy. They’re also fairly practical and, judging by the outcomes, successful. As such, this contribution to the conversation is quite good.
There are also a few large gold nuggets buried in this paper, which I’ll highlight near the end. They are important.
To start, the “low-effort” integrity interventions delivered in an introductory computer science class were:
(1) Discussing academic integrity for 20-30 minutes, several weeks into the term, (2) Requiring an integrity quiz with explicit do's and don'ts, (3) Allowing students to withdraw program submissions, (4) Reminding students mid-term about integrity and consequences of getting caught, (5) Showing tools in class that an instructor has available (including a similarity checker, statistics on time spent, and access to a student's full coding history), (6) Normalizing help and pointing students to help resources.
Candidly, I love this. I’ve long maintained that the single best intervention for any classroom is simply talking about misconduct, sending the message of awareness and import. Second, I love it because the discussion about integrity is intentionally not done early, because, according to the paper:
The discussion was done in Week 4, and not in the first week, because of our belief that talks on cheating held in the first week are less effective since no student is considering cheating in the first week
Brilliant.
I add that having this in the first week or first class tends to feel like a required disclosure, a checking of the box of housekeeping. Adults tend to discount such pronouncements, preferring to see what may or may not be actually important. Covering it early and never mentioning it again tells everyone it’s not.
The demonstration of detection tools is good too. Other research has supported that this can deter misconduct by making detection possibilities and probabilities real.
I confess that the policy of allowing students withdraw submissions is new to me. And I like it too. As explained, a student may withdraw coursework and receive a zero, without subjecting it to review and possible sanctions if misconduct is found. This, the authors say, helps mitigate the desperation, deadline-driven cheating that can occur. A single bad decision under pressure can be erased, though appropriately not without penalty. Good stuff.
What surprised me about this seemingly obvious policy was that, according to the paper:
on average 10 students withdrew at least one program each term.
In a class of 100 students, which was about accurate in the study sets, that’s 10% of the class affirmatively taking the zero at least one time to avoid integrity checks. To me, that says the policy works, especially when coupled with a culture in which detection is at least somewhat likely. Also, since it’s essentially admitting to cheating, 10%.
And here is the team’s signature finding:
The totality of the results above suggests that the low-effort interventions had a substantial impact on the student behavior metrics. Students spent more time programming, and a smaller percentage of students submitted highly-similar code.
The measurements were thoughtful and somewhat rigorous. The authors did not, for example, simply count the number of students cheating. In their data they found that students in the intervention class spent more time working on assignments and were less likely to have answers with high similarity rates to other students or material from outside, disallowed sources.
On time-on-assignment, the team found that in the variable group it:
rose, from 6 min 56 sec, to 11 min 6 sec, amounting to a 60% increase in time spent.
On similarity:
The average percentage across the 7 labs dropped from 33% to 18%, amounting to a 45% reduction.
The team checked the interventions with same instructor too - same course, same class size, same modality and found that, with the interventions:
median time [on assignment] rose, from 6:41 and 7:19, to 10:49 and 11:06, and the percentage of high-similarity students fell, from 34% and 29%, to 12% and 18%.
Those are big increases and big reductions. At least to me. And if the research stopped there, that would already be plenty for me to recommend it.
But, as mentioned, the gold.
Armed with data about student conduct in the intervention sections, the research team then examined the cases of known cheating. Their take-away is that students who were caught in acts misconduct tended to spend less time working on the assignments and made fewer attempts.
They have charts. They are eye-opening.
To me, these data undercut the idea that large portions of cheating behavior are driven by desperation after honest effort. Cheating due to repeated lack of success definitely happens, but in these examples, the cheaters did nearly no initial work whatsoever. There’s an absolute correlation between lack of effort and cheating, at least in these classes.
Further, and to their enormous credit, the writers of this paper examine whether such integrity-focused interventions negatively impacted teacher evaluations. They write:
A concern that many instructors have is that if they discuss cheating too much with students and pursue cheating cases, their end-of-term student evaluation scores will drop. This is of great concern to many instructors, especially those whose employment or advances depend on such student evaluation scores.
A-freaking-men.
Some of you may have heard me tell this story before, but one of my first interactions with misconduct was from an adjunct/contract teacher at Rutgers University. She taught an online introduction to music course, a course she conceded was neither rigorous nor challenging. The professor told me that she knew somewhere near 90% of her students were cheating - copying Wikipedia and submitting verbatim answers she’d seen over and over. When I asked her what she did about it, she said, “I give them As.”
Flabbergasted, I asked her to explain. She said, essentially, that if she filed misconduct cases on 90% of her class, her reviews would plummet. The Dean would hate her for making so much work, she said. And they’d find someone else, someone who would not create waves and get bad reviews. She said she needed the job and addressing misconduct simply was not worth losing it.
My point is that classroom culture and diligence on misconduct are heavily influenced by job security, which can be directly linked to student reviews.
Big, big Kudos to this Cal Riverside team for going there. And when they arrived, they found:
The scores for instructor effectiveness / course overall were: Spring 2019 4.82/4.64, Fall 2020 4.85/4.76, Spring 2020 4.38/4.29, Fall 2021 4.23/4.26. The first two terms were normal, while the latter two were intervention terms.
And:
As can be seen, the evaluation scores are lower in the two intervention terms.
And:
the [evaluations from] intervention terms included 2-3 comments like "The professor put more effort in trying to find 'cheaters' than in actually teaching the class"
The gold here is that administrators should be very aware of this link and invest in rewarding educators who invest in integrity, even when - especially when - their evaluation scores dip. It may mean they’re actually trying to curtail academic fraud and that some students don’t like when professors make it hard to cheat.
Finally, another wink to the team behind this paper for writing:
we define cheating on a class programming assignment as a student submitting code that is not their own, typically by copying code from a classmate or website (GitHub, Chegg, CourseHero, Quizlet, etc.), or by having someone else code for them (a friend, family member, or contractor), in a way that violates the collaboration policies of the class.
Mentioning Chegg and Course Hero and Quizlet by name is great. It tells me these folks know what’s going on. Good for them.
Good stuff.