(Long) Research Edition: AI Detectors Are Inaccurate, Can be Beaten
And where it goes wildly wrong. Plus, did you miss the deadline to apply to speak at Course Hero event? Don't worry. I didn't.
Issue 288
Subscribe below to join 3,819 other smart people who get “The Cheat Sheet.” New Issues every Tuesday and Thursday.
If you enjoy “The Cheat Sheet,” please consider joining the 18 amazing people who are chipping in a few bucks via Patreon. Or joining the 37 outstanding citizens who are now paid subscribers. Thank you!
Research: AI Detectors Don’t Work And Are Easy Hack
A new paper on AI detectors and efforts to evade them is making the rounds among those who doubt or discount the accuracy or value of AI similarity detection systems in education. Or, as is more often the case, among those who would prefer we all just ignore academic misconduct entirely.
That Venn Diagram is a pretty tight circle.
The title of the paper is:
GenAI Detection Tools, Adversarial Techniques and Implications for Inclusivity in Higher Education
Listed authors are Mike Perkins, Binh H. Vu, Darius Postma, Don Hickerson, James McGaughran, and Huy Q. Khuat of British University Vietnam, as well as Jasper Roe, of James Cook University Singapore.
Preamble, Apology
This is long and super-nerdy. And I am sorry. I would not weed this far into such murky waters if I did not believe that having a clear record, on the record, was important. Research, especially misunderstood or poor-quality research, should not be above skepticism and criticism. In fact, I think it’s required. And I want there to be a record somewhere of the kind of nonsense that seemingly smart people are endorsing.
So, here we are.
Most people, I’m willing to wager, never read past the first two paragraphs which say, in part:
The results demonstrate that the detectors' already low accuracy rates (39.5%) show major reductions in accuracy (17.4%) when faced with manipulated content, with some techniques proving more effective than others in evading detection.
The accuracy limitations and the potential for false accusations demonstrate that these tools cannot currently be recommended for determining whether violations of academic integrity have occurred
That’s a shame.
Speaking of wagers, mine, about anyone being able to demonstrate that AI similarity checkers do not work, still stands. Actually, it was more of an open offer, a call for anyone who had evidence showing that AI detectors do not work (see Issue 241). So far, no one has shown any evidence to support the idea. In fact, efforts to disprove their effectiveness have shown the opposite (see Issue 250).
This current paper from Perkins, et al does not do the trick either. Despite claiming that “detectors” as a group have an accuracy rate below 40%, there are errors in reaching this conclusion — a few major, several minor but still significant.
Cheating Blueprints
Before we even get to the errors, one of the things that deeply set me off about this paper is that it’s a blueprint for defeating AI detection. Which is silly if you believe they don’t work in the first place.
This paper tests several evasion techniques and reports on which are most effective against which detection regimes. The paper actually identifies as research questions:
What are the most effective adversarial techniques for deceiving text detectors?
Which AI tools provide outputs that are easier or more challenging for detectors to identify?
Someone should have thought a bit harder about these decisions.
They also write:
the choice of the GenAI tool has an impact in which adversarial techniques might be chosen if reducing the detectability of the text is the goal
You know, if reducing the detectability was a goal. Conveniently, they even made a chart of evasion success rates.
They continue:
The lower detectability of the content generated by [redacted] could make them more appealing to those who intend to circumvent academic honesty policies. [Redacted] superior performance in evading detection across all categories suggests that its outputs might align more closely with the nuanced and variable patterns of human writing
The redactions are mine. They may have been fine with publicizing which AI engines are better at not being caught, but I am not going to.
Either way, just in case someone intended to “circumvent academic honesty,” the authors tell them how to be successful. Moreover, I cannot imagine what a teacher or school is supposed to do what that information. It seems as though the only people who could benefit from knowing which systems are most likely to evade detection are those who want to evade detection.
Differences in which text generators and which “adversarial techniques” work best is a problem, the authors say, because:
students with access to advanced AI tools and knowledge of adversarial techniques could gain an unfair advantage over their peers, further widening extant digital inequalities and the digital divide
Boy, they are right. It really could be a problem if students had “knowledge of adversarial techniques” like the ones they just published.
The Tests
The research team tested three types of papers — human-written, AI-generated, and AI-generated but then altered by AI in ways specifically designed to fool or subvert AI detection technology. The team tests these three types of papers on seven detection systems:
Turnitin AI detect
GPTZero
ZeroGPT
Copyleaks
Crossplag
GPT-2 Output Detector (OpenAI)
GPTKit (GPTKit.com, n.d.)
N.B., the paper says in several places, including in the abstract, that the test covered six systems:
This study investigates the efficacy of six major Generative AI (GenAI) text detectors when confronted with machine-generated content that has been modified using techniques designed to evade detection
The test clearly covered seven systems. But that’s not one of the problems.
Major Issue One — Samples
Among the larger and more obvious problems with this trial is that the researchers did not develop or submit test samples that the selected detection systems are trained to classify. The papers developed and tested were:
Mini (short form) university essay testing AI’s ability to construct coherent, argumentative, or exploratory work within an HE setting.
Professional blog post to assess whether GenAI-generated content can demonstrate professionalism, industry knowledge, and expertise, while maintaining reader engagement.
Cover letter to apply for an internship designed to test GenAI’s ability to design tailored content specific to a position and the suitability and motivation of the applicant.
Middle-school level comparative analysis task designed to test GenAI's use of language specific to that of a younger author requiring clarity and simplicity.
Magazine article intended to test for content and tone in a journalistic manner that may appeal to a broad audience.
Obviously, the AI detection systems have no training — and therefore no specific ability — to detect AI in cover letters, professional blog posts, or magazine articles. The ones that perform best in academic settings are those that were developed to work well in academic settings. As we know from related research, the more specific data an AI detection system is trained on, the better it is at working in that area.
At best, only two of these six paper types are academic in nature and only one of those is higher ed related. In other words, 60% of the entire sample was designed to miss. Judging Turnitin or even Copyleaks on how well they do with cover letters or magazine articles is like, as I read recently, judging a fish by how well it can climb a tree. And it destroys any objective finding on how well these systems work.
Major Issue Two — Unreal Evasion
Even if you think, for some reason, that AI detectors designed and built for schools and academic writing should be good at scanning articles with a “journalistic manner,” what this test does is fully obliterated when the researchers ask the AI to rewrite the papers with tactics designed to evade detection.
The “adversarial techniques” that the authors used — asked the AI to create — included adding spelling errors, varying the length and structures of sentences, making the language more complex, using language a non-native speaker may use, and using a third-party paraphrasing bot.
For the record, when the research team thought about using a paraphrase engine to intentionally deceive an AI detector they turned to Quillbot, which is owned by cheating conglomerate Course Hero.
Anyway, after several attempts to get the AI to alter its own work to avoid detection, the team got results that were unnatural and implausible.
For spelling errors, for example, the AI spit out papers with “more than 20” spelling errors. When asked to change the sentence lengths and structures, the team says it:
sometimes resulted in overly short sentences that might not be suitable for formal or professional contexts.
When asked to make the language more complex, they said:
We observed that the outputs often descended into jargon or strayed significantly from the expected style which may result in suspicion as to the authorship of the work
In other words, not only were most of the starter texts outside of the trained range of most academic AI detectors, but the evasion tactics the team then tested yielded things that no one in the right mind would submit. More than twenty spelling errors in a 300-word paper? You’re kidding.
To their credit, the paper concedes this flaw:
some of the samples generated after applying adversarial techniques for testing may not accurately represent the quality of work that students would submit in a real-world setting. Although these samples evaded detection by software tools, they are likely to evoke suspicion from human markers because of their poor quality, strange phrasing choices, and excessive errors.
You think?
Let me repeat that. The papers, “may not accurately represent the quality of work that a student would submit in a real-world setting.”
But they tested them anyway — papers that students would not conceivably submit in an educational setting — and concluded that the detectors had a problem. That’s solid work.
Another problem — Bias
While there are some places in the paper where the authors seem reasonable, even skeptical, it’s clear that the team was biased. They, for example, consider their entire test within the context of previous research that AI detection systems show bias against non-native English speakers:
However, research has shown that GenAI text detectors have the potential to be barriers to inclusive assessment practices by disproportionately targeting individuals who don't speak English as their first language, or those with lower English proficiency
“Disproportionately targeting” is a pretty big tell. But even more substantially, the research they cited is suspect and has been openly questioned by several experts (see Issue 251).
Further evidence of bias is that the authors repeatedly reference AI detectors making “false accusations.” Detectors do not make accusations of any kind, which seems obvious. Still, the authors write:
error analysis was conducted to assess false accusations and undetected cases
Or:
Instances of AI detectors falsely accusing students of academic misconduct are not uncommon and cause concern regarding inclusivity, fairness, and ethical practice in education.
At the end of the paper, the bias in the research is even more obvious. But, for now, for the quazillionith time, detection systems do not make accusations.
But Another Big Problem — The Results
To get to the conclusion that the tested detectors have “already low accuracy rates (39.5%)” the authors fully commit to the fallacy of averages, even though the actual results vary considerably. And worse, they only cite one side of the test results.
Here are the results by detection system on the baseline, unaltered texts — ten human and 15 from AI. But again, on things such as cover letters and corporate blog posts.
Copyleaks 64.8%
Turnitin 61%
Crossplag 60.8%
GPT-2 detector 57.2%
ZeroGPT 46.1%
GPTKit 37.3%
GPTZero 26.3%
To start, we see three systems — Copyleaks, Turnitin, and Crossplag — scored above 60% on basic detection.
I will also highlight once again that GPTZero is awful. It’s always been awful. I am amazed that anyone takes it seriously. Let me even quote the authors on this point:
The worst-performing detector was GPTZero, with a considerably lower accuracy rating of approximately 26%
I’ve also never heard of GPTKit. And the GPT-2 detector from OpenAI was so bad that they shut it down.
In other words, the average of these averages includes at least two systems that are junk and one that is a complete mystery. Unsurprisingly, none of those three did well. Yet they are counted as equals with Crossplag, Turnitin, and Copyleaks. Research teams who want to “test” AI detection keep treating systems that no one uses or those that are known to be terrible as representative of the class.
Moreover, if you’re quick with math you will note that the average of the seven averages listed above is not 39.5%, as the authors said. It’s actually 50.5%. So, that’s fishy.
To get the 39.5% number, the study uses only the accuracy at detecting the unaltered AI text. That’s not overall accuracy, which the authors confusingly claimed when they wrote, from above:
The results demonstrate that the detectors' already low accuracy rates (39.5%)
Maybe it’s just bad construction. But it’s — at best — misleading.
The 39.5% number is also thrown way, way off by a finding of just 6% accuracy from GPTKit. Come on. In fact, if you remove the three obvious junk systems (GPTKit, GPTZero, and Open AI’s own classifier), the accuracy with unaltered AI text goes from 39.5% to 52%.
But that’s not as good a story.
I am also not sure what to make of the mislabeling of the results. In one place, the paper refers to the results — 64.8%, 61%, 60.8% and so on — as:
being able to detect 64,8% of AI-generated texts
But also as:
The results of our baseline testing of 15 AI-generated samples and 10 human samples
I think it’s the latter. But still, conflicting descriptions of the basic results is not great.
Human Accuracy, False Positives
Though the paper does not include it in its calculation of “accuracy rates,” the test did find that the seven systems had — as an average of averages — a 67% accuracy rate with human text.
Importantly, regarding the human-written control samples, only 67% of the tests were accurate, leading to significant concerns regarding the potential for false accusations from these tools.
False accusations again. But digging in, the real results are that:
four of the seven detectors did not misclassify any of the human-written samples. Notably, despite detecting the highest proportion of manipulated text, Copyleaks possessed the highest likelihood of producing false results, with 50% of human generated samples misidentified as AI written.
So, if I am reading that right, more than half of the detectors yielded exactly no false positive results. None. Among those was Turnitin, which is the most used detection system in education. That seems important. For a research project so focused on “false positives” and accuracy, a perfect score on misclassification from four systems seems like it should be big news.
Another Confusing Point on “False Accusations”
Having clearly said that the accuracy rate with human text was 67%, the authors also say that the “false accusation” rate was actually 15%, which is very different.
In the math, a total of nine human papers were misclassified as AI, out of 60 total. But, as mentioned, five of those nine were from one system — Copyleaks (which no school should ever use anyway — see Issue 208). So, again, one bad result from one bad system taints the whole batch. Without Copyleaks, the “false accusation” rate falls to about 7%.
Not that 7% is great. Or the 15%. But flipped over, that means that the seven tested detectors were somewhere around 90% accurate at avoiding false positives. This is also different than saying that 10% of human-written submitted papers will be incorrectly flagged. The market leader, Turnitin, had zero false positives. If Turnitin represents 80% of the academic market, your actual, real-world false positive rate is like 2%. At worst.
Post-Alteration Results
If you remember way back when we started, the goal of this paper was to test AI detectors with “adversarial” adjustments to AI-created text. By using QuillBot, adding spelling errors, more complex language, or adjusting sentence types, the paper found that:
A comparison of the tools showed reductions in accuracy with variations ranging from 1.5% to more than 42% when the outputs were subjected to adversarial techniques (mean value 17.4%).
The 17.4% number is described as “major reductions in accuracy.”
Is it though?
I mean if you’re trying to fool the AI detector — if deception is your goal — and you’re only degrading your risk by 17% … I don’t know. I mean 17% is not nothing. But is it major? I think I am somewhat reassured that deliberate evasion tactics moved the accuracy so little.
Further, the two alterations that confused detectors the most were the two that were borderline bonkers to start with — the spelling errors (more than 20 errors) and varying sentence structures (generating sentences that “might not be suitable for formal or professional contexts.”) It’s a flaw the authors, to their credit, acknowledge:
we recognise that an examination of many of the samples produced using the SE technique resulted in an output that would be very unlikely to be submitted by a student. Although they evaded detection, they would very likely receive poor marks in a real-world setting because of the high number of errors
Ah, no kidding. So the finding is that, when altered to the point that they become nearly unusable, these papers are about 18% more likely to escape detection.
Gee.
But the larger point here is that, yes, AI detectors can be fooled. Any detection system can be fooled. This is not news. And, in the analogy I have used many times, door locks can be picked or doors can simply be kicked in. We still lock our doors. Detection and prevention efforts need not be foolproof to be valuable.
And Finally
Just a few more notes, I promise.
In the conclusion, the authors knock detectors for a high “false accusation” rate and for missing AI content:
With the rate of false accusations at 15%, considering the major impact that this could have on student outcomes, we consider this to be a major concern for student equity. Although some detectors did not have any false accusations, this appeared to come at the cost of a higher UDC ratio, indicating that many instances of AI-generated content could go unnoticed, potentially providing an unfair advantage to dishonest students who can apply these adversarial techniques in a matter of seconds to hide the true source of text.
Not 15%. In reality, like 2%. And detectors don’t make accusations. But we covered that.
My point here is that you cannot have it both ways. You can set your smoke detector to sensitive and it will go off all the time for innocuous reasons. But you will be very safe. Or you can dial it down, potentially missing some risks but having very few false alarms. To knock a detector for not having false alarms coming “at the cost” of letting some things go is rich. I wanted to type “dishonest” there.
Fine, it’s dishonest.
Another point, from the authors:
If the goal of any given HEI was to use AI text detectors solely to determine whether a student has breached academic integrity guidelines, we would caution that the accuracy levels we have identified, coupled with the risks inherent in false accusations, means that we cannot recommend them for this purpose.
This makes me insane.
No one suggests using AI detection “solely to determining whether a student has breached academic integrity guidelines.” No one. No one has ever even suggested doing that. No one should do that. This is a strawman, and not worthy of honest research.
That section nonetheless proves that the authors know well that detectors don’t make accusations — that they are part of an evidence package. And it really upsets me that the paper spends so much time pretending otherwise.
Predictably, the authors of this paper end with the typical high-minded nothingness of:
GenAI tools offer an opportunity to reconsider the traditional notions of misconduct and the potential barriers and inequities that a punitive approach to detection can face
And:
This requires fundamentally rethinking how we assess student learning, moving away from the traditional assessment methods that are easily compromised by AI tools
No, AI does not require or even suggest a reconsideration of misconduct. Misconduct is, by definition, doing what is not allowed. And we would do well to keep things such as education fraud and intellectual theft in the “don’t do it” column.
And ending with the bromides about a “punitive approach” and “rethinking how we assess” is just perfect because it’s what this paper was really about — sowing doubt about detection so we can look away from misconduct.
As mentioned earlier, this Venn Diagram is pretty tight.
The Bottom Line
This paper is not good. People who cite it or share it should be required to explain it and defend it. But they will not. Because they likely cannot.
Did You Miss the Deadline to Apply to Speak at the Course Hero Event? I Didn’t.
Course Hero, as you probably know, is one of the top three cheating providers in the country (see Issue 97). This is true even if we skip over the company’s other business model — abusing copyrights (see Issue 252).
From time to time, Course Hero, in an effort to appear legitimate, hosts events with teachers, where they can speak or be paid to speak (see Issue 226).
Oddly, the topic of cheating never seems to come up at these Course Hero events.
Anyway, Course Hero is having another one of these things, soliciting speaker proposals for presentations. The deadline to apply was Monday, April 15. But don’t worry. In case you missed your chance to sell your personal and institutional credibility to an academic fraud provider (see Issue 42), I applied.
The submission form limited my proposal to 100 words. Here’s what I asked Course Hero to let me talk about:
Academic misconduct is a threat to education equity, fundamental fairness, and the value of education credentials. Unfortunately, the issue is often overlooked.
Or, in some cases, cheating is even exploited for profit.
In this session, I'll present an examination of several contract cheating companies that sell cheating services -- looking at their business models, customer bases, and services. As well as efforts that are already underway to shut them down around the globe.
Come learn how investors are making money by selling cheating to your students. And what you may be able to do about it.
As you can see here, Course Hero will review my submission and let me know in May. I’m very excited.