Research Edition: Paper Shows AI Detectors Work Very Well. The Study's Author Disagrees.
In tests, three different AI detection systems were 96% accurate with zero "false-positives."
Issue 250
To join the 3,657 smart people who subscribe to “The Cheat Sheet,” enter your e-mail address below. New Issues every Tuesday and Thursday. It’s free:
If you enjoy “The Cheat Sheet,” please consider joining the 14 amazing people who are chipping in a few bucks via Patreon. Or joining the 25 outstanding citizens who are now paid subscribers. Thank you!
A European Research Paper Tested AI Detectors. The Results are Clear. The Message Is Not.
Note: I apologize for the length of this, but it’s important and an unusual circumstance. If you prefer to not read it all, the short version is that a research paper came out in June showing that the better AI detectors are 96% accurate at differentiating between AI and human writing, with zero false-positives. But one of the authors of the study says they are unreliable.
In June, a team of researchers in Europe released a paper on their efforts to test AI detection systems and it’s been on my list to share here since then. Here are the listed researchers:
Debora Weber-Wulff (University of Applied Sciences HTW Berlin, Germany), Alla Anohina-Naumeca (Riga Technical University, Latvia), Sonja Bjelobaba (Uppsala University, Sweden), Tomáš Foltýnek (Masaryk University, Czechia), Jean Guerrero-Dib (Universidad de Monterrey, Mexico), Olumide Popoola (Queen Mary University of London, UK), Petr Šigut (Masaryk University, Czechia), Lorna Waddington (University of Leeds, UK)
I’m getting to it now because in Issue 241 I wrote that, despite what OpenAI (the makers of ChatGPT) say, there is no research anywhere showing that AI detectors do not work:
[there are a] litany of academic studies and press stunts that show uniformly that AI text-classifiers work. Unlike the one from OpenAI, the good ones work quite well. Not a single study that I know of shows anything to the contrary. If you know of a study that shows they flatly do not work at all, please send it.
In Issue 248 I mentioned that no one had sent me anything.
A few hours later, Debora Weber-Wulff, the lead author of the above study, e-mailed me. She highlighted my call for evidence and sent a link to her paper.
So, let’s get into it.
The Goalposts
To answer the question, we should be clear on what constitutes “working” in AI detection.
To me, an AI detector works if it can tell the difference between text that is created by humans and text that is created by AI at a rate better than random chance, which is 50%. Pretty simple.
I think of a police radar detector. To “work” it has to be able to tell whether a car is speeding or not speeding. If it only did so at a 50% clip, we’d consider it useless — any person in a blindfold would be 50/50. So, to me, that’s the target — better than simple chance at categorizing something as one thing or another. Yes, I know AI detection is a bit more complex because the detectors report levels, or suspected levels, of AI instead of just a “yes” or “no.” But I think the standard translates.
The Research Test
In the research above, the team used 14 AI detectors to test nine papers in each of six types of written content. A few of the companies are misidentified, but he paper says the 14 detectors tested were:
Check For AI
Compilatio
Content at Scale
Crossplag
DetectGPT
Go Winston
GPT Zero
GPT-2 Output Detector Demo
OpenAI Text Classifier
PlagiarismCheck
Turnitin
Writeful GPT Detector
Writer
ZeroGPT
The six tested text varieties were:
Human written
Text written in languages other than English, then machine-translated into English
AI-generated (unaltered, ChatGPT)
AI-generated (unaltered, ChatGPT)
AI-generated, then human edited
AI-generated, then computer paraphrased (Quillbot, by the way)
That’s a total of 54 papers checked across these 14 systems. And although that makes the baseline, control sample of human text just nine, we’re good so far.
A note that two of the systems — Turnitin and PlagiarismCheck — gave the research team access to their paid/subscription tools for these tests. Good on them.
The Results on Human Writing
When I responded to Weber-Wulff, I wrote that I was happy to hear from her and that her paper:
in my view, is very supportive of the idea that AI classifiers work and work well.
I said that because according to her paper:
The overall accuracy for case 01-Hum (human-written) was 96%.
In other words, when smushed together, all 14 detectors correctly identified human-written text as human-written text with 96% accuracy. Ten of the 14 systems were a perfect nine for nine. Only Complatio (8), Winston AI (7), GPTZero (6), and PlagiarismCheck (8) recorded incorrect results on human writing.
This is a very important moment to aggressively highlight that correctly identifying human content as human content is the tip of the spear on so-called “false positives” — incorrectly classifying genuine human writing as AI-generated. It’s this possibility that has some well-meaning folks worried about incorrect accusations of cheating and “traumatized students.” However, as I go through in Issue 244, that’s an invention of false concern.
As such, that ten of 14 classifiers tagged human text flawlessly, is worth underlining. And it tells me that, if you’re worried about “false positives,” the solution may not be to turn off AI detection, but to turn off the AI detectors that suck at correctly identifying human writing. Here too — a special note for GPTZero, which correctly identified just six of nine human samples and was the worst of the bunch. I’ve said dozens of times that GPTZero is not a good system. It is not.
The Results on AI Text
With half of the “do they work?” question answered, how did the systems do at spotting text created by AI?
Pretty well, thank you very much.
All 14 systems were 66% accurate at identifying straight AI-generated text — 167 accurate IDs of 252 samples.
That should answer the question of whether they work. They do.
According to this research, the 14 tested systems were 96% accurate with human writing and 66% accurate with AI text. That’s an overall accuracy rate (75%) and well better than random chance (50%).
And you may note here that nearly all the inaccuracy is on the AI side — not flagging AI work as AI work. Which is exactly what we should want. Using the radar detector analogy, if your technology is going to have holes, you want it to let some speeders go and not say someone was speeding when they were not.
Of course, you could argue that 66% with AI text and 75% overall isn’t great — that such results can’t be considered effective or reliable.
You could argue that if you stopped reading here.
But I don’t think you should because the 66% and 75% accuracy numbers are very misleading. What’s at work here is the fallacy of averages. It’s the old joke — when Bill Gates walks into a bar, on average, everyone in the bar is a millionaire.
In this research, the overall success averages are dragged down by some really bad performances, making the whole group look mediocre. For example, one system (Content at Scale) got a zero — not one of the 18 AI samples was correctly identified. Writefull GPT Detector was just 28% accurate with these tests. Five of the 14 systems were just 50% accurate or worse.
Consequently, some great performers were buried and tarnished with these pulled-down, mediocre averages. That’s a shame because some detectors were very good. Are very good.
Four of the 14 systems (CheckforAi, Winston AI, GPT-2 Output, and Turnitin) were 94% accurate with the AI work, missing just one of 18 test samples each. Another two detectors (Compilatio and CrossPlag) were 89% accurate with AI text.
Three of the four that were 94% accurate with AI were also 100% accurate with human-written work. So was one of the systems that was 89% accurate with the AI.
In other words, three of the 14 tested detectors were more than 96% accurate overall — getting the human work perfect and missing just one of 18 AI works. Another was 92% accurate overall - getting all the human work right but missing two of 18 AI submissions.
When I wrote this to Weber-Wulff, she replied:
I realize that the numbers sound good, but look at the non-accuracy rate and what the results are of false positives and false negatives.
The numbers do sound good. They are good. As far as false positives, among the top performers there simply were none. The best systems were perfect with the human writing.
When four separate detectors are more than 92% accurate at discerning human text from AI text, it’s obvious they work.
And it highlights what I’ve also written many times … those who are concerned about the accuracy of AI detectors would do well to investigate the differences between bad ones and good ones and not use the bad ones. They are not all the same.
Frankentext
As shown above, the use of averages obscures that some AI detection systems are really good. That’s a research choice.
Using modified text on AI detectors is also a choice. As noted above, the research team did that repeatedly. Half the tested writing samples were neither entirely human nor entirely AI. They were Frankentext. And while testing AI detectors with modified Frankentext is interesting and informative, measuring the results on a scale of overall accuracy is pointless — which is to say that it doesn’t prove much.
Said another way, what do we learn by knowing that a system designed to tell one thing from another isn’t especially good when it’s presented with something that is neither?
Using my radar detector analogy, what the research team did was study the tool with cars at fixed speeds, speeding and not. The results were clear; the good systems were more than 90% accurate.
But then they decided to have a car enter the detection zone speeding but slam on its brakes. Then they decided to test a car that started slow and sped up. They tested two cars at the same time — one speeding, one not. They tested the car going away from the detector. They tested what happens if you put the radar gun in a speeding car and point at a parked car. Fun, but not really informative to the central question of whether or not the system can spot a speeding car.
Worse, the research team then counted these results in an overall “accuracy” score for each system. Six tests, half of them on altered content. As I wrote to Weber-Wulff:
Where the findings in this study goes astray, in my view, is where the classifiers are given altered and edited material and those results are counted in an overall score. The alterations make the test documents neither human nor AI-created. If my math is right, fully half the tested samples were altered and included in the overall accuracy scores of detecting AI. Since the samples can't honestly be called AI, that feels off.
We’ve known from the outset that tactics such as manual editing or paraphrasing can confuse and confound AI detectors. In most cases, they are meant to do exactly that. Even the research team describes these altered texts as “obfuscation techniques,” yet they include them in overall accuracy, counting them as equal with spotting human text.
Frankentext Results
Even with this stacked sample of altered texts, the good systems still score decently - Turnitin was 74% “accurate” overall across all tests. Compilatio was 74%. CrossPlag, 69%. Even with efforts to evade and complicate detection, the good systems were still well better than random chance.
As we saw from the previous tests, not all the systems do well. One, Content At Scale, was just 33% “accurate” overall. PlagiarismCheck - 39%. Six of the 14 scored just 54% or worse. GTPZero was 54% — not much better than a coin flip.
I don’t know what else to say — the systems matter.
The team also parsed partial correctness in identification and created a system in which absolute accuracy was heavily rewarded (16 points) while false classifications were not (1 point). In this model, we again see the good systems doing well. Turnitin’s score jumps to 81%. Three others are 75% or better.
The Biggest Mistakes
In addition to using averages to characterize and consider all AI systems as equal, the researchers take the bad results from several systems and use them to apply to the entire class, writing:
The researchers conclude that the available detection tools are neither accurate nor reliable and have a main bias towards classifying the output as human-written rather than detecting AI-generated text. Furthermore, content obfuscation techniques significantly worsen the performance of tools
I’m sorry, what?
That’s simply not true. As I wrote Weber-Wulff:
It's also misleading, in my view, to lump scores of some really bad systems in with the averages and say all classifiers are inaccurate. Some are. Not all are. Some systems I've never heard of scored below 50%, yet they were averaged as if all the systems are equal. They are not. I don't think you can measure a Yugo and a Ferrari and use the average results to describe cars. Or at least I don't think it's accurate to do that.
I’ve used the Yugo and Ferrari thing before.
But I simply do not understand how anyone can say “the available detection tools are neither accurate nor reliable” when their own research surfaces several that are more than 90% accurate and reliable at standard detection.
Repeating from earlier up:
three of the 14 tested detectors were more than 96% accurate overall - getting the human work perfect and missing just one of 18 AI works. Another was 92% accurate overall - getting all the human work right but missing two of 18 AI submissions.
I will not apologize for this — seeing those results in your own research and saying all classifiers are “neither accurate nor reliable” is dishonest.
More from my reply to Weber-Wulff:
I think it's unfortunate that many people are reading your research to say that AI detection does not work or that they are, as you wrote, unreliable. I do not see that as being supported by the research, yours or others.
I’ve already seen people citing this research as proof that AI detection does not work. Unfortunate to be sure.
Finally, Some Context
I wrote to Weber-Wulff that I am thankful for her work and that of her team. I am. Getting these issues in the conversation is essential.
However, I feel compelled to share some other portions of her response to my first e-mail because I think they illuminate the context in which this particular research was initiated and conducted.
And I thank you for sticking with me.
If you were close reading up top you noted I put the phrase “traumatized students” in quotes. That’s because it’s from Weber-Wulff here:
I realize that the numbers sound good, but look at the non-accuracy rate and what the results are of false positives and false negatives.
My school has 14 000 students. Suppose that each has one text per year examined. That means 840 wrong classifications! And even if only one percent was a false accusation, that is still 140 traumatized students. And it will tend to be those writing in a second language that are thus penalized, as other studies have shown (such as https://arxiv.org/abs/2304.02819).
I don’t know where to start.
To get 840 “wrong classifications” from 14,000, she’s using the 94% average overall correct classification for all 14 systems in her trials. Not all 14 systems are in use at her school, or at any school. Some of the tested systems, I think it’s fair to say, are used by no one, anywhere. Who is using GPT-2 Output Detector Demo? Yet it’s weighted the same as Turnitin and CrossPlag, which are in wide academic use.
By far, the most widely used AI detection system across all schools is Turnitin. It had a 0% “wrong classification” rate for human writing. Zero.
Then, Weber-Wulff seamlessly swaps the terms “classifications” and “accusations” as if they are the same. This assumes that any and every flag of AI text turns into an accusation of misconduct.
As I walked through in Issue 248, this is fantasy. It’s also very dismissive of teachers — assuming they will blindly follow the technology, and the technology only, without using any of their skills or experience. Anyway, all that is in Issue 248, I don’t need to repeat it here. But it does expose a profound lack of understanding about academic integrity realities.
Finally, the study she cites, and where I left the link in, is — to be kind — suspect. I wrote about it at length in Issue 216.
Especially curious when juxtaposed to the previous excerpt, is this from Weber-Wulff:
The systems appear to default to human if they are not sure. That is good for an educational setting, but that does not prove that they work. And our sample was not really that large.
Setting aside the “not really that large” part, which I touched on at the start, it’s very interesting that she says a lean in favor of not flagging human writing is “good for an educational setting.” It is. We’re discussing it in an educational setting.
Not counting citations, her research paper uses the word “students” 31 times. It literally has a section titled, “False Accusations: Harm to individual students.” Yet she says that defaulting to human and minimizing false positives is “good for an education setting.” I’m confused.
And, no, a “default to human” does not prove they work. Four different systems with accuracy rates above 90% proves they work.
The last bit I will share from Weber-Wulff is that, after saying that I read her findings as showing that AI detection works, she wrote:
That is exactly NOT our conclusion! Our focus on academic integrity is a rather different from yours - you assume that students cheat.
We want to teach students to work with integrity. Once they get to the workplace there won't be "guardrails" in place, so our only hope is to instill good academic practice in them!
Yes, her conclusion is clear. We covered that.
I do assume students cheat. Because students cheat. This is not an opinion.
All I can leave you with is that the cited lead author of this study, by inference, does not assume students cheat, wants to “teach students to work with integrity,” references “traumatized students” because of an assumption that every flag of AI similarity spins up a full integrity accusation.