(357) Report: Humans Can Reliably Spot AI-Created Text
Plus, a must-read essay on performance teaching and learning, in the age of AI. Plus, a new cheating company raises $5.3 million to 'cheat on everything.'
Issue 357
Subscribe below to join 4,625 (+5) other smart people who get “The Cheat Sheet.” New Issues every Tuesday and Thursday.
The Cheat Sheet is free. Although, patronage through paid subscriptions is what makes this newsletter possible. Individual subscriptions start at $8 a month ($80 annual), and institutional or corporate subscriptions are $250 a year. You can also support The Cheat Sheet by giving through Patreon.
Report: Humans Can Detect AI Text Quite Well
It’s been pretty well established that humans, usually teachers, cannot detect text created by AI, even when they’re told to look for it (see Issue 352 or Issue 332).
But as I wrote in Forbes this week, a recent study shows that people who work with generative AI often, use it frequently, can reliably spot text created by AI. That’s interesting and important. But in the universe where teaching, learning, and assessment happen, the finding probably does not change much.
Humans Are Good
First, the paper is by Jenna Russell, Marzena Karpinska, and Mohit Iyyer of the University of Maryland, Microsoft, and the University of Massachusetts, Amherst, respectively. The headline finding is:
that annotators who frequently use LLMs for writing tasks excel at detecting AI-generated text, even without any specialized training or feedback.
That is absolutely worth knowing. And good news, on balance. The more ways that we collectively have to spot AI garble, the better. But as practice, the finding is well short of applicable, which the researchers acknowledge.
Before getting there, the study asked human reviewers to assess three types of written content — human-authored, AI-generated, and created with AI then edited by paraphrasing or spinning, a tactic intended to deceive AI detection technology. The research also submitted these three kinds of papers to existing AI detectors including GPTZero and Pangram.
Seems fair, although I do have a few questions.
Two Issues
Briefly, the experienced human annotators were recruited through freelance platform Upwork and questioned about their AI use, education, and language skills. They were paid — about $2 per paper for reading, assessing whether it was AI, and providing comments on their assertions. There were 300 papers in the sample.
But the paper does not say, as far as I can tell, whether the annotators were supervised during these reviews. They could have used an AI detector to make or refine their judgements. Unsupervised, they could have asked friends or compromised the findings the countless other ways. If we assume that unsupervised, online academic assessments are fatally compromised ab ovo, we should at least highlight that it appears that the human assessors in this study were recruited from the public at large and acting unsupervised, online. I want the results here to be air-tight. But people will lie or fake work for money. They will lie and fake work even for no money.
Also, I love that this study checked humans versus AI detectors. But it did not test Turnitin, the most widely used detector in academia. I’ve criticized other studies for using odd or off-brand detectors as stand-ins for popular, successful, commercial ones. At least this study, as mentioned, tested two known brands — GPTZero and Pangram. I am sure there were good reasons why Turnitin was not tested. I just wish it was tested, or that we knew why it was not.
Findings and Limits
Anyway, my quibbles aside, the paper is quite strong and finds that human reviewers were very accurate in getting the source of written material right. The human experts were 99.3% accurate at spotting text created by AI and 100% accurate at picking out the human text as human, according to the paper.
That’s strong.
The limits on this conclusion, however, are quite significant. Most significantly, the human assessors were not acting individually. To record a finding of AI or human-generated, the study asked the human reviewers to vote, where a majority of five human reviewers was needed to make a determination one way or the other.
So, it’s not that a human is particularly good at spotting AI, it’s that a group of us, by majority vote, can be accurate. In fact, when considered on a human by human basis, the accuracy for any particular human is good, but not great. From the report:
the five annotators who have significant experience with using LLMs for writing-related tasks are able to detect AI-generated text very reliably, achieving a TPR of 92.7%. The average FPR for this population of annotators was 3.3%, meaning that they rarely mistake human-written text as AI-generated.
FPR is false-positive rate — the rate at which human writing is falsely classified as being AI-generated. And for the most experienced reviewers as individuals, that false-positive rate was above 3%.
For some reason, a few very vocal people in education have decided that any false-positive rate above zero is inexcusable. So much so as to disqualify all AI detection. It’s silly. But by that absurd standard, the human detectors fail as individuals.
This means that in real education and assessment settings, using a single screener who has “significant experience with using LLMs for writing-related tasks” is unlikely to pass the absolutism threshold of the detractors. At the same time, while using a panel of five experienced screens does cut the FPR to zero, it is not feasible in realistic assessment settings. From the report:
An obvious drawback is that hiring humans is expensive and slow: on average, we paid $2.82 per article including bonuses, and we gave annotators roughly a week to complete a batch of 60 articles
To have the 300 papers reviewed by nine test reviewers was:
a total cost of $4.9K USD.
No school in the world is going to approve that kind of expense, or time, to scan papers for AI misuse — not even to grade or give academic feedback, just to scan for AI.
In other words, good solution. Can’t use it.
The Non-Human Detectors
So, a committee of humans got the AI text — even modified AI text — correct 99.3% and got the human text right 100% of the time. That’s top-notch. GPTZero and Pangram were tested on the same 300 papers. From the paper:
Our paper demonstrates that a population of ‘expert’ annotators—those who frequently use LLMs for writing-related tasks—are highly accurate and robust detectors of AI-generated text without any additional training. The majority vote of five such experts performs near perfectly on a dataset of 300 articles, outperforming all automatic detectors except the commercial Pangram model (which the experts match)
And:
The majority vote of expert humans ties Pangram Humanizers for highest overall TPR (99.3) without any false positives, while substantially outperforming all other detectors
Pangram, in other words, was “nearly perfect” in this test. More than ninety-nine percent accurate with no false-positives. That’s impressive. In fact, the papers says Pangram “also outperforms each expert individually.”
It’s also more evidence that AI detectors work. Not all of them. But the good ones do. Reliably. And at amazingly high rates. People who say AI detection does not work are misinformed and need to be challenged.
Speaking of not all of them, GTPZero does not perform well. From the paper:
Pangram is near perfect on the first four experiments and falters just slightly on humanized O1-PRO articles, while GPTZero struggles significantly on O1-PRO with and without humanization.
Yikes. In one set of written samples, GPTZero’s accuracy rate is below 50%. As I have written countless times now, GPTZero is awful (see Issue 288). No one should use it.
For completeness, the research team also tested other AI detectors, most of which are complete unknowns to me. Turns out, for good reason. They are: Fast-DetectGPT, Binoculars, and RADAR. Across five tests on AI-generated material, they were a disaster. One, RADAR, scored an overall accuracy rate with AI text of just 15%. Sorry, 15.3%.
In other words, the system you use matters. It matters quite a bit. And this reality means that schools or programs that leave AI detection up to individual teachers are creating massive problems for themselves as the exact same writing can be flagged as AI by one teacher in one class, and not flagged in a different class, exposing serious legal risks.
Schools should use AI detectors. They should use a good one. And they should set its use by policy — e.g., if you’re going to use an AI detector you must use System X. It’s been more than two years now.
Real Benefits
The best bit of the paper, in my view, is that, by using humans with high accuracy, the research team can uncover clues as to why AI work can be reliably flagged. That’s a big deal in general and, in cases where academic integrity is in question, being able to articulate the why and how is important.
The paper says humans:
can provide detailed explanations of their decision-making process, unlike all of the automatic detectors in our study.
True. To be able to say, “we flagged your work as AI because of a, b, and whatever,” is good. And something the AI detector cannot or does not do.
Summary and Uses
While it’s impractical to have student work scanned by a team of experienced human experts, there is comfort in knowing that we are not left to the vagaries of machines alone to catch deception spun-up by other machines.
As I mentioned in Forbes, I can envision a use in a two-tier system in schools, where student work is initially scanned with a quality, reliable detector and teacher review. Where the authenticity of the work is challenged after that, a panel of experienced reviewers could review and assess the work. Humans checking the machine, in a way.
In these circumstances, the number of reviews may be limited and the cost and time may be justified. If a good detector, the instructor, and a panel of experts all say the work is inauthentic, I’d be willing to accept that it is.
All schools have to do is to want to do this — to make protecting their grades and degrees worth their time and money. But literally everyone reading this knows they won’t. Closing your eyes is easy and free.
A Must-Read Essay — In the Age of AI, Is Education Just an Illusion?
A few people sent in this must-read essay (account required) in The Chronicle of Higher Ed, from Dan Sarofian-Butin, “professor in the School of Education and Social Policy at Merrimack College.”
Thank you, Chronicle for publishing it. And thank you, professor Sarofian-Butin for writing it.
I don’t teach. But I feel seen. I’ll share some of it here and reserve my amens to the limits of my self-control.
The opening graph is:
For two long years, professors have been fighting a rear-guard battle against artificial intelligence. We brought back blue books and in-person tests, appealed to our students’ ethical principles, used multiple and ever-shifting platforms in an attempt to catch cheaters, positioned AI as a supplement to (rather than a replacement of) their learning, and developed ever-more desperate attempts to “AI proof” our assignments. Nothing has worked.
I am resisting the urge.
He continues:
The vast majority of college students now use some form of AI to do their assignments.
Resisting.
The professor recounts an exchange with a student who shared that a different professor was giving “crazy” amounts of feedback on the student’s writing. Upon review, our essay author suspected the feedback was AI-generated. The student was not surprised and admitted to using ChatGPT to “write substantial portions of his papers” in the first place.
Continuing:
Let me put it starkly: We are not facing a cheating crisis. We are in the midst of a crisis of purpose.
Personally, I think the crisis of purpose is being fueled by the crisis of cheating — too many people who could act, could intercede, but simply do not want to spend the time and money. But that’s my opinion and I do not disagree with his.
Our author compares education today to professional wrestling:
Everyone — insiders and outsiders, participants and audiences — knew wrestling was fake yet still leaned into the wink, wink, nod, nod of authenticity.
Further:
There is supposedly no place for illusion or entertainment in higher education, where the life of the mind is sacrosanct, the coin of the realm is knowledge, the only buying and selling is in the marketplace of ideas. Yes, a few students may cheat and a few professors may cut corners, but higher education is not a performative spectacle!
But if you believe that, I really don’t know where you’ve been these last two long years. Probably not in a classroom.
He continues:
I was walking around my classroom the other day, observing as students were supposedly doing a quick bit of online research to a question I had posed. One student was playing sudoku; another student’s computer was off (“It ran out of power,” she claimed); another student said they couldn’t access the internet. All of them, I knew, would turn in superb reflection papers on the topic once class had ended.
Nonetheless:
In wrestling, they would derisively call me a “mark”: someone who did not understand that everything was staged. All I really had to do was perform my role. And, in turn, the students would perform theirs.
I am still resisting.
But I beg you, please read this and let it sink in just a tiny bit:
I am well aware that students have always cheated and faculty have always cut corners. This is why the sociologist Willard Waller, almost a hundred years ago, saw the classroom in a “state of perilous equilibrium.” But whereas before ChatGPT I believed that we could manage this situation, today I realize that we cannot.
Saying that, “We used to take it for granted that teaching and learning took time,” the professor writes:
none of us can simply download what we want to learn, like Neo in The Matrix, and 30 seconds later, have it appear on our computer screens.
But ChatGPT can.
And that’s precisely the problem. It collapses the entire process of teaching and learning (thoughtful, recursive, effortful) into an instantaneous, efficient, and polished transaction — no struggle, no iteration, no friction. It creates the perfect performative illusion of teaching and learning.
The friction of teaching and learning — everything from my academic labor of preparing my lectures to my students’ listening to it and then writing about it — has been instantaneously smoothed over.
Amen.
I am disappointed in myself.
When my views on academic integrity move up a degree or two, I land on two words I read here: struggle and friction. I use “effort.” But same idea. Here, cheating — especially cheating with AI — is steroids (see Issue 278). It’s the performative illusion of achievement. And students want, and expect, to be richly rewarded for it because they assume that everyone is in on the joke.
Moving on, we get:
There is no happy ending here. A disengagement spiral is upon us as AI supercharges students’ disinterest in learning and faculty disinvestment from teaching.
This, also, you must read:
I had a Zoom meeting with a first-year student who had clearly used AI for one of his assignments. His earlier assignments had a nice but vague style and tone, and the depth is what you would expect from someone straight out of high school. His most recent assignment, though, was clean, crisp, and substantial, using language and concepts I would expect from my graduate students. I pushed him on all of this, and it was clear he could not explain what he had written. (I heard him typing some of the words on his computer and then reading the explanations that popped up.) Yet he resisted any sort of confession. He was trying, he claimed, to write in a different mode; his ADHD, he reminded me, made him forget some words he used before; his high-school teachers, he solemnly explained, told him to always vary his style.
It was, to be honest, maddening. I asked him to flip the situation and imagine he was the professor and a student displayed completely different writing styles. Wouldn’t he too think something was up? “Yes,” he noted, “I see your point.” But in his case, he insisted, staring straight at me, not blinking, not hesitating, not squeamish, not sweating, not pausing, everything was absolutely his own writing.
It was a perfect performance.
A bit further, he says:
So do you know what I did? Staring straight at him, not blinking, not hesitating, not squeamish, not sweating, not pausing, I gave him an A.
I’ve heard this story more than a few times. As to why he gave the A, he says:
I’m tired of being the “mark,” tired of fighting for the authenticity of the classroom in an AI-saturated world. Yes, teaching has always been hard but usually enjoyable. And grading, well, as the saying goes, we get paid to do it. But now, today, the emotional labor of keeping my students “honest” as to whether or not they are using AI is just too exhausting.
It is. I get it.
When this whole thing blows up, let’s not be squeamish as to why — those who wanted to preserve actual learning by protecting integrity just gave up (see Issue 346). There are only so many such battles you can fight alone. I get it.
This, I think, is important too:
And, please, don’t get me wrong. I was one of those early adopters, fundamentally rethinking and revising my entire syllabus and teaching practices in order to leverage AI as (yes indeed!) supplementing my students’ learning.
But I’m tired. If all is spectacle and performative, who am I to hold up the crumbling edifice?
Overt Cheating Company Raises $5.3 Million to ‘Cheat on Everything’
A few people sent me this one too.
That former Columbia student who left or was kicked out for selling a service expressly designed to cheat during job interviews (see Issue 354), has launched a new company with the brazen purpose to let people:
Use Invisible AI to Cheat on Everything
Seriously, go look.
They even published a “manifesto” on the site with the headline:
We want to cheat on everything.
It says, in part:
We built Cluely so you never have to think alone again.
It sees your screen. Hears your audio.
Feeds you answers in real time.
While others guess — you're already right.And yes, the world will call it cheating.
But so was the calculator.
So was spellcheck.
So was Google.Every time technology makes us smarter, the world panics.
Then it adapts. Then it forgets.
And suddenly, it's normal.
Whatever. These are kids, filled to the brim with ignorance and unearned arrogance. But they will make a good penny in the process, which is the entire motive, I am sure. As I’ve said too many times to count — cheating pays, it’s obscenely profitable.
What upsets me about this is not that kids are saying and doing stupid stuff — stuff that won’t work long term and probably does not even work now. It’s that supposed adults have, it seems, rushed to join the party. Profit predictably buries things such as integrity, responsibility, and maturity.
As mentioned, the company says it’s raised more than $5 million to sell cheating. From the coverage:
On Sunday, 21-year-old Chungin “Roy” Lee announced he’s raised $5.3 million in seed funding from Abstract Ventures and Susa Ventures for his startup, Cluely, that offers an AI tool to “cheat on everything.”
I put the links to the venture capital firms in that paragraph. Neither company, from what I can tell, has a statement of ethics or guiding principles on their site. I see why.
Frankly, and I don’t say this often, I am disgusted.