AI detection software not a fan of non-native English speakers
Some are already convinced that “AI” will change the world beyond recognition. The dream applications of the technology are many and widespread. For the time being, “AI” is mainly a way out for people who are too lazy to write their own texts. Not that I don’t understand. Having once been a student myself, I can probably say with certainty that the temptation must be great for those unfortunates who have to write yet another nonsensical essay. It is not for nothing that it is mainly teachers who ask for resources that can detect the work of “AI”. It’s a pity that, just like the “AI” tools themselves, they regularly fail.
Up to 99% accurate, unless your parents speak another language
At least, that is what some vendors of “AI” detection software claim. Given the high demand for such software, you can hope that such claims are well substantiated. But that turns out to be counterproductive. Researchers had several “AI” detection programs make a judgment on a series of tests. One part was completed by students for whom English is not their first language, the other part by children of secondary school age who have been raised in English.
All seven programs studied showed a difference in their assessment of these two groups. The test in question is the TOEFL, or Test of English as a Foreign Language. Toefl is a widely recognized test of relative English proficiency. Of the tests conducted among students who did not speak English as a first language, more than half were generated as “AI”. One detection program even ticked 98% percent of the tests, all completed by humans, as “AI” generated.
By comparison, the seventh graders fared a lot better. On average, more than 90% of their tests were written by a human, according to the “AI” detection software. The dataset of the research is not very large, and additional research will therefore be necessary. The gap between the results is so large that the researchers find it worrying.
“Text perplexity” as an indicator
Now I’ll answer the most obvious concern right away: No, the detection software isn’t racist. At least, not intentionally. The reason these tools are so bad at correctly identifying the tests written by the non-native English group has to do with how “AI” works. Let’s explain a key concept. By “text perplexity” is meant the degree to which a so-called artificial intelligence has to predict which word will be next. Predicting the next word is essentially what a text-generating “AI” does. When an “AI” can easily predict which word to follow, the text perplexity is low. Does the “AI” have a lot of trouble with it? Then the perplexity is high.
To appear as credible as possible, large language models such as ChatGPT keep text perplexity low. So they don’t make it too difficult for the “AI”. The result is a text that shows a lot of repetition. An “AI” detection program is trained to recognize these patterns. I think the problem is now clear. People for whom English is not a first language often have a smaller vocabulary. A smaller vocabulary also means more repetition. People who do not master a language will therefore also be more formulaic in their use of language. All points that the detection software sees as the fingerprint of “AI”.
But wait, it gets even better. To further expose this inherent flaw in the way the tested detection software works, the researchers turned to a piece of irony that Alanis Morissette could learn from. The researchers fed the tests of the non-native English language students to ChatGPT. In doing so, they asked the software to rewrite the tests with more refined language. The result? All tests were marked as human work by the detection software this time. The whole thing led the researchers to make a rather paradoxical conclusion. If you are not writing in your native language, you better use ChatGPT to get past the “AI” detection.
Another reason for minorities to pay attention to “AI”.
I don’t think I have to explain to anyone that the academic consequences of such mistakes can be huge. The research exposes yet another way in which “AI” technology threatens to make life more difficult for certain groups of people. As mentioned earlier, “AI” finds it extremely difficult to correctly interpret infrequent data points. Usually they just blend in with the crowd. We only have to look at the facial recognition thunder to see how such an inability to deal with rarer data points can lead to big problems.
“The implications of GPT detectors for people who do not write in their native language are serious, and we need to think carefully about how to prevent such instances of discrimination.” According to the researchers’ warning. If the discrimination in “AI” detection software exposed in the study is not taken into account, there are guaranteed consequences. Students with a migration background will be more excluded from their academic and professional careers due to false allegations. But Google, for example, also gives less priority to content that it has designated as “AI” material, which threatens to marginalize people who do not write in their native language.
There is no doubt that there is a need to distinguish “AI” generated material from human work. However, it should not be the case that in order to separate the shoddy work of “AI” from the regular uninspired student work, we apply methods that are just as sloppy as “AI” itself.