Human or artificial intelligence? 5 tools under test
At least since the release of ChatGPT, the topic of artificial intelligence has been on everyone’s lips. Because: The texts of AI language models can hardly be distinguished from human ones. In the meantime, however, there are both indications that can point to AI texts and so-called AI checkers. A test report.
The topic of artificial intelligence is currently experiencing a real hype. At the latest since the release of ChatGPT, one thing is clear: AI texts can hardly be distinguished from human ones.
Nevertheless, there are some indications that point to artificial intelligence as the author of texts. So-called AI checkers in particular promise a remedy. They should be able to distinguish human from AI texts. But how good and reliable are the tools really? We tested five providers as examples.
Contents
KI checker in the test: how reliably do the tools recognize KI texts?
In order to be able to assess the AI checkers, we had the tools analyze various texts that we have already used for our comparison of ChatGPT 3.5 and version 4.0.
We had both versions write a 100-word text about the beginnings of the iPhone and three sentences about how Twitter works. In addition, they have written their own text for both requirements.
Both versions of ChatGPT were able to convince with their objectivity, with GPT 3.5 even appearing more human. But do the AI checkers recognize that the texts are from an artificial intelligence?
1. AI Text Classifier: OpenAI’s AI checker
The so-called “AI Text Classifier” is a tool from ChatGPT developer OpenAI itself. But can the AI checker recognize the texts of its in-house artificial intelligence? Unlike some other providers, OpenAI is a bit more reserved. The company writes about its tool:
“The classifier is not always accurate; it can mislabel both AI-generated and human-written text.” In order to perform a check, the “AI Text Classifier” requires a contribution of at least 1,000 characters. Depending on the length, this corresponds to around 150 to 250 words.
Recognize AI texts
As far as the number of characters is concerned, we had to reformulate our requirement for ChatGPT 3.5 and 4.0 again. Our own text, on the other hand, was sufficient in length. The result on the “human” content: “The classifier considers the text as unclear if it was generated by an artificial intelligence.”
The KI-Checker classifies the text of ChatGPT 3.5 as “probably KI-generated”. The same applies to the content of the GPT 4.0 version. Since the two AI models come from our own company, this shouldn’t really come as a surprise.
The three-sentence long Twitter text is meanwhile too short for the requirements. If we lengthen the content, we get a result similar to the iPhone text. The “AI Text Classifier” recognizes GPT 3.5 and 4.0 as “probably AI-generated”, however the tool thinks that it is also our “human” content and that of an AI.
2. Writer
According to KI-Checker Writer, content that reads “as if it was created entirely by an artificial intelligence” can impact search engine rankings. To use the free “AI Content Detector”, a text must not exceed 1,500 characters.
The result: the tool classifies all of our text input, both iPhone and Twitter texts, as 100 percent human. This also applies to all AI texts of GPT 3.5 and GPT 4.0. This means that the AI checker has not recognized a single AI text apart from the content that is actually of human origin.
3. Copyleaks AI Content Detector
The so-called “AI content detector‘ by Copyleaks promises a lot: ‘Paste your content below and we’ll tell you within seconds with exceptional accuracy if any of it was AI-generated.’ The tool is also ‘the only AI content detection solution for the enterprise.’ But it holds the AI checker what it promises? The result:
The “AI Content Detector” classified ChatGPT 3.5’s text as “this is human text”. The same applies to the lines of the GPT 4.0 version, as part of the ChatGPT Plus paid version. The tool correctly recognized our “human” text as such.
However, the result is somewhat different for the three-sentence explanatory text on Twitter. Because the “AI Content Detector” correctly recognized both the contribution of ChatGPT 3.5 and version 4.0 as “AI content”. The tool also correctly recognized our “human” three-liner as such. So the AI checker seems to be more accurate on shorter content.
4. GPTZero
The GPTZero platform describes itself as the “world’s leading AI detector with over 1 million users”. The minimum requirement to have a text checked is at least 250 characters – which is significantly less than the “AI Text Classifier” from OpenAI. But how reliable is GPTZero?
The result: The AI checker classifies the GPT 3.5 iPhone text as: “Your text is probably written entirely by a human”. The same correctly applies to the iPhone content that we have written. However, with the text of the GPT 4.0 version, the tool also means that the content comes from a human being.
Regarding the Twitter texts, the result is as follows: The GPT 3.5 content was detected as AI text. The GPT 4.0 contribution should again be human. Meanwhile, the content we authored has been correctly classified as human.
What is striking about GPTZero: Of all the tools tested, the AI checker takes the longest for the analysis. In addition, the platform often spits out error messages, which could be due to a possible overload.
5. Crossplag
The Crossplag platform advertises its “AI Content Detector” as follows: “Originality has a new threat, and here is the solution.” Whether there is a minimum or maximum input of words or characters is not apparent, at least at first glance. What is noticeable after two checks, however, is that the tool requires registration.
Like the AI checker Writer, Crossplag also classifies all texts as: “This text is mainly written by a human”, which is incorrect in four out of six cases. So in combination with the login, which we bypassed by using other browsers, this tool is neither effective nor the promised solution. Rather, caution should be exercised.
Conclusion: Recognize artificial intelligence – AI checker in the test
Note: So that this contribution does not get out of hand, we have not included the example texts at this point. However, you can find both GPT 3.5 and 4.0 releases, as well as our corresponding human contributions, in our ChatGPT comparison.
In the meantime, we have recorded the results of our AI checker test in the following table. The red X stands for an incorrectly recognized text, the green tick for a correctly recognized text. Since we each checked two examples, each field also contains two symbols. The first for the iPhone post, the second for the Twitter text.
ChatGPT 3.5 | ChatGPT 4.0 | Person | |
---|---|---|---|
AI text classifiers | ✓✓ | ✓✓ | ✓X |
Writer | XX | XX | ✓✓ |
Copyleaks AI Content Detector | ✓X | ✓X | ✓✓ |
GPTZero | X✓ | XX | ✓✓ |
crossplague | XX | XX | ✓✓ |
The AI Text Classifier from OpenAI achieves the best result overall. Since the company has developed both GPT 3.5 and version 4.0, this is hardly surprising. However, the AI checker was the only one in our test to classify a human text as AI content.
While the “AI content detector” from Copyleaks and GPTZero were each able to achieve at least partial success in recognizing AI texts, Writer and Crossplag failed completely. Because both tools have classified all inputs as “human”. The Crossplag tool also seems to be a data octopus.
Also interesting: