What happens when you train AI with AI material?
You would have almost forgotten, but last year around this time we hardly talked about “AI”. Back then everything was still in the theme of ‘NFTs’ and the blockchain. It would definitely be the technology of the future, according to the companies that went into business with it. Every piece of media was better off as NFT. There was a lot of speculation and there wasn’t a company that didn’t show some blockchain-driven tech to shareholders.
It’s pretty quiet on that front these days, because there’s a new kid in town: generative AI. ChatGPT has only been on the market for half a year, yet every company says that “AI” is the way forward. Maybe they’ll be right this time. But if you see what happens when you train an “AI” on “AI” generated material, you may, like myself, have doubts.
Contents
Let’s call it teething troubles
“AI” meanwhile is well entrenched in the news cycle. Here at Apparata we (I say ‘we’, of course, to shirk personal responsibility as much as possible) are guilty of it. But the implementation of “AI” technology is not yet an easy thing to do. About half of all employees of large global companies work with some kind of “AI” somewhere in their workflow.
Meanwhile, generative “AI” is no longer limited to half-baked chatbots. All kinds of media are generated by “AI”. Opinions are divided on whether “AI” actually generates something new or simply mixes up existing data. What no one can deny is that the frequent use of other people’s intellectual property for training “AI” is unacceptable. And so a number of researchers came up with the question: Can we train “AI” on material generated by other so-called artificial intelligences?
The short answer is no
Apart from the intellectual property issue, it’s a question worth asking. Because the fact is that the internet is increasingly provided with “AI” generated material. Since almost all training data is retrieved from the internet, it is wise to know what changes to that dataset will do to your model. It turns out not much, when this butcher consumes his own goods, irreversible defects are the result.
Unfortunately, what we generally call artificial intelligences has nothing to do with intelligence. Intelligence is obviously not a simple concept and definitions differ. But an essential part of intelligence is being able to demonstrate an understanding of your environment. Creativity, being able to apply abstract concepts, things like that. That is not what we call “AI” does. In short, an “AI” is a calculation model that makes predictions about the most likely distribution. In the case of chatbots such as ChatGPT we are talking about words, and in models such as Midjourney about pixels.
Those predictions are made based on training data. You tell the model what the data you feed the model is, and in doing so the model “learns”. When the model works, selection is mainly based on the results. As long as the model generates convincing results, it is fine. But an “AI” has no knowledge of what it produces. An “AI” does not form an abstract thought and then convert it into a sentence, it calculates after each word which other word is most likely to follow. So there is no overarching thought behind it, and that’s quite annoying because an underlying thought is what distinguishes language from a random stream of sounds.
System Collapse
Besides being a cool name for an industrial metal band, system collapse is the result of the central question of this article. The material generated by “AI” therefore lacked that central idea that provides language and media with an understandable logic. As an “AI” model consumes more “AI” material, the material produced becomes less and less cohesive, as the last remnants of human input are scrambled beyond recognition. The underlying data structure disappears, even under ideal conditions. The process, according to the researchers, is inevitable.
“As time goes on, errors in the generated data accumulate, forcing models that learn from generated data to perceive reality more and more poorly.” Says Ilia Shumailov, one of the study’s lead authors. “We were surprised how quickly the models forgot the original data set they learned from in the beginning.”
Long story short, the more an “AI” model is exposed to “AI” generated material, the worse the results. Errors pile up, the variation of non-incorrect material decreases and the model reverts to producing little coherent nonsense. Developers of “AI” who scrape their data from the internet have also noticed this. The internet archive, Archive.org, is inundated with their requests.
No sausage like unitary sausage
Of course, people are not always able to correctly interpret and then represent their environment. The data with which “AI” is fed is itself proof of this. But even with a very representative dataset, something strange happens during the production process of an “AI”.
Shumailov explains it like this. Imagine you tell an “AI” what a cat is by giving it 100 pictures of a cat. 90 cats have yellow fur, and 10 have blue. The model thus learns that yellow cats are more common. But those 10 blue cats are also in the dataset, so you would say that when the model is asked to show a cat, it occasionally produces a blue one. So this doesn’t just happen. The blue cats are depicted by the “AI” as more yellow than they really are, eventually turning them green.
As the “AI” generates more and more data, eventually there will be no blue cats left. Let the cycle run often enough and even the green cats disappear, leaving only yellow cats. Data points that form a minority therefore disappear from the data set over time. Starting with a representative dataset is a must, but even the ideal dataset is “contaminated” by “AI” generated material with unrepresentative results. “AI” has a lot of trouble learning rare data.
The implications
The above hopefully makes it clear to the reader why I have been stiffly including the abbreviation AI in quotation marks for some time now. It has little to do with intelligence. The results are sometimes convincing, but only on a superficial level. Artificial intelligences have no awareness of the material they produce. Even when the model was asked not to repeat data too much, things went wrong. The model just spontaneously started making things up. Such creative outbursts are becoming more common.
For example, a scientist asked ChatGPT to compile his bibliography. The resulting list contained a number of publications that the scientist actually had to his name, but a number of them were simply made up by ChatGPT. It is a problem that many professionals have encountered.
The problem shows that there are fundamental flaws with the way “AI” works. The current models are not an artificial approximation of the workings of the human mind, however convincing the results may seem at first glance. I wouldn’t say it’s a problem that “AI” developers will never find a solution for. But given the rapid rise of “AI”, I think it would be wise to be aware that an industry that thrives on hype may be deliberately misrepresenting its own products. AI is not intelligent, what it produces is full of distortions and it has a tendency to just make things up. Perhaps the scariest thing is that as we as a society have embraced technology more and more, we understand it less and less.