Microsoft has developed a new artificial intelligence (AI) called VALL-E that can imitate human voices. A three-second voice sample is apparently already sufficient.
Artificial intelligence and AI tools are playing an increasingly important role. However, the algorithms often lack many human characteristics – such as independent thinking. This is how historical texts of bears in space were created. Still, tools like ChatGPT show what’s possible these days.
Microsoft is now showing that the deceptively real-looking communication with a machine can go beyond text. The company recently hired own TTS model (Text-To-Speech) called VALL-E Before. The scary thing about it: It can imitate people in a deceptively real way. VALL-E only needs three seconds of voice recording for this.
VALL-E imitates human voices
The AI can thus imitate every person in the world in a deceptively real way. The system is based on a technology called EnCodec from the technology company Meta, which the US company first announced in October 2022. Meanwhile, artificial intelligence analyzes how a person speaks. To do this, she uses training data to simulate other pitches.
Three seconds of audio recording is sufficient for a natural voice. In theory, voice assistants could be created that sound like Barack Obama or Angela Merkel.
Training data from the LibriLight audio library – also a creation of Meta – ensure an even better result. This includes 60,000 hours of audio recordings from 7,000 English speakers.
New model carries some risks
It should also be possible to simulate an acoustic environment for the voice. For example, if the system accepts a voice sample on the phone, the finished model also sounds like a person on the phone. As anyone can imagine, this approach carries many risks. Microsoft has the same opinion.
To prevent the model from being misused, the company therefore developed a detection model that can clearly tell whether a recording comes from VALL-E. This is to prevent criminals from misusing the technology for authentication or other occurrences. Whether that will be enough remains to be seen.