ViciosGames

Humanity has taken another step toward the inevitable war against machines (which we will lose) with the birth of Vall-E, an AI developed by a Microsoft research team that can create high-quality human voice replicas in just a few seconds of voice training.

Vall-E is not the first AI-powered voice tool that has been around for several years, xVASynth (open in new tab) for example, but it promises to surpass them in terms of pure capability. In a paper published by Cornell University (opens in new tab) (via Windows Central (opens in new tab)), the Vall-E researchers note that most current speech synthesis systems rely on "high-quality clean data" to accurately synthesize high-quality speech and are therefore stated that they are "limited.

"Large data crawled from the Internet cannot meet this requirement and will always result in poor performance," the paper states.

"Due to the relatively small amount of training data, current TTS systems still suffer from poor generalizability. In the zero-shot scenario, speaker similarity and speech naturalness are dramatically degraded for unseen speakers."

(In this case, "zero-shot scenario (open in new tab)" essentially refers to the AI's ability to reproduce speech without being specifically trained to do so.)

Vall-E, on the other hand, has been trained on a much larger and more diverse data set: 60,000 hours of English speech drawn from over 7,000 unique speakers, all of which have been transcribed by speech recognition software. Although the data fed to the AI contains "noisier speech and inaccurate transcriptions" than those used by other speech synthesis systems, the researchers believe that the sheer scale and diversity of the input will allow for much more flexible, adaptive, and (this is the big point) natural speech synthesis than its predecessors.

"Experiments

"Experimental results show that VALL-E significantly outperforms state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity," the paper, filled with numbers, equations, diagrams, and other complications, says.

"Furthermore, we found that VALL-E was able to preserve the speaker's emotion and the acoustic environment of the acoustic prompts during synthesis.

You can hear Vall-E in action on Github (opens in new tab). The research team shares a brief breakdown of how it works, along with dozens of samples of inputs and outputs. The quality varies: some voices sound like robots, some sound like humans. But as a kind of first-pass technology demo, it's impressive: imagine what this technology will look like in a year, two years, five years, as the system improves and the voice training data set expands further.

Of course, that's why it's a problem: AI art generator Dall-E has faced backlash due to privacy and ownership concerns (open in new tab), and the ChatGPT bot is so compelling that it was recently banned by the New York City Department of Education (open in new tab ); Vall-E could be even more worrisome as it could be used to augment scam marketing calls and deep-fake videos. This may sound a bit sobering, but as our Executive Editor, Tyler Wilde, noted earlier in the year, this stuff is not going away (opens in new tab). It is imperative that we recognize the problem and regulate the creation and use of AI systems before potential problems become real (and really big) problems.

The VALL-E research team addressed these "broader impacts" in the paper's conclusion. The team wrote that "VALL-E's ability to synthesize speech while preserving the identity of the speaker may entail potential risks in the misuse of the model, such as spoofing of speech identification or impersonation of a particular speaker." To mitigate such risks, "it is possible to build a detection model that identifies whether a speech clip was synthesized by VALL-E or not. In further developing the model, we plan to put into practice Microsoft's AI principles (opens in new tab)." "

In case you need further evidence that on-the-spot voice imitation leads to the wrong place:

This new AI can mimic a human voice with just 3 seconds of training

Categories

Comments

About Us

Navigation