Microsoft's latest speech generator is so good that they are afraid to release it to the public.

Action
Microsoft's latest speech generator is so good that they are afraid to release it to the public.

What we have created is so great that we can't risk releasing it to the public." That's basically what Microsoft says about its latest speech generator, VALL-E 2.

If everything is completely real, what does it say about Microsoft that they are knowingly creating AI tools that are too dangerous to release to the public?

Anyway, these are the basic facts. In a recent blog post (via Extremetech), Microsoft stated that its latest neural codec language model for speech synthesis, known as VALL-E 2, has achieved "human-like" performance for the first time.

Specifically, "VALL-E 2 is able to generate accurate and natural speech in the original speaker's voice, rivaling human performance." Now, to some extent, this is nothing new. But the incredible speed with which VALL-E 2 can accomplish this, or to put it another way, the incredibly limited number of samples and prompts needed to accomplish this feat, is what makes it remarkable.

VALL-E 2 is able to accurately mimic the voice of a particular person based on only a few seconds of samples. It does this trick by using a vast training library that maps pronunciation, intonation, and cadence variations to models and samples and spits out what appears to be a perfectly convincing synthetic voice.

Microsoft's blog post includes a variety of audio

clips showing how VALL-E 2 (and indeed its predecessor, VALL-E) can turn short 3- to 10-second samples into convincing synthetic voices that are often indistinguishable from a real human voice. clip examples are provided.

This is a process known as zero-shot speech synthesis, or zero-shot TTS for short. Again, this approach is not new, but the accuracy and shortness of the sampled audio is.

Of course, the idea of weaponizing such tools to create fake content for malicious purposes is not new as well. But the capabilities of VALL-E 2 seem to take that threat to a whole new level. That is why the "ethics statement" attached to the blog post makes it clear that Microsoft does not currently intend to release VALL-E 2 to the public.

"VALL-E 2 is a pure research project. We currently have no plans to incorporate VALL-E 2 into a product or make it available to the public," and added, "There are potential risks associated with misuse of the model, such as spoofing speech identification or impersonating a particular speaker. We conducted our experiments under the assumption that the user agrees to be the target speaker of the speech synthesis. If the model is to be generalized to unknown speakers in the real world, a protocol needs to be included to ensure that the speaker authorizes the use of his or her voice and the synthetic speech detection model."

Microsoft has expressed similar concerns about VASA-1, which can turn still images of people into compelling motion video. It states, "VASA-1 is not intended to create content to mislead or deceive. However, like other related content generation technologies, it could be exploited to impersonate humans," Microsoft said of VASA-1.

The obvious observation, perhaps, is that the problems associated with such a model are not necessarily a surprise. Even if one does not succeed in creating a perfect speech synthesis model, one can imagine what problems might arise if such a tool were made available to the public.

So it was easy to foresee that problems would arise, but Microsoft went ahead anyway. Now Microsoft claims to have achieved its goal, only it has decided that it is not suitable for public release.

This rather begs the question, what other tools are they developing, knowing in advance that they have too many problems to release to the public? And one can't help but wonder what Microsoft's aims are.

There is also the inevitable "genie and the bottle" conundrum. Microsoft created this tool, but it is hard to imagine that this tool or something similar will not eventually go unchecked. In short, its ethics are quite confusing. Where it will end up is still anyone's guess.

.

Categories