Microsoft AI Lets You Clone Any Voice With 3-Second Audio - Microsoft (NASDAQ:MSFT)

Microsoft Corporation MSFT unveiled a text-to-speech artificial intelligence, or AI, model that can generate realistic voice imitations using a three-second audio sample.

What Happened: Last week, Microsoft announced a new AI model called VALL-E that only requires a three-second audio sample to accurately simulate anyone's voice along with their emotional tone, reported ARS Technica.

See Also: How To Buy Microsoft (MSFT) Shares

The tech giant calls VALL-E a neural codec language model based on EnCodec technology announced by Meta Platforms Inc. META in October 2022.

In contrast to how traditional text-to-speech techniques synthesize speech by altering waveforms, Microsoft's AI model produces discrete audio codecs from text and auditory prompts.

The company trained VALL-E's speech synthesis capabilities by leveraging Meta's audio library called LibriLight, which consists of 60,000 hours of English language speeches from over 7,000 speakers, the report noted.

Simply put, VALL-E analyzes sample audio, breaks the information into tokens and uses training data to assess how the voice would sound if it spoke other phrases outside of the given audio sample.

However, it can only generate a realistic imitation if the audio sample is similar to a voice in the training data.

Why It's Important: Unlike how OpenAI provided chatGPT for others to experiment with, Microsoft seems cautious of VALL-E's capabilities to fuel deceptive circumstances.

The research paper stated, "Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker."

The company said a detection model could be built to authenticate whether VALL-E synthesized an audio clip.

MSFTMicrosoft Corp

$437.050.85%