Meta's AI Voicebox Aims To Do For Audio What ChatGPT, Dall-E Did For Text, Image Generation

Zinger Key Points
  • Meta's Voicebox uses advanced AI that can produce realistic, contextually accurate speech output from text.
  • Voicebox can edit audio clips actively, eliminating noise from speech and replacing misspoken words.

Meta Platforms Inc META CEO Mark Zuckerberg on Friday introduced Voicebox, a text-to-speech (TTS) generative artificial intelligence speech model.

What Happened: Voicebox is an advanced AI model that can produce realistic, contextually accurate speech output from given text and has the potential to complete tasks for which it was not explicitly trained. Engadget compared Voicebox to what OpenAI's ChatGPT did for text output and what Dall-E did for image generation.

Zuckerberg made the announcement via his Meta Channel on Instagram, accompanied by a video showing how Voicebox can convert text into speech in various styles, how it can handle background noise much like an audio eraser and how it can even replace spoken words.

Built on a foundation of “non-autoregressive flow-matching model trained to infill speech, given audio context and text," Engadget noted Voicebox’s training involved more than 50,000 hours of diverse, unfiltered audio in multiple languages, including English, French, Spanish, German, Polish and Portuguese.

Voicebox leverages its varied training to deliver conversationally fluid speech in various languages. In tests, speech recognition models trained on synthetic speech generated by Voicebox performed nearly as well as models trained on real speech, the report said, showing a 1% degradation in error rate.

Read also: Meta Opens New AI Software For Commercial Use, A Bold Move Set to Rival Google and Microsoft

Why It Matters: One of Voicebox’s defining characteristics is its ability to edit audio clips actively, as it can eliminate noise from speech and replace misspoken words. By identifying a noisy segment in the speech, the user can crop it and instruct the model to regenerate the segment, much like using image-editing software to enhance photos.

Unlike existing TTS generators, Voicebox doesn’t require extensive source material to mimic a subject. It’s the result of Meta’s zero-shot text-to-speech training method known as Flow Matching.

While the potential applications of Voicebox are nearly endless, Meta decided not to release the app or its source code to the public as of now due to potential misuse concerns according to Engadget.

META Price Action: Shares of Meta are trading 0.92% higher to $284.42 at last check, according to Benzinga Pro.

Read next: Benzinga’s ‘Stock Whisper’ Index: 5 Stocks Investors Are Secretly Monitoring But Not Talking About Yet

Photo via Pixabay.

Market News and Data brought to you by Benzinga APIs
Comments
Loading...
Posted In:
Benzinga simplifies the market for smarter investing

Trade confidently with insights and alerts from analyst ratings, free reports and breaking news that affects the stocks you care about.

Join Now: Free!