Meta's AI Voicebox Aims To Do For Audio What ChatGPT, Dall-E Did For Text, Image Generation

by Aj Fabino Benzinga Staff Writer

Zinger Key Points

Meta's Voicebox uses advanced AI that can produce realistic, contextually accurate speech output from text.
Voicebox can edit audio clips actively, eliminating noise from speech and replacing misspoken words.
Get real-time earnings alerts before the market moves and access expert analysis that uncovers hidden opportunities in the post-earnings chaos.

Meta Platforms Inc META CEO Mark Zuckerberg on Friday introduced Voicebox, a text-to-speech (TTS) generative artificial intelligence speech model.

What Happened: Voicebox is an advanced AI model that can produce realistic, contextually accurate speech output from given text and has the potential to complete tasks for which it was not explicitly trained. Engadget compared Voicebox to what OpenAI's ChatGPT did for text output and what Dall-E did for image generation.

Zuckerberg made the announcement via his Meta Channel on Instagram, accompanied by a video showing how Voicebox can convert text into speech in various styles, how it can handle background noise much like an audio eraser and how it can even replace spoken words.

Built on a foundation of “non-autoregressive flow-matching model trained to infill speech, given audio context and text," Engadget noted Voicebox’s training involved more than 50,000 hours of diverse, unfiltered audio in multiple languages, including English, French, Spanish, German, Polish and Portuguese.

Voicebox leverages its varied training to deliver conversationally fluid speech in various languages. In tests, speech recognition models trained on synthetic speech generated by Voicebox performed nearly as well as models trained on real speech, the report said, showing a 1% degradation in error rate.

Read also: Meta Opens New AI Software For Commercial Use, A Bold Move Set to Rival Google and Microsoft

Why It Matters: One of Voicebox’s defining characteristics is its ability to edit audio clips actively, as it can eliminate noise from speech and replace misspoken words. By identifying a noisy segment in the speech, the user can crop it and instruct the model to regenerate the segment, much like using image-editing software to enhance photos.

Unlike existing TTS generators, Voicebox doesn’t require extensive source material to mimic a subject. It’s the result of Meta’s zero-shot text-to-speech training method known as Flow Matching.

While the potential applications of Voicebox are nearly endless, Meta decided not to release the app or its source code to the public as of now due to potential misuse concerns according to Engadget.

META Price Action: Shares of Meta are trading 0.92% higher to $284.42 at last check, according to Benzinga Pro.

Photo via Pixabay.

METAMeta Platforms Inc

$695.770.13%

Overview

Market News and Data brought to you by Benzinga APIs

Meta's AI Voicebox Aims To Do For Audio What ChatGPT, Dall-E Did For Text, Image Generation

Zinger Key Points

Want To View Edge Rankings?

Edge Rankings

Price Trend

Comments

Zinger Key Points

Popular Channels

Tools & Features

Partners & Contributors

About Benzinga