AIModels.fyi

AIModels.fyi

Share this post

AIModels.fyi
AIModels.fyi
Zero-shot voice cloning without transcription

Zero-shot voice cloning without transcription

MiniMax-Speech: Intrinsic zero-shot text-to-speech with a learnable speaker encoder

aimodels-fyi's avatar
aimodels-fyi
May 22, 2025
∙ Paid

Share this post

AIModels.fyi
AIModels.fyi
Zero-shot voice cloning without transcription
Share

Text-to-speech technology has advanced dramatically in recent years, but most systems still face limitations when cloning voices without extensive training data. MiniMax-Speech represents a significant step forward because it offers true zero-shot voice cloning that doesn’t require transcribed reference audio.

The architecture of MiniMax-Speech includes tokenizers, an autoregressive Transformer with speaker encoder, and a latent flow matching model. All images are from the paper.

Innovations That Set MiniMax-Speech Apart

Two key innovations make MiniMax-Speech stand out from existing text-to-speech systems:

  1. A learnable speaker encoder that captures voice characteristics from untranscribed audio, enabling true zero-shot voice cloning

  2. Flow-VAE architecture that enhances audio quality and speaker similarity by improving information representation

Unlike other models that claim “zero-shot” capabilities but actually require paired text-audio examples (a common issue in previous approaches), MiniMax-Speech can generate high-quality speech in a target voice using only an untranscribed audio sample.

True Zero-Shot Voice Cloning

MiniMax-Speech employs an autoregressive Transformer architecture similar to those used in large language models. The key difference is its speaker encoder, which extracts timbre and vocal style from reference audio without needing any transcription.

“Different voice cloning approaches in AR Transformer. (a.) One-shot approach requiring paired text-audio prompt. (b.) Intrinsic zero-shot approach using only untranscribed audio. (c.) Enhanced one-shot approach combining both methods.”

The speaker encoder is jointly trained with the autoregressive model, unlike systems that use pre-trained speaker verification models. This joint training allows the encoder to better capture the specific characteristics needed for high-quality speech synthesis.

This approach offers several advantages:

  • No transcription needed for reference audio

  • Cross-lingual synthesis capabilities

  • More natural prosody as the model isn’t constrained by prompt examples

  • Flexible voice cloning across 32 languages

While MiniMax-Speech supports both zero-shot and one-shot cloning, its zero-shot capabilities are what truly distinguish it from other systems like VALL-E, CosyVoice 2, and Seed-TTS, which require paired text-audio samples for speaker conditioning.

Enhancing Audio Quality with Flow-VAE

The second major innovation in MiniMax-Speech is its Flow-VAE architecture, which significantly improves audio quality.

Keep reading with a 7-day free trial

Subscribe to AIModels.fyi to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 AIModels.fyi
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share