Vevo: Controllable Zero-Shot Voice Imitation with
Self-Supervised Disentanglement

Abstract

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility.

Note: For Vevo-Timbre, we directly extract the content-style tokens from the source speech, and input them into flow-matching transformer for generation.
Task Model Source Style Reference Timbre Reference
Zero-Shot
Timbre Imitation
Vevo-Timbre \( \textcolor{blue}{U_i} \) / \( \textcolor{red}{U_r} \)
Zero-Shot
Style Imitation
Vevo-Style \( \textcolor{blue}{U_i} \) \( \textcolor{red}{U_r} \) \( \textcolor{blue}{U_i} \)
Zero-Shot
Voice Imitation
Vevo-Voice \( \textcolor{blue}{U_i} \) \( \textcolor{red}{U_r} \) \( \textcolor{red}{U_r} \)
Vevo-TTS \( \textcolor{blue}{T_i} \) \( \textcolor{red}{U_r} \) \( \textcolor{red}{U_r} \)

We will showcase the diverse capabilities of Vevo through the following examples.

Timbre Imitation and Voice Imitation (Conversion Task)

Vevo-Timbre v.s. Vevo-Voice

Vevo-Timbre and Vevo-Voice can both convert the speaker characteristics of speech. The key difference between them is that Vevo-Timbre only imitates the timbre of the reference speech, while effectively preserving the style of the source speech (e.g., prosody, emotion, accent). On the other hand, Vevo-Voice imitates both the timbre and style of the reference speech, resulting in higher speaker similarity to the reference. We will demonstrate the characteristics and distinctions between Vevo-Timbre and Vevo-Voice using samples from different domains.

For accented corpus, we observe that Vevo-Timbre retains the accent of the source speech, whereas Vevo-Voice mimics the accent of the reference speech:

Source Reference Vevo-Timbre Vevo-Voice
Mandarin-accented Hindi-accented Mandarin-accented Hindi-accented
Mandarin-accented Arabic-accented Mandarin-accented Arabic-accented

For emotional corpus, we observe that Vevo-Timbre retains the emotion of the source speech, whereas Vevo-Voice mimics the emotion of the reference speech:

Source Reference Vevo-Timbre Vevo-Voice
Neutral Angry Neutral Angry
Angry Neutral Angry Neutral

For general corpus, we observe that Vevo-Timbre maintains the prosody of the source speech, while Vevo-Voice can imitate the speech rate, stress, and rhythm of the reference speech:

Source Reference Vevo-Timbre Vevo-Voice
Compared to Baselines

We select several baselines in zero-shot voice conversion, including HierSpeech++, LM-VC, UniAudio, and FACodec.

Domain Source Reference Vevo-Timbre Vevo-Voice HierSpeech++
(2023)
LM-VC
(SPL 2023)
UniAudio
(ICML 2024)
FACodec
(ICML 2024)
Audio-
book
Common
Voice
ACCENT
EMOTION

Style Imitation

We present the performance of Vevo-Style in the zero-shot style imitation task, focusing on widely studied styles such as accent and emotion. Notably, Vevo-Style employs a zero-shot manner (i.e., using just a few seconds of speech) to achieve style imitation, which is rarely seen in existing researches.

Accented Corpus

For accent imitation, we select baselines from the accent conversion field: ASR-AC, VoiceShop, and Conv-Speak. We use their demo website samples as our evaluation set. We also introduce a zero-shot style imitation model, Vevo-Style (ASR), for comparison. Its only difference compared to Vevo-Style is the use of ASR model rather than our proposed content tokenizer to extract content tokens (see Section 4.3 of our paper).

Setting Source ASR-AC
(ICASSP 2023)
Reference Vevo-Style (ASR) Vevo-Style
Hindi to American Hindi-accented American-accented (female)
American-accented (male)
Hindi-accented American-accented (female)
American-accented (male)
British to American British-accented American-accented (female)
American-accented (male)
British-accented American-accented (female)
American-accented (male)
British to Hindi British-accented Hindi-accented (male)
Hindi-accented (female)
British-accented Hindi-accented (male)
Hindi-accented (female)
Setting Source VoiceShop
(2024)
Reference Vevo-Style (ASR) Vevo-Style
American to Hindi American-accented Hindi-accented (male)
Hindi-accented (female)
American to Mandarin American-accented Mandarin-accented (male)
Mandarin-accented (female)
British to American British-accented American-accented (female)
American-accented (male)
British to Hindi British-accented Hindi-accented (male)
Hindi-accented (female)
Hindi to American Hindi-accented American-accented (female)
American-accented (male)
Setting Source Conv-Speak
(MM 2024)
Reference Vevo-Style (ASR) Vevo-Style
Hindi to American Hindi-accented American-accented (female)
American-accented (male)
Hindi to American Hindi-accented American-accented (female)
American-accented (male)
Hindi to American Hindi-accented American-accented (female)
American-accented (male)
Emotional Corpus

For emotion imitation, we select a baseline from the emotion conversion field, Emovox. We use its demo website samples as our evaluation set.

Setting Source Emovox
(TAC 2023)
Reference Vevo-Style (ASR) Vevo-Style
Neutral to Angry Neutral Angry (male)
Angry (female)
Neutral to Angry Neutral Angry (male)
Angry (female)
Neutral to Angry Neutral Angry (male)
Angry (female)
Neutral to Happy Neutral Happy (male)
Happy (female)
Neutral to Happy Neutral Happy (male)
Happy (female)
Neutral to Happy Neutral Happy (male)
Happy (female)

Voice Imitation (Synthesis Task)

To evaluate the performance of Vevo-TTS in the zero-shot voice imitation (synthesis) task, we select the classic baselines of the zero-shot TTS filed, including the Non-AR model, Voicebox, and the AR models such as VALL-E and VoiceCraft, all of which are trained only on audiobook speech data. For comparison, we also include two stronger state-of-the-art models: CosyVoice and MaskGCT, which are trained on large-scale private corpus derived from in-the-wild video data, featuring highly diverse distributions.

Accented Corpus
Accent Source Reference Ground Truth Vevo-TTS CosyVoice
(2024)
MaskGCT
(2024)
VoiceCraft
(ACL 2024)
Voicebox
(NeurIPS 2023)
VALL-E
(2023)
Arabic-accented Men like Joe Goose dated existence from drunk to drunk.
The Churchill narrowed and its current became swifter as they progressed.
We leave the eventuality to time and law.
Hindi-accented In the picture he saw each moment a greater resemblance to Jeanne.
In short my joyous individualism was dominated by the orthodox bourgeois ethics.
About him everywhere were the evidences of luxury and of age.
Mandarin-accented Like a flash he launched himself into the feathered mass of the owl.
Outwardly he maintained a calm and smiling aspect.
The land exchanged its austere robes for the garb of a smiling wanton.
Spanish-accented Gregson held a lighted match until it burnt his fingertips.
Between him and all domestic animals there must be no hostilities.
It is very plausible to such people a most convincing hypothesis.
Emotional Corpus
Emotion Source Reference Ground Truth Vevo-TTS CosyVoice
(2024)
MaskGCT
(2024)
VoiceCraft
(ACL 2024)
Voicebox
(NeurIPS 2023)
VALL-E
(2023)
Angry She and I are running a neck and neck race.
From each cake, there sprang a huge dog.
Perhaps you think that is a queer title for this chapter.
Happy And they were sandy yellow brownish all over.
She just smiled calmly, mother knows that is the best smile.
And what are doves. And what are doves.
Sad Your midget wife never can sing a song.
Annie please please don't hurt me!
The new born baby is stolen as we go.
Surprise Perhaps you think is a queer title for this chapter.
Clear are your eyes and bright your breath!
Born once every one hundred years, dies in flames!
General Corpus
Domain Source Reference Ground Truth Vevo-TTS CosyVoice
(2024)
MaskGCT
(2024)
VoiceCraft
(ACL 2024)
Voicebox
(NeurIPS 2023)
VALL-E
(2023)
Audio-
book
He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.
A moment before the ghost of the ancient kingdom of the Danes had looked forth through the vesture of the hazewrapped City.
"I know; but that renders your uncle a most agreeable companion and gossip," declared Dr. Pipt.
"Well, he wrote so furiously that he broke his pencil, and had, as you observe, to sharpen it again.
Emerald and black and russet and olive, it moved beneath the current, swaying and turning.
Common Voice The stained glass offered a hypnotic atmosphere.
One by one, the campfires were extinguished, and the oasis fell as quiet as the desert.
Take these capsules over to Mrs. David's house.
I thought about the difficulties in translation that might arise.
What is the forecast for California for rain?

Ethics Statement

As with other powerful new AI innovations, we recognize this technology brings the potential for misuse and unintended harm. We will build a highly effective classifier that can distinguish between authentic speech and audio generated with Vevo to mitigate these possible future risks.