Vevo: Controllable Zero-Shot Voice Imitation with
Self-Supervised Disentanglement

Xueyao Zhang¹ Xiaohui Zhang² Kainan Peng² Zhenyu Tang² Vimal Manohar² Yingru Liu² Jeff Huang² Dangna Li² Yuhao Huang² Julian Chan² Yuan Huang² Zhizheng Wu¹ Mingbo Ma²

¹ The Chinese University of Hong Kong, Shenzhen ² Meta AI

Abstract

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility.

What can Vevo do?

Vevo can take either speech or text as input, and perform zero-shot imitation with controllable linguistic content (controlled by the source), style (controlled by the style reference), and timbre (controlled by the timbre reference) in a single forward pass. This inference pipeline can be adjusted for various zero-shot imitation tasks.

Given a source speech (\( \textcolor{blue}{U_i} \)) or text (\( \textcolor{blue}{T_i} \)), and a reference speech (\( \textcolor{red}{U_r} \)), we propose the following variants of Vevo according to the specific imitation task:

**Note**: For Vevo-Timbre, we directly extract the content-style tokens from the source speech, and input them into flow-matching transformer for generation.
Task	Model	Source	Style Reference	Timbre Reference
Zero-Shot Timbre Imitation	Vevo-Timbre	\( \textcolor{blue}{U_i} \)	/	\( \textcolor{red}{U_r} \)
Zero-Shot Style Imitation	Vevo-Style	\( \textcolor{blue}{U_i} \)	\( \textcolor{red}{U_r} \)	\( \textcolor{blue}{U_i} \)
Zero-Shot Voice Imitation	Vevo-Voice	\( \textcolor{blue}{U_i} \)	\( \textcolor{red}{U_r} \)	\( \textcolor{red}{U_r} \)
Zero-Shot Voice Imitation	Vevo-TTS	\( \textcolor{blue}{T_i} \)	\( \textcolor{red}{U_r} \)	\( \textcolor{red}{U_r} \)

We will showcase the diverse capabilities of Vevo through the following examples.

Model	Source	Style Reference	Timbre Reference	Results	Applications
Vevo-TTS	I'm a versatile zero-shot voice cloner. I can mimic a wide range of vocal styles from either text or waveform as input.	American-accented, Female		American-accented, Female	Style and Timbre Controllable TTS
Vevo-TTS	I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.	Arabic-accented, Male	Mandarin, Female	Arabic-accented, Female	Style and Timbre Controllable TTS
Vevo-Timbre	American-accented, Female	/	Arabic-accented, Male	American-accented, Male	Style-Preserved or Style-Converted VC
Vevo-Voice	American-accented, Female	Arabic-accented, Male	Arabic-accented, Male	Arabic-accented, Male	Style-Preserved or Style-Converted VC
Vevo-Style	American-accented, Female	Arabic-accented, Male	American-accented, Female (Same as Source)	Arabic-accented, Female	Accent Conversion, Emotion Conversion, etc.
		Emotional, Male		Emotional, Female
		Slow-paced, Male		Slow-paced, Female

Timbre Imitation and Voice Imitation (Conversion Task)

Vevo-Timbre v.s. Vevo-Voice

Vevo-Timbre and Vevo-Voice can both convert the speaker characteristics of speech. The key difference between them is that Vevo-Timbre only imitates the timbre of the reference speech, while effectively preserving the style of the source speech (e.g., prosody, emotion, accent). On the other hand, Vevo-Voice imitates both the timbre and style of the reference speech, resulting in higher speaker similarity to the reference. We will demonstrate the characteristics and distinctions between Vevo-Timbre and Vevo-Voice using samples from different domains.

For accented corpus, we observe that Vevo-Timbre retains the accent of the source speech, whereas Vevo-Voice mimics the accent of the reference speech:

Source	Reference	Vevo-Timbre	Vevo-Voice
Mandarin-accented	Hindi-accented	Mandarin-accented	Hindi-accented
Mandarin-accented	Arabic-accented	Mandarin-accented	Arabic-accented

For emotional corpus, we observe that Vevo-Timbre retains the emotion of the source speech, whereas Vevo-Voice mimics the emotion of the reference speech:

Source	Reference	Vevo-Timbre	Vevo-Voice
Neutral	Angry	Neutral	Angry
Angry	Neutral	Angry	Neutral

For general corpus, we observe that Vevo-Timbre maintains the prosody of the source speech, while Vevo-Voice can imitate the speech rate, stress, and rhythm of the reference speech:

Source	Reference	Vevo-Timbre	Vevo-Voice

Compared to Baselines

We select several baselines in zero-shot voice conversion, including HierSpeech++, LM-VC, UniAudio, and FACodec.

Domain	Source	Reference	Vevo-Timbre	Vevo-Voice	HierSpeech++ (2023)	LM-VC (SPL 2023)	UniAudio (ICML 2024)	FACodec (ICML 2024)
Audio- book


Common Voice


ACCENT


EMOTION

Different domains

Style Imitation

We present the performance of Vevo-Style in the zero-shot style imitation task, focusing on widely studied styles such as accent and emotion. Notably, Vevo-Style employs a zero-shot manner (i.e., using just a few seconds of speech) to achieve style imitation, which is rarely seen in existing researches.

Accented Corpus

For accent imitation, we select baselines from the accent conversion field: ASR-AC, VoiceShop, and Conv-Speak. We use their demo website samples as our evaluation set. We also introduce a zero-shot style imitation model, Vevo-Style (ASR), for comparison. Its only difference compared to Vevo-Style is the use of ASR model rather than our proposed content tokenizer to extract content tokens (see Section 4.3 of our paper).

Setting	Source	Reference
Hindi to American	Hindi-accented	American-accented (female)
	Hindi-accented	American-accented (male)
	Hindi-accented	American-accented (female)
	Hindi-accented	American-accented (male)
British to American	British-accented	American-accented (female)
	British-accented	American-accented (male)
	British-accented	American-accented (female)
	British-accented	American-accented (male)
British to Hindi	British-accented	Hindi-accented (male)
	British-accented	Hindi-accented (female)
	British-accented	Hindi-accented (male)
	British-accented	Hindi-accented (female)

Compared to ASR-AC

Setting	Source	Reference
American to Hindi	American-accented	Hindi-accented (male)
American to Hindi	American-accented	Hindi-accented (female)
American to Mandarin	American-accented	Mandarin-accented (male)
American to Mandarin	American-accented	Mandarin-accented (female)
British to American	British-accented	American-accented (female)
British to American	British-accented	American-accented (male)
British to Hindi	British-accented	Hindi-accented (male)
British to Hindi	British-accented	Hindi-accented (female)
Hindi to American	Hindi-accented	American-accented (female)
Hindi to American	Hindi-accented	American-accented (male)

Compared to VoiceShop

Setting	Source	Reference
Hindi to American	Hindi-accented	American-accented (female)
Hindi to American	Hindi-accented	American-accented (male)
Hindi to American	Hindi-accented	American-accented (female)
Hindi to American	Hindi-accented	American-accented (male)
Hindi to American	Hindi-accented	American-accented (female)
Hindi to American	Hindi-accented	American-accented (male)

Compared to Conv-Speak

Emotional Corpus

For emotion imitation, we select a baseline from the emotion conversion field, Emovox. We use its demo website samples as our evaluation set.

Setting	Source	Reference
Neutral to Angry	Neutral	Angry (male)
Neutral to Angry	Neutral	Angry (female)
Neutral to Angry	Neutral	Angry (male)
Neutral to Angry	Neutral	Angry (female)
Neutral to Angry	Neutral	Angry (male)
Neutral to Angry	Neutral	Angry (female)
Neutral to Happy	Neutral	Happy (male)
Neutral to Happy	Neutral	Happy (female)
Neutral to Happy	Neutral	Happy (male)
Neutral to Happy	Neutral	Happy (female)
Neutral to Happy	Neutral	Happy (male)
Neutral to Happy	Neutral	Happy (female)

Compared to Emovox

Voice Imitation (Synthesis Task)

To evaluate the performance of Vevo-TTS in the zero-shot voice imitation (synthesis) task, we select the classic baselines of the zero-shot TTS filed, including the Non-AR model, Voicebox, and the AR models such as VALL-E and VoiceCraft, all of which are trained only on audiobook speech data. For comparison, we also include two stronger state-of-the-art models: CosyVoice and MaskGCT, which are trained on large-scale private corpus derived from in-the-wild video data, featuring highly diverse distributions.

Accented Corpus

Accent	Source	Reference	Ground Truth	Vevo-TTS	CosyVoice (2024)	MaskGCT (2024)	VoiceCraft (ACL 2024)	Voicebox (NeurIPS 2023)	VALL-E (2023)
Arabic-accented	Men like Joe Goose dated existence from drunk to drunk.
	The Churchill narrowed and its current became swifter as they progressed.
	We leave the eventuality to time and law.
Hindi-accented	In the picture he saw each moment a greater resemblance to Jeanne.
	In short my joyous individualism was dominated by the orthodox bourgeois ethics.
	About him everywhere were the evidences of luxury and of age.
Mandarin-accented	Like a flash he launched himself into the feathered mass of the owl.
	Outwardly he maintained a calm and smiling aspect.
	The land exchanged its austere robes for the garb of a smiling wanton.
Spanish-accented	Gregson held a lighted match until it burnt his fingertips.
	Between him and all domestic animals there must be no hostilities.
	It is very plausible to such people a most convincing hypothesis.

Different Accents

Emotional Corpus

Emotion	Source	Reference	Ground Truth	Vevo-TTS	CosyVoice (2024)	MaskGCT (2024)	VoiceCraft (ACL 2024)	Voicebox (NeurIPS 2023)	VALL-E (2023)
Angry	She and I are running a neck and neck race.
	From each cake, there sprang a huge dog.
	Perhaps you think that is a queer title for this chapter.
Happy	And they were sandy yellow brownish all over.
	She just smiled calmly, mother knows that is the best smile.
	And what are doves. And what are doves.
Sad	Your midget wife never can sing a song.
	Annie please please don't hurt me!
	The new born baby is stolen as we go.
Surprise	Perhaps you think is a queer title for this chapter.
	Clear are your eyes and bright your breath!
	Born once every one hundred years, dies in flames!

Different Emotions

General Corpus

Domain	Source	Reference	Ground Truth	Vevo-TTS	CosyVoice (2024)	MaskGCT (2024)	VoiceCraft (ACL 2024)	Voicebox (NeurIPS 2023)	VALL-E (2023)
Audio- book	He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story.
	A moment before the ghost of the ancient kingdom of the Danes had looked forth through the vesture of the hazewrapped City.
	"I know; but that renders your uncle a most agreeable companion and gossip," declared Dr. Pipt.
	"Well, he wrote so furiously that he broke his pencil, and had, as you observe, to sharpen it again.
	Emerald and black and russet and olive, it moved beneath the current, swaying and turning.
Common Voice	The stained glass offered a hypnotic atmosphere.
	One by one, the campfires were extinguished, and the oasis fell as quiet as the desert.
	Take these capsules over to Mrs. David's house.
	I thought about the difficulties in translation that might arise.
	What is the forecast for California for rain?

Different Domains

Citation

If you find this work useful for your research, please cite our paper:

@inproceedings{vevo,
  author       = {Xueyao Zhang and Xiaohui Zhang and Kainan Peng and Zhenyu Tang and Vimal Manohar and Yingru Liu and Jeff Hwang and Dangna Li and Yuhao Wang and Julian Chan and Yuan Huang and Zhizheng Wu and Mingbo Ma},
  title        = {Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement},
  booktitle    = {{ICLR}},
  publisher    = {OpenReview.net},
  year         = {2025}
}

If you use the Vevo implementation in Amphion, please also cite:

@article{amphion2,
  title        = {Overview of the Amphion Toolkit (v0.2)},
  author       = {Jiaqi Li and Xueyao Zhang and Yuancheng Wang and Haorui He and Chaoren Wang and Li Wang and Huan Liao and Junyi Ao and Zeyu Xie and Yiqiao Huang and Junan Zhang and Zhizheng Wu},
  year         = {2025},
  journal      = {arXiv preprint arXiv:2501.15442},
}

@inproceedings{amphion,
    author={Xueyao Zhang and Liumeng Xue and Yicheng Gu and Yuancheng Wang and Jiaqi Li and Haorui He and Chaoren Wang and Ting Song and Xi Chen and Zihao Fang and Haopeng Chen and Junan Zhang and Tze Ying Tang and Lexiao Zou and Mingxuan Wang and Jun Han and Kai Chen and Haizhou Li and Zhizheng Wu},
    title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit},
    booktitle={{IEEE} Spoken Language Technology Workshop, {SLT} 2024},
    year={2024}
}