The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain the content and content-style tokens of speech, we design a fully self-supervised approach that progressively decouples the timbre, style, and linguistic content of speech. Specifically, we adopt VQ-VAE as the tokenizer for the continuous hidden features of HuBERT. We treat the vocabulary size of the VQ-VAE codebook as the information bottleneck, and adjust it carefully to obtain the disentangled speech representations. Solely self-supervised trained on 60K hours of audiobook speech data, without any fine-tuning on style-specific corpora, Vevo matches or surpasses existing methods in accent and emotion conversion tasks. Additionally, Vevo's effectiveness in zero-shot voice conversion and text-to-speech tasks further demonstrates its strong generalization and versatility.
Vevo can take either speech or text as input, and perform zero-shot imitation with controllable linguistic content (controlled by the source), style (controlled by the style reference), and timbre (controlled by the timbre reference) in a single forward pass. This inference pipeline can be adjusted for various zero-shot imitation tasks.
Given a source speech (\( \textcolor{blue}{U_i} \)) or text (\( \textcolor{blue}{T_i} \)), and a reference speech (\( \textcolor{red}{U_r} \)), we propose the following variants of Vevo according to the specific imitation task:
Task | Model | Source | Style Reference | Timbre Reference |
---|---|---|---|---|
Zero-Shot Timbre Imitation |
Vevo-Timbre | \( \textcolor{blue}{U_i} \) | / | \( \textcolor{red}{U_r} \) |
Zero-Shot Style Imitation |
Vevo-Style | \( \textcolor{blue}{U_i} \) | \( \textcolor{red}{U_r} \) | \( \textcolor{blue}{U_i} \) |
Zero-Shot Voice Imitation |
Vevo-Voice | \( \textcolor{blue}{U_i} \) | \( \textcolor{red}{U_r} \) | \( \textcolor{red}{U_r} \) |
Vevo-TTS | \( \textcolor{blue}{T_i} \) | \( \textcolor{red}{U_r} \) | \( \textcolor{red}{U_r} \) |
We will showcase the diverse capabilities of Vevo through the following examples.
Model | Source | Style Reference | Timbre Reference | Results | Applications |
---|---|---|---|---|---|
Vevo-TTS | I'm a versatile zero-shot voice cloner. I can mimic a wide range of vocal styles from either text or waveform as input. | American-accented, Female | American-accented, Female | Style and Timbre Controllable TTS | |
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences. | Arabic-accented, Male | Mandarin, Female | Arabic-accented, Female | ||
Vevo-Timbre | American-accented, Female | / | Arabic-accented, Male | American-accented, Male | Style-Preserved or Style-Converted VC |
Vevo-Voice | Arabic-accented, Male | Arabic-accented, Male | |||
Vevo-Style | American-accented, Female | Arabic-accented, Male | American-accented, Female (Same as Source) | Arabic-accented, Female |
Accent Conversion, Emotion Conversion, etc. |
Emotional, Male | Emotional, Female | ||||
Slow-paced, Male | Slow-paced, Female |
Vevo-Timbre and Vevo-Voice can both convert the speaker characteristics of speech. The key difference between them is that Vevo-Timbre only imitates the timbre of the reference speech, while effectively preserving the style of the source speech (e.g., prosody, emotion, accent). On the other hand, Vevo-Voice imitates both the timbre and style of the reference speech, resulting in higher speaker similarity to the reference. We will demonstrate the characteristics and distinctions between Vevo-Timbre and Vevo-Voice using samples from different domains.
For accented corpus, we observe that Vevo-Timbre retains the accent of the source speech, whereas Vevo-Voice mimics the accent of the reference speech:
Source | Reference | Vevo-Timbre | Vevo-Voice |
---|---|---|---|
Mandarin-accented | Hindi-accented | Mandarin-accented | Hindi-accented |
Mandarin-accented | Arabic-accented | Mandarin-accented | Arabic-accented |
For emotional corpus, we observe that Vevo-Timbre retains the emotion of the source speech, whereas Vevo-Voice mimics the emotion of the reference speech:
Source | Reference | Vevo-Timbre | Vevo-Voice |
---|---|---|---|
Neutral | Angry | Neutral | Angry |
Angry | Neutral | Angry | Neutral |
For general corpus, we observe that Vevo-Timbre maintains the prosody of the source speech, while Vevo-Voice can imitate the speech rate, stress, and rhythm of the reference speech:
Source | Reference | Vevo-Timbre | Vevo-Voice |
---|---|---|---|
We select several baselines in zero-shot voice conversion, including HierSpeech++, LM-VC, UniAudio, and FACodec.
Domain | Source | Reference | Vevo-Timbre | Vevo-Voice |
HierSpeech++ (2023) |
LM-VC (SPL 2023) |
UniAudio (ICML 2024) |
FACodec (ICML 2024) |
---|---|---|---|---|---|---|---|---|
Audio- book |
||||||||
Common Voice |
||||||||
ACCENT | ||||||||
EMOTION | ||||||||
We present the performance of Vevo-Style in the zero-shot style imitation task, focusing on widely studied styles such as accent and emotion. Notably, Vevo-Style employs a zero-shot manner (i.e., using just a few seconds of speech) to achieve style imitation, which is rarely seen in existing researches.
For accent imitation, we select baselines from the accent conversion field: ASR-AC, VoiceShop, and Conv-Speak. We use their demo website samples as our evaluation set. We also introduce a zero-shot style imitation model, Vevo-Style (ASR), for comparison. Its only difference compared to Vevo-Style is the use of ASR model rather than our proposed content tokenizer to extract content tokens (see Section 4.3 of our paper).
Setting | Source |
ASR-AC (ICASSP 2023) |
Reference | Vevo-Style (ASR) | Vevo-Style |
---|---|---|---|---|---|
Hindi to American | Hindi-accented | American-accented (female) | |||
American-accented (male) | |||||
Hindi-accented | American-accented (female) | ||||
American-accented (male) | |||||
British to American | British-accented | American-accented (female) | |||
American-accented (male) | |||||
British-accented | American-accented (female) | ||||
American-accented (male) | |||||
British to Hindi | British-accented | Hindi-accented (male) | |||
Hindi-accented (female) | |||||
British-accented | Hindi-accented (male) | ||||
Hindi-accented (female) |
Setting | Source |
VoiceShop (2024) |
Reference | Vevo-Style (ASR) | Vevo-Style |
---|---|---|---|---|---|
American to Hindi | American-accented | Hindi-accented (male) | |||
Hindi-accented (female) | |||||
American to Mandarin | American-accented | Mandarin-accented (male) | |||
Mandarin-accented (female) | |||||
British to American | British-accented | American-accented (female) | |||
American-accented (male) | |||||
British to Hindi | British-accented | Hindi-accented (male) | |||
Hindi-accented (female) | |||||
Hindi to American | Hindi-accented | American-accented (female) | |||
American-accented (male) |
Setting | Source |
Conv-Speak (MM 2024) |
Reference | Vevo-Style (ASR) | Vevo-Style |
---|---|---|---|---|---|
Hindi to American | Hindi-accented | American-accented (female) | |||
American-accented (male) | |||||
Hindi to American | Hindi-accented | American-accented (female) | |||
American-accented (male) | |||||
Hindi to American | Hindi-accented | American-accented (female) | |||
American-accented (male) |
For emotion imitation, we select a baseline from the emotion conversion field, Emovox. We use its demo website samples as our evaluation set.
Setting | Source |
Emovox (TAC 2023) |
Reference | Vevo-Style (ASR) | Vevo-Style |
---|---|---|---|---|---|
Neutral to Angry | Neutral | Angry (male) | |||
Angry (female) | |||||
Neutral to Angry | Neutral | Angry (male) | |||
Angry (female) | |||||
Neutral to Angry | Neutral | Angry (male) | |||
Angry (female) | |||||
Neutral to Happy | Neutral | Happy (male) | |||
Happy (female) | |||||
Neutral to Happy | Neutral | Happy (male) | |||
Happy (female) | |||||
Neutral to Happy | Neutral | Happy (male) | |||
Happy (female) |
To evaluate the performance of Vevo-TTS in the zero-shot voice imitation (synthesis) task, we select the classic baselines of the zero-shot TTS filed, including the Non-AR model, Voicebox, and the AR models such as VALL-E and VoiceCraft, all of which are trained only on audiobook speech data. For comparison, we also include two stronger state-of-the-art models: CosyVoice and MaskGCT, which are trained on large-scale private corpus derived from in-the-wild video data, featuring highly diverse distributions.
Accent | Source | Reference | Ground Truth | Vevo-TTS |
CosyVoice (2024) |
MaskGCT (2024) |
VoiceCraft (ACL 2024) |
Voicebox (NeurIPS 2023) |
VALL-E (2023) |
---|---|---|---|---|---|---|---|---|---|
Arabic-accented | Men like Joe Goose dated existence from drunk to drunk. | ||||||||
The Churchill narrowed and its current became swifter as they progressed. | |||||||||
We leave the eventuality to time and law. | |||||||||
Hindi-accented | In the picture he saw each moment a greater resemblance to Jeanne. | ||||||||
In short my joyous individualism was dominated by the orthodox bourgeois ethics. | |||||||||
About him everywhere were the evidences of luxury and of age. | |||||||||
Mandarin-accented | Like a flash he launched himself into the feathered mass of the owl. | ||||||||
Outwardly he maintained a calm and smiling aspect. | |||||||||
The land exchanged its austere robes for the garb of a smiling wanton. | |||||||||
Spanish-accented | Gregson held a lighted match until it burnt his fingertips. | ||||||||
Between him and all domestic animals there must be no hostilities. | |||||||||
It is very plausible to such people a most convincing hypothesis. |
Emotion | Source | Reference | Ground Truth | Vevo-TTS |
CosyVoice (2024) |
MaskGCT (2024) |
VoiceCraft (ACL 2024) |
Voicebox (NeurIPS 2023) |
VALL-E (2023) |
---|---|---|---|---|---|---|---|---|---|
Angry | She and I are running a neck and neck race. | ||||||||
From each cake, there sprang a huge dog. | |||||||||
Perhaps you think that is a queer title for this chapter. | |||||||||
Happy | And they were sandy yellow brownish all over. | ||||||||
She just smiled calmly, mother knows that is the best smile. | |||||||||
And what are doves. And what are doves. | |||||||||
Sad | Your midget wife never can sing a song. | ||||||||
Annie please please don't hurt me! | |||||||||
The new born baby is stolen as we go. | |||||||||
Surprise | Perhaps you think is a queer title for this chapter. | ||||||||
Clear are your eyes and bright your breath! | |||||||||
Born once every one hundred years, dies in flames! |
Domain | Source | Reference | Ground Truth | Vevo-TTS |
CosyVoice (2024) |
MaskGCT (2024) |
VoiceCraft (ACL 2024) |
Voicebox (NeurIPS 2023) |
VALL-E (2023) |
---|---|---|---|---|---|---|---|---|---|
Audio- book |
He shrugged his shoulders in ungracious acquiescence, while our visitor in hurried words and with much excitable gesticulation poured forth his story. | ||||||||
A moment before the ghost of the ancient kingdom of the Danes had looked forth through the vesture of the hazewrapped City. | |||||||||
"I know; but that renders your uncle a most agreeable companion and gossip," declared Dr. Pipt. | |||||||||
"Well, he wrote so furiously that he broke his pencil, and had, as you observe, to sharpen it again. | |||||||||
Emerald and black and russet and olive, it moved beneath the current, swaying and turning. | |||||||||
Common Voice | The stained glass offered a hypnotic atmosphere. | ||||||||
One by one, the campfires were extinguished, and the oasis fell as quiet as the desert. | |||||||||
Take these capsules over to Mrs. David's house. | |||||||||
I thought about the difficulties in translation that might arise. | |||||||||
What is the forecast for California for rain? |
As with other powerful new AI innovations, we recognize this technology brings the potential for misuse and unintended harm. We will build a highly effective classifier that can distinguish between authentic speech and audio generated with Vevo to mitigate these possible future risks.