V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization
Jiangyi Deng1, Fei Teng1, Yanjiao Chen1, Xiaofu Chen2, Zhaohui Wang2, Wenyuan Xu1
1Zhejiang University, 2Wuhan University
The paper has been accepted to USENIX Security Symposium 2023.
Abstract
Voice data generated on instant messaging or social media applications contains unique user voiceprint that may be abused by malicious adversaries for identity inference or identity theft. In this paper, we develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization while preserving the intelligibility, naturalness and timbre of the audio.
We have conducted extensive experiments on four datasets, i.e., LibriSpeech (English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian), five Automatic Speaker Verification (ASV) systems (including two DNN-based, two statistical and one commercial ASV), and eleven Automatic Speech Recognition (ASR) systems (for different languages), demonstrating the effectiveness, robustness, and efficiency of V-Cloak.
Hopefully, V-Cloak may provide a cloak for us in a prism world.
Demo Audios
In this part, we provide demo audios generated by V-Cloak and four previous works.
Group 1: Speaker #1188 (Male)
B0: Raw | |
B1: NSF | |
B2: HFGAN | |
B3: McAdams | |
B4: VoiceMask | |
V-Cloak (ε=0.1) |
Group 2: Speaker #61 (Male)
B0: Raw | |
B1: NSF | |
B2: HFGAN | |
B3: McAdams | |
B4: VoiceMask | |
V-Cloak (ε=0.1) |
Group 3: Speaker #2961 (Female)
B0: Raw | |
B1: NSF | |
B2: HFGAN | |
B3: McAdams | |
B4: VoiceMask | |
V-Cloak (ε=0.1) |
Group 4: Speaker #3575 (Female)
B0: Raw | |
B1: NSF | |
B2: HFGAN | |
B3: McAdams | |
B4: VoiceMask | |
V-Cloak (ε=0.1) |
Different Anonymization Levels
In this part, we present audios anonymized with different ε-s.
Speaker #7021 (Male)
B0: Raw | |
ε=0.02 | |
ε=0.04 | |
ε=0.06 | |
ε=0.08 | |
ε=0.10 |