Application Scenario of V-Cloak

V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time Voice Anonymization

Jiangyi Deng1, Fei Teng1, Yanjiao Chen1, Xiaofu Chen2, Zhaohui Wang2, Wenyuan Xu1

1Zhejiang University, 2Wuhan University

The paper has been accepted to USENIX Security Symposium 2023.

Abstract

Voice data generated on instant messaging or social media applications contains unique user voiceprint that may be abused by malicious adversaries for identity inference or identity theft. In this paper, we develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization while preserving the intelligibility, naturalness and timbre of the audio.

We have conducted extensive experiments on four datasets, i.e., LibriSpeech (English), AISHELL (Chinese), CommonVoice (French) and CommonVoice (Italian), five Automatic Speaker Verification (ASV) systems (including two DNN-based, two statistical and one commercial ASV), and eleven Automatic Speech Recognition (ASR) systems (for different languages), demonstrating the effectiveness, robustness, and efficiency of V-Cloak.

Hopefully, V-Cloak may provide a cloak for us in a prism world.


Demo Audios

In this part, we provide demo audios generated by V-Cloak and four previous works.

Group 1: Speaker #1188 (Male)

B0: Raw
B1: NSF
B2: HFGAN
B3: McAdams
B4: VoiceMask
V-Cloak (ε=0.1)

Group 2: Speaker #61 (Male)

B0: Raw
B1: NSF
B2: HFGAN
B3: McAdams
B4: VoiceMask
V-Cloak (ε=0.1)

Group 3: Speaker #2961 (Female)

B0: Raw
B1: NSF
B2: HFGAN
B3: McAdams
B4: VoiceMask
V-Cloak (ε=0.1)

Group 4: Speaker #3575 (Female)

B0: Raw
B1: NSF
B2: HFGAN
B3: McAdams
B4: VoiceMask
V-Cloak (ε=0.1)

Different Anonymization Levels

In this part, we present audios anonymized with different ε-s.

Speaker #7021 (Male)

B0: Raw
ε=0.02
ε=0.04
ε=0.06
ε=0.08
ε=0.10