Controllable Accented Text-to-Speech Synthesis

Rui Liu, Berrak Sisman, Guanglai Gao, Haizhou Li

Abstract
Accented text-to-speech (TTS) synthesis seeks to generate speech with an accent (L2) as a variant of the standard version (L1). Accented TTS synthesis is challenging as L2 is different from L1 in both in terms of phonetic rendering and prosody pattern. Furthermore, there is no easy solution to the control of the accent intensity in an utterance. In this work, we propose a neural TTS architecture, that allows us to control the accent and its intensity during inference. This is achieved through three novel mechanisms, 1) an accent variance adaptor to model the complex accent variance with three prosody controlling factors, namely pitch, energy and duration; 2) an accent intensity modeling strategy to quantify the accent intensity; 3) a consistency constraint module to encourage the TTS system to render the expected accent intensity at a fine level. Experiments show that the proposed system attains superior performance to the baseline models in terms of accent rendering and intensity control. To our best knowledge, this is the first study of accented TTS synthesis with explicit intensity control.

Contents

(Section IV-D & IV-E) Audio Quality of Generated L1 Speech & Audio Quality of CAI-TTS
1. Audio Quality of Generated L1 Speech
2. Audio Quality of CAI-TTS
(Section IV-G) Controllable Accent Intensity

(Section IV-D & IV-E) Audio Quality of Accented TTS Synthesis & Accent Variance Learning

1. Audio Quality of Generated L1 Speech

Author of the danger trail Philip Steels etc.

Synthesized L1 Speech L2 Speech (GT)
(Speaker: ABA; Accent: Arabic)
(Speaker: ASI; Accent: Hindi)
(Speaker: EBVS; Accent: Spanish)
(Speaker: HKK; Accent: Korean)
(Speaker: NCC; Accent: Mandarin)

Will we ever forget it ?

Synthesized L1 Speech L2 Speech (GT)
(Speaker: HQTV; Accent: Vietnamese)
(Speaker: LXC; Accent: Mandarin)
(Speaker: NJS; Accent: Spanish)
(Speaker: SKA; Accent: Arabic)
(Speaker: YDCK; Accent: Korean)

He was a head shorter than his companion of almost delicate physique.

Synthesized L1 Speech L2 Speech (GT)
(Speaker: BWC; Accent: Mandarin)
(Speaker: RRBI; Accent: Hindi)
(Speaker: TLV; Accent: Vietnamese)
(Speaker: TNI; Accent: Hindi)
(Speaker: YKWK; Accent: Korean)
2. Audio Quality of CAI-TTS

A moment before, he was intoxicated by a joy that was almost madness.
(Speaker: EBVS; Accent: Spanish)

GT GT (mel+HiFi-GAN) Tacotron2 (mel+HiFi-GAN) Transformer TTS (mel+HiFi-GAN) FastSpeech2 (mel+HiFi-GAN) CAI-TTS (mel+HiFi-GAN) CAI-TTS w/o phoneme pitch & energy (mel+HiFi-GAN) CAI-TTS w/o accent intensity (mel+HiFi-GAN) CAI-TTS w/o consistency constraint (mel+HiFi-GAN)

Among my minor afflictions, I may mention a new and mysterious one.
(Speaker: PNV; Accent: Vietnamese)

GT GT (mel+HiFi-GAN) Tacotron2 (mel+HiFi-GAN) Transformer TTS (mel+HiFi-GAN) FastSpeech2 (mel+HiFi-GAN) CAI-TTS (mel+HiFi-GAN) CAI-TTS w/o phoneme pitch & energy (mel+HiFi-GAN) CAI-TTS w/o accent intensity (mel+HiFi-GAN) CAI-TTS w/o consistency constraint (mel+HiFi-GAN)

By that answer my professional medical prestige stood or fell
(Speaker: MBMPS; Accent: Spanish)

GT GT (mel+HiFi-GAN) Tacotron2 (mel+HiFi-GAN) Transformer TTS (mel+HiFi-GAN) FastSpeech2 (mel+HiFi-GAN) CAI-TTS (mel+HiFi-GAN) CAI-TTS w/o phoneme pitch & energy (mel+HiFi-GAN) CAI-TTS w/o accent intensity (mel+HiFi-GAN) CAI-TTS w/o consistency constraint (mel+HiFi-GAN)

Unconsciously, our yells and exclamations yielded to this rhythm.
(Speaker: TXHC; Accent: Mandarin)

GT GT (mel+HiFi-GAN) Tacotron2 (mel+HiFi-GAN) Transformer TTS (mel+HiFi-GAN) FastSpeech2 (mel+HiFi-GAN) CAI-TTS (mel+HiFi-GAN) CAI-TTS w/o phoneme pitch & energy (mel+HiFi-GAN) CAI-TTS w/o accent intensity (mel+HiFi-GAN) CAI-TTS w/o consistency constraint (mel+HiFi-GAN)

He made no reply as he waited for Whittemore to continue.
(Speaker: NCC; Accent: Mandarin)

GT GT (mel+HiFi-GAN) Tacotron2 (mel+HiFi-GAN) Transformer TTS (mel+HiFi-GAN) FastSpeech2 (mel+HiFi-GAN) CAI-TTS (mel+HiFi-GAN) CAI-TTS w/o phoneme pitch & energy (mel+HiFi-GAN) CAI-TTS w/o accent intensity (mel+HiFi-GAN) CAI-TTS w/o consistency constraint (mel+HiFi-GAN)

(Section IV-G) Controllable Accent Intensity

A moment before, he was intoxicated by a joy that was almost madness.
(Speaker: EBVS; Accent: Spanish)

CAI-TTS
(accent intensity = 0.1)
CAI-TTS
(accent intensity = 0.2)
CAI-TTS
(accent intensity = 0.3)
CAI-TTS
(accent intensity = 0.4)
CAI-TTS
(accent intensity = 0.5)
CAI-TTS
(accent intensity = 0.6)
CAI-TTS
(accent intensity = 0.7)
CAI-TTS
(accent intensity = 0.8)
CAI-TTS
(accent intensity = 0.9)
CAI-TTS w/o consistency constraint
(accent intensity = 0.1)
CAI-TTS w/o consistency constraint
(accent intensity = 0.2)
CAI-TTS w/o consistency constraint
(accent intensity = 0.3)
CAI-TTS w/o consistency constraint
(accent intensity = 0.4)
CAI-TTS w/o consistency constraint
(accent intensity = 0.5)
CAI-TTS w/o consistency constraint
(accent intensity = 0.6)
CAI-TTS w/o consistency constraint
(accent intensity = 0.7)
CAI-TTS w/o consistency constraint
(accent intensity = 0.8)
CAI-TTS w/o consistency constraint
(accent intensity = 0.9)

Gregson had left the outer door slightly ajar.
(Speaker: ABA; Accent: Arabic)

CAI-TTS
(accent intensity = 0.1)
CAI-TTS
(accent intensity = 0.2)
CAI-TTS
(accent intensity = 0.3)
CAI-TTS
(accent intensity = 0.4)
CAI-TTS
(accent intensity = 0.5)
CAI-TTS
(accent intensity = 0.6)
CAI-TTS
(accent intensity = 0.7)
CAI-TTS
(accent intensity = 0.8)
CAI-TTS
(accent intensity = 0.9)
CAI-TTS w/o consistency constraint
(accent intensity = 0.1)
CAI-TTS w/o consistency constraint
(accent intensity = 0.2)
CAI-TTS w/o consistency constraint
(accent intensity = 0.3)
CAI-TTS w/o consistency constraint
(accent intensity = 0.4)
CAI-TTS w/o consistency constraint
(accent intensity = 0.5)
CAI-TTS w/o consistency constraint
(accent intensity = 0.6)
CAI-TTS w/o consistency constraint
(accent intensity = 0.7)
CAI-TTS w/o consistency constraint
(accent intensity = 0.8)
CAI-TTS w/o consistency constraint
(accent intensity = 0.9)

Her own betrayal of herself was like tonic to Philip.
(Speaker: NJS; Accent: Spanish)

CAI-TTS
(accent intensity = 0.1)
CAI-TTS
(accent intensity = 0.2)
CAI-TTS
(accent intensity = 0.3)
CAI-TTS
(accent intensity = 0.4)
CAI-TTS
(accent intensity = 0.5)
CAI-TTS
(accent intensity = 0.6)
CAI-TTS
(accent intensity = 0.7)
CAI-TTS
(accent intensity = 0.8)
CAI-TTS
(accent intensity = 0.9)
CAI-TTS w/o consistency constraint
(accent intensity = 0.1)
CAI-TTS w/o consistency constraint
(accent intensity = 0.2)
CAI-TTS w/o consistency constraint
(accent intensity = 0.3)
CAI-TTS w/o consistency constraint
(accent intensity = 0.4)
CAI-TTS w/o consistency constraint
(accent intensity = 0.5)
CAI-TTS w/o consistency constraint
(accent intensity = 0.6)
CAI-TTS w/o consistency constraint
(accent intensity = 0.7)
CAI-TTS w/o consistency constraint
(accent intensity = 0.8)
CAI-TTS w/o consistency constraint
(accent intensity = 0.9)

I was in New York when the crash came.
(Speaker: ZHAA; Accent: Arabic)

CAI-TTS
(accent intensity = 0.1)
CAI-TTS
(accent intensity = 0.2)
CAI-TTS
(accent intensity = 0.3)
CAI-TTS
(accent intensity = 0.4)
CAI-TTS
(accent intensity = 0.5)
CAI-TTS
(accent intensity = 0.6)
CAI-TTS
(accent intensity = 0.7)
CAI-TTS
(accent intensity = 0.8)
CAI-TTS
(accent intensity = 0.9)
CAI-TTS w/o consistency constraint
(accent intensity = 0.1)
CAI-TTS w/o consistency constraint
(accent intensity = 0.2)
CAI-TTS w/o consistency constraint
(accent intensity = 0.3)
CAI-TTS w/o consistency constraint
(accent intensity = 0.4)
CAI-TTS w/o consistency constraint
(accent intensity = 0.5)
CAI-TTS w/o consistency constraint
(accent intensity = 0.6)
CAI-TTS w/o consistency constraint
(accent intensity = 0.7)
CAI-TTS w/o consistency constraint
(accent intensity = 0.8)
CAI-TTS w/o consistency constraint
(accent intensity = 0.9)

Unconsciously, our yells and exclamations yielded to this rhythm.
(Speaker: TXHC; Accent: Mandarin)

CAI-TTS
(accent intensity = 0.1)
CAI-TTS
(accent intensity = 0.2)
CAI-TTS
(accent intensity = 0.3)
CAI-TTS
(accent intensity = 0.4)
CAI-TTS
(accent intensity = 0.5)
CAI-TTS
(accent intensity = 0.6)
CAI-TTS
(accent intensity = 0.7)
CAI-TTS
(accent intensity = 0.8)
CAI-TTS
(accent intensity = 0.9)
CAI-TTS w/o consistency constraint
(accent intensity = 0.1)
CAI-TTS w/o consistency constraint
(accent intensity = 0.2)
CAI-TTS w/o consistency constraint
(accent intensity = 0.3)
CAI-TTS w/o consistency constraint
(accent intensity = 0.4)
CAI-TTS w/o consistency constraint
(accent intensity = 0.5)
CAI-TTS w/o consistency constraint
(accent intensity = 0.6)
CAI-TTS w/o consistency constraint
(accent intensity = 0.7)
CAI-TTS w/o consistency constraint
(accent intensity = 0.8)
CAI-TTS w/o consistency constraint
(accent intensity = 0.9)