• The aim of multi-speaker emotional speech synthesis is to generate speech for a designated speaker in a desired emotional state. The task is challenging due to the presence of speech variations, such as noise, content, and timbre, which can obstruct emotion extraction and transfer. This paper proposes a new approach to performing multi-speaker emotional speech synthesis. The proposed method, which is based on a seq2seq synthesizer, integrates emotion embedding as a conditioning variable to convey emotional information. To enhance emotion representation extraction, we utilize a three-dimensional feature map as input. And generalization module with adaptive instance normalization (AdaIN) is proposed to obtain emotion embedding with high generalization capability, which also results in improved controllability. The output emotion embedding can be readily conditioned by affine parameters, allowing for control over both the emotion category and intensity of the synthesized speech. We conduct an evaluation of our model using both Mandarin and English datasets from an emotional speech database. The results demonstrate its state-of-the-art performance in multi-speaker emotional speech synthesis, coupled with the notable advantage of high emotion controllability.
  • Submitted to IEEE Transactions on Affective Computing (Major Revision and Re-submitted)
TTS Sample on Cross-speaker Emotion Transfer
Emotion Reference Audio Target Speaker Generated
Angry
Happy
Surprise
Neutral
Sad
TTS Sample on Emotion Controlling By Affine Parameters
Emotion Target Speaker Generated-Weak Generated-Medium Generated-Strong
Angry  
Happy  
Surprise  
Neutral  
Sad