Van Thang Nguyen, Thanh Long Luong, Huan Vu
Author affiliations


  • Van Thang Nguyen Innovation Center, VNPT-IT, Ha Noi, Viet Nam
  • Thanh Long Luong Innovation Center, VNPT-IT, Ha Noi, Viet Nam
  • Huan Vu University of Transport and Communications, Ha Noi, Viet Nam



Emotional speech synthesis; , Emotion transplantation; , Text-to-speech.


Emotional speech synthesis is a challenging task in speech processing. To build an emotional Text-to-speech (TTS) system, one would need to have a quality emotional dataset of the target speaker. However, collecting such data is difficult, sometimes even impossible. This paper presents our approach that addresses the problem of transplanting a source speaker's emotional expression to a target speaker, one of the Vietnamese Language and Speech Processsing (VLSP) 2022 TTS tasks. Our approach includes a complete data pre-processing pipeline and two training algorithms. We first train a source speaker's expressive TTS model, then adapt the voice characteristics for the target speaker. Empirical results have shown the efficacy of our method in generating the expressive speech of a speaker under a limited training data regime.


Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- ¨danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.

Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single channel speech enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6633–6637.

Young-Sun Joo, Hanbin Bae, Young-Ik Kim, HoonYoung Cho, and Hong-Goo Kang. 2020. Effective emotion transplantation in an end-to end textto-speech system. IEEE Access, 8:161713–161719.

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.

Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. 2021. Decoupling magnitude and phase estimation with deep resunet for music source separation. In ISMIR. Citeseer.

Ohsung Kwon, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2019. An effective style token weight control technique for end-to end emotional speech synthesis. IEEE Signal Processing Letters, 26:1383–1387.

Younggun Lee, Azam Rabiee, and Soo-Young Lee. 2017. Emotional end-to-end neural speech synthesizer. ArXiv, abs/1711.05447. Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Ruben´ San-Segundo-Hernandez, Javier Ferreiros, Junichi ´Yamagishi, and Juan Manuel Montero-Mart´ınez. 2015. Emotion transplantation through adaptation in hmm-based speech synthesis. Comput. Speech Lang., 34:292–307.

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In INTERSPEECH.

Yamato Ohtani, Yuuki Nasu, Masahiro Morita, and Masami Akamine. 2015. Emotional transplant in statistical speech synthesis based on emotion additive model. In INTERSPEECH.

Jonathan Parker, Yannis Stylianou, and Roberto Cipolla. 2018. Adaptation of an expressive single speaker deep neural network speech synthesis system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5309–5313. IEEE Press.

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558.

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. 13. Noe Tits, Kevin El Haddad, and Thierry Dutoit. 2019. ´Exploring transfer learning for low resource emotional tts. In IntelliSys.

Seyun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2020. Emotional speech synthesis with rich and granularized control. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7254–7258.

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. SkerryRyan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML.




How to Cite

V. T. Nguyen, T. L. Luong, and H. Vu, “THE VNPT-IT EMOTION TRANSPLANTATION APPROACH FOR VLSP 2022”, JCC, vol. 39, no. 4, Nov. 2023.