The VNPT-IT emotion transplantation approach for VLSP 2022

Van Thang Nguyen; Thanh Long Luong; Huan Vu

doi:10.15625/1813-9663/18236

Author affiliations

Authors

Van Thang Nguyen Innovation Center, VNPT-IT, 57 Huynh Thuc Khang Street, Dong Da District, Ha Noi, Viet Nam
Thanh Long Luong Innovation Center, VNPT-IT, 57 Huynh Thuc Khang Street, Dong Da District, Ha Noi, Viet Nam
Huan Vu University of Transport and Communications, 3 Cau Giay Street, Lang Thuong Ward, Dong Da District, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18236

Keywords:

Emotional speech synthesis, emotion transplantation, text-to-speech.

Abstract

Emotional speech synthesis is a challenging task in speech processing. To build an emotional Text-to-speech (TTS) system, one would need to have a quality emotional dataset of the target speaker. However, collecting such data is difficult, sometimes even impossible. This paper presents our approach that addresses the problem of transplanting a source speaker's emotional expression to a target speaker, one of the Vietnamese Language and Speech Processsing (VLSP) 2022 TTS tasks. Our approach includes a complete data pre-processing pipeline and two training algorithms. We first train a source speaker's expressive TTS model, then adapt the voice characteristics for the target speaker. Empirical results have shown the efficacy of our method in generating the expressive speech of a speaker under a limited training data regime.

Metrics

PDF views

178

References

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- ¨danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.

Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single channel speech enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6633–6637. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414177

Young-Sun Joo, Hanbin Bae, Young-Ik Kim, HoonYoung Cho, and Hong-Goo Kang. 2020. Effective emotion transplantation in an end-to end textto-speech system. IEEE Access, 8:161713–161719. DOI: https://doi.org/10.1109/ACCESS.2020.3021758

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.

Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. 2021. Decoupling magnitude and phase estimation with deep resunet for music source separation. In ISMIR. Citeseer.

Ohsung Kwon, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2019. An effective style token weight control technique for end-to end emotional speech synthesis. IEEE Signal Processing Letters, 26:1383–1387. DOI: https://doi.org/10.1109/LSP.2019.2931673

Younggun Lee, Azam Rabiee, and Soo-Young Lee. 2017. Emotional end-to-end neural speech synthesizer. ArXiv, abs/1711.05447. Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Ruben´ San-Segundo-Hernandez, Javier Ferreiros, Junichi ´Yamagishi, and Juan Manuel Montero-Mart´ınez. 2015. Emotion transplantation through adaptation in hmm-based speech synthesis. Comput. Speech Lang., 34:292–307. DOI: https://doi.org/10.1016/j.csl.2015.03.008

Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In INTERSPEECH. DOI: https://doi.org/10.21437/Interspeech.2017-1386

Yamato Ohtani, Yuuki Nasu, Masahiro Morita, and Masami Akamine. 2015. Emotional transplant in statistical speech synthesis based on emotion additive model. In INTERSPEECH. DOI: https://doi.org/10.21437/Interspeech.2015-116

Jonathan Parker, Yannis Stylianou, and Roberto Cipolla. 2018. Adaptation of an expressive single speaker deep neural network speech synthesis system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5309–5313. IEEE Press. DOI: https://doi.org/10.1109/ICASSP.2018.8461888

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558.

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. 13. Noe Tits, Kevin El Haddad, and Thierry Dutoit. 2019. ´Exploring transfer learning for low resource emotional tts. In IntelliSys. DOI: https://doi.org/10.1109/ICASSP.2018.8461368

Seyun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2020. Emotional speech synthesis with rich and granularized control. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7254–7258.

Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. SkerryRyan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML.

The VNPT-IT emotion transplantation approach for VLSP 2022

Authors

DOI:

Keywords:

Abstract

Metrics

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles