THE VNPT-IT EMOTION TRANSPLANTATION APPROACH FOR VLSP 2022
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18236Keywords:
Emotional speech synthesis, Emotion transplantation, Text-to-speech.Abstract
Emotional speech synthesis is a challenging task in speech processing. To build an emotional Text-to-speech (TTS) system, one would need to have a quality emotional dataset of the target speaker. However, collecting such data is difficult, sometimes even impossible. This paper presents our approach that addresses the problem of transplanting a source speaker's emotional expression to a target speaker, one of the Vietnamese Language and Speech Processsing (VLSP) 2022 TTS tasks. Our approach includes a complete data pre-processing pipeline and two training algorithms. We first train a source speaker's expressive TTS model, then adapt the voice characteristics for the target speaker. Empirical results have shown the efficacy of our method in generating the expressive speech of a speaker under a limited training data regime.
Metrics
References
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- ¨danau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111, Doha, Qatar. Association for Computational Linguistics.
Xiang Hao, Xiangdong Su, Radu Horaud, and Xiaofei Li. 2021. Fullsubnet: A full-band and sub-band fusion model for real-time single channel speech enhancement. ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6633–6637. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414177
Young-Sun Joo, Hanbin Bae, Young-Ik Kim, HoonYoung Cho, and Hong-Goo Kang. 2020. Effective emotion transplantation in an end-to end textto-speech system. IEEE Access, 8:161713–161719. DOI: https://doi.org/10.1109/ACCESS.2020.3021758
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
Qiuqiang Kong, Yin Cao, Haohe Liu, Keunwoo Choi, and Yuxuan Wang. 2021. Decoupling magnitude and phase estimation with deep resunet for music source separation. In ISMIR. Citeseer.
Ohsung Kwon, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2019. An effective style token weight control technique for end-to end emotional speech synthesis. IEEE Signal Processing Letters, 26:1383–1387. DOI: https://doi.org/10.1109/LSP.2019.2931673
Younggun Lee, Azam Rabiee, and Soo-Young Lee. 2017. Emotional end-to-end neural speech synthesizer. ArXiv, abs/1711.05447. Jaime Lorenzo-Trueba, Roberto Barra-Chicote, Ruben´ San-Segundo-Hernandez, Javier Ferreiros, Junichi ´Yamagishi, and Juan Manuel Montero-Mart´ınez. 2015. Emotion transplantation through adaptation in hmm-based speech synthesis. Comput. Speech Lang., 34:292–307. DOI: https://doi.org/10.1016/j.csl.2015.03.008
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In INTERSPEECH. DOI: https://doi.org/10.21437/Interspeech.2017-1386
Yamato Ohtani, Yuuki Nasu, Masahiro Morita, and Masami Akamine. 2015. Emotional transplant in statistical speech synthesis based on emotion additive model. In INTERSPEECH. DOI: https://doi.org/10.21437/Interspeech.2015-116
Jonathan Parker, Yannis Stylianou, and Roberto Cipolla. 2018. Adaptation of an expressive single speaker deep neural network speech synthesis system. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), page 5309–5313. IEEE Press. DOI: https://doi.org/10.1109/ICASSP.2018.8461888
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. Fastspeech 2: Fast and high-quality end-to-end text to speech. ArXiv, abs/2006.04558.
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. 13. Noe Tits, Kevin El Haddad, and Thierry Dutoit. 2019. ´Exploring transfer learning for low resource emotional tts. In IntelliSys. DOI: https://doi.org/10.1109/ICASSP.2018.8461368
Seyun Um, Sangshin Oh, Kyungguen Byun, Inseon Jang, Chung Hyun Ahn, and Hong-Goo Kang. 2020. Emotional speech synthesis with rich and granularized control. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7254–7258.
Yuxuan Wang, Daisy Stanton, Yu Zhang, R. J. SkerryRyan, Eric Battenberg, Joel Shor, Ying Xiao, Fei Ren, Ye Jia, and Rif A. Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In ICML.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.