ADAPT-TTS: HIGH-QUALITY ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH ADAPTIVE-BASED FOR VIETNAMESE

Phuong Pham Ngoc, Chung Tran Quang, Mai Luong Chi
Author affiliations

Authors

  • Phuong Pham Ngoc Thai Nguyen University
  • Chung Tran Quang AIMed Vietnam Artificial Intelligence Solutions, Vietnam; Japan Advanced Institute of Science and Technology (JAIST)
  • Mai Luong Chi Institute of Information Technology, Vietnam Academy of Science and Technology, Vietnam

DOI:

https://doi.org/10.15625/1813-9663/18136

Keywords:

Zero-shot TTS, multi-speaker, text-to-speech, diffusion models, mel-spectrogram denoiser, Extracting Mel-vector, EMV, Adapt-TTS

Abstract

Current adaptive-based speech synthesis techniques are based on two main streams: 1. Fine-tuning the model using small amounts of adaptive data, and 2. Conditionally training the entire model through a speaker embedding of the target speaker. However, both of these methods require adaptive data to appear during training, which makes the training cost to generate new voices quite expensively. In addition, the traditional TTS model uses a simple loss function to reproduce the acoustic features. However, this optimization is based on incorrect distribution assumptions leading to noisy composite audio results. We introduce the Adapt-TTS model that allows high-quality audio synthesis from a small adaptive sample without training to solve these problems. Key recommendations: 1. The Extracting Mel-vector (EMV) architecture allows for a better representation of speaker characteristics and speech style; 2. An improved zero-shot model with a denoising diffusion model (Mel-spectrogram denoiser) component allows for new voice synthesis without training with better quality (less noise). The evaluation results have proven the model's effectiveness when only needing a single utterance (1-3 seconds) of the reference speaker, the synthesis system gave high-quality synthesis results and achieved high similarity.

Metrics

Metrics Loading ...

References

Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE. DOI: https://doi.org/10.1109/ICASSP.2018.8461368

Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.

Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.

Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE.

Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., ... & Liu, T. Y. (2022). Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436. DOI: https://doi.org/10.21437/Interspeech.2022-901

Tits, N., El Haddad, K., & Dutoit, T. (2020). Exploring transfer learning for low resource emotional tts. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (pp. 52-60). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-29516-5_5

Xie, Q., Tian, X., Liu, G., Song, K., Xie, L., Wu, Z., ... & Xu, X. (2021, June). The multi-speaker multi-style voice cloning challenge 2021. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8613-8617). IEEE. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414001

Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31.

Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., ... & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence. DOI: https://doi.org/10.1109/TPAMI.2022.3191696

. Ping, W., Peng, K., Gibiansky, A., Arik, S. Ö., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep Voice 3: 2000-Speaker Neural Text-to-Speech.

Min, D., Lee, D. B., Yang, E., & Hwang, S. J. (2021, July). Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning (pp. 7748-7759). PMLR.

Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028) DOI: https://doi.org/10.1609/aaai.v36i10.21350

Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.

Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054535

Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.

Choi, S., Han, S., Kim, D., & Ha, S. (2020). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. arXiv preprint arXiv:2005.08484. DOI: https://doi.org/10.21437/Interspeech.2020-2096

Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., ... & Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:2104.05557. DOI: https://doi.org/10.21437/Interspeech.2021-1774

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.

Nichol, A. Q., & Dhariwal, P. (2021, July). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (pp. 8162-8171). PMLR

Huang, S. F., Lin, C. J., Liu, D. R., Chen, Y. C., & Lee, H. Y. (2022). Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1558-1571. DOI: https://doi.org/10.1109/TASLP.2022.3167258

Liu, Y., He, L., Liu, J., & Johnson, M. T. (2019). Introducing phonetic information to speaker embedding for speaker verification. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 1-17. DOI: https://doi.org/10.1186/s13636-019-0166-8

Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016, December). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165-170). IEEE. DOI: https://doi.org/10.1109/SLT.2016.7846260

Kwon, Y., Jung, J. W., Heo, H. S., Kim, Y. J., Lee, B. J., & Chung, J. S. (2021). Adapting speaker embeddings for speaker diarisation. arXiv preprint arXiv:2104.02879. DOI: https://doi.org/10.21437/Interspeech.2021-448

Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.

Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862. DOI: https://doi.org/10.21437/Interspeech.2019-1873

Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the Voice Conversion Challenge 2016 Evaluation Results. In Interspeech (pp. 1637-1641). DOI: https://doi.org/10.21437/Interspeech.2016-1331

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).

Downloads

Published

12-06-2023

How to Cite

[1]
P. Pham Ngoc, C. Tran Quang, and M. Luong Chi, “ADAPT-TTS: HIGH-QUALITY ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH ADAPTIVE-BASED FOR VIETNAMESE”, JCC, vol. 39, no. 2, p. 159–173, Jun. 2023.

Issue

Section

Articles