ADAPT-TTS: HIGH-QUALITY ZERO-SHOT MULTI-SPEAKER TEXT-TO-SPEECH ADAPTIVE-BASED FOR VIETNAMESE
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18136Keywords:
Zero-shot TTS, multi-speaker, text-to-speech, diffusion models, mel-spectrogram denoiser, Extracting Mel-vector, EMV, Adapt-TTSAbstract
Current adaptive-based speech synthesis techniques are based on two main streams: 1. Fine-tuning the model using small amounts of adaptive data, and 2. Conditionally training the entire model through a speaker embedding of the target speaker. However, both of these methods require adaptive data to appear during training, which makes the training cost to generate new voices quite expensively. In addition, the traditional TTS model uses a simple loss function to reproduce the acoustic features. However, this optimization is based on incorrect distribution assumptions leading to noisy composite audio results. We introduce the Adapt-TTS model that allows high-quality audio synthesis from a small adaptive sample without training to solve these problems. Key recommendations: 1. The Extracting Mel-vector (EMV) architecture allows for a better representation of speaker characteristics and speech style; 2. An improved zero-shot model with a denoising diffusion model (Mel-spectrogram denoiser) component allows for new voice synthesis without training with better quality (less noise). The evaluation results have proven the model's effectiveness when only needing a single utterance (1-3 seconds) of the reference speaker, the synthesis system gave high-quality synthesis results and achieved high similarity.
Metrics
References
Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N., Yang, Z., ... & Wu, Y. (2018, April). Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4779-4783). IEEE. DOI: https://doi.org/10.1109/ICASSP.2018.8461368
Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2020). Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558.
Kim, J., Kong, J., & Son, J. (2021, July). Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning (pp. 5530-5540). PMLR.
Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE.
Wu, Y., Tan, X., Li, B., He, L., Zhao, S., Song, R., ... & Liu, T. Y. (2022). Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436. DOI: https://doi.org/10.21437/Interspeech.2022-901
Tits, N., El Haddad, K., & Dutoit, T. (2020). Exploring transfer learning for low resource emotional tts. In Intelligent Systems and Applications: Proceedings of the 2019 Intelligent Systems Conference (IntelliSys) Volume 1 (pp. 52-60). Springer International Publishing. DOI: https://doi.org/10.1007/978-3-030-29516-5_5
Xie, Q., Tian, X., Liu, G., Song, K., Xie, L., Wu, Z., ... & Xu, X. (2021, June). The multi-speaker multi-style voice cloning challenge 2021. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 8613-8617). IEEE. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414001
Arik, S., Chen, J., Peng, K., Ping, W., & Zhou, Y. (2018). Neural voice cloning with a few samples. Advances in neural information processing systems, 31.
Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R., Lim, C. P., ... & Wu, Q. J. (2022). A review of generalized zero-shot learning methods. IEEE transactions on pattern analysis and machine intelligence. DOI: https://doi.org/10.1109/TPAMI.2022.3191696
. Ping, W., Peng, K., Gibiansky, A., Arik, S. Ö., Kannan, A., Narang, S., ... & Miller, J. (2017). Deep Voice 3: 2000-Speaker Neural Text-to-Speech.
Min, D., Lee, D. B., Yang, E., & Hwang, S. J. (2021, July). Meta-stylespeech: Multi-speaker adaptive text-to-speech generation. In International Conference on Machine Learning (pp. 7748-7759). PMLR.
Liu, J., Li, C., Ren, Y., Chen, F., & Zhao, Z. (2022, June). Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 36, No. 10, pp. 11020-11028) DOI: https://doi.org/10.1609/aaai.v36i10.21350
Jia, Y., Zhang, Y., Weiss, R., Wang, Q., Shen, J., Ren, F., ... & Wu, Y. (2018). Transfer learning from speaker verification to multispeaker text-to-speech synthesis. Advances in neural information processing systems, 31.
Cooper, E., Lai, C. I., Yasuda, Y., Fang, F., Wang, X., Chen, N., & Yamagishi, J. (2020, May). Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6184-6188). IEEE. DOI: https://doi.org/10.1109/ICASSP40776.2020.9054535
Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.
Choi, S., Han, S., Kim, D., & Ha, S. (2020). Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding. arXiv preprint arXiv:2005.08484. DOI: https://doi.org/10.21437/Interspeech.2020-2096
Casanova, E., Shulby, C., Gölge, E., Müller, N. M., de Oliveira, F. S., Junior, A. C., ... & Ponti, M. A. (2021). Sc-glowtts: an efficient zero-shot multi-speaker text-to-speech model. arXiv preprint arXiv:2104.05557. DOI: https://doi.org/10.21437/Interspeech.2021-1774
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840-6851.
Nichol, A. Q., & Dhariwal, P. (2021, July). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning (pp. 8162-8171). PMLR
Huang, S. F., Lin, C. J., Liu, D. R., Chen, Y. C., & Lee, H. Y. (2022). Meta-TTS: Meta-learning for few-shot speaker adaptive text-to-speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30, 1558-1571. DOI: https://doi.org/10.1109/TASLP.2022.3167258
Liu, Y., He, L., Liu, J., & Johnson, M. T. (2019). Introducing phonetic information to speaker embedding for speaker verification. EURASIP Journal on Audio, Speech, and Music Processing, 2019, 1-17. DOI: https://doi.org/10.1186/s13636-019-0166-8
Snyder, D., Ghahremani, P., Povey, D., Garcia-Romero, D., Carmiel, Y., & Khudanpur, S. (2016, December). Deep neural network-based speaker embeddings for end-to-end speaker verification. In 2016 IEEE Spoken Language Technology Workshop (SLT) (pp. 165-170). IEEE. DOI: https://doi.org/10.1109/SLT.2016.7846260
Kwon, Y., Jung, J. W., Heo, H. S., Kim, Y. J., Lee, B. J., & Chung, J. S. (2021). Adapting speaker embeddings for speaker diarisation. arXiv preprint arXiv:2104.02879. DOI: https://doi.org/10.21437/Interspeech.2021-448
Wang, Y., Stanton, D., Zhang, Y., Ryan, R. S., Battenberg, E., Shor, J., ... & Saurous, R. A. (2018, July). Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning (pp. 5180-5189). PMLR.
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862. DOI: https://doi.org/10.21437/Interspeech.2019-1873
Wester, M., Wu, Z., & Yamagishi, J. (2016, September). Analysis of the Voice Conversion Challenge 2016 Evaluation Results. In Interspeech (pp. 1637-1641). DOI: https://doi.org/10.21437/Interspeech.2016-1331
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.