The NO_TRAIN_NO_GAIN system for O-COCOSDA and VLSP 2022 - A-MSV shared task: ASIAN multilingual speaker verification

Ngoc-Dung Nguyen; Nhat-Nam Ly; Trong-Khanh Le

doi:10.15625/1813-9663/18248

Author affiliations

Authors

Ngoc-Dung Nguyen School of Information and Communication Technology, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam
Nhat-Nam Ly School of Information and Communication Technology, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam
Trong-Khanh Le School of Information and Communication Technology, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18248

Keywords:

Speaker verification, ECAPA- TDNN, GMM, fine-tuning, score normalization

Abstract

This paper proposes a semi-supervised multilingual speaker verification (MSV) system submitted for the 2 tasks, MSV for the Asian language inside the training set (T01) and outside the training set (T02) in O-COCOSDA and VLSP challenge 2022.
To solve the problem, our strategy is training a baseline acoustic model with given labeled data (MSV CommonVoice) and
fine-tuning the trained acoustic model with both given labeled data and given unlabeled data (MSV Youtube). To achieve the fine-tuning step, the unlabeled data is converted to labeled data by pseudo labeling technique using the clustering method with the embedding vectors extracted from the trained acoustic model. Besides, we also apply test-time augmentation, back-end scoring, and score normalization with the AS-Norm technique to improve the result. When evaluated on the VLSP 2022 challenge's given test set, our best system with baseline ECAPA-TDNN achieves an equal error rate (EER) of 2.296% in T01 and 3.3296% in T02, which ranks second rank in both two tasks.

References

David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.

Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2017, pp. 5220–5224.

Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.

Jenthe Thienpondt, Brecht Desplanques, and Kris Demuynck, “Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification,” arXiv preprint arXiv:2104.02370, 2021.

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.

Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.

Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.

Zhifu Gao, Yan Song, Ian McLoughlin, Pengcheng Li, Yiheng Jiang, and Li-Rong Dai, “Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system.,” in INTERSPEECH, 2019, pp. 361–365.

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.

Qiongqiong Wang, Kong Aik Lee, and Tianchi Liu, “Scoring of large-margin embeddings for speaker verification: Cosine or plda?,” arXiv preprint arXiv:2204.03965, 2022.

Douglas A Reynolds, “Gaussian mixture models.,” Encyclopedia of biometrics, vol. 741, no. 659-663, 2009.

Dat Vi Thanh, Thanh Pham Viet, and Trang Nguyen Thi Thu, “Deep speaker verification model for low-resource languages and vietnamese dataset,” in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021, pp. 445–454.

Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Md Jahangir Alam, and Pierre Dumouchel, “Plda for speaker verification with utterances of arbitrary duration,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE,

, pp. 7649–7653.

Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.

Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Processing, vol. 10, no. 1-3, pp. 42–54, 2000.

Patrick Kenny, “Bayesian speaker verification with, heavy tailed priors,” Proc. Odyssey 2010, 2010.

Pavel Matejka, Ondrej Novotn `y, Oldrich Plchot, Lukas Burget, Mireia Diez S ́anchez, and Jan Cernock `y, “Analysis of score normalization in multilingual speaker recognition.,” in Interspeech, 2017, pp.1567–1571.