THE NO TRAIN NO GAIN SYSTEM FOR O-COCOSDA AND VLSP 2022 - A-MSV SHARED TASK: ASIAN MULTILINGUAL SPEAKER VERIFICATION
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18248Keywords:
Speaker verification, ECAPA- TDNN, GMM, fine-tuning, score normalizationAbstract
This paper proposes a semi-supervised multilingual speaker verification (MSV) system submitted for the 2 tasks, MSV for the Asian language inside the training set (T01) and outside the training set (T02) in O-COCOSDA and VLSP challenge 2022.
To solve the problem, our strategy is training a baseline acoustic model with given labeled data (MSV CommonVoice) and
fine-tuning the trained acoustic model with both given labeled data and given unlabeled data (MSV Youtube). To achieve the fine-tuning step, the unlabeled data is converted to labeled data by pseudo labeling technique using the clustering method with the embedding vectors extracted from the trained acoustic model. Besides, we also apply test-time augmentation, back-end scoring, and score normalization with the AS-Norm technique to improve the result. When evaluated on the VLSP 2022 challenge's given test set, our best system with baseline ECAPA-TDNN achieves an equal error rate (EER) of 2.296% in T01 and 3.3296% in T02, which ranks second rank in both two tasks.
Metrics
References
David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Process- ing (ICASSP). IEEE, 2017, pp. 5220–5224.
Daniel S Park, William Chan, Yu Zhang, ChungCheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le, “Specaugment: A simple data augmentation method for automatic speech recognition,” arXiv preprint arXiv:1904.08779, 2019.
Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” arXiv preprint arXiv:2005.07143, 2020.
Jenthe Thienpondt, Brecht Desplanques, and Kris Demuynck, “Integrating frequency translational invariance in tdnns and frequency positional information in 2d resnets to enhance speaker verification,” arXiv preprint arXiv:2104.02370, 2021.
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.
Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.
Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
Zhifu Gao, Yan Song, Ian McLoughlin, Pengcheng Li, Yiheng Jiang, and Li-Rong Dai, “Improving aggregation and loss function for better embedding learning in end-to-end speaker verification system.,” in INTERSPEECH, 2019, pp. 361–365.
Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4690–4699.
Qiongqiong Wang, Kong Aik Lee, and Tianchi Liu, “Scoring of large-margin embeddings for speaker verification: Cosine or plda?,” arXiv preprint arXiv:2204.03965, 2022.
Douglas A Reynolds, “Gaussian mixture models.,” Encyclopedia of biometrics, vol. 741, no. 659-663, 2009.
Dat Vi Thanh, Thanh Pham Viet, and Trang Nguyen Thi Thu, “Deep speaker verification model for low-resource languages and vietnamese dataset,” in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation, 2021, pp. 445–454.
Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Md Jahangir Alam, and Pierre Dumouchel, “Plda for speaker verification with utterances of arbitrary duration,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE,
, pp. 7649–7653.
Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn, “Speaker verification using adapted gaussian mixture models,” Digital signal processing, vol. 10, no. 1-3, pp. 19–41, 2000.
Roland Auckenthaler, Michael Carey, and Harvey Lloyd-Thomas, “Score normalization for text-independent speaker verification systems,” Digital Signal Processing, vol. 10, no. 1-3, pp. 42–54, 2000.
Patrick Kenny, “Bayesian speaker verification with, heavy tailed priors,” Proc. Odyssey 2010, 2010.
Pavel Matejka, Ondrej Novotn `y, Oldrich Plchot, Lukas Burget, Mireia Diez S ́anchez, and Jan Cernock `y, “Analysis of score normalization in multilingual speaker recognition.,” in Interspeech, 2017, pp.1567–1571.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.