MSV challenge language-adversarial training for indic multilingual speaker verification

Hoang Long Vu, Nguyen Van Huy, Ngo Thi Thu Huyen, Pham Viet Thanh
Author affiliations

Authors

  • Hoang Long Vu Hanoi University of Science and Technology, 1 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam
  • Nguyen Van Huy Hanoi University of Science and Technology, 1 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam
  • Ngo Thi Thu Huyen Hanoi University of Science and Technology, 1 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam
  • Pham Viet Thanh Hanoi University of Science and Technology, 1 Dai Co Viet Street, Hai Ba Trung District, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18320

Keywords:

Speaker verification, adversarial training, multilingual.

Abstract

Speaker verification now reports a reasonable level of accuracy in its applications in voice-based biometric systems. Recent research on deep neural networks and predicting speaker identity based on speaker embeddings have gained remarkable success. However, results are limited when it comes to verifying multilingual speakers. In this paper, we propose an ensemble system submitted to the I-MSV Challenge 2022. The system is built upon the ECAPA and RawNet model with additional adversarial training layers. Probabilistic Linear Discriminant Analysis back-end scoring and Large Margin Cosine Loss are implemented to further obtain more discriminative features. Experimental results show that on the Constraint Private Test set of the task, our proposed model achieved remarkable results, ranked third with an Equal Error Rate (EER) of 2.9734\%.

Metrics

PDF views
77

References

Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khu-

danpur, “A time delay neural network architecture for

efficient modeling of long temporal contexts,” 09 2015,

pp. 3214–3218.

Brecht Desplanques, Jenthe Thienpondt, and Kris De-

muynck, “Ecapa-tdnn: Emphasized channel attention,

propagation and aggregation in tdnn based speaker ver-

ification,” 10 2020.

Y. Jiang, Kong Aik Lee, Z. Tang, Bin Ma, Anthony

Larcher, and Haizhou Li, “Plda modeling in i-vector

and supervector space for speaker verification,” vol. 2,

pp. 1678–1681, 01 2012.

Jee-Weon Jung, Hee-Soo Heo, Ju-Ho Kim, Hye-Jin

Shim, and Ha-Jin Yu, “Rawnet: Advanced end-to-

end deep neural network using raw waveforms for text-

independent speaker verification,” 04 2019.

Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng

Li, Dihong Gong, Jingchao Zhou, and Wei Liu, “Cos-

face: Large margin cosine loss for deep face recogni-

tion,” 01 2018.

Motoki Sato, Hitoshi Manabe, Hiroshi Noji, and Yuji

Matsumoto, “Adversarial training for cross-domain uni-

versal dependency parsing,” 01 2017, pp. 71–79.

Bengt J Borgstr ̈om, “Discriminative training of plda for

speaker verification with x-vectors,” 2020.

Georg Heigold, Vincent Vanhoucke, Alan Senior,

Patrick Nguyen, Marc’Aurelio Ranzato, Matthieu

Devin, and Jeffrey Dean, “Multilingual acoustic models

using distributed deep neural networks,” in 2013 IEEE

international conference on acoustics, speech and sig-

nal processing. IEEE, 2013, pp. 8619–8623.

Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan

Gong, “Cross-language knowledge transfer using multi-

lingual deep neural network with shared hidden layers,”

in 2013 IEEE International Conference on Acoustics,

Speech and Signal Processing. IEEE, 2013, pp. 7304–

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-

cal Germain, Hugo Larochelle, Franc ̧ois Laviolette,

Mario Marchand, and Victor Lempitsky, “Domain-

adversarial training of neural networks,” The journal

of machine learning research, vol. 17, no. 1, pp. 2096–

, 2016.

Ke Hu, Hasim Sak, and Hank Liao, “Adversarial train-

ing for multilingual acoustic modeling,” arXiv preprint

arXiv:1906.07093, 2019.

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-

cal Germain, Hugo Larochelle, Franc ̧ois Laviolette,

Mario Marchand, and Victor Lempitsky, “Domain-

adversarial training of neural networks,” The journal

of machine learning research, vol. 17, no. 1, pp. 2096–

, 2016.

Lantian Li, Ruiqian Nai, and Dong Wang, “Real addi-

tive margin softmax for speaker verification,” in ICASSP

-2022 IEEE International Conference on Acous-

tics, Speech and Signal Processing (ICASSP). IEEE,

, pp. 7527–7531.

Yi Liu, Liang He, and Jia Liu, “Large margin soft-

max loss for speaker verification,” arXiv preprint

arXiv:1904.03479, 2019.

Mirco Ravanelli, Titouan Parcollet, Peter Plantinga,

Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem

Subakan, Nauman Dawalatabad, Abdelwahab Heba,

Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-

Wei Fu, Chien-Feng Liao, Elena Rastorgueva, Franc ̧ois

Grondin, William Aris, Hwidong Na, Yan Gao, Re-

nato De Mori, and Yoshua Bengio, “Speech-

Brain: A general-purpose speech toolkit,” 2021,

arXiv:2106.04624.

Ahilan Kanagasundaram, Robert Vogt, David Dean,

Sridha Sridharan, and Michael Mason, “I-vector based

speaker recognition on short utterances,” in Pro-

ceedings of the 12th Annual Conference of the Inter-

national Speech Communication Association. Interna-

tional Speech Communication Association, 2011, pp.

–2344.

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos

Zafeiriou, “Arcface: Additive angular margin loss for

deep face recognition,” in Proceedings of the IEEE/CVF

conference on computer vision and pattern recognition,

, pp. 4690–4699.

Roland Auckenthaler, Michael Carey, and Harvey

Lloyd-Thomas, “Score normalization for text-

independent speaker verification systems,” Digital Sig-

nal Processing, vol. 10, no. 1-3, pp. 42–54, 2000.

Andrey Shulipa, Sergey Novoselov, and Yuri Matveev,

“Scores calibration in speaker recognition systems,”

in International Conference on Speech and Computer.

Springer, 2016, pp. 596–603.

Philipp Moritz, Robert Nishihara, and Michael Jordan,

“A linearly-convergent stochastic l-bfgs algorithm,” in

Artificial Intelligence and Statistics. PMLR, 2016, pp.

–258.

Florin R ̆astoceanu and Marilena Laz ̆ar, “Score fusion

methods for text-independent speaker verification appli-

cations,” in 2011 6th Conference on Speech Technology

and Human-Computer Dialogue (SpeD). IEEE, 2011,

pp. 1–6.

Ville Hautam ̈aki, Tomi Kinnunen, Filip Sedl ́ak,

Kong Aik Lee, Bin Ma, and Haizhou Li, “Sparse classi-

fier fusion for speaker verification,” IEEE Transactions

on Audio, Speech, and Language Processing, vol. 21,

no. 8, pp. 1622–1631, 2013.

Jee-weon Jung, You Jin Kim, Hee-Soo Heo, Bong-Jin

Lee, Youngki Kwon, and Joon Son Chung, “Pushing

the limits of raw waveform speaker recognition,” arXiv

preprint, vol. 2203, 2022.

Downloads

Published

10-09-2024

How to Cite

[1]
H. L. Vu, N. V. Huy, N. T. T. Huyen, and P. V. Thanh, “MSV challenge language-adversarial training for indic multilingual speaker verification”, JCC, vol. 40, no. 3, p. 287–298, Sep. 2024.

Issue

Section

Articles