DEVELOPMENT OF HIGH-PERFORMANCE AND LARGE-SCALE VIETNAMESE AUTOMATIC SPEECH RECOGNITION SYSTEMS

Do Quoc Truong; Pham Ngoc Phuong; Tran Hoang Tung; Luong Chi Mai

doi:10.15625/1813-9663/34/4/13165

DEVELOPMENT OF HIGH-PERFORMANCE AND LARGE-SCALE VIETNAMESE AUTOMATIC SPEECH RECOGNITION SYSTEMS

Do Quoc Truong

, Pham Ngoc Phuong, Tran Hoang Tung, Luong Chi Mai

Author affiliations

Authors

Do Quoc Truong Vietnamese Artificial Intelligence Systems Company Limited http://orcid.org/0000-0003-1472-1370
Pham Ngoc Phuong
Tran Hoang Tung
Luong Chi Mai

DOI:

https://doi.org/10.15625/1813-9663/34/4/13165

Keywords:

Speech recognition, Vietnamese, speech corpus

Abstract

Automatic Speech Recognition (ASR) systems convert human speech into the corresponding transcription automatically. They have a wide range of applications such as controlling robots, call center analytics, voice chatbot. Recent studies on ASR for English have achieved the performance that surpasses human ability. The systems were trained on a large amount of training data and performed well under many environments. With regards to Vietnamese, there have been many studies on improving the performance of existing ASR systems, however, many of them are conducted on a small-scaled data, which does not reflect realistic scenarios. Although the corpora used to train the system were carefully design to maintain phonetic balance properties, efforts in collecting them at a large-scale are still limited. Specifically, only a certain accent of Vietnam was evaluated in existing works. In this paper, we first describe our efforts in collecting a large data set that covers all 3 major accents of Vietnam located in the Northern, Center, and Southern regions. Then, we detail our ASR system development procedure utilizing the collected data set and evaluating different model architectures to find the best structure for Vietnamese. In the VLSP 2018 challenge, our system achieved the best performance with 6.5% WER and on our internal test set with more than 10 hours of speech collected real environments, the system also performs well with 11% WER

Metrics

PDF views

366

References

S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language model-

ing,” in Proceedings of ACL, 1996, pp. 310–318.

M. Chu, Co so ngon ngu hoc va tieng Viet.

NXB Giao Duc, 1997.

S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word

recognition in continuously spoken sentences,” IEEE, pp. 357–366, 1980.

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front end factor analysis for

speaker verification,” IEEE, 2010.

M. Gibson and T. Hain, “Hypothesis spaces for minimum bayes risk training in large vocabulary

speech recognition.” in Proceedings of INTERSPEECH, 2006.

G. Hinton, “A practical guide to training restricted boltzmann machines,” Momentum, vol. 9,

no. 1, p. 926, 2010.

P. Kenny, G. Boulianne, and P. Dumouchel, “Eigenvoice modeling with sparse training data,”

IEEE, vol. 13, no. 3, pp. 345–354, May 2005.

P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Joint factor analysis versus eigenchan-

nels in speaker recognition,” IEEE, vol. 15, no. 4, pp. 1435–1447, May 2007.

Q. Nguyen, T. Vu, and C. Luong, “Improving acoustic model for vietnamese large vocabulary

continuous speech recognition system using tonal feature as input of deep neural network,”

Journal of Computer Science and Cybernetics, vol. 30, pp. 28–38, 2014.

V. Nguyen, C. Luong, T. Vu, and Q. Do, “Vietnamese recognition using tonal phoneme based

on multi space distribution,” Journal of Computer Science and Cybernetics, vol. 30, pp. 28–38,

D. Povey, A. Ghoshal, G. Boulianne, N. Goel, M. Hannemann, Y. Qian, P. Schwarz, and G. Stem-

mer, “The Kaldi speech recognition toolkit,” in Proceedings of IEEE workshop, 2011.

D. Povey and B. Kingsbury, “Evaluation of proposed modifications to MPE for large scale

discriminative training,” in Proceedings of ICASSP, vol. 4, April 2007, pp. IV–321–IV–324.

A. Stolcke, “SRILM – an extensible language modeling toolkit,” in Proceedings of ICSLP, vol. 2,

Denver, USA, 2002, pp. 901–904.

V. Thang, L. C. Mai, and S. Nakamura, “An hmm-based vietnamese speech synthesis system,”

in Proceedings of O-COCOSDA, 2009.

T. Vu, T. Nguyen, C. Luong, and J. Hosom, “Vietnamese large vocabulary continuous speech

recognition,” in Interspeech, 2005.

X. Z., J. Trmal, D. Povey, and S. Khudanpur, “Improving deep neural network acoustic mo

Downloads

Published

30-01-2019

How to Cite

[1]

D. Q. Truong, P. N. Phuong, T. H. Tung, and L. C. Mai, “DEVELOPMENT OF HIGH-PERFORMANCE AND LARGE-SCALE VIETNAMESE AUTOMATIC SPEECH RECOGNITION SYSTEMS”, J. Comput. Sci. Cybern., vol. 34, no. 4, p. 335–348, Jan. 2019.

Download Citation

Issue

Vol. 34 No. 4 (2018)

Section

Computer Science

License

1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.