Statistical syntax-based machine translation approach to diacritization problem

Nguyen Minh Hai, Nguyễn Minh Tuấn


In this paper, the automatic diacritization of a language is modeled as a statistical syntax-based machine translation problem with the source undiacritized text and the target diacritized text of the same language. The grammatical inference technique ABL proposed in [2] is extended for learning a probabilistic synchronous context-free grammar from training corpus containing plain diacritized sentences only. The diacritization is to parse input sentences by the probabilistic CKY parsing algorithm for received grammar. This method is applied to Vietnamese with high quality result. As language independent building way, it can be applied to the other languages.


Automatic diacritization, syntax-based machine translation, grammatical inference.


Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology