Retranslating number expression unknown word in Chinese-Vietnamese statistical machine translation

Phước Thanh Trần, Điền Đinh


Word boundary in Chinese and Vietnamese is not defined by a space. Therefore, Chinese-Vietnamese word segmentations are always done first in Chinese-Vietnamese natural language processing problem in general and in Chinese-Vietnamese statistical machine translation in particular. The word segmentation increases the final quality of translation, but it appears many unknown words (UKW) in the target translation. The type of popular unknown word in Chinese-Vietnamese translation system is named entity (NE). In this paper, we present a hybrid method to combine statistic and rule and to re-translate number expression NE-UKW (NumExp-NE-UKW). Applying this method into Chinese-Vietnamese SMT, the experiment result shows that our method significantly improves Chinese-Vietnamese SMT performance.


Chinese-Vietnamese statistical machine translation, unknown word, named entity, number expression.


Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology