A model for exploiting the target language characteristics to extract bilingual base noun phrases

Nguyen Chi Hieu

Abstract


Bilingual Base Noun Phrase (BaseNP) extraction is one of the key tasks of Natural Language Processing (NLP). This task is more challenging for the pair of English-Vietnamese due to the lack of available Vietnamese language resources such as treebanks, part-of-speech taggers, and parsers. In this paper, we propose a combination model that uses language characteristics based on statistics and the projection method to extract BaseNP correspondences from a bilingual corpus. The language characteristics used in this model include the word segmentation, word order and word classification [1]. Our model overcomes not only the lack of resources of Vietnamese, but also improves the performance of miss-alignment, null-alignment, overlap and conflict projection of the existing methods. The proposed model can be easily applied to other language pairs. Experiment on 66,646 pairs of sentences in the English-Vietnamese bilingual corpus shows that our proposed model is very satisfactory.

Keywords


Npbase, classifiers, word order, NLP

Full Text:

PDF


DOI: https://doi.org/10.15625/1813-9663/30/2/3591

Journal of Computer Science and Cybernetics ISSN: 1813-9663

Published by Vietnam Academy of Science and Technology