IN-ORDER TRANSITION-BASED PARSING FOR VIETNAMESE
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18363Keywords:
Constituency Parsing, Vietnamese constituency parsing, Transition parsing, Parser, Dynamic oracle.Abstract
In this paper, we implement a general neural constituency parser based on an in-order parser. We apply this parser to the VLSP 2022 Vietnamese treebank, obtaining a test score of .8393 F1, top of the private test leaderboard. Earlier versions of the parser for languages other than Vietnamese have already been included in the publicly released Python package Stanza [35 ]. The next Stanza release will include the Vietnamese model, along with all of the code used in this project.
Metrics
References
P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, Stanza: A Python natural
language processing toolkit for many human languages, 2020. doi: 10.48550/ARXIV.
07082. [Online]. Available: https://arxiv.org/abs/2003.07082.
J. Liu and Y. Zhang, “In-order transition-based constituent parsing,” Transactions of
the Association for Computational Linguistics, vol. 5, pp. 413–424, 2017. doi: 10.1162/
tacl_a_00070. [Online]. Available: https://aclanthology.org/Q17-1029.
M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated
corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2,
pp. 313–330, 1993. [Online]. Available: https://aclanthology.org/J93-2004.
N. Xue, F. Xia, F.-D. Chiou, and M. Palmer, “The Penn Chinese Treebank : Phrase
structure annotation of a large corpus,” Natural Language Engineering, vol. 11, pp. 207–
, 2005.
J. Silva, A. Branco, S. Castro, and R. Reis, Out-of-the-Box Robust Parsing of Portuguese,
Springer, Berlin, 2010.
N. Kara, B. Mar¸san, M. ¨Oz¸celik, et al., “Creating a syntactically felicitous constituency
treebank for turkish,” in 2020 Innovations in Intelligent Systems and Applications
Conference (ASYU), 2020, pp. 1–6. doi: 10.1109/ASYU50717.2020.9259873.
Y. K. Thu, W. P. Pa, M. Utiyama, A. Finch, and E. Sumita, “Introducing the Asian
language treebank (ALT),” in Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC’16), Portoroˇz, Slovenia: European Language
Resources Association (ELRA), May 2016, pp. 1574–1578. [Online]. Available: https:
//aclanthology.org/L16-1249.
R. Delmonte, A. Bristot, and S. Tonelli, “VIT – Venice Italian Treebank: Syntactic and
quantitative features,” in Proceedings of the Sixth International Workshop on Treebanks
and Linguistic Theories, 2007, pp. 43–54.
C. Bosco, “Multiple-step treebank conversion: From dependency to Penn format,” in
Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic: Association
for Computational Linguistics, Jun. 2007, pp. 164–167. [Online]. Available: https:
//aclanthology.org/W07-1526.
IN-ORDER TRANSITION-BASED PARSING FOR VIETNAMESE 13
E. Bick, “Arboretum, a hybrid treebank for Danish,” in Proceedings of Treebanks and
Linguistic Theory, J. Nivre and E. Hinrich, Eds., 2003, pp. 9–20.
M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu, “Fast and accurate shift-reduce
constituent parsing,” in Proceedings of the 51st Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria: Association
for Computational Linguistics, Aug. 2013, pp. 434–443. [Online]. Available: https:
//aclanthology.org/P13-1043.
C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith, “Recurrent neural network
grammars,” in Proceedings of the 2016 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies, San
Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 199–209.
doi: 10.18653/v1/N16-1024. [Online]. Available: https://aclanthology.org/N16-1024.
K. Yang and J. Deng, “Strongly incremental constituency parsing with graph neural
networks,” in Neural Information Processing Systems (NeurIPS), 2020.
T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in International Conference on Learning Representations,
[Online]. Available: https://api.semanticscholar.org/CorpusID:5959482.
J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. doi: 10.3115/v1/D14- 1162. [Online]. Available:
https://aclanthology.org/D14-1162.
K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from
tree-structured long short-term memory networks,” in Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th International
Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing,
China: Association for Computational Linguistics, Jul. 2015, pp. 1556–1566. doi:
3115/v1/P15-1150. [Online]. Available: https://aclanthology.org/P15-1150.
A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance
deep learning library,” in Advances in Neural Information Processing Systems 32,
Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http : / / papers .
neurips . cc / paper / 9015 - pytorch - an - imperative - style - high - performance - deep -
learning-library.pdf.
K. Fukushima, “Cognitron: A self-organizing multilayered neural network,” Biological
Cybernetics, vol. 20, no. 3, pp. 121–136, 1975.
D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), 2016. doi: 10.
/ARXIV.1606.08415. [Online]. Available: https://arxiv.org/abs/1606.08415.
D. Misra, Mish: A self regularized non-monotonic activation function, 2019. doi: 10.
/ARXIV.1908.08681. [Online]. Available: https://arxiv.org/abs/1908.08681.
S. Elfwing, E. Uchibe, and K. Doya, Sigmoid-weighted linear units for neural network
function approximation in reinforcement learning, 2017. doi: 10.48550/ARXIV.1702.
[Online]. Available: https://arxiv.org/abs/1702.03118.
JOHN BAUER, et al.
D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network
learning by exponential linear units (ELUs), 2015. doi: 10.48550/ARXIV.1511.07289.
[Online]. Available: https://arxiv.org/abs/1511.07289.
A. G. Howard, M. Zhu, B. Chen, et al., Mobilenets: Efficient convolutional neural
networks for mobile vision applications, 2017. doi: 10.48550/ARXIV.1704.04861. [Online].
Available: https://arxiv.org/abs/1704.04861.
N. Kitaev and D. Klein, Constituency parsing with a self-attentive encoder, 2018. doi:
48550/ARXIV.1805.01052. [Online]. Available: https://arxiv.org/abs/1805.01052.
K. Mrini, F. Dernoncourt, Q. Tran, T. Bui, W. Chang, and N. Nakashole, Rethinking
self-attention: Towards interpretability in neural parsing, 2019. doi: 10.48550/ARXIV.
03875. [Online]. Available: https://arxiv.org/abs/1911.03875.
M. D. Zeiler, Adadelta: An adaptive learning rate method, Dec. 2012. [Online]. Available:
https://arxiv.org/abs/1212.5701.
I. Loshchilov and F. Hutter, Decoupled weight decay regularization, 2017. doi: 10.48550/
ARXIV.1711.05101. [Online]. Available: https://arxiv.org/abs/1711.05101.
A. Defazio and S. Jelassi, Adaptivity without compromise: A momentumized, adaptive,
dual averaged gradient method for stochastic optimization, 2021. arXiv: 2101.11075
[cs.LG].
D. Zeman, J. Hajiˇc, M. Popel, et al., “CoNLL 2018 shared task: Multilingual parsing
from raw text to Universal Dependencies,” in Proceedings of the CoNLL 2018 Shared
Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium:
Association for Computational Linguistics, Oct. 2018, pp. 1–21. doi: 10.18653/v1/K18-
[Online]. Available: https://aclanthology.org/K18-2001.
A. T. Nguyen, M. H. Dao, and D. Q. Nguyen, “A pilot study of text-to-SQL semantic
parsing for Vietnamese,” in Findings of the Association for Computational Linguistics:
EMNLP 2020, 2020, pp. 4079–4085.
A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence
labeling,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug.
, pp. 1638–1649. [Online]. Available: https://aclanthology.org/C18-1139.
J. Abadji, P. Ortiz Suarez, L. Romary, and B. Sagot, “Towards a cleaner documentoriented multilingual crawled corpus,” arXiv e-prints, arXiv:2201.06642, arXiv:2201.06642,
Jan. 2022. arXiv: 2201.06642 [cs.CL].
D. Q. Nguyen and A. T. Nguyen, “PhoBERT: Pre-trained language models for Vietnamese,” in Findings of the Association for Computational Linguistics: EMNLP 2020,
, pp. 1037–1042.
N. L. Tran, D. M. Le, and D. Q. Nguyen, “BARTpho: Pre-trained sequence-to-sequence
models for Vietnamese,” in Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022.
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
arXiv: 1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805.
IN-ORDER TRANSITION-BASED PARSING FOR VIETNAMESE 15
Y. Liu, M. Ott, N. Goyal, et al., RoBERTa: A robustly optimized BERT pretraining
approach, 2019. doi: 10.48550/ARXIV.1907.11692. [Online]. Available: https://arxiv.
org/abs/1907.11692.
L. Parisi, S. Francia, and P. Magnani, Umberto: An italian language model trained with
whole word masking, https://github.com/musixmatchresearch/umberto, 2020.
F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT models for
Brazilian Portuguese,” in 9th Brazilian Conference on Intelligent Systems, BRACIS,
Rio Grande do Sul, Brazil, October 20-23 (to appear), 2020.
S. Schweter, BERTurk – BERT models for Turkish, version 1.0.0, Apr. 2020. doi: 10.
/zenodo.3770924. [Online]. Available: https://doi.org/10.5281/zenodo.3770924.
Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-trained models
for Chinese natural language processing,” in Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing: Findings, Online: Association
for Computational Linguistics, Nov. 2020, pp. 657–668. [Online]. Available: https:
//www.aclweb.org/anthology/2020.findings-emnlp.58.
G. Attardi, Wikiextractor, https://github.com/attardi/wikiextractor, 2015.
D. Q. Nguyen, D. Q. Nguyen, T. Vu, M. Dras, and M. Johnson, “A fast and accurate
Vietnamese word segmenter,” in Proceedings of the 11th International Conference on
Language Resources and Evaluation (LREC 2018), 2018, pp. 2582–2587.
D. K. Choe and E. Charniak, “Parsing as language modeling,” in Proceedings of
the 2016 Conference on Empirical Methods in Natural Language Processing, Austin,
Texas: Association for Computational Linguistics, Nov. 2016, pp. 2331–2336. doi:
18653/v1/D16-1257. [Online]. Available: https://aclanthology.org/D16-1257.
K. Nguyen, V. Nguyen, A. Nguyen, and N. Nguyen, “A Vietnamese dataset for evaluating
machine reading comprehension,” in Proceedings of the 28th International Conference
on Computational Linguistics, Barcelona, Spain (Online): International Committee on
Computational Linguistics, Dec. 2020, pp. 2595–2605. doi: 10.18653/v1/2020.colingmain.233. [Online]. Available: https://aclanthology.org/2020.coling-main.233.
Y. Goldberg and J. Nivre, “A dynamic oracle for arc-eager dependency parsing,” in
COLING, 2012.
M. Coavoux and B. Crabbé, “Neural greedy constituent parsing with dynamic oracles,”
in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany: Association for Computational
Linguistics, Aug. 2016, pp. 172–182. doi: 10.18653/v1/P16-1017. [Online]. Available:
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.