In-order transition-based parsing for Vietnamese

John Bauer; Hung Bui; Vy Thai; Christopher Manning

doi:10.15625/1813-9663/18363

Author affiliations

Authors

John Bauer HAI, Stanford University, 353 Jane Stanford Way Stanford, CA 94305, United States of America
Hung Bui Department of Computer Science, Stanford University, 353 Jane Stanford Way Stanford, CA 94305, United States of America
Vy Thai Department of Computer Science, Stanford University, 353 Jane Stanford Way Stanford, CA 94305, United States of America
Christopher D. Manning Linguistics & Computer Science, Stanford University, 353 Jane Stanford Way Stanford, CA 94305, United States of America

DOI:

https://doi.org/10.15625/1813-9663/18363

Keywords:

Constituency parsing, Vietnamese constituency parsing, transition parsing, parser, dynamic oracle.

Abstract

In this paper, we implement a general neural constituency parser based on an in-order parser. We apply this parser to the VLSP 2022 Vietnamese treebank, obtaining a test score of .8393 F1, top of the private test leaderboard. Earlier versions of the parser for languages other than Vietnamese have already been included in the publicly released Python package Stanza [35 ]. The next Stanza release will include the Vietnamese model, along with all of the code used in this project.

Metrics

PDF views

296

References

P. Qi, Y. Zhang, Y. Zhang, J. Bolton, and C. D. Manning, Stanza: A Python natural

language processing toolkit for many human languages, 2020. doi: 10.48550/ARXIV.

07082. [Online]. Available: https://arxiv.org/abs/2003.07082.

J. Liu and Y. Zhang, “In-order transition-based constituent parsing,” Transactions of

the Association for Computational Linguistics, vol. 5, pp. 413–424, 2017. doi: 10.1162/

tacl_a_00070. [Online]. Available: https://aclanthology.org/Q17-1029.

M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, “Building a large annotated

corpus of English: The Penn Treebank,” Computational Linguistics, vol. 19, no. 2,

pp. 313–330, 1993. [Online]. Available: https://aclanthology.org/J93-2004.

N. Xue, F. Xia, F.-D. Chiou, and M. Palmer, “The Penn Chinese Treebank : Phrase

structure annotation of a large corpus,” Natural Language Engineering, vol. 11, pp. 207–

, 2005.

J. Silva, A. Branco, S. Castro, and R. Reis, Out-of-the-Box Robust Parsing of Portuguese,

Springer, Berlin, 2010.

N. Kara, B. Mar¸san, M. ¨Oz¸celik, et al., “Creating a syntactically felicitous constituency

treebank for turkish,” in 2020 Innovations in Intelligent Systems and Applications

Conference (ASYU), 2020, pp. 1–6. doi: 10.1109/ASYU50717.2020.9259873.

Y. K. Thu, W. P. Pa, M. Utiyama, A. Finch, and E. Sumita, “Introducing the Asian

language treebank (ALT),” in Proceedings of the Tenth International Conference on

Language Resources and Evaluation (LREC’16), Portoroˇz, Slovenia: European Language

Resources Association (ELRA), May 2016, pp. 1574–1578. [Online]. Available: https:

//aclanthology.org/L16-1249.

R. Delmonte, A. Bristot, and S. Tonelli, “VIT – Venice Italian Treebank: Syntactic and

quantitative features,” in Proceedings of the Sixth International Workshop on Treebanks

and Linguistic Theories, 2007, pp. 43–54.

C. Bosco, “Multiple-step treebank conversion: From dependency to Penn format,” in

Proceedings of the Linguistic Annotation Workshop, Prague, Czech Republic: Association

for Computational Linguistics, Jun. 2007, pp. 164–167. [Online]. Available: https:

//aclanthology.org/W07-1526.

IN-ORDER TRANSITION-BASED PARSING FOR VIETNAMESE 13

E. Bick, “Arboretum, a hybrid treebank for Danish,” in Proceedings of Treebanks and

Linguistic Theory, J. Nivre and E. Hinrich, Eds., 2003, pp. 9–20.

M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu, “Fast and accurate shift-reduce

constituent parsing,” in Proceedings of the 51st Annual Meeting of the Association

for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria: Association

for Computational Linguistics, Aug. 2013, pp. 434–443. [Online]. Available: https:

//aclanthology.org/P13-1043.

C. Dyer, A. Kuncoro, M. Ballesteros, and N. A. Smith, “Recurrent neural network

grammars,” in Proceedings of the 2016 Conference of the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies, San

Diego, California: Association for Computational Linguistics, Jun. 2016, pp. 199–209.

doi: 10.18653/v1/N16-1024. [Online]. Available: https://aclanthology.org/N16-1024.

K. Yang and J. Deng, “Strongly incremental constituency parsing with graph neural

networks,” in Neural Information Processing Systems (NeurIPS), 2020.

T. Mikolov, K. Chen, G. S. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in International Conference on Learning Representations,

[Online]. Available: https://api.semanticscholar.org/CorpusID:5959482.

J. Pennington, R. Socher, and C. Manning, “GloVe: Global vectors for word representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural

Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 1532–1543. doi: 10.3115/v1/D14- 1162. [Online]. Available:

https://aclanthology.org/D14-1162.

K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from

tree-structured long short-term memory networks,” in Proceedings of the 53rd Annual

Meeting of the Association for Computational Linguistics and the 7th International

Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing,

China: Association for Computational Linguistics, Jul. 2015, pp. 1556–1566. doi:

3115/v1/P15-1150. [Online]. Available: https://aclanthology.org/P15-1150.

A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high-performance

deep learning library,” in Advances in Neural Information Processing Systems 32,

Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http : / / papers .

neurips . cc / paper / 9015 - pytorch - an - imperative - style - high - performance - deep -

learning-library.pdf.

K. Fukushima, “Cognitron: A self-organizing multilayered neural network,” Biological

Cybernetics, vol. 20, no. 3, pp. 121–136, 1975.

D. Hendrycks and K. Gimpel, Gaussian error linear units (GELUs), 2016. doi: 10.

/ARXIV.1606.08415. [Online]. Available: https://arxiv.org/abs/1606.08415.

D. Misra, Mish: A self regularized non-monotonic activation function, 2019. doi: 10.

/ARXIV.1908.08681. [Online]. Available: https://arxiv.org/abs/1908.08681.

S. Elfwing, E. Uchibe, and K. Doya, Sigmoid-weighted linear units for neural network

function approximation in reinforcement learning, 2017. doi: 10.48550/ARXIV.1702.

[Online]. Available: https://arxiv.org/abs/1702.03118.

JOHN BAUER, et al.

D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network

learning by exponential linear units (ELUs), 2015. doi: 10.48550/ARXIV.1511.07289.

[Online]. Available: https://arxiv.org/abs/1511.07289.

A. G. Howard, M. Zhu, B. Chen, et al., Mobilenets: Efficient convolutional neural

networks for mobile vision applications, 2017. doi: 10.48550/ARXIV.1704.04861. [Online].

Available: https://arxiv.org/abs/1704.04861.

N. Kitaev and D. Klein, Constituency parsing with a self-attentive encoder, 2018. doi:

48550/ARXIV.1805.01052. [Online]. Available: https://arxiv.org/abs/1805.01052.

K. Mrini, F. Dernoncourt, Q. Tran, T. Bui, W. Chang, and N. Nakashole, Rethinking

self-attention: Towards interpretability in neural parsing, 2019. doi: 10.48550/ARXIV.

03875. [Online]. Available: https://arxiv.org/abs/1911.03875.

M. D. Zeiler, Adadelta: An adaptive learning rate method, Dec. 2012. [Online]. Available:

https://arxiv.org/abs/1212.5701.

I. Loshchilov and F. Hutter, Decoupled weight decay regularization, 2017. doi: 10.48550/

ARXIV.1711.05101. [Online]. Available: https://arxiv.org/abs/1711.05101.

A. Defazio and S. Jelassi, Adaptivity without compromise: A momentumized, adaptive,

dual averaged gradient method for stochastic optimization, 2021. arXiv: 2101.11075

[cs.LG].

D. Zeman, J. Hajiˇc, M. Popel, et al., “CoNLL 2018 shared task: Multilingual parsing

from raw text to Universal Dependencies,” in Proceedings of the CoNLL 2018 Shared

Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium:

Association for Computational Linguistics, Oct. 2018, pp. 1–21. doi: 10.18653/v1/K18-

[Online]. Available: https://aclanthology.org/K18-2001.

A. T. Nguyen, M. H. Dao, and D. Q. Nguyen, “A pilot study of text-to-SQL semantic

parsing for Vietnamese,” in Findings of the Association for Computational Linguistics:

EMNLP 2020, 2020, pp. 4079–4085.

A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings for sequence

labeling,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug.

, pp. 1638–1649. [Online]. Available: https://aclanthology.org/C18-1139.

J. Abadji, P. Ortiz Suarez, L. Romary, and B. Sagot, “Towards a cleaner documentoriented multilingual crawled corpus,” arXiv e-prints, arXiv:2201.06642, arXiv:2201.06642,

Jan. 2022. arXiv: 2201.06642 [cs.CL].

D. Q. Nguyen and A. T. Nguyen, “PhoBERT: Pre-trained language models for Vietnamese,” in Findings of the Association for Computational Linguistics: EMNLP 2020,

, pp. 1037–1042.

N. L. Tran, D. M. Le, and D. Q. Nguyen, “BARTpho: Pre-trained sequence-to-sequence

models for Vietnamese,” in Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022.

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.

arXiv: 1810.04805. [Online]. Available: http://arxiv.org/abs/1810.04805.

IN-ORDER TRANSITION-BASED PARSING FOR VIETNAMESE 15

Y. Liu, M. Ott, N. Goyal, et al., RoBERTa: A robustly optimized BERT pretraining

approach, 2019. doi: 10.48550/ARXIV.1907.11692. [Online]. Available: https://arxiv.

org/abs/1907.11692.

L. Parisi, S. Francia, and P. Magnani, Umberto: An italian language model trained with

whole word masking, https://github.com/musixmatchresearch/umberto, 2020.

F. Souza, R. Nogueira, and R. Lotufo, “BERTimbau: Pretrained BERT models for

Brazilian Portuguese,” in 9th Brazilian Conference on Intelligent Systems, BRACIS,

Rio Grande do Sul, Brazil, October 20-23 (to appear), 2020.

S. Schweter, BERTurk – BERT models for Turkish, version 1.0.0, Apr. 2020. doi: 10.

/zenodo.3770924. [Online]. Available: https://doi.org/10.5281/zenodo.3770924.

Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu, “Revisiting pre-trained models

for Chinese natural language processing,” in Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing: Findings, Online: Association

for Computational Linguistics, Nov. 2020, pp. 657–668. [Online]. Available: https:

//www.aclweb.org/anthology/2020.findings-emnlp.58.

G. Attardi, Wikiextractor, https://github.com/attardi/wikiextractor, 2015.

D. Q. Nguyen, D. Q. Nguyen, T. Vu, M. Dras, and M. Johnson, “A fast and accurate

Vietnamese word segmenter,” in Proceedings of the 11th International Conference on

Language Resources and Evaluation (LREC 2018), 2018, pp. 2582–2587.

D. K. Choe and E. Charniak, “Parsing as language modeling,” in Proceedings of

the 2016 Conference on Empirical Methods in Natural Language Processing, Austin,

Texas: Association for Computational Linguistics, Nov. 2016, pp. 2331–2336. doi:

18653/v1/D16-1257. [Online]. Available: https://aclanthology.org/D16-1257.

K. Nguyen, V. Nguyen, A. Nguyen, and N. Nguyen, “A Vietnamese dataset for evaluating

machine reading comprehension,” in Proceedings of the 28th International Conference

on Computational Linguistics, Barcelona, Spain (Online): International Committee on

Computational Linguistics, Dec. 2020, pp. 2595–2605. doi: 10.18653/v1/2020.colingmain.233. [Online]. Available: https://aclanthology.org/2020.coling-main.233.

Y. Goldberg and J. Nivre, “A dynamic oracle for arc-eager dependency parsing,” in

COLING, 2012.

M. Coavoux and B. Crabbé, “Neural greedy constituent parsing with dynamic oracles,”

in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany: Association for Computational

Linguistics, Aug. 2016, pp. 172–182. doi: 10.18653/v1/P16-1017. [Online]. Available:

https://aclanthology.org/P16-1017.