A data challenge for Vietnamese abstractive multi-document summarization

Mai Vu Tran; Hoang Quynh Le; Duy Cat Can; Quoc An Nguyen

doi:10.15625/1813-9663/18291

Author affiliations

Authors

Mai Vu Tran Faculty of Information Technology, VNU University of Engineering and Technology, E3 Building, 144 Xuan Thuy Street, Cau Giay District, Ha Noi, Viet Nam
Hoang Quynh Le Faculty of Information Technology, VNU University of Engineering and Technology, E3 Building, 144 Xuan Thuy Street, Cau Giay District, Ha Noi, Viet Nam
Duy Cat Can Faculty of Information Technology, VNU University of Engineering and Technology, E3 Building, 144 Xuan Thuy Street, Cau Giay District, Ha Noi, Viet Nam
Quoc An Nguyen Faculty of Information Technology, VNU University of Engineering and Technology, E3 Building, 144 Xuan Thuy Street, Cau Giay District, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18291

Keywords:

Abstractive summarization, Vietnamese summarization dataset, multi-document summarization.

Abstract

This paper provides an overview of the Vietnamese abstractive multi-document summarization shared task (AbMuSu) for Vietnamese news, which is hosted at the 9th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The main goal of this shared task is to develop automated summarization systems that can generate abstractive summaries for a given set of documents on a specific topic. The input consists of several news documents on the same topic, and the output is a related abstractive summary. The focus of the AbMuSu shared task is solely on Vietnamese news summarization. To this end, a human-annotated dataset comprising 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories, has been developed. Participating models are evaluated and ranked based on their ROUGE2-F1 score, which is the most common evaluation metric for document summarization problems.

Metrics

PDF views

125

References

A. B. Abacha, Y. M’rabet, Y. Zhang, C. Shivade, C. Langlotz, and D. Demner-Fushman, “Overview of the mediqa 2021 shared task on summarization in the medical domain,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp. 74–85.

M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, “Text summarization techniques: A brief survey,” International Journal of Advanced Computer Science and Applications (ijacsa), vol. 8, no. 10, 2017.

S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998.

G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of artificial intelligence research, vol. 22, pp. 457–479, 2004.

A. R. Fabbri, I. Li, T. She, S. Li, and D. Radev, “Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1074–1084.

M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: a survey,” Artificial Intelligence Review, vol. 47, no. 1, pp. 1–66, 2017.

J. Goldstein and J. G. Carbonell, “Summarization:(1) using mmr for diversity-based reranking and (2) evaluating summaries,” in TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998, 1998, pp. 181–195.

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

K. Jeˇzek and J. Steinberger, “Automatic text summarization (the state of the art 2007 and new challenges),” in Proceedings of Znalosti, 2008, pp. 1–12.

J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.

R. Khan, Y. Qian, and S. Naeem, “Extractive based text summarization using k-means and tf-idf,” International Journal of Information Engineering and Electronic Business, vol. 11, no. 3, p. 33, 2019.

J.-C. Klie, M. Bugert, B. Boullosa, R. E. de Castilho, and I. Gurevych, “The inception platform: Machine-assisted and knowledge-oriented interactive annotation,” in Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, 2018, pp. 5–9.

C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.

R. D. Lins, R. F. Mello, and S. Simske, “Doceng’19 competition on extractive text summariza- tion,” in Proceedings of the ACM Symposium on Document Engineering 2019, 2019, pp. 1–2.

R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 404–411.

L. Phan, H. Tran, H. Nguyen, and T. H. Trinh, “Vit5: Pretrained text-to-text transformer for vietnamese language generation,” arXiv preprint arXiv:2205.06457, 2022.

M. Savery, A. B. Abacha, S. Gayen, and D. Demner-Fushman, “Question-driven summarization of answers to consumer health questions,” Scientific Data, vol. 7, no. 1, pp. 1–9, 2020.

N.-T. Tran, M.-Q. Nghiem, N. T. Nguyen, N. L.-T. Nguyen, N. Van Chi, and D. Dinh, “Vims: a high-quality vietnamese dataset for abstractive multi-document summarization,” Language Resources and Evaluation, vol. 54, no. 4, pp. 893–920, 2020.

A data challenge for Vietnamese abstractive multi-document summarization

Authors

DOI:

Keywords:

Abstract

Metrics

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)