A data challenge for Vietnamese abstractive multi-document summarization
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18291Keywords:
Abstractive summarization, Vietnamese summarization dataset, multi-document summarization.Abstract
This paper provides an overview of the Vietnamese abstractive multi-document summarization shared task (AbMuSu) for Vietnamese news, which is hosted at the 9th annual workshop on Vietnamese Language and Speech Processing (VLSP 2022). The main goal of this shared task is to develop automated summarization systems that can generate abstractive summaries for a given set of documents on a specific topic. The input consists of several news documents on the same topic, and the output is a related abstractive summary. The focus of the AbMuSu shared task is solely on Vietnamese news summarization. To this end, a human-annotated dataset comprising 1,839 documents in 600 clusters, collected from Vietnamese news in 8 categories, has been developed. Participating models are evaluated and ranked based on their ROUGE2-F1 score, which is the most common evaluation metric for document summarization problems.
Metrics
References
A. B. Abacha, Y. M’rabet, Y. Zhang, C. Shivade, C. Langlotz, and D. Demner-Fushman, “Overview of the mediqa 2021 shared task on summarization in the medical domain,” in Proceedings of the 20th Workshop on Biomedical Language Processing, 2021, pp. 74–85.
M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, “Text summarization techniques: A brief survey,” International Journal of Advanced Computer Science and Applications (ijacsa), vol. 8, no. 10, 2017.
S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer networks and ISDN systems, vol. 30, no. 1-7, pp. 107–117, 1998.
G. Erkan and D. R. Radev, “Lexrank: Graph-based lexical centrality as salience in text summarization,” Journal of artificial intelligence research, vol. 22, pp. 457–479, 2004.
A. R. Fabbri, I. Li, T. She, S. Li, and D. Radev, “Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 1074–1084.
M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: a survey,” Artificial Intelligence Review, vol. 47, no. 1, pp. 1–66, 2017.
J. Goldstein and J. G. Carbonell, “Summarization:(1) using mmr for diversity-based reranking and (2) evaluating summaries,” in TIPSTER TEXT PROGRAM PHASE III: Proceedings of a Workshop held at Baltimore, Maryland, October 13-15, 1998, 1998, pp. 181–195.
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
K. Jeˇzek and J. Steinberger, “Automatic text summarization (the state of the art 2007 and new challenges),” in Proceedings of Znalosti, 2008, pp. 1–12.
J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.
R. Khan, Y. Qian, and S. Naeem, “Extractive based text summarization using k-means and tf-idf,” International Journal of Information Engineering and Electronic Business, vol. 11, no. 3, p. 33, 2019.
J.-C. Klie, M. Bugert, B. Boullosa, R. E. de Castilho, and I. Gurevych, “The inception platform: Machine-assisted and knowledge-oriented interactive annotation,” in Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, 2018, pp. 5–9.
C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
R. D. Lins, R. F. Mello, and S. Simske, “Doceng’19 competition on extractive text summariza- tion,” in Proceedings of the ACM Symposium on Document Engineering 2019, 2019, pp. 1–2.
R. Mihalcea and P. Tarau, “Textrank: Bringing order into text,” in Proceedings of the 2004 conference on empirical methods in natural language processing, 2004, pp. 404–411.
L. Phan, H. Tran, H. Nguyen, and T. H. Trinh, “Vit5: Pretrained text-to-text transformer for vietnamese language generation,” arXiv preprint arXiv:2205.06457, 2022.
M. Savery, A. B. Abacha, S. Gayen, and D. Demner-Fushman, “Question-driven summarization of answers to consumer health questions,” Scientific Data, vol. 7, no. 1, pp. 1–9, 2020.
N.-T. Tran, M.-Q. Nghiem, N. T. Nguyen, N. L.-T. Nguyen, N. Van Chi, and D. Dinh, “Vims: a high-quality vietnamese dataset for abstractive multi-document summarization,” Language Resources and Evaluation, vol. 54, no. 4, pp. 893–920, 2020.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.