EMPIRICAL STUDY OF FEATURE EXTRACTION APPROACHES FOR IMAGE CAPTIONING IN VIETNAMESE
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/38/4/17548Keywords:
Grid features, Region features, Image captioning, Viecap4h, Uit-viic, Faster R-CNN, Cascade R-CNN, Grid R-CNN, Vinvl.Abstract
Image captioning is a challenging task that is still being addressed in the 2020s. The problem has the input as an image, and the output is the generated caption that describes the context of the input image. In this study, I focus on the image captioning problem in Vietnamese. In detail, I present the empirical study of feature extraction approaches using current state-of-the-art object detection methods to represent the images in the model space. Each type of feature is trained with the Transformer-based captioning model. I investigate the effectiveness of different feature types on two Vietnamese datasets: UIT-ViIC and VieCap4H, the two standard benchmark datasets. The experimental results show crucial insight into the feature extraction task for image captioning in Vietnamese.
Metrics
References
Huaizu Jiang et al. “In defense of grid features for visual question answering”. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 10267–10276. DOI: https://doi.org/10.1109/CVPR42600.2020.01028
Pengchuan Zhang et al. “Vinvl: Revisiting visual representations in vision-language models”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. 2021, pp. 5579–5588. DOI: https://doi.org/10.1109/CVPR46437.2021.00553
Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: Advances in neural information processing systems 28 (2015).
Zhaowei Cai and Nuno Vasconcelos. “Cascade r-cnn: Delving into high quality object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6154–6162.
Xin Lu et al. “Grid r-cnn”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 7363–7372. DOI: https://doi.org/10.1109/CVPR.2019.00754
Xuying Zhang et al. “RSTNet: Captioning with adaptive attention on visual and non- visual words”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, pp. 15465–15474. DOI: https://doi.org/10.1109/CVPR46437.2021.01521
Quan Hoang Lam et al. “UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning”. In: International Conference on Computational Collective Intelli- gence. Springer. 2020, pp. 730–742. DOI: https://doi.org/10.1007/978-3-030-63007-2_57
Thao Minh Le et al. “VLSP 2021-VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese”. In: people 1.2 (2021), p. 2.
Xinlei Chen et al. “Microsoft coco captions: Data collection and evaluation server”. In:arXiv preprint arXiv:1504.00325 (2015).
Danna Gurari et al. “Vizwiz grand challenge: Answering visual questions from blind people”. In: Proceedings of the IEEE conference on computer vision and pattern recog- nition. 2018, pp. 3608–3617. DOI: https://doi.org/10.1109/CVPR.2018.00380
Oleksii Sidorov et al. “Textcaps: a dataset for image captioning with reading compre- hension”. In: European conference on computer vision. Springer. 2020, pp. 742–758. DOI: https://doi.org/10.1007/978-3-030-58536-5_44
Piyush Sharma et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning”. In: Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2556–2565. DOI: https://doi.org/10.18653/v1/P18-1238
Xuewen Yang et al. “Fashion captioning: Towards generating accurate descriptions with semantic rewards”. In: European Conference on Computer Vision. Springer. 2020, pp. 1–17. DOI: https://doi.org/10.1007/978-3-030-58601-0_1
Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017).
Oriol Vinyals et al. “Show and tell: A neural image caption generator”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3156– 3164. DOI: https://doi.org/10.1109/CVPR.2015.7298935
Kelvin Xu et al. “Show, attend and tell: Neural image caption generation with visual attention”. In: International conference on machine learning. PMLR. 2015, pp. 2048– 2057.
Peter Anderson et al. “Bottom-up and top-down attention for image captioning and visual question answering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6077–6086. DOI: https://doi.org/10.1109/CVPR.2018.00636
Lun Huang et al. “Attention on attention for image captioning”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 4634–4643. DOI: https://doi.org/10.1109/ICCV.2019.00473
Simao Herdade et al. “Image captioning: Transforming objects into words”. In: Ad- vances in Neural Information Processing Systems 32 (2019).
Marcella Cornia et al. “Meshed-memory transformer for image captioning”. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 10578–10587. DOI: https://doi.org/10.1109/CVPR42600.2020.01059
Yingwei Pan et al. “X-linear attention networks for image captioning”. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 10971–10980.
Luowei Zhou et al. “Unified vision-language pre-training for image captioning and vqa”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 07. 2020, pp. 13041–13049. DOI: https://doi.org/10.1609/aaai.v34i07.7005
Kaiming He et al. “Mask r-cnn”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2961–2969.
Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1440–1448. DOI: https://doi.org/10.1109/ICCV.2015.169
Dat Quoc Nguyen and Anh Tuan Nguyen. “PhoBERT: Pre-trained language models for Vietnamese”. In: arXiv preprint arXiv:2003.00744 (2020). DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.92
Kishore Papineni et al. “Bleu: a method for automatic evaluation of machine transla- tion”. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002, pp. 311–318. DOI: https://doi.org/10.3115/1073083.1073135
Satanjeev Banerjee and Alon Lavie. “METEOR: An automatic metric for MT evalua- tion with improved correlation with human judgments”. In: Proceedings of the acl work- shop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005, pp. 65–72.
Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text summarization branches out. 2004, pp. 74–81.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. “Cider: Consensus- based image description evaluation”. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 4566–4575. DOI: https://doi.org/10.1109/CVPR.2015.7299087
Stephen Robertson. “Understanding inverse document frequency: on theoretical argu- ments for IDF”. In: Journal of documentation (2004). DOI: https://doi.org/10.1108/00220410410560582
Kai Chen et al. “MMDetection: Open mmlab detection toolbox and benchmark”. In:arXiv preprint arXiv:1906.07155 (2019).
Steven J Rennie et al. “Self-critical sequence training for image captioning”. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 7008–7024. DOI: https://doi.org/10.1109/CVPR.2017.131
Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3 (1992), pp. 229–256. DOI: https://doi.org/10.1007/BF00992696
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.