Khang Nguyen
Author affiliations


  • Khang Nguyen University of Information Technology, Ho Chi Minh City, Vietnam National University, Ho Chi Minh City, Vietnam




Grid features, Region features, Image captioning, Viecap4h, Uit-viic, Faster R-CNN, Cascade R-CNN, Grid R-CNN, Vinvl.


Image captioning is a challenging task that is still being addressed in the 2020s. The problem has the input as an image, and the output is the generated caption that describes the context of the input image. In this study, I focus on the image captioning problem in Vietnamese. In detail, I present the empirical study of feature extraction approaches using current state-of-the-art object detection methods to represent the images in the model space. Each type of feature is trained with the Transformer-based captioning model. I investigate the effectiveness of different feature types on two Vietnamese datasets: UIT-ViIC and VieCap4H, the two standard benchmark datasets. The experimental results show crucial insight into the feature extraction task for image captioning in Vietnamese.


PDF views


Huaizu Jiang et al. “In defense of grid features for visual question answering”. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, pp. 10267–10276. DOI: https://doi.org/10.1109/CVPR42600.2020.01028

Pengchuan Zhang et al. “Vinvl: Revisiting visual representations in vision-language models”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. 2021, pp. 5579–5588. DOI: https://doi.org/10.1109/CVPR46437.2021.00553

Shaoqing Ren et al. “Faster r-cnn: Towards real-time object detection with region proposal networks”. In: Advances in neural information processing systems 28 (2015).

Zhaowei Cai and Nuno Vasconcelos. “Cascade r-cnn: Delving into high quality object detection”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6154–6162.

Xin Lu et al. “Grid r-cnn”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, pp. 7363–7372. DOI: https://doi.org/10.1109/CVPR.2019.00754

Xuying Zhang et al. “RSTNet: Captioning with adaptive attention on visual and non- visual words”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021, pp. 15465–15474. DOI: https://doi.org/10.1109/CVPR46437.2021.01521

Quan Hoang Lam et al. “UIT-ViIC: A Dataset for the First Evaluation on Vietnamese Image Captioning”. In: International Conference on Computational Collective Intelli- gence. Springer. 2020, pp. 730–742. DOI: https://doi.org/10.1007/978-3-030-63007-2_57

Thao Minh Le et al. “VLSP 2021-VieCap4H Challenge: Automatic Image Caption Generation for Healthcare Domain in Vietnamese”. In: people 1.2 (2021), p. 2.

Xinlei Chen et al. “Microsoft coco captions: Data collection and evaluation server”. In:arXiv preprint arXiv:1504.00325 (2015).

Danna Gurari et al. “Vizwiz grand challenge: Answering visual questions from blind people”. In: Proceedings of the IEEE conference on computer vision and pattern recog- nition. 2018, pp. 3608–3617. DOI: https://doi.org/10.1109/CVPR.2018.00380

Oleksii Sidorov et al. “Textcaps: a dataset for image captioning with reading compre- hension”. In: European conference on computer vision. Springer. 2020, pp. 742–758. DOI: https://doi.org/10.1007/978-3-030-58536-5_44

Piyush Sharma et al. “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning”. In: Proceedings of the 56th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers). 2018, pp. 2556–2565. DOI: https://doi.org/10.18653/v1/P18-1238

Xuewen Yang et al. “Fashion captioning: Towards generating accurate descriptions with semantic rewards”. In: European Conference on Computer Vision. Springer. 2020, pp. 1–17. DOI: https://doi.org/10.1007/978-3-030-58601-0_1

Ashish Vaswani et al. “Attention is all you need”. In: Advances in neural information processing systems 30 (2017).

Oriol Vinyals et al. “Show and tell: A neural image caption generator”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3156– 3164. DOI: https://doi.org/10.1109/CVPR.2015.7298935

Kelvin Xu et al. “Show, attend and tell: Neural image caption generation with visual attention”. In: International conference on machine learning. PMLR. 2015, pp. 2048– 2057.

Peter Anderson et al. “Bottom-up and top-down attention for image captioning and visual question answering”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 6077–6086. DOI: https://doi.org/10.1109/CVPR.2018.00636

Lun Huang et al. “Attention on attention for image captioning”. In: Proceedings of the IEEE/CVF international conference on computer vision. 2019, pp. 4634–4643. DOI: https://doi.org/10.1109/ICCV.2019.00473

Simao Herdade et al. “Image captioning: Transforming objects into words”. In: Ad- vances in Neural Information Processing Systems 32 (2019).

Marcella Cornia et al. “Meshed-memory transformer for image captioning”. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 10578–10587. DOI: https://doi.org/10.1109/CVPR42600.2020.01059

Yingwei Pan et al. “X-linear attention networks for image captioning”. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. 2020, pp. 10971–10980.

Luowei Zhou et al. “Unified vision-language pre-training for image captioning and vqa”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 07. 2020, pp. 13041–13049. DOI: https://doi.org/10.1609/aaai.v34i07.7005

Kaiming He et al. “Mask r-cnn”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2961–2969.

Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE international conference on computer vision. 2015, pp. 1440–1448. DOI: https://doi.org/10.1109/ICCV.2015.169

Dat Quoc Nguyen and Anh Tuan Nguyen. “PhoBERT: Pre-trained language models for Vietnamese”. In: arXiv preprint arXiv:2003.00744 (2020). DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.92

Kishore Papineni et al. “Bleu: a method for automatic evaluation of machine transla- tion”. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002, pp. 311–318. DOI: https://doi.org/10.3115/1073083.1073135

Satanjeev Banerjee and Alon Lavie. “METEOR: An automatic metric for MT evalua- tion with improved correlation with human judgments”. In: Proceedings of the acl work- shop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 2005, pp. 65–72.

Chin-Yew Lin. “Rouge: A package for automatic evaluation of summaries”. In: Text summarization branches out. 2004, pp. 74–81.

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. “Cider: Consensus- based image description evaluation”. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition. 2015, pp. 4566–4575. DOI: https://doi.org/10.1109/CVPR.2015.7299087

Stephen Robertson. “Understanding inverse document frequency: on theoretical argu- ments for IDF”. In: Journal of documentation (2004). DOI: https://doi.org/10.1108/00220410410560582

Kai Chen et al. “MMDetection: Open mmlab detection toolbox and benchmark”. In:arXiv preprint arXiv:1906.07155 (2019).

Steven J Rennie et al. “Self-critical sequence training for image captioning”. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 7008–7024. DOI: https://doi.org/10.1109/CVPR.2017.131

Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3 (1992), pp. 229–256. DOI: https://doi.org/10.1007/BF00992696




How to Cite

K. Nguyen, “EMPIRICAL STUDY OF FEATURE EXTRACTION APPROACHES FOR IMAGE CAPTIONING IN VIETNAMESE”, J. Comput. Sci. Cybern., vol. 38, no. 4, p. 327–346, Dec. 2022.


