OD-VR-Cap: Image captioning based on detecting and predicting relationships between objects

Nguyen Van Thinh; Tran Van Lang; Van The Thanh Van

doi:10.15625/1813-9663/20929

Author affiliations

Authors

Nguyen Van Thinh Institute of Mechanics and Applied Informatics, Vietnam Academy of Science and Technology (VAST), 291 Dien Bien Phu Street, 3 District, Ho Chi Minh City, Viet Nam
Tran Van Lang Journal Editorial Department, HCMC University of Foreign Languages and Information Technology (HUFLIT), 828 Su Van Hanh, 10 District, Ho Chi Minh City, Viet Nam https://orcid.org/0000-0002-8925-5549
Van The Thanh Faculty of Information Technology, HCMC University of Education (HCMUE), \\280 An Duong Vuong, 5 District, Ho Chi Minh City, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/20929

Keywords:

Image captioning, object detection, visual relationship, attention mechanism, deep neural network.

Abstract

Recent image captioning works often focus on global features or individual object regions within the image without exploiting the relational information between them, resulting in limited accuracy. In this paper, the proposed image captioning model leverages the relationships between objects in the image to fully understand the content and improve accuracy. The approach goes through the following steps: First, objects in the image are detected using an object detection model combined with a graph convolutional network (GCN). From this, a relationship prediction model based on relational context information and knowledge is proposed to classify relationships between objects to create a relationship graph to represent the image. Subsequently, a dual attention mechanism is built to enable the model to focus on relevant parts of both object regions and vertices in the relationship graph when generating captions. Finally, an LSTM network with dual attention is trained to generate captions relying on the image representation and given captions. Experiments conducted on MS COCO and Visual Genome datasets demonstrate that the proposed model achieves higher accuracy compared to baseline methods and some recently published works.

Metrics

PDF views

106

References

Stefanini, M., et al., From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Zhang, M., et al., Topic scene graphs for image captioning. IET Computer Vision, 2022. 16(4): p. 364-375.

Ghandi, T., H. Pourreza, and H. Mahyar, Deep Learning Approaches on Image Captioning: A Review. arXiv preprint arXiv:2201.12944, 2022.

Roy, A. A Guide to Image Captioning. 2020; Available from: https://towardsdatascience.com/a-guide-to-image-captioning-e9fd5517f350.

Pavlopoulos, J., V. Kougia, and I. Androutsopoulos. A survey on biomedical image captioning. in Proceedings of the second workshop on shortcomings in vision and language. 2019.

Ayesha, H., et al., Automatic medical image interpretation: State of the art and future directions. Pattern Recognition, 2021. 114: p. 107856.

Hossain, M.Z., et al., A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 2019. 51(6): p. 1-36.

Zohourianshahzadi, Z. and J.K. Kalita, Neural attention for image captioning: review of outstanding methods. Artificial Intelligence Review, 2022. 55(5): p. 3833-3862.

Vinyals, O., et al. Show and tell: A neural image caption generator. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Mao, J., et al., Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.

Karpathy, A. and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

Xu, K., et al. Show, attend and tell: Neural image caption generation with visual attention. in International conference on machine learning. 2015. PMLR.

You, Q., et al. Image captioning with semantic attention. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Jia, J., et al., Image captioning based on scene graphs: A survey. Expert Systems with Applications, 2023: p. 120698.

Azhar, I., I. Afyouni, and A. Elnagar. Facilitated deep learning models for image captioning. in 2021 55th Annual Conference on Information Sciences and Systems (CISS). 2021. IEEE.

Sharma, H. and S. Srivastava, Multilevel attention and relation network based image captioning model. Multimedia Tools and Applications, 2023. 82(7): p. 10981-11003.

Yang, X., et al. Auto-encoding scene graphs for image captioning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

Song, Z. and X. Zhou. Exploring explicit and implicit visual relationships for image captioning. in 2021 IEEE International Conference on Multimedia and Expo (ICME). 2021. IEEE.

Wei, H., et al., Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021. 17(2): p. 1-22.

Thinh, N.V., T.V. Lang, and V.T. Thanh, A Method of Automatic Image Captioning Based on Scene Graph and LSTM Network, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2022). 2022, Natural Science and Technology Publishing House: Ha Noi. p. 431-439.

Thinh, N.V., T.V. Lang, and V.T. Thanh, Automatic Image Captioning Based on Object Detection and Attention Mechanism, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2023). 2023, Natural Science and Technology Publishing House.

Hossain, M.Z., et al. Attention-based image captioning using DenseNet features. in Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part V 26. 2019. Springer.

Patwari, N. and D. Naik. En-de-cap: An encoder decoder model for image captioning. in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC). 2021. IEEE.

Yao, T., et al. Exploring visual relationship for image captioning. in Proceedings of the European conference on computer vision (ECCV). 2018.

Gao, L., B. Wang, and W. Wang. Image captioning with scene-graph based semantic concepts. in Proceedings of the 2018 10th International Conference on Machine Learning and Computing. 2018.

Wu, L., et al., Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia, 2019. 22(3): p. 808-818.

Chen, S., et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.

Yan, J., et al., Caption TLSTMs: combining transformer with LSTMs for image captioning. International Journal of Multimedia Information Retrieval, 2022. 11(2): p. 111-121.

Xie, T., et al., Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning. Applied Sciences, 2023. 13(13): p. 7916.

Chen, Z.-M., et al. Multi-label image recognition with graph convolutional networks. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.

Kipf, T.N. and M. Welling, Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.

Hamilton, W., Z. Ying, and J. Leskovec, Inductive representation learning on large graphs. Advances in neural information processing systems, 2017. 30.

Yang, J., et al. Graph r-cnn for scene graph generation. in Proceedings of the European conference on computer vision (ECCV). 2018.

Zellers, R., et al. Neural motifs: Scene graph parsing with global context. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

Xu, N., et al., Scene graph captioner: Image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, 2019. 58: p. 477-485.

Kumar, V., et al. A Novel Approach to Scene Graph Vectorization. in 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). 2021. IEEE.

Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780.

Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning internal representations by error propagation. 1985, Institute for Cognitive Science, University of California, San Diego La ….

Werbos, P.J., Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 1990. 78(10): p. 1550-1560.

Krishna, R., et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 2017. 123(1): p. 32-73.

Xu, D., et al. Scene graph generation by iterative message passing. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

Lu, C., et al. Visual relationship detection with language priors. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. Springer.

Tian, P., H. Mo, and L. Jiang, Scene graph generation by multi-level semantic tasks. Applied Intelligence, 2021: p. 1-13.

Lin, T.-Y., et al. Microsoft coco: Common objects in context. in European conference on computer vision. 2014. Springer.