OD-VR-Cap: Image captioning based on detecting and predicting relationships between objects
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/20929Keywords:
Image captioning, object detection, visual relationship, attention mechanism, deep neural network.Abstract
Recent image captioning works often focus on global features or individual object regions within the image without exploiting the relational information between them, resulting in limited accuracy. In this paper, the proposed image captioning model leverages the relationships between objects in the image to fully understand the content and improve accuracy. The approach goes through the following steps: First, objects in the image are detected using an object detection model combined with a graph convolutional network (GCN). From this, a relationship prediction model based on relational context information and knowledge is proposed to classify relationships between objects to create a relationship graph to represent the image. Subsequently, a dual attention mechanism is built to enable the model to focus on relevant parts of both object regions and vertices in the relationship graph when generating captions. Finally, an LSTM network with dual attention is trained to generate captions relying on the image representation and given captions. Experiments conducted on MS COCO and Visual Genome datasets demonstrate that the proposed model achieves higher accuracy compared to baseline methods and some recently published works.
Metrics
References
Stefanini, M., et al., From show to tell: a survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Zhang, M., et al., Topic scene graphs for image captioning. IET Computer Vision, 2022. 16(4): p. 364-375.
Ghandi, T., H. Pourreza, and H. Mahyar, Deep Learning Approaches on Image Captioning: A Review. arXiv preprint arXiv:2201.12944, 2022.
Roy, A. A Guide to Image Captioning. 2020; Available from: https://towardsdatascience.com/a-guide-to-image-captioning-e9fd5517f350.
Pavlopoulos, J., V. Kougia, and I. Androutsopoulos. A survey on biomedical image captioning. in Proceedings of the second workshop on shortcomings in vision and language. 2019.
Ayesha, H., et al., Automatic medical image interpretation: State of the art and future directions. Pattern Recognition, 2021. 114: p. 107856.
Hossain, M.Z., et al., A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 2019. 51(6): p. 1-36.
Zohourianshahzadi, Z. and J.K. Kalita, Neural attention for image captioning: review of outstanding methods. Artificial Intelligence Review, 2022. 55(5): p. 3833-3862.
Vinyals, O., et al. Show and tell: A neural image caption generator. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Mao, J., et al., Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632, 2014.
Karpathy, A. and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
Xu, K., et al. Show, attend and tell: Neural image caption generation with visual attention. in International conference on machine learning. 2015. PMLR.
You, Q., et al. Image captioning with semantic attention. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Jia, J., et al., Image captioning based on scene graphs: A survey. Expert Systems with Applications, 2023: p. 120698.
Azhar, I., I. Afyouni, and A. Elnagar. Facilitated deep learning models for image captioning. in 2021 55th Annual Conference on Information Sciences and Systems (CISS). 2021. IEEE.
Sharma, H. and S. Srivastava, Multilevel attention and relation network based image captioning model. Multimedia Tools and Applications, 2023. 82(7): p. 10981-11003.
Yang, X., et al. Auto-encoding scene graphs for image captioning. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Song, Z. and X. Zhou. Exploring explicit and implicit visual relationships for image captioning. in 2021 IEEE International Conference on Multimedia and Expo (ICME). 2021. IEEE.
Wei, H., et al., Integrating scene semantic knowledge into image captioning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021. 17(2): p. 1-22.
Thinh, N.V., T.V. Lang, and V.T. Thanh, A Method of Automatic Image Captioning Based on Scene Graph and LSTM Network, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2022). 2022, Natural Science and Technology Publishing House: Ha Noi. p. 431-439.
Thinh, N.V., T.V. Lang, and V.T. Thanh, Automatic Image Captioning Based on Object Detection and Attention Mechanism, in The 15th National Conference on Fundamental and Applied IT Research (FAIR'2023). 2023, Natural Science and Technology Publishing House.
Hossain, M.Z., et al. Attention-based image captioning using DenseNet features. in Neural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part V 26. 2019. Springer.
Patwari, N. and D. Naik. En-de-cap: An encoder decoder model for image captioning. in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC). 2021. IEEE.
Yao, T., et al. Exploring visual relationship for image captioning. in Proceedings of the European conference on computer vision (ECCV). 2018.
Gao, L., B. Wang, and W. Wang. Image captioning with scene-graph based semantic concepts. in Proceedings of the 2018 10th International Conference on Machine Learning and Computing. 2018.
Wu, L., et al., Recall what you see continually using gridlstm in image captioning. IEEE Transactions on Multimedia, 2019. 22(3): p. 808-818.
Chen, S., et al. Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
Yan, J., et al., Caption TLSTMs: combining transformer with LSTMs for image captioning. International Journal of Multimedia Information Retrieval, 2022. 11(2): p. 111-121.
Xie, T., et al., Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning. Applied Sciences, 2023. 13(13): p. 7916.
Chen, Z.-M., et al. Multi-label image recognition with graph convolutional networks. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
Kipf, T.N. and M. Welling, Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Hamilton, W., Z. Ying, and J. Leskovec, Inductive representation learning on large graphs. Advances in neural information processing systems, 2017. 30.
Yang, J., et al. Graph r-cnn for scene graph generation. in Proceedings of the European conference on computer vision (ECCV). 2018.
Zellers, R., et al. Neural motifs: Scene graph parsing with global context. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
Xu, N., et al., Scene graph captioner: Image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, 2019. 58: p. 477-485.
Kumar, V., et al. A Novel Approach to Scene Graph Vectorization. in 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS). 2021. IEEE.
Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780.
Rumelhart, D.E., G.E. Hinton, and R.J. Williams, Learning internal representations by error propagation. 1985, Institute for Cognitive Science, University of California, San Diego La ….
Werbos, P.J., Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 1990. 78(10): p. 1550-1560.
Krishna, R., et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 2017. 123(1): p. 32-73.
Xu, D., et al. Scene graph generation by iterative message passing. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Lu, C., et al. Visual relationship detection with language priors. in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. 2016. Springer.
Tian, P., H. Mo, and L. Jiang, Scene graph generation by multi-level semantic tasks. Applied Intelligence, 2021: p. 1-13.
Lin, T.-Y., et al. Microsoft coco: Common objects in context. in European conference on computer vision. 2014. Springer.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.