[1]Thinh, N.V. et al. 2025. RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge. Vietnam Journal of Science and Technology. 64, 1 (Jul. 2025), 123–138. DOI:https://doi.org/10.15625/2525-2518/22381.