Thinh, N. V., Lang, T. V., & Thanh, V. T. (2025). RGTranCNet: Effective image captioning model using cross-attention and semantic knowledge. Vietnam Journal of Science and Technology, 64(1), 123–138. https://doi.org/10.15625/2525-2518/22381