Thinh, Nguyen Van, et al. “RGTranCNet: Effective Image Captioning Model Using Cross-Attention and Semantic Knowledge”. Vietnam Journal of Science and Technology, vol. 64, no. 1, July 2025, pp. 123–138, doi:10.15625/2525-2518/22381.