Triet M. Thai, Son T. Luu
Author affiliations


  • Triet M. Thai University of Information Technology, Vietnam National University Ho Chi Minh City
  • Son T. Luu University of Information Technology, Vietnam National University, Ho Chi Minh City



Visual Question Answering, Sequence-to-sequence learning, Multilingual, Multimodal.


Visual question answering is a task that requires computers to give correct answers for the input questions based on the images. This task can be solved by humans with ease, but it is a challenge for computers. The VLSP2022-EVJVQA shared task carries the Visual question answering task in the multilingual domain on a newly released dataset UIT-EVJVQA, in which the questions and answers are written in three different languages: English, Vietnamese, and Japanese. We approached the challenge as a sequence-to-sequence learning task, in which we integrated hints from pre-trained state-of-the-art VQA models and image features with a convolutional sequence-to-sequence network to generate the desired answers. Our results obtained up to 0.3442 by F1 score on the public test set and 0.4210 on the private test set.


Metrics Loading ...


S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2016.

I. Chowdhury, K. Nguyen, C. Fookes, and S. Sridharan, “A cascaded long short-term memory (lstm) driven generic visual question answering (vqa),” in 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 1842–1846.

Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” CoRR, vol. abs/1612.08083, 2016. [Online]. Available:">

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https: //

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.

[ D. Dzendzik, J. Foster, and C. Vogel, “English machine reading comprehension datasets: A survey,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 8784–8804. [Online]. Available: https: //

H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? dataset and methods for multilingual image question answering,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 2296–2304.

J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in Proceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17., 2017, p. 1243–1252.

Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, p. 6904–6913.

W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, H. Wu, Q. She, X. Liu, T. Wu, and H. Wang, “DuReader: a Chinese machine reading comprehension dataset from real-world applications,” in Proceedings of the Workshop on Machine Reading for Question Answering. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 37–46. [Online]. Available:">

D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

N. V. Kiet, T. Q. Son, N. T. Luan, H. V. Tin, L. T. Son, and N. L.-T. Ngan, “VLSP 2021 - ViMRC Challenge: Vietnamese Machine Reading Comprehension,” VNU Journal of Science: Computer Science and Communication Engineering, vol. 38, no. 2, 2022.

W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139. PMLR, 18–24 Jul 2021, pp. 5583–5594. [Online]. Available:">

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei, “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” Int. J. Comput. Vision, vol. 123, no. 1, p. 32–73, may 2017. [Online]. Available:">

S. Lim, M. Kim, and J. Lee, “Korquad1. 0: Korean qa dataset for machine reading comprehen- sion,” arXiv preprint arXiv:1909.07005, 2019.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ́ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.

T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, Sep. 2015, pp. 1412–1421. [Online]. Available:">

K. Nguyen, V. Nguyen, A. Nguyen, and N. Nguyen, “A Vietnamese dataset for evaluating machine reading comprehension,” in Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, Dec. 2020, pp. 2595–2605. [Online]. Available: https://aclanthology. org/2020.coling-main.233

M. V. Nguyen, V. Lai, A. P. B. Veyseh, and T. H. Nguyen, “Trankit: A light-weight transformer- based toolkit for multilingual natural language processing,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstra- tions, 2021.

N. L.-T. Nguyen, N. H. Nguyen, D. T. D. Vo, K. Q. Tran, and K. V. Nguyen, “VLSP 2022 - EVJVQA Challenge: Multilingual visual question answering,” Journal of Computer Science and Cybernetics, 2023.

P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2016, pp. 2383–2392. [Online]. Available:">

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recog- nition, 2016, pp. 779–788.

A. Rogers, M. Gardner, and I. Augenstein, “Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension,” ACM Comput. Surv., sep 2022. [Online]. Available:">

A. Rogers, O. Kovaleva, and A. Rumshisky, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020. [Online]. Available:">

N. Shimizu, N. Rong, and T. Miyazaki, “Visual question answering dataset for bilingual image understanding: A study of cross-lingual transfer using attention maps,” in Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 1918–1928. [Online]. Available:">

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recog- nition,” arXiv preprint arXiv:1409.1556, 2014.

B. So, K. Byun, K. Kang, and S. Cho, “Jaquad: Japanese question answering dataset for machine reading comprehension,” arXiv preprint arXiv:2202.01764, 2022.

K. Q. Tran, A. T. Nguyen, A. T.-H. Le, and K. V. Nguyen, “ViVQA: Vietnamese visual question answering,” in Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation. Shanghai, China: Association for Computational Lingustics, 11 2021, pp. 683–691. [Online]. Available:">

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available:">

P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” CoRR, vol. abs/2202.03052, 2022.




How to Cite