OHYEAH AT VLSP2022-EVJVQA CHALLENGE: A JOINTLY LANGUAGE-IMAGE MODEL FOR MULTILINGUAL VISUAL QUESTION ANSWERING
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/18122Keywords:
Machine reading comprehension, Question answering.Abstract
Multilingual Visual Question Answering (mVQA) is an extremely challenging task which needs to answer a question given in different languages and take the context in an image. This problem can only be addressed by the combination of Natural Language Processing and Computer Vision. In this paper, we propose applying a jointly developed model to the task of multilingual visual question answering. Specifically, we conduct experiments on a multimodal sequence-to-sequence transformer model derived from the T5 encoder-decoder architecture. Text tokens and Vision Transformer (ViT) dense image embeddings are inputs to an encoder then we used a decoder to automatically anticipate discrete text tokens. We achieved the F1-score of 0.4349 on the private test set and ranked 2nd in the EVJVQA task at the VLSP shared task 2022. For reproducing the result, the code can be found at https://github.com/DinhLuan14/VLSP2022-VQA-OhYeah
Metrics
References
Aishwarya Agrawal et al. “VQA: Visual Question Answering”. In: arXiv preprint arXiv:1505.00468
(2015).
Qi Wu et al. “Visual Question Answering: A Survey of Methods and Datasets”. In:
arXiv preprint arXiv:1607.05910 (2016).
Jacob Devlin. “Multilingual BERT README”. In: (2018). url: https : / / github .
com/google-research/bert/blob/master/multilingual.md.
Alexis Conneau et al. “Unsupervised Cross-lingual Representation Learning at Scale”.
In: CoRR abs/1911.02116 (2019). arXiv: 1911.02116. url: http://arxiv.org/abs/
02116.
Linting Xue et al. “mT5: A massively multilingual pre-trained text-to-text trans-
former”. In: CoRR abs/2010.11934 (2020). arXiv: 2010.11934. url: https://arxiv.
org/abs/2010.11934.
Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale”. In: CoRR abs/2010.11929 (2020). arXiv: 2010 . 11929. url:
https://arxiv.org/abs/2010.11929.
Hangbo Bao, Li Dong, and Furu Wei. “BEiT: BERT Pre-Training of Image Transform-
ers”. In: CoRR abs/2106.08254 (2021). arXiv: 2106.08254. url: https://arxiv.org/
abs/2106.08254.
Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Win-
dows”. In: CoRR abs/2103.14030 (2021). arXiv: 2103.14030. url: https://arxiv.
org/abs/2103.14030.
Yinhan Liu et al. “Multilingual Denoising Pre-training for Neural Machine Transla-
tion”. In: CoRR abs/2001.08210 (2020). arXiv: 2001.08210. url: https://arxiv.
org/abs/2001.08210.
Teven Le Scao et al. “BLOOM: A 176B-Parameter Open-Access Multilingual Language
Model”. In: arXiv preprint arXiv:2211.05100 (2022).
Colin Raffel et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-
Text Transformer”. In: CoRR abs/1910.10683 (2019). arXiv: 1910.10683. url: http:
//arxiv.org/abs/1910.10683.
Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension”. In: CoRR abs/1910.13461
(2019). arXiv: 1910.13461. url: http://arxiv.org/abs/1910.13461.
Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Lan-
guage Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810 . 04805. url:
http://arxiv.org/abs/1810.04805.
Yinhan Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”.
In: CoRR abs/1907.11692 (2019). arXiv: 1907.11692. url: http://arxiv.org/abs/
11692.
Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: CoRR abs/2005.14165
(2020). arXiv: 2005.14165. url: https://arxiv.org/abs/2005.14165.
Guillaume Lample and Alexis Conneau. “Cross-lingual Language Model Pretraining”.
In: CoRR abs/1901.07291 (2019). arXiv: 1901.07291. url: http://arxiv.org/abs/
07291.
Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In:
CoRR abs/1409.0575 (2014). arXiv: 1409.0575. url: http://arxiv.org/abs/1409.
Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385
(2015). arXiv: 1512.03385. url: http://arxiv.org/abs/1512.03385.
Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312
(2014). arXiv: 1405.0312. url: http://arxiv.org/abs/1405.0312.
Khanh Quoc Tran et al. “ViVQA: Vietnamese Visual Question Answering”. In: Pro-
ceedings of the 35th Pacific Asia Conference on Language, Information and Computa-
tion, PACLIC 2021, Shanghai International Studies University, Shanghai, China, 5-7
November 2021. Ed. by Kaibao Hu et al. Association for Computational Lingustics,
, pp. 683–691. url: https://aclanthology.org/2021.paclic-1.72.
Yash Goyal et al. “Making the V in VQA Matter: Elevating the Role of Image Un-
derstanding in Visual Question Answering”. In: CoRR abs/1612.00837 (2016). arXiv:
00837. url: http://arxiv.org/abs/1612.00837.
Qi Wu et al. “Visual Question Answering: A Survey of Methods and Datasets”. In:
CoRR abs/1607.05910 (2016). arXiv: 1607.05910. url: http://arxiv.org/abs/
05910.
Xi Chen et al. “Pali: A jointly-scaled multilingual language-image model”. In: arXiv
preprint arXiv:2209.06794 (2022).
Luis Perez and Jason Wang. “The Effectiveness of Data Augmentation in Image Clas-
sification using Deep Learning”. In: CoRR abs/1712.04621 (2017). arXiv: 1712.04621. DOI: https://doi.org/10.1007/978-981-287-588-4_100717
url: http://arxiv.org/abs/1712.04621.
Ashish Vaswani et al. “Attention Is All You Need”. In: CoRR abs/1706.03762 (2017).
arXiv: 1706.03762. url: http://arxiv.org/abs/1706.03762.
Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. “Leveraging Pre-trained Check-
points for Sequence Generation Tasks”. In: CoRR abs/1907.12461 (2019). arXiv: 1907.
url: http://arxiv.org/abs/1907.12461.
Ngan Luu-Thuy Nguyen et al. “VLSP 2022 - EVJVQA Challenge: Multilingual Visual DOI: https://doi.org/10.15625/1813-9663/18157
Question Answering”. In: Journal of Computer Science and Cybernetics (2023).
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. 2017. doi:
48550/ARXIV.1711.05101. url: https://arxiv.org/abs/1711.05101.
Wonjae Kim, Bokyung Son, and Ildoo Kim. “Vilt: Vision-and-language transformer
without convolution or region supervision”. In: International Conference on Machine
Learning. PMLR. 2021, pp. 5583–5594.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.