Ohyeah at VLSP2022-EVJVQA challenge:  a jointly language-image model for multilingual  visual question answering

Luan Ngo Dinh; Hieu Le Ngoc; Long Quoc Phan

doi:10.15625/1813-9663/18122

Author affiliations

Authors

Luan Ngo Dinh VNU-HCM University of Information Technology, Quarter 6, Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Viet Nam
Hieu Le Ngoc University of Technology, Ho Chi Minh City, Vietnam 268 Ly Thuong Kiet Street, Ward 14, Dist 10, Ho Chi Minh City, Viet Nam
Long Quoc Phan Vietnam National University, Ho Chi Minh City, Quarter 6, Linh Trung Ward, Thu Duc City, Ho Chi Minh City, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/18122

Keywords:

Emotional speech synthesis, emotion transplantation, text-to-speech.

Abstract

Multilingual Visual Question Answering (mVQA) is an extremely challenging task which needs to answer a question given in different languages and take the context in an image. This problem can only be addressed by the combination of Natural Language Processing and Computer Vision. In this paper, we propose applying a jointly developed model to the task of multilingual visual question answering. Specifically, we conduct experiments on a multimodal sequence-to-sequence transformer model derived from the T5 encoder-decoder architecture. Text tokens and Vision Transformer (ViT) dense image embeddings are inputs to an encoder then we used a decoder to automatically anticipate discrete text tokens. We achieved the F1-score of 0.4349 on the private test set and ranked 2nd in the EVJVQA task at the VLSP shared task 2022. For reproducing the result, the code can be found at https://github.com/DinhLuan14/VLSP2022-VQA-OhYeah

Metrics

PDF views

111

References

Aishwarya Agrawal et al. “VQA: Visual Question Answering”. In: arXiv preprint arXiv:1505.00468

(2015).

Qi Wu et al. “Visual Question Answering: A Survey of Methods and Datasets”. In:

arXiv preprint arXiv:1607.05910 (2016).

Jacob Devlin. “Multilingual BERT README”. In: (2018). url: https : / / github .

com/google-research/bert/blob/master/multilingual.md.

Alexis Conneau et al. “Unsupervised Cross-lingual Representation Learning at Scale”.

In: CoRR abs/1911.02116 (2019). arXiv: 1911.02116. url: http://arxiv.org/abs/

02116.

Linting Xue et al. “mT5: A massively multilingual pre-trained text-to-text trans-

former”. In: CoRR abs/2010.11934 (2020). arXiv: 2010.11934. url: https://arxiv.

org/abs/2010.11934.

Alexey Dosovitskiy et al. “An Image is Worth 16x16 Words: Transformers for Image

Recognition at Scale”. In: CoRR abs/2010.11929 (2020). arXiv: 2010 . 11929. url:

https://arxiv.org/abs/2010.11929.

Hangbo Bao, Li Dong, and Furu Wei. “BEiT: BERT Pre-Training of Image Transform-

ers”. In: CoRR abs/2106.08254 (2021). arXiv: 2106.08254. url: https://arxiv.org/

abs/2106.08254.

Ze Liu et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted Win-

dows”. In: CoRR abs/2103.14030 (2021). arXiv: 2103.14030. url: https://arxiv.

org/abs/2103.14030.

Yinhan Liu et al. “Multilingual Denoising Pre-training for Neural Machine Transla-

tion”. In: CoRR abs/2001.08210 (2020). arXiv: 2001.08210. url: https://arxiv.

org/abs/2001.08210.

Teven Le Scao et al. “BLOOM: A 176B-Parameter Open-Access Multilingual Language

Model”. In: arXiv preprint arXiv:2211.05100 (2022).

Colin Raffel et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-

Text Transformer”. In: CoRR abs/1910.10683 (2019). arXiv: 1910.10683. url: http:

//arxiv.org/abs/1910.10683.

Mike Lewis et al. “BART: Denoising Sequence-to-Sequence Pre-training for Natural

Language Generation, Translation, and Comprehension”. In: CoRR abs/1910.13461

(2019). arXiv: 1910.13461. url: http://arxiv.org/abs/1910.13461.

Jacob Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Lan-

guage Understanding”. In: CoRR abs/1810.04805 (2018). arXiv: 1810 . 04805. url:

http://arxiv.org/abs/1810.04805.

Yinhan Liu et al. “RoBERTa: A Robustly Optimized BERT Pretraining Approach”.

In: CoRR abs/1907.11692 (2019). arXiv: 1907.11692. url: http://arxiv.org/abs/

11692.

Tom B. Brown et al. “Language Models are Few-Shot Learners”. In: CoRR abs/2005.14165

(2020). arXiv: 2005.14165. url: https://arxiv.org/abs/2005.14165.

Guillaume Lample and Alexis Conneau. “Cross-lingual Language Model Pretraining”.

In: CoRR abs/1901.07291 (2019). arXiv: 1901.07291. url: http://arxiv.org/abs/

07291.

Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In:

CoRR abs/1409.0575 (2014). arXiv: 1409.0575. url: http://arxiv.org/abs/1409.

Kaiming He et al. “Deep Residual Learning for Image Recognition”. In: CoRR abs/1512.03385

(2015). arXiv: 1512.03385. url: http://arxiv.org/abs/1512.03385.

Tsung-Yi Lin et al. “Microsoft COCO: Common Objects in Context”. In: CoRR abs/1405.0312

(2014). arXiv: 1405.0312. url: http://arxiv.org/abs/1405.0312.

Khanh Quoc Tran et al. “ViVQA: Vietnamese Visual Question Answering”. In: Pro-

ceedings of the 35th Pacific Asia Conference on Language, Information and Computa-

tion, PACLIC 2021, Shanghai International Studies University, Shanghai, China, 5-7

November 2021. Ed. by Kaibao Hu et al. Association for Computational Lingustics,

, pp. 683–691. url: https://aclanthology.org/2021.paclic-1.72.

Yash Goyal et al. “Making the V in VQA Matter: Elevating the Role of Image Un-

derstanding in Visual Question Answering”. In: CoRR abs/1612.00837 (2016). arXiv:

00837. url: http://arxiv.org/abs/1612.00837.

Qi Wu et al. “Visual Question Answering: A Survey of Methods and Datasets”. In:

CoRR abs/1607.05910 (2016). arXiv: 1607.05910. url: http://arxiv.org/abs/

05910.

Xi Chen et al. “Pali: A jointly-scaled multilingual language-image model”. In: arXiv

preprint arXiv:2209.06794 (2022).

Luis Perez and Jason Wang. “The Effectiveness of Data Augmentation in Image Clas-

sification using Deep Learning”. In: CoRR abs/1712.04621 (2017). arXiv: 1712.04621. DOI: https://doi.org/10.1007/978-981-287-588-4_100717

url: http://arxiv.org/abs/1712.04621.

Ashish Vaswani et al. “Attention Is All You Need”. In: CoRR abs/1706.03762 (2017).

arXiv: 1706.03762. url: http://arxiv.org/abs/1706.03762.

Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. “Leveraging Pre-trained Check-

points for Sequence Generation Tasks”. In: CoRR abs/1907.12461 (2019). arXiv: 1907.

url: http://arxiv.org/abs/1907.12461.

Ngan Luu-Thuy Nguyen et al. “VLSP 2022 - EVJVQA Challenge: Multilingual Visual DOI: https://doi.org/10.15625/1813-9663/18157

Question Answering”. In: Journal of Computer Science and Cybernetics (2023).

Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. 2017. doi:

48550/ARXIV.1711.05101. url: https://arxiv.org/abs/1711.05101.

Wonjae Kim, Bokyung Son, and Ildoo Kim. “Vilt: Vision-and-language transformer

without convolution or region supervision”. In: International Conference on Machine

Learning. PMLR. 2021, pp. 5583–5594.

Ohyeah at VLSP2022-EVJVQA challenge: a jointly language-image model for multilingual visual question answering

Authors

DOI:

Keywords:

Abstract

Metrics

References

Downloads

Published

How to Cite

Issue

Section

License

Similar Articles