Document Type : Research Article
Abstract
Gaining knowledge of Vision and language is becoming a happening topic with in-depth research in Artificial Intelligence (AI). This AI, NLP, and computer vision have recently hit by issues like Graphical Question Answering (GQA). Here, we present the mission of the Multimedia Machine Modal Comprehension Question Answering Algorithm (MMCQA), focusing on addressing multimodal queries concerning words, figures, and pictures. Dataset addresses the questions and has around twelve thousand and odd lessons and more than thirty-six thousand multi-modal questions obtained from the science curriculum. Our study demonstrates that a significant part of queries need resolving of texts, figures, and cognitive analysis, denoting that the info in our report is way more complicated than the earlier studies and obvious question answering datasets. Lastly, we put forth a method based on dual-LSTM having spatial as well as temporal focus and prove that it is useful compared to other standard GQA methodologies via experiential studies.