MoReVQA: Exploring Modular Reasoning Models for
Video Question Answering

Juhong Min1,2*        Shyamal Buch1        Arsha Nagrani1        Minsu Cho2        Cordelia Schmid1
1Google Research             2POSTECH
(* the work was during a Google internship)
CVPR 2024

MoReVQA

a modular and decomposed multi-stage pipeline for video question answering




Abstract

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

What's this project about?

We tackle the challenging task of Video Question-Answering (VideoQA). Our goal is to help advance interpretable (modular) systems for multimodal long video reasoning. We make two key contributions:

First, we introduce a new “simple program” baseline that just captions every frame (JCEF), and show that it (surprisingly) outperforms state-of-the-art visual programming methods for videoQA! (This means we have a lot of room to improve for visual programming and modular reasoning)

Second, we introduce MoReVQA for multi-stage, modular reasoning for VideoQA. Our method improves over prior single-stage visual programming methods by decomposing the overall task into more focused sub-tasks inherent to video-language reasoning (event parsing, event grounding, and event reasoning). Each stage has its own sub-program generation step (with a flexible API), and all stages are unified with shared memory.

We show that our MoReVQA method improves over prior modular reasoning and visual programming baselines (with consistent base models), and sets a new state-of-the-art across a range of videoQA benchmarks and domains (long videos, egocentric videos, instructional videos, etc.), with extensions to related tasks. To learn more, please see our paper (link).



Multi-Stage Reasoning Mechanism of MoReVQA

(Additional visualization coming soon)
Example qualitative result on NExT-QA

Example qualitative result on iVQA




Qualitative Comparison Between
MoReVQA, JCEF, and Visual Programming


BibTeX

@inproceedings{min2024morevqa,
  author    = {Min, Juhong and Buch, Shyamal and Nagrani, Arsha and Cho, Minsu and Schmid, Cordelia},
  title     = {MoReVQA: Exploring Modular Reasoning Models for Video Question Answering},
  booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2024},
}