MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

MoReVQA: Exploring Modular Reasoning Models for
Video Question Answering

¹Google Research ²POSTECH

(* the work was during a Google internship)

CVPR 2024

Abstract

This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).

What's this project about?

We tackle the challenging task of Video Question-Answering (VideoQA). Our goal is to help advance interpretable (modular) systems for multimodal long video reasoning. We make two key contributions:

First, we introduce a new “simple program” baseline that just captions every frame (JCEF), and show that it (surprisingly) outperforms state-of-the-art visual programming methods for videoQA! (This means we have a lot of room to improve for visual programming and modular reasoning)

Second, we introduce MoReVQA for multi-stage, modular reasoning for VideoQA. Our method improves over prior single-stage visual programming methods by decomposing the overall task into more focused sub-tasks inherent to video-language reasoning (event parsing, event grounding, and event reasoning). Each stage has its own sub-program generation step (with a flexible API), and all stages are unified with shared memory.

We show that our MoReVQA method improves over prior modular reasoning and visual programming baselines (with consistent base models), and sets a new state-of-the-art across a range of videoQA benchmarks and domains (long videos, egocentric videos, instructional videos, etc.), with extensions to related tasks. To learn more, please see our paper (link).

Multi-Stage Reasoning Mechanism of MoReVQA

(Additional visualization coming soon)

Example qualitative result on NExT-QA

Example qualitative result on iVQA

BibTeX

@inproceedings{min2024morevqa, author = {Min, Juhong and Buch, Shyamal and Nagrani, Arsha and Cho, Minsu and Schmid, Cordelia}, title = {MoReVQA: Exploring Modular Reasoning Models for Video Question Answering}, booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2024}, }

MoReVQA: Exploring Modular Reasoning Models for
Video Question Answering

MoReVQA

a modular and decomposed multi-stage pipeline for video question answering

Abstract

What's this project about?

Multi-Stage Reasoning Mechanism of MoReVQA

(Additional visualization coming soon)

Example qualitative result on NExT-QA

Example qualitative result on iVQA

Qualitative Comparison Between
MoReVQA, JCEF, and Visual Programming

BibTeX

MoReVQA: Exploring Modular Reasoning Models forVideo Question Answering

MoReVQA

a modular and decomposed multi-stage pipeline for video question answering

Abstract

What's this project about?

Multi-Stage Reasoning Mechanism of MoReVQA

(Additional visualization coming soon)

Example qualitative result on NExT-QA

Example qualitative result on iVQA

Qualitative Comparison BetweenMoReVQA, JCEF, and Visual Programming

BibTeX

MoReVQA: Exploring Modular Reasoning Models for
Video Question Answering

Qualitative Comparison Between
MoReVQA, JCEF, and Visual Programming