Foveated Reasoning

Teaser figure placeholder

Foveated Reasoning lets a vision-language model reason, decide when to focus, and acquire high-resolution visual evidence within a single autoregressive trajectory.

Abstract

We introduce Foveated Reasoning, a vision-language reasoning framework that enables a model to adaptively focus on task-relevant visual regions while generating its reasoning trajectory. Instead of processing the entire image at uniformly high resolution, the model starts from a low-resolution view, emits textual reasoning tokens, and selectively triggers foveation actions to retrieve high-resolution evidence only when needed in a single autoregressive decoding trajectory.

Stateful Reasoning

The model maintains a hidden interaction state through its autoregressive context and uses it to decide both what to say and where to look next.

Action-based Foveation

A special foveation action predicts a continuous visual region, retrieves high-resolution evidence, and inserts it back into the same reasoning stream.

Efficient High-res VLM

The model avoids uniformly expensive high-resolution processing by allocating visual tokens adaptively based on instance difficulty.

Method

Foveated Reasoning treats visual understanding as a sequential decision-making process. Given an initial low-resolution image and an instruction, the model alternates between ordinary language generation and visual focusing actions.

Low-resolution observation

The model first receives a global low-resolution image and the user instruction.

Reason or focus

At each decoding step, the model either emits a text token or triggers a foveation action.

High-resolution evidence

The foveation action predicts a region, retrieves high-resolution visual evidence, and injects it into the context.

Final answer

The model uses both its reasoning history and acquired evidence to produce the final answer.

Method figure placeholder

Overview of the action-based foveation loop.

Results

Foveated Reasoning improves high-resolution visual understanding while using an adaptive visual token budget. The model learns to request more visual evidence for difficult examples and little or none for easy examples.

Setting	Input	Visual Budget	Key Observation
Standard VLM	Fixed resolution	Fixed	Uniform computation regardless of difficulty
High-resolution VLM	High resolution	Large	Better detail, but expensive
Foveated Reasoning	Low-res + adaptive foveation	Adaptive	Focuses computation on task-relevant evidence

Result figure placeholder

Add benchmark tables, qualitative examples, or token-budget plots here.

BibTeX

Please cite our work if you find it useful.

@article{min2026foveatedreasoning,
  title   = {Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models},
  author  = {Min, Juhong and Valkov, Lazar and Petsiuk, Vitali and Souri, Hossein and Mohan, Deen Dayal},
  journal = {arXiv preprint arXiv:2604.21079},
  year    = {2026}
}