Abstract
We introduce Foveated Reasoning, a vision-language reasoning framework that enables a model to adaptively focus on task-relevant visual regions while generating its reasoning trajectory. Instead of processing the entire image at uniformly high resolution, the model starts from a low-resolution view, emits textual reasoning tokens, and selectively triggers foveation actions to retrieve high-resolution evidence only when needed in a single autoregressive decoding trajectory.
Stateful Reasoning
The model maintains a hidden interaction state through its autoregressive context and uses it to decide both what to say and where to look next.
Action-based Foveation
A special foveation action predicts a continuous visual region, retrieves high-resolution evidence, and inserts it back into the same reasoning stream.
Efficient High-res VLM
The model avoids uniformly expensive high-resolution processing by allocating visual tokens adaptively based on instance difficulty.
Method
Foveated Reasoning treats visual understanding as a sequential decision-making process. Given an initial low-resolution image and an instruction, the model alternates between ordinary language generation and visual focusing actions.
Low-resolution observation
The model first receives a global low-resolution image and the user instruction.
Reason or focus
At each decoding step, the model either emits a text token or triggers a foveation action.
High-resolution evidence
The foveation action predicts a region, retrieves high-resolution visual evidence, and injects it into the context.
Final answer
The model uses both its reasoning history and acquired evidence to produce the final answer.
Results
Foveated Reasoning improves high-resolution visual understanding while using an adaptive visual token budget. The model learns to request more visual evidence for difficult examples and little or none for easy examples.
| Setting | Input | Visual Budget | Key Observation |
|---|---|---|---|
| Standard VLM | Fixed resolution | Fixed | Uniform computation regardless of difficulty |
| High-resolution VLM | High resolution | Large | Better detail, but expensive |
| Foveated Reasoning | Low-res + adaptive foveation | Adaptive | Focuses computation on task-relevant evidence |
BibTeX
Please cite our work if you find it useful.
@article{min2026foveatedreasoning,
title = {Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models},
author = {Min, Juhong and Valkov, Lazar and Petsiuk, Vitali and Souri, Hossein and Mohan, Deen Dayal},
journal = {arXiv preprint arXiv:2604.21079},
year = {2026}
}