ProgressLM Icon

ProgressLM

Towards Progress Reasoning in Vision-Language Models

Jianshu Zhang* Chengxuan Qian* Haosen Sun Haoran Lu Dingcheng Wang Letian Xue Han Liu

* Equal Contribution

ProgressLM Teaser

Can vision-language models acquire progress estimation as a general reasoning capability from a single observation? Given a task demonstration and a single observation, the goal is to estimate how much of the task has already been completed. Direct prediction can often judge whether the task is unfinished, but struggles to assign a well-calibrated progress score. Progress reasoning instead follows a coarse-to-fine process: it first performs episodic retrieval to coarsely locate the observation along the demonstrated task, then applies mental simulation to imagine the transition from the retrieved anchor to the current observation, enabling a fine-grained estimate of completed progress.

Abstract

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. In this work, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. We further explore a human-inspired two-stage progress reasoning paradigm that combines episodic retrieval with mental simulation. We instantiate this paradigm through both training-free prompting and a training-based approach built on an automatically curated dataset, ProgressLM-45K. Evaluating 14 VLMs on Progress-Bench, we find that current models struggle to reliably estimate task progress. While training-free prompting that enforces structured progress reasoning yields improvements, these gains are limited and model-dependent. In contrast, ProgressLM-3B achieves consistent improvements in accuracy, robustness to viewpoint variation, and calibrated handling of unanswerable cases, even at small model scale. Further analyses reveal characteristic error patterns of existing VLMs and clarify when and why progress reasoning succeeds or fails.

How do we annotate progress under controlled shifts?
Progress Annotation

We construct progress data by pairing each task demonstration with a single observation sampled from intermediate or boundary moments during execution, and assign its progress label via temporal interpolation between adjacent key steps. To systematically probe progress reasoning beyond static matching, we introduce controlled shifts along three dimensions: demonstration modality (vision-based vs. text-based), viewpoint correspondence (same-view vs. cross-view for vision demos), and answerability, where semantic mismatches are injected so that progress becomes ill-defined and the correct output is N/A.

Scaling progress data across robots and objects
Statistics Bar Statistics Sankey Diagram

To scale progress reasoning beyond a single embodiment or environment, we curate progress estimation data spanning single-arm manipulators, dual-arm systems, and humanoid robots, together with diverse object interactions. Progress-Bench evaluates models on thousands of observation queries grounded in real manipulation trajectories, while ProgressLM-45K further expands training coverage with large-scale supervised and reinforcement learning samples. This diversity encourages models to learn transferable progress cues—tracking long-horizon state evolution rather than overfitting to specific robots, viewpoints, or object appearances.

Method

Direct Prediction

The model directly outputs a progress percentage from the demonstration and the current observation, without any explicit reasoning decomposition.

Training-free Approach

We enforce a human-inspired two-stage progress reasoning process via structured prompting, where the model first performs episodic retrieval to locate a coarse anchor step, then applies mental simulation to infer fine-grained progress.

SFT

We explicitly train the model to internalize episodic retrieval and mental simulation using cold-start supervised fine-tuning on ProgressLM-25K-CoT via LLaMA-Factory.

RL

We further refine the SFT model with GRPO-based reinforcement learning on ProgressLM-20K-RL via EasyR1 and verl, yielding ProgressLM with improved accuracy, robustness, and calibrated handling of unanswerable cases.

Performance on Answerable Scenarios

How well do current VLMs perform at progress estimation?

Table 1: Performance on Answerable Scenarios

Overall, current VLMs exhibit limited and highly variable progress estimation ability under direct prediction. Even strong models achieve only moderate performance and remain highly sensitive to the demonstration setting, while several models show abnormally low or even negative PRC, indicating distorted temporal ordering rather than meaningful ordinal progress reasoning. In addition, some models adopt overly conservative behaviors with elevated AFRR, rejecting answerable cases instead of producing calibrated progress scores.

Does training-free progress reasoning help?

The training-free approach provides only conditional benefits: it tends to help large models improve temporal consistency (often reflected in PRC), but its effect is unstable and can be neutral or harmful for smaller models. This suggests that limited-capacity VLMs may imitate the structured format without truly improving progress understanding, leading to degraded pointwise accuracy (higher NSE) or increased rejection behavior (higher AFRR).

Does training-based progress reasoning help?

Yes—explicit learning yields consistent and substantial improvements even at small model scale. Compared to the base model, PROGRESSLM improves both single-point progress accuracy (lower NSE) and trajectory-level temporal consistency (higher PRC), demonstrating that robust progress reasoning is not merely a consequence of scale. Notably, the reinforced variant further strengthens overall reliability, showing that progress reasoning can be effectively learned through targeted supervision and optimization rather than relying on prompting alone.

Supported Models

We conduct extensive validation on open-source VLMs across different architectures and model scales.

Qwen2.5-VL (3B, 7B, 32B, 72B)
Qwen3-VL (2B, 4B, 8B, 32B)
InternVL3.5 (4B, 8B, 14B, 38B)
GPT-5 / GPT-5-mini

All models support direct prediction, non-thinking, and thinking modes for evaluation.

Citation

If you find our work useful, please consider citing:

@article{zhang2025progresslm,
  title={ProgressLM: Towards Progress Reasoning in Vision-Language Models},
  author={Zhang, Jianshu and Qian, Chengxuan and Sun, Haosen and Lu, Haoran and Wang, Dingcheng and Xue, Letian and Liu, Han},
  journal={arXiv preprint arXiv:2505.XXXXX},
  year={2025}
}