Preprint Visual Planning Assistance Grounding-as-Verification GRPO

RECIPE: Procedural Planning via Grounding in Instructional Video

A reinforcement-learning framework that turns a noisy instructional-video corpus into a scalable verifier for procedural plans — without pseudo-labels and without a separately trained critic.

Luigi Seminara^1,2, Antonino Furnari², Lorenzo Torresani¹

(1) Northeastern University • (2) University of Catania

arXiv

Overview of RECIPE. Generated plans are scored against a noisy instructional-video corpus (HowTo100M) via a two-stage text alignment, treating the corpus as a verifier rather than a label source. The reward credits the policy only for the alignment improvement the continuation adds beyond the history alone.

Overview

Visual planning asks a model to generate the remaining steps of a procedure in natural language, given a partial video context and a goal. Progress is bottlenecked by annotation: clean labeled datasets are small and encode a single execution trajectory per example, while large-scale instructional corpora such as HowTo100M are noisy and unsuitable as a direct supervision source.

RECIPE (Reward from Empirical Corpus of Instructional Procedure Examples) reframes this corpus as a verifier rather than a label source. Verifying whether a generated plan is temporally grounded in ASR transcripts is comparatively cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry by using grounding quality as a reward signal for GRPO, training the planner end-to-end.

Key contributions:

Grounding-as-verification. We use instructional videos as verifiers for generated plans, optimizing a history-normalized GRPO reward across text- and video-conditioned planners.
Strong empirical gains. RECIPE-RL improves all scales and benchmarks, outperforms supervised and pseudo-label training, transfers to video-token planning, and strengthens search-based planning.
Open-ended evaluation. We introduce a seven-dataset, six-criterion LLM-judge evaluation and will release code plus an online server for open-dictionary visual planning.

Experiments

We evaluate RECIPE on seven procedural benchmarks (CrossTask, COIN, CaptainCook4D, EgoProceL for in-domain; NIV, Ego-Exo4D, EgoPER for zero-shot) using a reference-based LLM-as-judge protocol. Below we summarize the main findings.

Main results across scales

RECIPE-RL improves over the base checkpoint at every scale (0.5B, 3B, 7B) and on every benchmark, with macro-accuracy gains of +7 to +8 points in-domain and up to +16 points zero-shot. At 7B, in-domain and zero-shot scores are within 0.5 points (46.6 vs. 46.1), indicating the gains do not come at the cost of generalization.

Main results: macro accuracy at 0.5B, 3B, 7B for Base vs. RECIPE-RL

Same corpus, two roles

Using HowTo100M as a pseudo-label source for supervised fine-tuning degrades the base checkpoint at every scale. Using the same corpus as a verifier for RL improves it substantially. Both RECIPE variants (RL only and SFT→RL) outperform both SFT baselines.

Comparison of SFT (annotated, HT100M) and RECIPE-RL variants

Robustness to weak supervision

We vary the fraction of weakly supervised training examples (where both history and continuation are derived from a VLM rather than from human annotations) from 0% to 100%. RECIPE-RL is flat across the supervision mix, while SFT collapses when annotations are scarce — at high weak-supervision fractions, SFT zero-shot accuracy falls below the no-fine-tuning baseline.

$Robustness to weak supervision: RECIPE-RL vs. SFT under varying weak-supervision fractions$

Plug-in to Visual Planning for Assistance

We integrate RECIPE-RL into the propose–assess–search pipeline of VidAssist as an additional proposer. The rest of the pipeline is unchanged, isolating the contribution of the proposal distribution. Pairing VidAssist with RECIPE-RL improves over every zero-shot baseline in our comparison on the accuracy metrics across all planning horizons.

Visual Planning for Assistance benchmark: VidAssist with RECIPE-RL improves accuracy at all horizons

Reward-component ablation

Removing any one of the four reward components (Stage-2 high-fidelity scoring, history baseline, relative-progress normalization, progress gate) costs 4–11 macro-accuracy points. Stage-2 scoring is the most critical for zero-shot generalization; the history baseline prevents the policy from paraphrasing the history to inflate reward.

Procedural diversity on COIN

A reward that rewards many valid orderings should leave the policy free to express that variety. On the COIN test split, SFT collapses on both axes (lowest diversity and lowest accuracy), while RECIPE-RL is the only configuration in the upper-right region: it matches the base checkpoint's diversity while adding nearly 10 accuracy points.

Procedural diversity vs. Score@1 on COIN: RECIPE-RL preserves variety while improving accuracy

Qualitative example

Contact

For technical questions, please reach out via email.

Email: seminara.l@northeastern.edu