Visual planning asks a model to generate the remaining steps of a procedure in natural language, given a partial video context and a goal. Progress is bottlenecked by annotation: clean labeled datasets are small and encode a single execution trajectory per example, while large-scale instructional corpora such as HowTo100M are noisy and unsuitable as a direct supervision source.
RECIPE (Reward from Empirical Corpus of Instructional Procedure Examples) reframes this corpus as a verifier rather than a label source. Verifying whether a generated plan is temporally grounded in ASR transcripts is comparatively cheap and scales to millions of videos via precomputed text embeddings. We exploit this asymmetry by using grounding quality as a reward signal for GRPO, training the planner end-to-end.
Key contributions:
- Grounding-as-verification. We use instructional videos as verifiers for generated plans, optimizing a history-normalized GRPO reward across text- and video-conditioned planners.
- Strong empirical gains. RECIPE-RL improves all scales and benchmarks, outperforms supervised and pseudo-label training, transfers to video-token planning, and strengthens search-based planning.
- Open-ended evaluation. We introduce a seven-dataset, six-criterion LLM-judge evaluation and will release code plus an online server for open-dictionary visual planning.