TrajPilot: Trajectory-Conditioned Egocentric Prediction

Shot-outcome prediction accuracy
Method	Accuracy
Pixels onlyNo trajectory — near chance	56.8%
V-JEPA 2.1 attentive probePixels only, stronger head	63.5%
TrajPilot (Ours)Predicted future trajectory	81.1%
OracleObserved future trajectory	93.2%

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way.

Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain.

We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

Two empirical findings. Future camera trajectory substantially outperforms language as a conditioning signal, and is itself partially predictable from the observed context, together making future trajectory a deployable conditioning signal at test time.
TrajPilot, a trajectory-piloted predictor of human activity. A causal model conditioned on (start, goal, predicted future trajectory) that reads predictions out in an action-aligned embedding space, with an inference-time scorer that decides when to trust a predicted trajectory.
Procedural planning across four egocentric benchmarks. On Ego-Exo4D atomic, Keystep, GoalStep, and EgoPER, TrajPilot outperforms strong VLM and structured-planner baselines. The trajectory advantage widens with horizon and holds under RGB-only pose estimation.
One recipe, multiple tasks. The same recipe applies to goal-free anticipation on EPIC-Kitchens-100 and predicts whether a shot will score from a few frames of pre-shot context in Ego-Exo4D basketball.

Conditioning	Val ℓ₁ ↓	Δ vs none
none	0.4773	—
text	0.4761	−0.001
trajectory	0.4572	−0.020
text + traj	0.4569	−0.020
shuffled text	0.4575	−0.020
shuffled traj	0.4964	+0.019

Setting (Train → Eval)	Mid mAcc	Full mAcc
Aria → AriaGround-truth pose	19.47	24.29
PI3 → PI3RGB-estimated pose	19.65	24.23
Aria → PI3Source mismatch (failure mode)	12.65	18.96
Aria+PI3 → PI3Mixed training	19.53	24.37

How You Move Tells What You'll Do

Abstract

Contributions

Key ideas

Same view, many futures

Motion reveals intent

Predict the trajectory

Two questions

Why trajectory?

Why not search in V-JEPA space?

Our architecture

Results

Qualitative examples

BibTeX