How You Move Tells What You'll Do

Trajectory-Conditioned Egocentric Prediction

Sejoon Jun, Hai Nguyen-Truong, Luigi Seminara, Lorenzo Torresani

Northeastern University

{jun.se, nguyentruong.h, seminara.l, l.torresani}@northeastern.edu

arXiv:2605.20388 Preprint

Do you think the shot will go in?

From a single pre-shot egocentric frame, can you tell?

Same egocentric pre-shot view admits multiple shot outcomes, disambiguated by future trajectory.
Shot-outcome prediction accuracy
Method Accuracy
Pixels onlyNo trajectory — near chance 56.8%
V-JEPA 2.1 attentive probePixels only, stronger head 63.5%
TrajPilot (Ours)Predicted future trajectory 81.1%
OracleObserved future trajectory 93.2%

Abstract

Predicting how a person's first-person view will evolve (what action will follow, what plan completes a task, whether an in-progress shot will score) is fundamentally under-specified: the same context admits many plausible futures, and a model trained to minimize prediction error is forced to hedge or average across them, getting it wrong either way.

Two findings shape our approach. First, the future camera trajectory, the path the head carves through space, lets the model commit to one of those futures: it carries the operator's intent in a form fine enough to determine how an action will unfold, substantially outperforming language as a conditioning signal. Second, this same intent makes the trajectory itself partially predictable from the context at hand, enough that trajectory need not be observed at test time to recover most of the gain.

We instantiate these findings as TrajPilot, a model that predicts candidate future trajectories from egocentric context and uses them to pilot action prediction in an action-aligned embedding space where language shapes the structure but is never used as a conditioning input. TrajPilot beats VLM and structured-planner baselines on procedural planning across Ego-Exo4D atomic, Ego-Exo4D Keystep, Ego4D GoalStep, and EgoPER, with the trajectory advantage widening with horizon (exactly where prior planners collapse) and holding under RGB-only camera-pose estimation. With the goal masked at inference, the same model performs goal-free anticipation, beating VLM baselines on Ego-Exo4D atomic and extending to EPIC-Kitchens-100 and basketball shot-outcome prediction.

Contributions

  • Two empirical findings. Future camera trajectory substantially outperforms language as a conditioning signal, and is itself partially predictable from the observed context, together making future trajectory a deployable conditioning signal at test time.
  • TrajPilot, a trajectory-piloted predictor of human activity. A causal model conditioned on (start, goal, predicted future trajectory) that reads predictions out in an action-aligned embedding space, with an inference-time scorer that decides when to trust a predicted trajectory.
  • Procedural planning across four egocentric benchmarks. On Ego-Exo4D atomic, Keystep, GoalStep, and EgoPER, TrajPilot outperforms strong VLM and structured-planner baselines. The trajectory advantage widens with horizon and holds under RGB-only pose estimation.
  • One recipe, multiple tasks. The same recipe applies to goal-free anticipation on EPIC-Kitchens-100 and predicts whether a shot will score from a few frames of pre-shot context in Ego-Exo4D basketball.

Key ideas

01

Same view, many futures

Egocentric observations are ambiguous. Multiple actions and outcomes are consistent with the same visual context.

02

Motion reveals intent

Future camera trajectory is physically grounded and fine-grained, capturing how the wearer is about to move.

03

Predict the trajectory

Future trajectory is unobserved at test time, so TrajPilot retrieves candidate trajectories and learns when to trust them.

Two questions

Question 1

Why trajectory?

Language is coarse. The cook can "whisk the eggs" gently or splatter them out of the bowl — the description is shared across success and failure. What separates them is how the body moves.

Camera trajectory is physically grounded and causally tied to outcome. Trained on Ego-Exo4D, a future-latent predictor cuts validation loss 20× more under trajectory than under language:

Conditioning Val ℓ1 Δ vs none
none0.4773
text0.4761−0.001
trajectory0.4572−0.020
text + traj0.4569−0.020
shuffled text0.4575−0.020
shuffled traj0.4964+0.019
Shuffling trajectory hurts; shuffling text doesn't.
Question 2

Why not search in V-JEPA space?

A natural baseline is to search candidate trajectories by ℓ1 distance to the goal latent in V-JEPA space. This actively underperforms ignoring trajectory entirely.

V-JEPA latents are organized by visual continuity, not by what someone is doing. CEM ends up optimizing visual similarity, not action correctness — motivating an action-aligned space.

CEM in V-JEPA latent space underperforms No-Traj at every horizon.

Our architecture

TrajPilot compresses start and goal clips with frozen V-JEPA 2.1 features, aligns trajectory embeddings with action semantics, and rolls out a causal predictor whose outputs are read against an EgoVLPv2 action bank.

TrajPilot architecture: training and inference modes.

Results

TrajPilot improves planning and anticipation across open- and closed-vocabulary settings on Ego-Exo4D, with the trajectory advantage widening at long horizons.

Atomic-action planning results across horizons.
Planning. Left and middle: open vocabulary (|V|=8,472), Mid R@1 and exact mid-sequence match versus horizon; baselines are VLMs. Right: closed vocabulary (|V|=1,165) against structured procedural planners.
Open-vocabulary atomic anticipation results.
Anticipation. Open-vocabulary atomic anticipation on Ego-Exo4D with the goal removed at inference. TrajPilot (Scorer) beats the best VLM by +13 to +19 pp across horizons.
Robustness to RGB-only camera-pose estimation (Keystep, H=3–8)
Setting (Train → Eval) Mid mAcc Full mAcc
Aria → AriaGround-truth pose 19.47 24.29
PI3 → PI3RGB-estimated pose 19.65 24.23
Aria → PI3Source mismatch (failure mode) 12.65 18.96
Aria+PI3 → PI3Mixed training 19.53 24.37

Qualitative examples

Predicted action plans versus ground truth on Ego-Exo4D atomic action. TrajPilot recovers the correct steps where the VLM baseline substitutes plausible-looking but wrong actions.

Predicted action plans vs ground truth on four Ego-Exo4D examples.

BibTeX

@misc{jun2026trajpilot,
  title         = {How You Move Tells What You'll Do:
                   Trajectory-Conditioned Egocentric Prediction},
  author        = {Jun, Sejoon and Nguyen-Truong, Hai and
                   Seminara, Luigi and Torresani, Lorenzo},
  year          = {2026},
  eprint        = {2605.20388},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  doi           = {10.48550/arXiv.2605.20388},
  url           = {https://arxiv.org/abs/2605.20388}
}