Task: Put the blue cube on the red cube
To help generalization of the policy, the VLM itself needs to be able to generalize well to new tasks. We fine-tune a pre-trained VLM on a large, automatically labeled robotics dataset to produce paths and masking points. Paths help the policy understand what to do, and masking points help the policy understand where to focus on.
Data Labeling Pipeline:The data labeling pipeline: (1) Detect task-relevant moving points, (2) mask irrelevant areas and extract gripper paths, (3) segment trajectories, and (4) generate gripper paths and masking points for each segment. This scalable labeling pipeline allows to label 20 robotics datasets from Open-X!
VLM Fine-tuning:We fine-tune a VILA-1.5-3b on the automatically labeled data to produce paths and masking points conditioned on robot image observations and task instructions.
At training time, we use the VLM to predict paths and masks every $H$ steps in closed loop. These paths and masks are drawn onto all images that the policy sees, $o_t^{p,m}$, where it predicts low-level environment actions. At inference time, the VLM is used in the same way, constantly predicting updated paths and masks in closed loop during policy execution. This process is compatible with any RGB-input policy!
(1): Franka Sim-to-Real: We train only in simulation and evaluate zero-shot in real with a 3D policy (3DDA).
(2): WidowX Bridge: We train on the BRIDGE dataset and evaluate zero-shot on our custom WidowX setup.
Task: Put the blue cube on the red cube
Task: Put the blue cube on the red cube
8x speed, pauses come mainly from FoundationStereo depth queries.
Task: Give the banana to Jensen Huang (gets stuck)
Task: Put the green pepper in the green drawer (wrong drawer)
Task: Put the healthy food on the plate (grabs pizza instead of carrrot)
Task: Give the banana to Jensen Huang
Task: Put the green pepper in the green drawer
Task: Put the healthy food on the plate
Videos play at dynamic iPhone timelapse speed
Task: Push the button (doesn't know what to do)
Task: Put the carrot in the drawer (goes for drawer)
Task: Slide the pot to the shrimp (goes for shrimp)
Task: Push the button
Task: Put the carrot in the drawer (partial success: missed drop)
Task: Slide the pot to the shrimp
Videos play at dynamic iPhone timelapse speed
Task: Put the flamingo on Kobe Bryant
Task: Close the drawer
Task: Put the Labubu toy on the oreo box
Task: Put the spiderman in the lunch box
Task: Put the Labubu toy in the shopping bag
Videos play at 8x speed
Task: Put the green pepper in the green drawer
Task: Put the blue block on the red block
Task: Put the healthy food on the plate
Task: Put the red block on the blue block
Paths p | Masks m | Success (%) |
---|---|---|
✗ | ✗ | 33.5 ± 3.1 |
✓ | ✗ | 52.8 ± 2.9 |
✗ | ✓ | 65.6 ± 3.1 |
✓ | ✓ | 73.6 ± 3.9 |
PEEK benefits from both paths and masks, with the combination of the two achieving the highest success rate on just block stacking, without any generalization challenges, in simulation. We can see in the real-world results that masks become even more important in cluttered real-world scenes.
@inproceedings{zhang2025peek,
title={PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies},
author={Jesse Zhang and Marius Memmel and Kevin Kim and Dieter Fox and Jesse Thomason and Fabio Ramos and Erdem Bıyık and Abhishek Gupta and Anqi Li},
booktitle={arXiv:2509.18282},
year={2025},
}