PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

PEEK enhances the zero-shot generalization ability of any RGB-input manipulation policy by showing policies where to focus on and what to do. This guidance is given to the policy via a VLM that predicts paths and masking points to draw onto the policy's input images in closed-loop. This allows policies to focus on just how to execute low-level actions.

1. VLM Fine-tuning and Data Preparation

Overview:

To help generalization of the policy, the VLM itself needs to be able to generalize well to new tasks. We fine-tune a pre-trained VLM on a large, automatically labeled robotics dataset to produce paths and masking points. Paths help the policy understand what to do, and masking points help the policy understand where to focus on.

Data Labeling Pipeline:

The data labeling pipeline: (1) Detect task-relevant moving points, (2) mask irrelevant areas and extract gripper paths, (3) segment trajectories, and (4) generate gripper paths and masking points for each segment. This scalable labeling pipeline allows to label 20 robotics datasets from Open-X!

VLM Fine-tuning:

We fine-tune a VILA-1.5-3b on the automatically labeled data to produce paths and masking points conditioned on robot image observations and task instructions.

2. Policy Training and Inference

At training time, we use the VLM to predict paths and masks every $H$ steps in closed loop. These paths and masks are drawn onto all images that the policy sees, $o_t^{p,m}$, where it predicts low-level environment actions. At inference time, the VLM is used in the same way, constantly predicting updated paths and masks in closed loop during policy execution. This process is compatible with any RGB-input policy!

Our Environments and initial PEEK VLM predictions.

Task: Put the blue cube on the red cube

Task: Give the banana to Jensen Huang (gets stuck)

Task: Put the green pepper in the green drawer (wrong drawer)

Task: Put the healthy food on the plate (grabs pizza instead of carrrot)

Task: Give the banana to Jensen Huang

Task: Put the green pepper in the green drawer

Task: Put the healthy food on the plate

Task: Push the button (doesn't know what to do)

Task: Put the carrot in the drawer (goes for drawer)

Task: Slide the pot to the shrimp (goes for shrimp)

Task: Push the button

Task: Put the carrot in the drawer (partial success: missed drop)

Task: Slide the pot to the shrimp

Task: Put the flamingo on Kobe Bryant

Task: Close the drawer

Task: Put the Labubu toy on the oreo box

Task: Put the spiderman in the lunch box

Task: Put the Labubu toy in the shopping bag

Task: Put the green pepper in the green drawer

Task: Put the blue block on the red block

Task: Put the healthy food on the plate

Task: Put the red block on the blue block

Paths p	Masks m	Success (%)
✗	✗	33.5 ± 3.1
✓	✗	52.8 ± 2.9
✗	✓	65.6 ± 3.1
✓	✓	73.6 ± 3.9

BibTeX


@inproceedings{zhang2025peek,
      title={PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies}, 
      author={Jesse Zhang and Marius Memmel and Kevin Kim and Dieter Fox and Jesse Thomason and Fabio Ramos and Erdem Bıyık and Abhishek Gupta and Anqi Li},
      booktitle={arXiv:2509.18282},
      year={2025},
}

PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

Summary

1. VLM Fine-tuning and Data Preparation

2. Policy Training and Inference

Experiments and Results

Extreme Clutter and Policy Reactivity

3DDA+PEEK, training data: Block Stacking in Simulation

Semantic Generalization

Standard $\pi_0$, training data: BRIDGE

$\pi_0$+PEEK, training data: BRIDGE

Visual Clutter

Vanilla ACT, training data: BRIDGE

ACT+PEEK, training data: BRIDGE

Extreme Object Generalization

3DDA+PEEK, training data: Putting cans on pans in Simulation

What Does the Policy See?

Ablation Study: Paths vs Masks?

Ablation of path and mask supervision on success rate in simulation.

BibTeX