Goal-VLA: Image-Generative VLMs as Object-Centric World Models Empowering Zero-shot Robot Manipulation

Goal-VLA: Image-Generative VLMs as Object-Centric World Models Enable Zero-shot Robot Manipulation

¹ National University of Singapore
² The University of Hong Kong, ³ Peking University, ⁴ Tsinghua University
^* denotes equal contribution ^† corresponding author

Abstract

Generalization remains a fundamental challenge in robotic manipulation. To tackle this challenge, recent Vision-Language-Action (VLA) models build policies on top of Vision-Language Models (VLMs), seeking to transfer their open-world semantic knowledge. However, their zero-shot capability lags significantly behind the base VLMs, as the instruction-vision-action data is too limited to cover diverse scenarios, tasks, and robot embodiments. In this work, we present Goal-VLA, a zero-shot framework that leverages Image-Generative VLMs as world models to generate desired goal states, from which the target object pose is derived to enable generalizable manipulation. The key insight is that object state representation is the golden interface, naturally separating a manipulation system into high-level and low-level policies. This representation abstracts away explicit action annotations, allowing the use of highly generalizable VLMs while simultaneously providing spatial cues for training-free low-level control. To further improve robustness, we introduce a Reflection-through-Synthesis process that iteratively validates and refines the generated goal image before execution. Both simulated and real-world experiments demonstrate that our Goal-VLA achieves strong performance and inspiring generalizability in manipulation tasks.

Framework Overview

Overview of Goal-VLA — Overview of the **Goal-VLA** framework, which decouples the manipulation pipeline into three stages: **(a) Goal State Reasoning:** A VLM generates a goal image from instructions and refines it for task feasibility, yielding a validated goal with image, mask, and depth. **(b) Spatial Grounding:** The object's transformation is computed by feature matching and point cloud registration between the initial and goal states. **(c) Low-level Policy:** The gripper's goal pose is derived by applying the object's transformation to a contact pose, after which a motion planner generates the final trajectory for robot execution.

Real-world Experiments (1x)

All experiments are conducted in a zero-shot setting, without any task-specific or additional training.

Select Experiment

Goal Visualization