VLA-OS | Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.

VLA-OS: Structuring and Dissecting Planning Representationsand Paradigms in Vision-Language-Action Models

TL;DR

Visually grounded planning representations are better than language planning representation.

Hierarchical VLA paradigm demonstrates similar performances compared to Integrated VLA paradigm, but shows greater generalization ability.

Abstract

Video Introduction

VLA-OS Model Components

VLA-OS Model Family

Task Planning Representations and Datasets

The LIBERO-Long Reasoning Dataset

The Real World Deformable Object Manipulation Reasoning Dataset

The Colosseum Manipulation Reasoning Dataset

The FurnitureBench Reasoning Dataset

The DexArt Dexterous Manipulation Reasoning Dataset

The PerAct2 Dual-Arm Manipulation Reasoning Dataset

Experiment Findings

Finding 2: For Integrated-VLA paradigm, implicit planning can yield a positive performance gain, whereas explicit planning incurs a significant performance degradation when trained from scratch.

Finding 3: Visually grounded planning representations work better than language planning representations, and also have faster inference speed and smaller training cost.

Finding 4: When employing multiple planning representations concurrently, Hierarchical-VLA outperforms Integrated-VLA paradigms.

Finding 5: Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA across a broad spectrum of tasks (2D, 3D, simulation, and real‐world), with their performances largely comparable.

Finding 6: Both Integrated-VLA and Hierarchical-VLA benefit similarly from task‐planning pretraining, exhibiting analogous gains in task success rate.

Finding 7: Hierarchical-VLA demonstrates the best generalization ability.

Finding 8: Hierarchical-VLA performs better than Integrated-VLA in task planning.

Finding 9: Visually grounded planning representations are easier for low-level policy to follow.

Finding 10: The autoregressive property of the language‐planning representation head is the principal cause of its higher training cost and slower inference speeds.

Finding 11: The performance of all VLA paradigms improves as the amount of action-labeled demonstration data increases, i.e., all VLA paradigms have the data scalability.

Finding 12: For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters.

Finding 13: VLA paradigms with task planning (Integrated-VLA and Hierarchical-VLA), compared to the non-planning paradigm (ActionOnly-VLA), achieve higher forward transfer but incur faster forgetting.

Finding 14: Visually grounded planning representations deliver superior forward transfer and exhibit slower forgetting relative to language-based planning representations.

Take Home Messages

1. The time has not yet come to scale up VLA model sizes. (Finding 1)

2. Visually grounded representations (visual and image foresight) are better than language planning representations among success rates, low-level following, and continual learning. (Finding 2, 9, 14)

3. Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA on task performance and generalization ability, but incurs faster forgetting. (Finding 5, 6, 13)

4. Integrated-VLA and Hierarchical-VLA perform comparably on task performance and Planning Head Pretraining, but Hierarchical-VLA generalizes better and has better task-planning performance. (Finding 4, 7, 8)

5. All VLA paradigms have the data scalability. For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters. (Finding 11, 12)

Future Research Directions

1. Why are visually grounded representations better than language?

2. How to avoid gradient conflict between planning head losses and action head losses on the VLM backbone?

3. How to design network architectures to effectively extract information from VLM?

4. How to design faster planning heads for autoregressive planning heads?

5. How to design better low-level action head with better planning-following ability?

6. How to construct large-scale task planning datasets for VLA? How to transfer current datasets to task planning datasets?

Acknowledgments

BibTeX

VLA-OS: Structuring and Dissecting Planning Representations
and Paradigms in Vision-Language-Action Models