VLA-OS: Structuring and Dissecting Planning Representations
and Paradigms in Vision-Language-Action Models

1National University of Singapore, 2University of Science and Technology of China, 3Tsinghua University, 4Nanyang Technological University

TL;DR

Visually grounded planning representations are better than language planning representation.

Hierarchical VLA paradigm demonstrates similar performances compared to Integrated VLA paradigm, but shows greater generalization ability.



VLA-OS is a comprehensive framework for evaluating Vision-Language-Action (VLA) models with task planning capabilities. It provides a unified interface that supports different VLA paradigms, composable planning and action heads, different planning representations, systematic evaluation metrics, and diverse evaluation benchmarks and robotic end-effectors.



Abstract

Recent studies on Vision-Language-Action (VLA) models have shifted from the end-to-end action-generation paradigm toward a pipeline involving task planning followed by action generation, demonstrating improved performance on various complex, long-horizon manipulation tasks. However, existing approaches vary significantly in terms of network architectures, planning paradigms, representations, and training data sources, making it challenging for researchers to identify the precise sources of performance gains and components to be further improved. To systematically investigate the impacts of different planning paradigms and representations isolating from network architectures and training data, in this paper, we introduce VLA-OS, a unified VLA architecture series capable of various task planning paradigms, and design a comprehensive suite of controlled experiments across diverse object categories (rigid and deformable), visual modalities (2D and 3D), environments (simulation and real-world), and end-effectors (grippers and dexterous hands). Our results demonstrate that: 1) visually grounded planning representations are generally better than language planning representations; 2) the Hierarchical-VLA paradigm generally achieves superior or comparable performance than other paradigms on task performance, pretraining, generalization ability, scalability, and continual learning ability, albeit at the cost of slower training and inference speeds.


Video Introduction


VLA-OS Model Components

VLA-OS provides standard and unified VLA components for task planning: 1) VLMs with same model structures. We choose a series of LLM backbones with only the number of parameters varying. In this work, we choose Qwen-2.5 LLM series with model sizes ranging from 0.5B to 7B, and pretrain them with the Llava v1.5 dataset to become VLMs. Note our codebase supports any off-the-shelf LLM models from HuggingFace; 2) unified action heads supporting both regression training and flow matching training. These action heads take as input the KV cache from the LLM backbone and output action chunks; 3) unified planning heads with three kinds of planning representations: language reasoning, visual reasoning, and goal image reasoning.




VLA-OS Model Family

With the VLA-OS model components, we build a family of VLA-OS models for different VLA paradigms: 1) VLA-OS-A for ActionOnly-VLA; 2) VLA-OS-I for Integrated-VLA. We provide both the implicit planning and explicit planning variants; 3) VLA-OS-H for Hierarchical-VLA.


Task Planning Representations and Datasets

We provide three kinds of task planning representations for six different manipulation tasks benchmarks: LIBERO, Deformable (real-world), The Colosseum, FurnitureBench, DexArt, and PerAct2, as well as their data annotation codes. Please select a task to see its reasoning data visualization videos.


The LIBERO-Long Reasoning Dataset




The Real World Deformable Object Manipulation Reasoning Dataset




The Colosseum Manipulation Reasoning Dataset




The FurnitureBench Reasoning Dataset




The DexArt Dexterous Manipulation Reasoning Dataset




The PerAct2 Dual-Arm Manipulation Reasoning Dataset


Experiment Findings

We perform systematic and controllable experiments to investigate differnet perspectives of task planning on VLA models with our VLA-OS model family and annotated datasets. The findings are shown as follows.


Finding 1: For downstream tasks, larger VLA models trained on large-scale datasets do not necessarily outperform smaller models that are trained from scratch. Model architectures and algorithmic designs are still important at the current moment. The time has not yet come to scale up VLA model sizes.

We train VLA-OS-A on four suites from LIBERO (LIBERO-Spatial, LIBERO-Object, LIBERO-Goal, LIBERO-Long) from scratch with L1 loss and compare them with Diffusion-Policy, fine-tuned OpenVLA, fine-tuned CoT-VLA, fine-tuned DiT Policy, and fine-tuned π_0 and its variant π_0-FAST. Results are shown as follows.




We can see that VLA-OS-A-S performs better (+13.2%) than Diffusion Policy (trained from scratch) and the fine-tuned OpenVLA model (+9.1%), CoT-VLA (+4.5%), and DiT Policy (+3.2%), and is comparable to fine-tuned π_0-FAST (+0.1%). Although our model is worse than the SOTA method, these results sufficiently demonstrate that our model design is excellent and competitive. Note that VLA-OS-A-S is trained from scratch and utilizes only a 0.5B LLM backbone.



Finding 2: For Integrated-VLA paradigm, implicit planning can yield a positive performance gain, whereas explicit planning incurs a significant performance degradation when trained from scratch.

We perform experiments with language planning, visual planning, image foresight planning, and their combinations on LIBERO-LONG benchmark that contains 10 long-horizon tasks with 50 demonstrations in each task to investigate the performance of implicit planning and explicit planning variants of Integrated-VLA. Results are shown as follows.

The implicit planning paradigm leverages various auxiliary task planning objectives as additional losses for training, and during inference, there is no difference between it and ActionOnly-VLA, thus it brings performance improvement. This shows that using task planning as auxiliary losses can improve the performance. However, the explicit planning paradigm will have to first complete the entire planning process before the action head generation during inference, and this will bring severe planning accumulation error issues. Typically, the length of planning tokens significantly exceeds that of action tokens (approximately 2000 vs. 8), which will exacerbate the accumulation error issue than purely with action tokens. Additionally, the embeddings from every layer of the planning head are fed into the action head, affecting its internal representations. Meanwhile, the action head does not receive raw visual or language inputs. It only receives embeddings from the VLM and planning heads, which makes it lack the necessary error-correction capability.




Finding 3: Visually grounded planning representations work better than language planning representations, and also have faster inference speed and smaller training cost.

We perform experiments with language planning, visual planning, image foresight planning, and their combinations on LIBERO-LONG benchmark that contains 10 long-horizon tasks with 50 demonstrations in each task to investigate the performance of differnet planning representations. Results are shown as follows.




Finding 4: When employing multiple planning representations concurrently, Hierarchical-VLA outperforms Integrated-VLA paradigms.

We illustrate the performance on LIBERO-LONG benchmark of both Integrated-VLA and Hierarchical-VLA paradigms with different planning representations.




Finding 5: Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA across a broad spectrum of tasks (2D, 3D, simulation, and real‐world), with their performances largely comparable.

We illustrate the performance on six benchmark of all VLA paradigms and their average success rates. We can see that Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA across all benchmarks, and their performances are largely comparable.




Finding 6: Both Integrated-VLA and Hierarchical-VLA benefit similarly from task‐planning pretraining, exhibiting analogous gains in task success rate.

Finding 7: Hierarchical-VLA demonstrates the best generalization ability.

We illustrate the generalization performance of all VLA paradigms on The-Colosseum (ALL-Purturbation) benchmark, and the performance improvement of the Integrated-VLA and Hierarchical-VLA paradigms with task planning pretraining on LIBERO-90 and testing results on LIBERO-LONG. We can see that Hierarchical-VLA achieves the best generalization performance, and both Integrated-VLA and Hierarchical-VLA benefit similarly from task‐planning pretraining.



Finding 8: Hierarchical-VLA performs better than Integrated-VLA in task planning.

It is imperative to discern whether task failures arise from the planning component or policy learning. In this part, we use LIBERO-LONG for Integrated-VLA (only for task planning) and Hierarchical-VLA to separately evaluate the task planning part and policy learning part of the model for all three planning representations. For evaluation, we manually divide each long-horizon task into several sub-tasks, and forcibly reset the environment to the initial state of each subtask. Then we compute the average planning correctness (0 or 1) of the planning outcomes and execution success rate (0 or 1) from the action head across all subtask start points. Thus, for a given task trajectory, we can get Task Decomposition Score (DCS) and Policy Following Score (PFS). Note for Hierarchical-VLA, we give the ground truth planning results when testing PFS.

We can see that Hierarchical-VLA consistently outperforms Integrated-VLA in task planning across different planning representations.



Finding 9: Visually grounded planning representations are easier for low-level policy to follow.

As stated above, we show the policy following score (PFS) on different planning representations of Hierarchical-VLA.

We can see that visually grounded planning representations (visual and image foresight) are easier for low-level policy to follow.



Finding 10: The autoregressive property of the language‐planning representation head is the principal cause of its higher training cost and slower inference speeds.

To investigate the reason why different planning representations have different training costs and inference speeds, we show the forward pass of different planning heads of Hierarchical-VLA in the following picture.

We can see that, because the language planning and visual planning heads are autoregressive, they need to forward several hundreds times to generate the planning tokens, which incurs a high training cost and slow inference speed, while the image foresight planning head (in this work we use VAR-like generator) only needs to forward 7 times (around 100x fewer than language planning and visual planning) to generate the planning tokens.



Finding 11: The performance of all VLA paradigms improves as the amount of action-labeled demonstration data increases, i.e., all VLA paradigms have the data scalability.

For data scalability, we use LIBERO-LONG, a dataset with 10 tasks with a total of 500 demonstrations. We use 10%, 40%, 70%, and 100% of the data to train on three VLA paradigms with the model size S.

We can see that all VLA paradigms have the data scalability, and the performance of all VLA paradigms improves as the amount of action-labeled demonstration data increases.



Finding 12: For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters.

For model scalability, we use LIBERO-90, a dataset with 90 tasks and 4,500 demonstrations, for the experiment with all training data. We choose Qwen-2.5 LLM backbone with parameters of 0.5B, 1.5B, 3B, and 7B for experiments.

We can see that the performance of all VLA paradigms do not improve as the model size increases, and their performance even decreases when the model size is larger than 3B.



Finding 13: VLA paradigms with task planning (Integrated-VLA and Hierarchical-VLA), compared to the non-planning paradigm (ActionOnly-VLA), achieve higher forward transfer but incur faster forgetting.

We test the continual learning capacities of three paradigms on 10 tasks of LIBERO-LONG sequentially. We only use Sequential Finetuning (SEQL) as our lifelong learning algorithm. We use the original metrics from LIBERO, including forward transfer (FWT), and negative backward transfer (NBT).

We can see that Integrated-VLA and Hierarchical-VLA achieve higher forward transfer (FWT) than ActionOnly-VLA, but incur faster forgetting (NBT).



Finding 14: Visually grounded planning representations deliver superior forward transfer and exhibit slower forgetting relative to language-based planning representations.

We test the continual learning capacities of three planning representations on 10 tasks of LIBERO-LONG sequentially. We only use Sequential Finetuning (SEQL) as our lifelong learning algorithm. We use the original metrics from LIBERO, including forward transfer (FWT), and negative backward transfer (NBT).

We can see that visually grounded planning representations (visual and image foresight) deliver superior forward transfer (FWT) and exhibit slower forgetting (NBT) relative to language-based planning representations.



Take Home Messages

1. The time has not yet come to scale up VLA model sizes. (Finding 1)

2. Visually grounded representations (visual and image foresight) are better than language planning representations among success rates, low-level following, and continual learning. (Finding 2, 9, 14)

3. Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA on task performance and generalization ability, but incurs faster forgetting. (Finding 5, 6, 13)

4. Integrated-VLA and Hierarchical-VLA perform comparably on task performance and Planning Head Pretraining, but Hierarchical-VLA generalizes better and has better task-planning performance. (Finding 4, 7, 8)

5. All VLA paradigms have the data scalability. For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters. (Finding 11, 12)


Future Research Directions

1. Why are visually grounded representations better than language?

2. How to avoid gradient conflict between planning head losses and action head losses on the VLM backbone?

3. How to design network architectures to effectively extract information from VLM?

4. How to design faster planning heads for autoregressive planning heads?

5. How to design better low-level action head with better planning-following ability?

6. How to construct large-scale task planning datasets for VLA? How to transfer current datasets to task planning datasets?


Acknowledgments

We thank Zhixuan Xu for his valuable discussion and his guidance for drawing the pictures.


BibTeX

@article{gao2025vlaos,
  title   = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models},
  author  = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin},
  journal = {arXiv preprint arXiv:2506.17561},
  year    = {2025},
  url     = {https://arxiv.org/abs/2506.17561}
}