Task Planning Representations and Datasets
We provide three kinds of task planning representations for six different manipulation tasks benchmarks: LIBERO, Deformable (real-world),
The Colosseum, FurnitureBench, DexArt, and PerAct2, as well as their data annotation codes. Please select a task to see its reasoning data visualization videos.
The LIBERO-Long Reasoning Dataset
The Real World Deformable Object Manipulation Reasoning Dataset
The Colosseum Manipulation Reasoning Dataset
The FurnitureBench Reasoning Dataset
The DexArt Dexterous Manipulation Reasoning Dataset
The PerAct2 Dual-Arm Manipulation Reasoning Dataset
Experiment Findings
We perform systematic and controllable experiments to investigate differnet perspectives of task planning on VLA models with our VLA-OS model family and
annotated datasets. The findings are shown as follows.
Finding 1: For downstream tasks, larger VLA models trained on large-scale datasets do not necessarily outperform smaller models that are trained from scratch. Model architectures and algorithmic designs are still important at the current moment. The time has not yet come to scale up VLA model sizes.
We train VLA-OS-A on four suites from LIBERO (LIBERO-Spatial, LIBERO-Object, LIBERO-Goal,
LIBERO-Long) from scratch with L1 loss and compare them with Diffusion-Policy, fine-tuned
OpenVLA, fine-tuned CoT-VLA, fine-tuned DiT Policy, and fine-tuned
π_0 and its variant π_0-FAST. Results are shown as follows.
We can see that VLA-OS-A-S performs better (+13.2%) than Diffusion Policy (trained from scratch) and the fine-tuned OpenVLA
model (+9.1%), CoT-VLA (+4.5%), and DiT Policy (+3.2%), and is comparable to fine-tuned π_0-FAST (+0.1%). Although our
model is worse than the SOTA method, these results sufficiently demonstrate that our model design is excellent and competitive.
Note that VLA-OS-A-S is trained from scratch and utilizes only a 0.5B LLM backbone.
Finding 2: For Integrated-VLA paradigm, implicit planning can yield a positive performance gain, whereas explicit planning incurs a significant performance degradation when trained from scratch.
We perform experiments with language planning, visual planning, image foresight planning, and their combinations on
LIBERO-LONG benchmark that contains 10 long-horizon tasks with 50 demonstrations in each task to investigate the performance of
implicit planning and explicit planning variants of Integrated-VLA. Results are shown as follows.
The implicit planning paradigm leverages various auxiliary task planning objectives as additional losses for training,
and during inference, there is no difference between it and ActionOnly-VLA, thus it brings performance improvement.
This shows that using task planning as auxiliary losses can improve the performance. However, the explicit
planning paradigm will have to first complete the entire planning process before the action head generation during inference,
and this will bring severe planning accumulation error issues. Typically, the length of planning tokens significantly
exceeds that of action tokens (approximately 2000 vs. 8), which will exacerbate the accumulation error issue than purely
with action tokens. Additionally, the embeddings from every layer of the planning head are fed into the action head,
affecting its internal representations. Meanwhile, the action head does not receive raw visual or language inputs.
It only receives embeddings from the VLM and planning heads, which makes it lack the necessary error-correction capability.
Finding 3: Visually grounded planning representations work better than language planning representations, and also have faster inference speed and smaller training cost.
We perform experiments with language planning, visual planning, image foresight planning, and their combinations on
LIBERO-LONG benchmark that contains 10 long-horizon tasks with 50 demonstrations in each task to investigate the performance of
differnet planning representations. Results are shown as follows.
Finding 4: When employing multiple planning representations concurrently, Hierarchical-VLA outperforms Integrated-VLA paradigms.
We illustrate the performance on LIBERO-LONG benchmark of both Integrated-VLA and Hierarchical-VLA paradigms with different planning representations.
Finding 5: Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA across a broad spectrum of tasks (2D, 3D, simulation, and real‐world), with their performances largely comparable.
We illustrate the performance on six benchmark of all VLA paradigms and their average success rates. We can see that Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA across all benchmarks, and their performances are largely comparable.
Finding 6: Both Integrated-VLA and Hierarchical-VLA benefit similarly from task‐planning pretraining, exhibiting analogous gains in task success rate.
Finding 7: Hierarchical-VLA demonstrates the best generalization ability.
We illustrate the generalization performance of all VLA paradigms on The-Colosseum (ALL-Purturbation) benchmark, and the performance improvement
of the Integrated-VLA and Hierarchical-VLA paradigms with task planning pretraining on LIBERO-90 and testing results on LIBERO-LONG. We can see that Hierarchical-VLA
achieves the best generalization performance, and both Integrated-VLA and Hierarchical-VLA benefit similarly from task‐planning pretraining.
Finding 8: Hierarchical-VLA performs better than Integrated-VLA in task planning.
It is imperative to discern whether task failures arise from the planning component or policy learning. In this part, we use
LIBERO-LONG for Integrated-VLA (only for task planning) and Hierarchical-VLA to separately evaluate the task planning part and
policy learning part of the model for all three planning representations. For evaluation, we manually divide each long-horizon task
into several sub-tasks, and forcibly reset the environment to the initial state of each subtask. Then we compute the average planning
correctness (0 or 1) of the planning outcomes and execution success rate (0 or 1) from the action head across all subtask start
points. Thus, for a given task trajectory, we can get Task Decomposition Score (DCS) and Policy Following Score (PFS). Note
for Hierarchical-VLA, we give the ground truth planning results when testing PFS.
We can see that Hierarchical-VLA consistently outperforms Integrated-VLA in task planning across different planning representations.
Finding 9: Visually grounded planning representations are easier for low-level policy to follow.
As stated above, we show the policy following score (PFS) on different planning representations of Hierarchical-VLA.
We can see that visually grounded planning representations (visual and image foresight) are easier for low-level policy to follow.
Finding 10: The autoregressive property of the language‐planning representation head is the principal cause of its higher training cost and slower inference speeds.
To investigate the reason why different planning representations have different training costs and inference speeds,
we show the forward pass of different planning heads of Hierarchical-VLA in the following picture.
We can see that, because the language planning and visual planning heads are autoregressive, they need to forward several hundreds
times to generate the planning tokens, which incurs a high training cost and slow inference speed, while the image foresight planning head (in this work we use VAR-like generator)
only needs to forward 7 times (around 100x fewer than language planning and visual planning) to generate the planning tokens.
Finding 11: The performance of all VLA paradigms improves as the amount of action-labeled demonstration data increases, i.e., all VLA paradigms have the data scalability.
For data scalability, we use LIBERO-LONG, a dataset with 10 tasks with a total of 500 demonstrations. We use 10%, 40%, 70%, and 100%
of the data to train on three VLA paradigms with the model size S.
We can see that all VLA paradigms have the data scalability, and the performance of all VLA paradigms improves as the amount of action-labeled demonstration data increases.
Finding 12: For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters.
For model scalability, we use LIBERO-90, a dataset with 90 tasks and 4,500 demonstrations, for the experiment with all training data.
We choose Qwen-2.5 LLM backbone with parameters of 0.5B, 1.5B, 3B, and 7B for experiments.
We can see that the performance of all VLA paradigms do not improve as the model size increases, and their performance even decreases when the model size is larger than 3B.
Finding 13: VLA paradigms with task planning (Integrated-VLA and Hierarchical-VLA), compared to the non-planning paradigm (ActionOnly-VLA), achieve higher forward transfer but incur faster forgetting.
We test the continual learning capacities of three paradigms on 10 tasks of LIBERO-LONG sequentially.
We only use Sequential Finetuning (SEQL) as our lifelong learning algorithm. We use the original metrics from
LIBERO, including forward transfer (FWT), and negative backward transfer (NBT).
We can see that Integrated-VLA and Hierarchical-VLA achieve higher forward transfer (FWT) than ActionOnly-VLA, but incur faster forgetting (NBT).
Finding 14: Visually grounded planning representations deliver superior forward transfer and exhibit slower forgetting relative to language-based planning representations.
We test the continual learning capacities of three planning representations on 10 tasks of LIBERO-LONG sequentially.
We only use Sequential Finetuning (SEQL) as our lifelong learning algorithm. We use the original metrics from
LIBERO, including forward transfer (FWT), and negative backward transfer (NBT).
We can see that visually grounded planning representations (visual and image foresight) deliver superior forward transfer (FWT) and exhibit slower forgetting (NBT) relative to language-based planning representations.
Take Home Messages
1. The time has not yet come to scale up VLA model sizes. (Finding 1)
2. Visually grounded representations (visual and image foresight) are better than language planning representations among success rates, low-level following, and continual learning. (Finding 2, 9, 14)
3. Integrated-VLA and Hierarchical-VLA outperform ActionOnly-VLA on task performance and generalization ability, but incurs faster forgetting. (Finding 5, 6, 13)
4. Integrated-VLA and Hierarchical-VLA perform comparably on task performance and Planning Head Pretraining, but Hierarchical-VLA generalizes better and has better task-planning performance. (Finding 4, 7, 8)
5. All VLA paradigms have the data scalability. For tasks trained from scratch with roughly 5,000 demonstrations, the LLM backbone should be limited to 0.5B parameters, or keeping the total model size under 1B parameters. (Finding 11, 12)
Future Research Directions
1. Why are visually grounded representations better than language?
2. How to avoid gradient conflict between planning head losses and action head losses on the VLM backbone?
3. How to design network architectures to effectively extract information from VLM?
4. How to design faster planning heads for autoregressive planning heads?
5. How to design better low-level action head with better planning-following ability?
6. How to construct large-scale task planning datasets for VLA? How to transfer current datasets to task planning datasets?
Acknowledgments
We thank Zhixuan Xu for his valuable discussion and his guidance for drawing the pictures.
BibTeX
@article{gao2025vlaos,
title = {VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models},
author = {Gao, Chongkai and Liu, Zixuan and Chi, Zhenghao and Huang, Junshan and Fei, Xin and Hou, Yiwen and Zhang, Yuxuan and Lin, Yudi and Fang, Zhirui and Jiang, Zeyu and Shao, Lin},
journal = {arXiv preprint arXiv:2506.17561},
year = {2025},
url = {https://arxiv.org/abs/2506.17561}
}