How learned manipulation policies represent and generate actions, 2023–2026. A high-level overview compares every model on the dimensions that generalize across paradigms; the π-family section below drills into architecture-specific detail. Models are tagged by paradigm so you can read across families or across types.
| ACTAction-chunking imitation Transformer predicts a chunk of future actions; generative head is a CVAE. Trained from scratch per task, no language. | Diffusion PolicyDiffusion visuomotor policy Action distribution modelled as a denoising diffusion process. Trained from scratch per task, no language. | π0Vision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.5Vision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.6 / π*0.6Vision-Language-Action · + RL (π*) Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.7VLA + world model A VLA paired with a generative world model that predicts visual subgoals to plan over. | |
|---|---|---|---|---|---|---|
| Year · origin | 2023 · Stanford (ALOHA) | 2023 · Columbia / TRI / MIT | 2024 · Physical Intelligence | 2025 · PI | 2025 · PI | 2026 · PI |
| Params | ~80M | ~67M (CNN variant) | 3.3B | ~3.3B | ~5.3B (4B + 860M) | ~5B + 14B world model |
| Perception backbone | ResNet-18 × 4 cams | ResNet-18 (spatial-softmax, GroupNorm) | PaliGemma (SigLIP 400M + Gemma 2B) | PaliGemma (SigLIP 400M + Gemma 2B) | SigLIP 400M + Gemma 3 4B | Gemma 3 4B + SigLIP + MEM |
| Action representation | CVAE (transformer enc-dec), L1 + KL | DDPM diffusion (1D U-Net / transformer), ε-pred | Conditional flow matching | Hybrid FAST (AR) + flow matching | KI: FAST in VLM + flow in expert | KI + flow; world-model subgoals |
| Action chunk (H) | k = 100 | Tp=16 predict, Ta=8 exec (To=2) | H = 50 | H = 50 | H = 50 | H = 50 |
| Language-conditioned | No | No | Yes | Yes | Yes | Yes |
| Cross-embodiment | No | No | Yes | Yes | Yes | Yes (zero-shot) |
| Control freq | 50 Hz | ~10 Hz | up to 50 Hz | 50 Hz | 50 Hz | 50 Hz |
| Inference | single forward pass; temporal ensembling (+3.3%) | DDIM 10 steps (DDPM 100 train) | 10 flow steps; open-loop (ensembling hurt) | 10 flow steps | 5 flow steps; 63 ms / chunk | 5 flow steps; 38–127 ms |
| Training data | ~50 demos / task | 136–250 demos / task | ~10,000 h, 903M steps, 7 robots | + ~400 h mobile, ~100 homes | π0.5 + RL rollouts + interventions | + egocentric human video + autonomous |
| Generalization scope | single task, single robot | single task; models multimodal demos | fine-tune to new tasks | open-world new homes | specialist-level out-of-box | compositional, new embodiments |
| Key contribution | Introduced action chunking | Introduced the diffusion action head | First flow-matching VLA | Generalize to entirely new homes | RL (RECAP) at VLA scale (π*0.6) | Compositional generalization |