How learned manipulation policies represent and generate actions, 2023–2026. A high-level overview compares every model on the dimensions that generalize across paradigms; the π-family section below drills into architecture-specific detail. Models are tagged by paradigm so you can read across families or across types.
| ACTAction-chunking imitation Transformer predicts a chunk of future actions; generative head is a CVAE. Trained from scratch per task, no language. | Diffusion PolicyDiffusion visuomotor policy Action distribution modelled as a denoising diffusion process. Trained from scratch per task, no language. | π0Vision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.5Vision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.6 / π*0.6Vision-Language-Action · + RL (π*) Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments. | π0.7VLA + world model A VLA paired with a generative world model that predicts visual subgoals to plan over. | mimic-videoWorld Action Model Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act. | DreamZeroWorld Action Model Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act. | |
|---|---|---|---|---|---|---|---|---|
| Year · origin | 2023 · Stanford (ALOHA) | 2023 · Columbia / TRI / MIT | 2024 · Physical Intelligence | 2025 · PI | 2025 · PI | 2026 · PI | Dec 2025 · mimic robotics / ETH / Microsoft / UC Berkeley | Feb 2026 · NVIDIA |
| Params | ~80M | ~67M (CNN variant) | 3.3B | ~3.3B | ~5.3B (4B + 860M) | ~5B + 14B world model | 2B video backbone + flow-matching action decoder | 14B |
| Perception backbone | ResNet-18 × 4 cams | ResNet-18 (spatial-softmax, GroupNorm) | PaliGemma (SigLIP 400M + Gemma 2B) | PaliGemma (SigLIP 400M + Gemma 2B) | SigLIP 400M + Gemma 3 4B | Gemma 3 4B + SigLIP + MEM | NVIDIA Cosmos-Predict2 (2B latent DiT) | Wan2.1-I2V-14B (autoregressive video diffusion) |
| Action representation | CVAE (transformer enc-dec), L1 + KL | DDPM diffusion (1D U-Net / transformer), ε-pred | Conditional flow matching | Hybrid FAST (AR) + flow matching | KI: FAST in VLM + flow in expert | KI + flow; world-model subgoals | Flow-matching inverse-dynamics decoder on video latents (partial denoise to τv) | Jointly denoises video + action chunks; implicit IDM → normalized joint positions |
| Action chunk (H) | k = 100 | Tp=16 predict, Ta=8 exec (To=2) | H = 50 | H = 50 | H = 50 | H = 50 | not stated | H=48 @ 30 Hz (1.6 s, AgiBot) · H=24 @ 15 Hz (DROID) |
| Language-conditioned | No | No | Yes | Yes | Yes | Yes | Yes (T5 instruction encoder) | Yes (frozen text encoder) |
| Cross-embodiment | No | No | Yes | Yes | Yes | Yes (zero-shot) | Tested on several embodiments (WidowX/Panda/bimanual); not yet a unified model | Yes, adapts to new robot (YAM) with 30 min play; video-only demos +42% on unseen |
| Control freq | 50 Hz | ~10 Hz | up to 50 Hz | 50 Hz | 50 Hz | 50 Hz | not stated | 7 Hz closed-loop (real-time) |
| Inference | single forward pass; temporal ensembling (+3.3%) | DDIM 10 steps (DDPM 100 train) | 10 flow steps; open-loop (ensembling hurt) | 10 flow steps | 5 flow steps; 63 ms / chunk | 5 flow steps; 38–127 ms | Partial denoising of video to τv=1, then decode actions from latents | 16 denoising steps; Flash variant 4→1 step (~350→~150 ms) |
| Training data | ~50 demos / task | 136–250 demos / task | ~10,000 h, 903M steps, 7 robots | + ~400 h mobile, ~100 homes | π0.5 + RL rollouts + interventions | + egocentric human video + autonomous | LIBERO 50 demos/task · real bimanual 512 + 480 eps · 10× sample-efficiency vs VLA | ~500 h teleop on AgiBot G1 across 22 environments + DROID |
| Generalization scope | single task, single robot | single task; models multimodal demos | fine-tune to new tasks | open-world new homes | specialist-level out-of-box | compositional, new embodiments | 77% from 1 episode/task (2% of action data); converges ~2× faster than VLA | >2× task progress vs SOTA VLAs; zero-shot new tasks & environments |
| Key contribution | Introduced action chunking | Introduced the diffusion action head | First flow-matching VLA | Generalize to entirely new homes | RL (RECAP) at VLA scale (π*0.6) | Compositional generalization | Video-Action Model: video backbone supplies dynamics; decoder only solves control | World Action Model: dream the future in video pixels, then act, a strong zero-shot policy |