Logo

Robot manipulation policy architectures

How learned manipulation policies represent and generate actions, 2023–2026. A high-level overview compares every model on the dimensions that generalize across paradigms; the π-family section below drills into architecture-specific detail. Models are tagged by paradigm so you can read across families or across types.

ACTiAction-chunking imitation
Transformer predicts a chunk of future actions; generative head is a CVAE. Trained from scratch per task, no language.
Diffusion PolicyiDiffusion visuomotor policy
Action distribution modelled as a denoising diffusion process. Trained from scratch per task, no language.
π0iVision-Language-Action
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.5iVision-Language-Action
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.6 / π*0.6iVision-Language-Action · + RL (π*)
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.7iVLA + world model
A VLA paired with a generative world model that predicts visual subgoals to plan over.
mimic-videoiWorld Action Model
Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act.
DreamZeroiWorld Action Model
Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act.
Year · origin2023 · Stanford (ALOHA)2023 · Columbia / TRI / MIT2024 · Physical Intelligence2025 · PI2025 · PI2026 · PIDec 2025 · mimic robotics / ETH / Microsoft / UC BerkeleyFeb 2026 · NVIDIA
Params~80M~67M (CNN variant)3.3B~3.3B~5.3B (4B + 860M)~5B + 14B world model2B video backbone + flow-matching action decoder14B
Perception backboneResNet-18 × 4 camsResNet-18 (spatial-softmax, GroupNorm)PaliGemma (SigLIP 400M + Gemma 2B)PaliGemma (SigLIP 400M + Gemma 2B)SigLIP 400M + Gemma 3 4BGemma 3 4B + SigLIP + MEMNVIDIA Cosmos-Predict2 (2B latent DiT)Wan2.1-I2V-14B (autoregressive video diffusion)
Action representationCVAE (transformer enc-dec), L1 + KLDDPM diffusion (1D U-Net / transformer), ε-predConditional flow matchingHybrid FAST (AR) + flow matchingKI: FAST in VLM + flow in expertKI + flow; world-model subgoalsFlow-matching inverse-dynamics decoder on video latents (partial denoise to τv)Jointly denoises video + action chunks; implicit IDM → normalized joint positions
Action chunk (H)k = 100Tp=16 predict, Ta=8 exec (To=2)H = 50H = 50H = 50H = 50not statedH=48 @ 30 Hz (1.6 s, AgiBot) · H=24 @ 15 Hz (DROID)
Language-conditionedNoNoYesYesYesYesYes (T5 instruction encoder)Yes (frozen text encoder)
Cross-embodimentNoNoYesYesYesYes (zero-shot)Tested on several embodiments (WidowX/Panda/bimanual); not yet a unified modelYes, adapts to new robot (YAM) with 30 min play; video-only demos +42% on unseen
Control freq50 Hz~10 Hzup to 50 Hz50 Hz50 Hz50 Hznot stated7 Hz closed-loop (real-time)
Inferencesingle forward pass; temporal ensembling (+3.3%)DDIM 10 steps (DDPM 100 train)10 flow steps; open-loop (ensembling hurt)10 flow steps5 flow steps; 63 ms / chunk5 flow steps; 38–127 msPartial denoising of video to τv=1, then decode actions from latents16 denoising steps; Flash variant 4→1 step (~350→~150 ms)
Training data~50 demos / task136–250 demos / task~10,000 h, 903M steps, 7 robots+ ~400 h mobile, ~100 homesπ0.5 + RL rollouts + interventions+ egocentric human video + autonomousLIBERO 50 demos/task · real bimanual 512 + 480 eps · 10× sample-efficiency vs VLA~500 h teleop on AgiBot G1 across 22 environments + DROID
Generalization scopesingle task, single robotsingle task; models multimodal demosfine-tune to new tasksopen-world new homesspecialist-level out-of-boxcompositional, new embodiments77% from 1 episode/task (2% of action data); converges ~2× faster than VLA>2× task progress vs SOTA VLAs; zero-shot new tasks & environments
Key contributionIntroduced action chunkingIntroduced the diffusion action headFirst flow-matching VLAGeneralize to entirely new homesRL (RECAP) at VLA scale (π*0.6)Compositional generalizationVideo-Action Model: video backbone supplies dynamics; decoder only solves controlWorld Action Model: dream the future in video pixels, then act, a strong zero-shot policy
← scroll horizontally to compare →