Logo

Robot manipulation policy architectures

How learned manipulation policies represent and generate actions, 2023–2026. A high-level overview compares every model on the dimensions that generalize across paradigms; the π-family section below drills into architecture-specific detail. Models are tagged by paradigm so you can read across families or across types.

ACTAction-chunking imitation
Transformer predicts a chunk of future actions; generative head is a CVAE. Trained from scratch per task, no language.
Diffusion PolicyDiffusion visuomotor policy
Action distribution modelled as a denoising diffusion process. Trained from scratch per task, no language.
π0Vision-Language-Action
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.5Vision-Language-Action
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.6 / π*0.6Vision-Language-Action · + RL (π*)
Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.
π0.7VLA + world model
A VLA paired with a generative world model that predicts visual subgoals to plan over.
Year · origin2023 · Stanford (ALOHA)2023 · Columbia / TRI / MIT2024 · Physical Intelligence2025 · PI2025 · PI2026 · PI
Params~80M~67M (CNN variant)3.3B~3.3B~5.3B (4B + 860M)~5B + 14B world model
Perception backboneResNet-18 × 4 camsResNet-18 (spatial-softmax, GroupNorm)PaliGemma (SigLIP 400M + Gemma 2B)PaliGemma (SigLIP 400M + Gemma 2B)SigLIP 400M + Gemma 3 4BGemma 3 4B + SigLIP + MEM
Action representationCVAE (transformer enc-dec), L1 + KLDDPM diffusion (1D U-Net / transformer), ε-predConditional flow matchingHybrid FAST (AR) + flow matchingKI: FAST in VLM + flow in expertKI + flow; world-model subgoals
Action chunk (H)k = 100Tp=16 predict, Ta=8 exec (To=2)H = 50H = 50H = 50H = 50
Language-conditionedNoNoYesYesYesYes
Cross-embodimentNoNoYesYesYesYes (zero-shot)
Control freq50 Hz~10 Hzup to 50 Hz50 Hz50 Hz50 Hz
Inferencesingle forward pass; temporal ensembling (+3.3%)DDIM 10 steps (DDPM 100 train)10 flow steps; open-loop (ensembling hurt)10 flow steps5 flow steps; 63 ms / chunk5 flow steps; 38–127 ms
Training data~50 demos / task136–250 demos / task~10,000 h, 903M steps, 7 robots+ ~400 h mobile, ~100 homesπ0.5 + RL rollouts + interventions+ egocentric human video + autonomous
Generalization scopesingle task, single robotsingle task; models multimodal demosfine-tune to new tasksopen-world new homesspecialist-level out-of-boxcompositional, new embodiments
Key contributionIntroduced action chunkingIntroduced the diffusion action headFirst flow-matching VLAGeneralize to entirely new homesRL (RECAP) at VLA scale (π*0.6)Compositional generalization
← scroll horizontally to compare →