Robot manipulation policy architectures

How learned manipulation policies represent and generate actions, 2023–2026. A high-level overview compares every model on the dimensions that generalize across paradigms; the π-family section below drills into architecture-specific detail. Models are tagged by paradigm so you can read across families or across types.

	ACTiAction-chunking imitation Transformer predicts a chunk of future actions; generative head is a CVAE. Trained from scratch per task, no language.	Diffusion PolicyiDiffusion visuomotor policy Action distribution modelled as a denoising diffusion process. Trained from scratch per task, no language.	π₀iVision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.	π_0.5iVision-Language-Action Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.	π_0.6 / π_0.6iVision-Language-Action · + RL (π) Initialized from a pretrained vision-language model; language-conditioned; trained across embodiments.	π_0.7iVLA + world model A VLA paired with a generative world model that predicts visual subgoals to plan over.	mimic-videoiWorld Action Model Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act.	DreamZeroiWorld Action Model Grounded in a pretrained video-generation model: predicts the future as video and derives actions via inverse dynamics: dream the future, then act.
Year · origin	2023 · Stanford (ALOHA)	2023 · Columbia / TRI / MIT	2024 · Physical Intelligence	2025 · PI	2025 · PI	2026 · PI	Dec 2025 · mimic robotics / ETH / Microsoft / UC Berkeley	Feb 2026 · NVIDIA
Params	~80M	~67M (CNN variant)	3.3B	~3.3B	~5.3B (4B + 860M)	~5B + 14B world model	2B video backbone + flow-matching action decoder	14B
Perception backbone	ResNet-18 × 4 cams	ResNet-18 (spatial-softmax, GroupNorm)	PaliGemma (SigLIP 400M + Gemma 2B)	PaliGemma (SigLIP 400M + Gemma 2B)	SigLIP 400M + Gemma 3 4B	Gemma 3 4B + SigLIP + MEM	NVIDIA Cosmos-Predict2 (2B latent DiT)	Wan2.1-I2V-14B (autoregressive video diffusion)
Action representation	CVAE (transformer enc-dec), L1 + KL	DDPM diffusion (1D U-Net / transformer), ε-pred	Conditional flow matching	Hybrid FAST (AR) + flow matching	KI: FAST in VLM + flow in expert	KI + flow; world-model subgoals	Flow-matching inverse-dynamics decoder on video latents (partial denoise to τ_v)	Jointly denoises video + action chunks; implicit IDM → normalized joint positions
Action chunk (H)	k = 100	Tp=16 predict, Ta=8 exec (To=2)	H = 50	H = 50	H = 50	H = 50	not stated	H=48 @ 30 Hz (1.6 s, AgiBot) · H=24 @ 15 Hz (DROID)
Language-conditioned	No	No	Yes	Yes	Yes	Yes	Yes (T5 instruction encoder)	Yes (frozen text encoder)
Cross-embodiment	No	No	Yes	Yes	Yes	Yes (zero-shot)	Tested on several embodiments (WidowX/Panda/bimanual); not yet a unified model	Yes, adapts to new robot (YAM) with 30 min play; video-only demos +42% on unseen
Control freq	50 Hz	~10 Hz	up to 50 Hz	50 Hz	50 Hz	50 Hz	not stated	7 Hz closed-loop (real-time)
Inference	single forward pass; temporal ensembling (+3.3%)	DDIM 10 steps (DDPM 100 train)	10 flow steps; open-loop (ensembling hurt)	10 flow steps	5 flow steps; 63 ms / chunk	5 flow steps; 38–127 ms	Partial denoising of video to τ_v=1, then decode actions from latents	16 denoising steps; Flash variant 4→1 step (~350→~150 ms)
Training data	~50 demos / task	136–250 demos / task	~10,000 h, 903M steps, 7 robots	+ ~400 h mobile, ~100 homes	π0.5 + RL rollouts + interventions	+ egocentric human video + autonomous	LIBERO 50 demos/task · real bimanual 512 + 480 eps · 10× sample-efficiency vs VLA	~500 h teleop on AgiBot G1 across 22 environments + DROID
Generalization scope	single task, single robot	single task; models multimodal demos	fine-tune to new tasks	open-world new homes	specialist-level out-of-box	compositional, new embodiments	77% from 1 episode/task (2% of action data); converges ~2× faster than VLA	>2× task progress vs SOTA VLAs; zero-shot new tasks & environments
Key contribution	Introduced action chunking	Introduced the diffusion action head	First flow-matching VLA	Generalize to entirely new homes	RL (RECAP) at VLA scale (π*0.6)	Compositional generalization	Video-Action Model: video backbone supplies dynamics; decoder only solves control	World Action Model: dream the future in video pixels, then act, a strong zero-shot policy

← scroll horizontally to compare →