PhysDB - Vision-language-action models

What it is

RT-2, Gemini Robotics, PI0, and GR00T-style systems all sit near this node, though their architectures and release boundaries differ.

VLA models make robot policy learning look more like multimodal foundation modeling, but they still depend on robot data and action interfaces.

A VLA model is not a full robot product or safety case.

contains

Physical AI model taxonomy

VLA models are components, not whole robots.

instantiates

Robot action generation

Policy behavior is embodiment-specific.

trained by

Robot learning

Web pretraining does not replace robot action data.

overlaps with

Task interpretation

Reasoning and action output may be separated in some systems.

evaluates

Model comparison

Benchmark scores are not deployment readiness.