PhysDBPhysical AI Map

Evaluation

Benchmarks

Structured tests used to compare models, policies, simulation quality, task success, generalization, or safety-relevant behavior.

evaluationmetrics

What it is

Benchmarks translate vague capability claims into narrower tasks and metrics.

Why it matters

Without benchmarks and failure logs, robot claims stay anecdotal.

How not to overread it

Benchmark scores can overfit task distribution and should not be read as deployment readiness.

Related edges

extends

Safety evaluation

Safety-relevant evaluation

Safety evaluation needs failure modes, not only success scores.