PhysDB - Benchmarks

What it is

Benchmarks translate vague capability claims into narrower tasks and metrics.

Without benchmarks and failure logs, robot claims stay anecdotal.

Benchmark scores can overfit task distribution and should not be read as deployment readiness.

evaluates

Model comparison

Benchmark scores are not deployment readiness.

evaluates

World model evaluation

Benchmark scope must be named.

extends

Safety-relevant evaluation

Safety evaluation needs failure modes, not only success scores.