Evaluation infrastructure for robot world models

The test suite your robot futures can depend on.

WorldBench catches futures that look right but fail the checks that matter for control: action consistency, contact physics, temporal stability, object permanence, and visual grounding.

Python 3.10+ 5 control-aware metrics Local dashboard
Measured against pretty-video metrics

Visual similarity is not enough when a planner trusts the future.

World-model diagnosis bad_model rollout
SSIM / PSNR: plausible WorldBench: warning Control failure ROLLOUT TIME

A model can keep frames visually close while moving opposite the action log or teleporting contact objects. WorldBench scores those failures directly.

42/100 Example bad-model overall score from the README report
5 Built-in metrics spanning visual, action, time, contact, and object checks
0 Cloud services required for the CLI, reports, or local dashboard
01 - Infrastructure

One local evaluation layer, from rollout folder to failure evidence.

WorldBench is not another generator. It is the practical harness robotics teams run before trusting generated futures inside planning loops.

01 Input

Rollout dataset validation

Check frames, actions, states, predictions, and metadata before an eval run starts.

02 Control

Action consistency

Measure whether predicted motion follows the robot command sequence instead of drifting.

03 Physics

Contact realism

Flag object motion before contact and other contact-rich failures that visual scores miss.

04 Memory

Object permanence

Track task-relevant objects so models cannot simply drop them from the future.

05 Time

Temporal stability

Catch flicker and inconsistent rollouts across sequential predictions.

06 Evidence

Reports and dashboard

Export Markdown reports or open a zero-dependency HTML dashboard for debugging.

02 - The SDK

Run a robotics world-model eval in one command, then automate it in Python.

Start with the synthetic robot-cube demo, compare good and bad prediction folders, then plug the same evaluator into your model workflow.

  • CLI first. Demo, validate, eval, compare, benchmark, report, dashboard.
  • SDK ready. Import WorldBench, evaluate predictions, and save JSON artifacts.
  • Adapter path. Experimental LeRobot-style local folder import is included.
quickstart.sh bash
1git clone https://github.com/tigee1311/worldbench.git
2cd worldbench
3python -m pip install -e ".[dev,video]"
4worldbench demo
5worldbench validate examples/demo_dataset
6worldbench eval examples/demo_dataset \
7  --predictions examples/demo_dataset/bad_model
8worldbench compare examples/demo_dataset \
9  --models good_model bad_model
10worldbench dashboard .worldbench/runs/latest/result.json
03 - Proof loop

Give researchers and evaluation engineers a view they can act on.

WorldBench writes timestamped result JSON under .worldbench/runs/, updates latest/result.json, generates Markdown reports, and opens a local dashboard.

WorldBench local dashboard screenshot
WorldBench Report result.json
Overall Score 42/100
Action Consistency 31/100
Contact Realism 20/100
Main failure: plausible frames, wrong control response.
Read the docs
04 - Who it is for

Built for teams that need more than plausible video.

01

World-model builders

Find the failure modes that block generated futures from driving real planning loops.

02

Robotics ML researchers

Compare prediction folders and benchmark scenarios with reproducible local artifacts.

03

Evaluation engineers

Turn rollout regressions into reports, dashboards, and CI-friendly JSON outputs.

Make robot futures trustworthy before they touch hardware.

WorldBench is open source, local-first, and ready for robotics world-model evaluation workflows.