The test suite your robot futures can depend on.
WorldBench catches futures that look right but fail the checks that matter for control: action consistency, contact physics, temporal stability, object permanence, and visual grounding.
Visual similarity is not enough when a planner trusts the future.
A model can keep frames visually close while moving opposite the action log or teleporting contact objects. WorldBench scores those failures directly.
One local evaluation layer, from rollout folder to failure evidence.
WorldBench is not another generator. It is the practical harness robotics teams run before trusting generated futures inside planning loops.
Rollout dataset validation
Check frames, actions, states, predictions, and metadata before an eval run starts.
Action consistency
Measure whether predicted motion follows the robot command sequence instead of drifting.
Contact realism
Flag object motion before contact and other contact-rich failures that visual scores miss.
Object permanence
Track task-relevant objects so models cannot simply drop them from the future.
Temporal stability
Catch flicker and inconsistent rollouts across sequential predictions.
Reports and dashboard
Export Markdown reports or open a zero-dependency HTML dashboard for debugging.
Run a robotics world-model eval in one command, then automate it in Python.
Start with the synthetic robot-cube demo, compare good and bad prediction folders, then plug the same evaluator into your model workflow.
- CLI first. Demo, validate, eval, compare, benchmark, report, dashboard.
- SDK ready. Import WorldBench, evaluate predictions, and save JSON artifacts.
- Adapter path. Experimental LeRobot-style local folder import is included.
1git clone https://github.com/tigee1311/worldbench.git
2cd worldbench
3python -m pip install -e ".[dev,video]"
4worldbench demo
5worldbench validate examples/demo_dataset
6worldbench eval examples/demo_dataset \
7 --predictions examples/demo_dataset/bad_model
8worldbench compare examples/demo_dataset \
9 --models good_model bad_model
10worldbench dashboard .worldbench/runs/latest/result.json
Give researchers and evaluation engineers a view they can act on.
WorldBench writes timestamped result JSON under .worldbench/runs/, updates
latest/result.json, generates Markdown reports, and opens a local dashboard.
Built for teams that need more than plausible video.
World-model builders
Find the failure modes that block generated futures from driving real planning loops.
Robotics ML researchers
Compare prediction folders and benchmark scenarios with reproducible local artifacts.
Evaluation engineers
Turn rollout regressions into reports, dashboards, and CI-friendly JSON outputs.
Make robot futures trustworthy before they touch hardware.
WorldBench is open source, local-first, and ready for robotics world-model evaluation workflows.