Benchmarks
WorldFlux separates synthetic smoke benchmarks from evidence-oriented
env_policy lanes.
Synthetic Smoke Benchmarks
These scripts are compatibility and artifact-generation checks. They are not performance claims.
Shared CLI contract:
--quick(CI-safe short run)--full(longer run for manual/scheduled validation)--seed <int>--output-dir <path>
Synthetic smoke outputs:
- summary JSON (
summary.json) - visualization artifact (
imagination.ppm)
Benchmark 1: DreamerV3 (Atari-oriented)
uv run python benchmarks/benchmark_dreamerv3_atari.py --quick --seed 42
Full-mode example:
uv run python benchmarks/benchmark_dreamerv3_atari.py --full --data atari_data.npz --seed 42
Expected minimum result:
- finite losses
- imagination artifact generated
Benchmark 2: TD-MPC2 (MuJoCo-oriented)
uv run python benchmarks/benchmark_tdmpc2_mujoco.py --quick --seed 42
Full-mode example:
uv run python benchmarks/benchmark_tdmpc2_mujoco.py --full --data mujoco_data.npz --seed 42
Expected minimum result:
- finite losses
- imagination artifact generated
Benchmark 3: Diffusion Imagination
uv run python benchmarks/benchmark_diffusion_imagination.py --quick --seed 42
Full-mode example:
uv run python benchmarks/benchmark_diffusion_imagination.py --full --seed 42
Expected minimum result:
- finite losses
- imagination artifact generated
Evidence Lanes
The two canonical evidence lanes in the current MVP are:
- DreamerV3 on
ALE/Breakout-v5 - TD-MPC2 on
HalfCheetah-v5
These are reproducible evidence bundles, not SOTA claims or proof claims.
Evidence Lane 1: DreamerV3 Breakout
uv run python benchmarks/evidence_dreamerv3_breakout.py \
--quick \
--output-dir outputs/benchmarks/dreamerv3-breakout-evidence
Artifacts:
summary.jsonreturns.jsonllearning_curve.csvcheckpoint_index.jsonreport.md- dataset manifest + replay buffer bundle
Evidence semantics:
eval_mode = env_policypolicy_impl = candidate_actor_stateful_eval- learned-policy Atari rollout only; random env sampling is invalid evidence
Evidence Lane 2: TD-MPC2 HalfCheetah
uv run python benchmarks/evidence_tdmpc2_halfcheetah.py \
--quick \
--collector-policy random \
--output-dir outputs/benchmarks/tdmpc2-halfcheetah-evidence
Preferred collection path:
uv run python benchmarks/evidence_tdmpc2_halfcheetah.py \
--quick \
--policy-checkpoint ./outputs/checkpoint_final.pt \
--output-dir outputs/benchmarks/tdmpc2-halfcheetah-evidence
Evidence lane artifacts:
summary.jsonreturns.jsonllearning_curve.csvcheckpoint_index.jsonreport.md- dataset manifest + replay buffer bundle
Evidence semantics:
eval_mode = env_policypolicy_impl = cem_planner_eval- learned-policy MuJoCo rollout only; replay/data collection provenance is recorded separately
These lanes are intended to produce reproducible evidence bundles, not SOTA or paper-parity claims.
Reproducibility Notes
- Keep
seedfixed for comparisons. - CPU is the default benchmark target in quick mode.
- Full mode is intended for manual or scheduled evidence runs.
- Runtime and artifacts depend on hardware and optional dependencies.