PROTOCOL RECOMPUTATION · PUBLIC LEDGER

Compare recomputed VLA scores against reference results.

WorldFlux keeps reference results visible, then shows what happened when the same benchmark protocol was rerun with retained evidence. Ranking tables make the score, denominator, delta, and caveat visible in one scan.

Models

Total episodes

4,000

Mean delta

-0.15pt

Best WorldFlux

98.05%

Official-reference recomputation

Ranked by WorldFlux recomputed success rate.

Rows are ranked only inside the displayed benchmark family. Click a row to inspect suites, evidence caveats, and reference links.

MolmoAct2

MolmoAct2 official LIBERO full

WorldFlux

98.05%

Official

98.25%

Delta

-0.20pt

Successes

1961/2000

Grade B

Details

Suite breakdown

481/500

WorldFlux

96.2%

Official

96.6%

Delta

-0.40pt

Goal

488/500

WorldFlux

97.6%

Official

98.0%

Delta

-0.40pt

Object

499/500

WorldFlux

99.8%

Official

100.0%

Delta

-0.20pt

Spatial

493/500

WorldFlux

98.6%

Official

98.4%

Delta

+0.20pt

Evidence details

Protocol-level recomputation with model files hashed before evaluation. Not Grade A because the runtime image identity and exact official H100 environment were not fully attested.

- Official LeRobot MolmoAct2 LIBERO evaluation method

- 4 LIBERO suites · 50 trials per task

- Pre-run model and checkpoint SHA-256 digests recorded

- All visible attempts retained; no hidden best-run selection

verified AWS artifact snapshotOfficial reference

OpenPI pi0.5

OpenPI official LIBERO

WorldFlux

96.75%

Official

96.85%

Delta

-0.10pt

Successes

1935/2000

Grade B

Details

Suite breakdown

465/500

WorldFlux

93.0%

Official

92.4%

Delta

+0.60pt

Goal

490/500

WorldFlux

98.0%

Official

98.0%

Delta

0.00pt

Object

491/500

WorldFlux

98.2%

Official

98.2%

Delta

0.00pt

Spatial

489/500

WorldFlux

97.8%

Official

98.8%

Delta

-1.00pt

Evidence details

Protocol-level recomputation. Model identity is recorded, but the checkpoint was not frozen before the run with a pre-run model artifact digest.

- Official OpenPI LIBERO evaluation method

- 4 LIBERO suites · 50 trials per task

- Score recomputation, not a custom benchmark

- Model identity signing shown as evidence-grade caveat

verified local artifact snapshotOfficial reference

Robustness ranking

LIBERO-family robustness readouts.

Standard LIBERO and LIBERO-Pro stay separate. Click a row to inspect axes, caveats, denominator, and source references.

MolmoAct2

WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete

Standard LIBERO

98.050%

LIBERO-Pro

52.500%

Gap

-45.550pt

Details

LIBERO-Pro perturbation axes

One official 40-instance slice per axis.

Object

85.00%

34/40

Semantic

97.50%

39/40

Environment

55.00%

22/40

Position

22.50%

9/40

Task

2.50%

1/40

Evidence and caveat

Official LIBERO-Pro denominator is 200 task instances. This row is Grade B, not Grade A, because the public snapshot does not expose every pre-run identity and attempt-history field required for Grade A publication. It is ranked only inside the LIBERO-Pro family, not as a replacement for standard LIBERO.

Standard evidence

1961/2000

LIBERO-Pro evidence

105/200

Evidence grade

Grade B

Source

WorldFlux

Denominator

200 LIBERO-Pro

200/200 official task instances complete; shard integrity passed

LIBERO-Pro reference

OpenPI pi0.5

WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete

Standard LIBERO

96.750%

LIBERO-Pro

52.500%

Gap

-44.250pt

Details

LIBERO-Pro perturbation axes

One official 40-instance slice per axis.

Object

90.00%

36/40

Semantic

100.00%

40/40

Environment

47.50%

19/40

Position

20.00%

8/40

Task

5.00%

2/40

Evidence and caveat

Run record discloses a pre-scoring worker replacement. The published aggregate uses completed 200/200 shard outputs under the frozen LIBERO-Pro manifest; raw failure logs are not included in this public snapshot. The row is Grade B, not Grade A, because the public snapshot does not expose every pre-run identity and attempt-history field required for Grade A publication.

Standard evidence

1935/2000

LIBERO-Pro evidence

105/200

Evidence grade

Grade B

Source

WorldFlux

Denominator

200 LIBERO-Pro

200/200 official task instances complete; completed shard integrity summary recorded

LIBERO-Pro reference

Standard LIBERO and LIBERO-Pro are separate benchmark families. The LIBERO-Pro denominator follows the official task map: 4 base suites x 5 perturbation axes x 10 task instances = 200. Rows below that denominator stay out of this ranking.

The benchmark owner and reference result remain the source of truth. WorldFlux verifies the path to a recomputed score and shows the evidence caveats next to the numbers, instead of hiding them behind a single leaderboard rank.