PROTOCOL RECOMPUTATION · PUBLIC LEDGER
Compare recomputed VLA scores against reference results.
WorldFlux keeps reference results visible, then shows what happened when the same benchmark protocol was rerun with retained evidence. Ranking tables make the score, denominator, delta, and caveat visible in one scan.
Models
2
Total episodes
4,000
Mean delta
-0.15pt
Best WorldFlux
98.05%
Official-reference recomputation
Ranked by WorldFlux recomputed success rate.
Rows are ranked only inside the displayed benchmark family. Click a row to inspect suites, evidence caveats, and reference links.
#1MolmoAct2
MolmoAct2 official LIBERO full
WorldFlux
98.05%
Official
98.25%
Delta
-0.20pt
Successes
1961/2000
Grade BDetails
MolmoAct2
MolmoAct2 official LIBERO full
WorldFlux
98.05%
Official
98.25%
Delta
-0.20pt
Successes
1961/2000
Suite breakdown
10
481/500
WorldFlux
96.2%
Official
96.6%
Delta
-0.40pt
Goal
488/500
WorldFlux
97.6%
Official
98.0%
Delta
-0.40pt
Object
499/500
WorldFlux
99.8%
Official
100.0%
Delta
-0.20pt
Spatial
493/500
WorldFlux
98.6%
Official
98.4%
Delta
+0.20pt
Evidence details
Protocol-level recomputation with model files hashed before evaluation. Not Grade A because the runtime image identity and exact official H100 environment were not fully attested.
- Official LeRobot MolmoAct2 LIBERO evaluation method
- 4 LIBERO suites · 50 trials per task
- Pre-run model and checkpoint SHA-256 digests recorded
- All visible attempts retained; no hidden best-run selection
#2OpenPI pi0.5
OpenPI official LIBERO
WorldFlux
96.75%
Official
96.85%
Delta
-0.10pt
Successes
1935/2000
Grade BDetails
OpenPI pi0.5
OpenPI official LIBERO
WorldFlux
96.75%
Official
96.85%
Delta
-0.10pt
Successes
1935/2000
Suite breakdown
10
465/500
WorldFlux
93.0%
Official
92.4%
Delta
+0.60pt
Goal
490/500
WorldFlux
98.0%
Official
98.0%
Delta
0.00pt
Object
491/500
WorldFlux
98.2%
Official
98.2%
Delta
0.00pt
Spatial
489/500
WorldFlux
97.8%
Official
98.8%
Delta
-1.00pt
Evidence details
Protocol-level recomputation. Model identity is recorded, but the checkpoint was not frozen before the run with a pre-run model artifact digest.
- Official OpenPI LIBERO evaluation method
- 4 LIBERO suites · 50 trials per task
- Score recomputation, not a custom benchmark
- Model identity signing shown as evidence-grade caveat
Robustness ranking
LIBERO-family robustness readouts.
Standard LIBERO and LIBERO-Pro stay separate. Click a row to inspect axes, caveats, denominator, and source references.
#1MolmoAct2
WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete
Standard LIBERO
98.050%
LIBERO-Pro
52.500%
Gap
-45.550pt
Details
MolmoAct2
WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete
Standard LIBERO
98.050%
LIBERO-Pro
52.500%
Gap
-45.550pt
LIBERO-Pro perturbation axes
One official 40-instance slice per axis.
Object
85.00%
34/40
Semantic
97.50%
39/40
Environment
55.00%
22/40
Position
22.50%
9/40
Task
2.50%
1/40
Evidence and caveat
Official LIBERO-Pro denominator is 200 task instances. This row is Grade B, not Grade A, because the public snapshot does not expose every pre-run identity and attempt-history field required for Grade A publication. It is ranked only inside the LIBERO-Pro family, not as a replacement for standard LIBERO.
Standard evidence
1961/2000
LIBERO-Pro evidence
105/200
Evidence grade
Grade B
Source
WorldFlux
Denominator
200 LIBERO-Pro
200/200 official task instances complete; shard integrity passed
LIBERO-Pro reference#2OpenPI pi0.5
WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete
Standard LIBERO
96.750%
LIBERO-Pro
52.500%
Gap
-44.250pt
Details
OpenPI pi0.5
WorldFlux LIBERO-Pro shard snapshot; 20/20 shards complete
Standard LIBERO
96.750%
LIBERO-Pro
52.500%
Gap
-44.250pt
LIBERO-Pro perturbation axes
One official 40-instance slice per axis.
Object
90.00%
36/40
Semantic
100.00%
40/40
Environment
47.50%
19/40
Position
20.00%
8/40
Task
5.00%
2/40
Evidence and caveat
Run record discloses a pre-scoring worker replacement. The published aggregate uses completed 200/200 shard outputs under the frozen LIBERO-Pro manifest; raw failure logs are not included in this public snapshot. The row is Grade B, not Grade A, because the public snapshot does not expose every pre-run identity and attempt-history field required for Grade A publication.
Standard evidence
1935/2000
LIBERO-Pro evidence
105/200
Evidence grade
Grade B
Source
WorldFlux
Denominator
200 LIBERO-Pro
200/200 official task instances complete; completed shard integrity summary recorded
LIBERO-Pro referenceStandard LIBERO and LIBERO-Pro are separate benchmark families. The LIBERO-Pro denominator follows the official task map: 4 base suites x 5 perturbation axes x 10 task instances = 200. Rows below that denominator stay out of this ranking.
The benchmark owner and reference result remain the source of truth. WorldFlux verifies the path to a recomputed score and shows the evidence caveats next to the numbers, instead of hiding them behind a single leaderboard rank.