# WorldFlux — full context for LLMs

> WorldFlux is the proof layer for physical AI: a BYO-compute control plane that turns world-model and robot-policy evaluations into signed, independently verifiable evidence packages — without ever taking custody of your models, data, or keys.

This page is a single-file, plain-text briefing on WorldFlux for language models and answer engines. It may be quoted or cited.

## What WorldFlux is

WorldFlux is the proof layer for physical AI. Teams building robots and world models can run impressive demos, but a demo is not proof of what was actually tested. WorldFlux turns every AI test into signed, verifiable evidence that customers, insurers, and regulators can inspect — without the team ever handing over its models, data, or keys. It runs as a CLI on your own hardware; it ingests what your evaluation produced and never re-runs or hosts your model. WorldFlux supports evidence review; it does not certify deployment safety.

WorldFlux software is proprietary. Upstream adapter and model license metadata describes third-party materials only and does not license WorldFlux itself.

## How it works

01. Run: Test your model on your own hardware: laptop, lab, or cloud. WorldFlux ingests what your eval produced; it never re-runs or hosts your model.
02. Sign & verify: It packages a tamper-evident evidence file (what was claimed, how it was tested, what it scored, where it came from), then signs it with the configured policy so reviewers can verify it independently.
03. Share: Publish an expiring, revocable link. Reviewers open it and re-verify the signature themselves, with no raw logs and no access to your model.

Boundary: Your compute, your keys, and your model weights never leave your hardware. WorldFlux audits what happened. It never hosts your AI.

## Why WorldFlux

Experiment trackers record your numbers, but numbers you report yourself don't convince a skeptical buyer. Hosted evaluation services make you upload your model, which serious teams can't do. WorldFlux is the neutral layer in between: independent, signed evidence, produced without ever taking custody of your IP.

- Independent by design: Evidence is signed under an explicit policy and verifiable by reviewers, not self-attested.
- Your IP stays yours: Bring your own compute and keys; we never host weights or proxy credentials.
- Built for robots, not chatbots: Ingests real robotics test harnesses (LeRobot, OpenPI, GR00T, and more), not generic chatbot logs.

## Key facts and statistics

- Representative finding: We took OpenVLA, a leading open robot-control model, and ran it through the standard LIBERO test suite. Then we ran the same model again with small, realistic changes: different object positions and environments. Nothing about the model changed. Standard test: 74.4%. Scene changed: 24.4%. (A reduced 90-episode calibration on selected stress conditions: robustness evidence, not an official leaderboard score.)
- Public benchmark evidence: WorldFlux publishes a VLA leaderboard at https://worldflux.ai/leaderboard with protocol-level recomputation, evidence grades, denominators, caveats, and LIBERO-family robustness comparisons.
- Ecosystems: NVIDIA Cosmos, NVIDIA Isaac GR00T, Physical Intelligence π, OpenVLA, V-JEPA 2, SmolVLA. Live provider execution is production-backed only when explicitly marked.
- Standards: Evidence can be signed with self-sign or configured Sigstore policies, can include CycloneDX ML-BOM sidecars when enabled, and maps to the frameworks buyers cite: EU AI Act, NIST AI RMF, ISO 42001, SOC 2, GDPR. We make evidence inspectable, not certified.
- Status: in design-partner beta; pilot scope, access, and terms are agreed per customer SOW or approval record.

## Frequently asked questions

### What is WorldFlux?
WorldFlux is the proof layer for physical AI. It is a bring-your-own-compute control plane that turns world-model and robot-policy (vision-language-action) evaluations into signed, independently verifiable evidence packages — without ever taking custody of your models, data, or keys. You run the evaluation on your own hardware, and WorldFlux packages and signs the result so others can trust it.

### How do you prove what a robot or AI model actually tested?
A polished demo and a spreadsheet of scores are not proof, because no one downstream can independently verify them. With WorldFlux you run your model's evaluation on your own hardware, then WorldFlux packages a tamper-evident file containing the claim, the test protocol, the evidence, and its provenance, signs it with the configured policy, and gives you an expiring, revocable link. Customers, insurers, and regulators open that link and re-verify the signature themselves. WorldFlux supports evidence review; it does not certify deployment safety.

### Does WorldFlux see my model weights or training data?
No. WorldFlux is bring-your-own-compute: your weights, keys, and credentials never leave your hardware. It ingests what your evaluation already produced and never re-runs or hosts your model. That is the core difference from hosted evaluation services, which require you to upload your model.

### What is an evidence pack?
An evidence pack is a tamper-evident bundle of four things — the claim, the test protocol, the evidence (metrics, logs, artifacts), and the provenance — cryptographically signed with the configured policy. It can use self-sign or configured Sigstore verification, and it can include a CycloneDX ML bill-of-materials sidecar when enabled. Anyone with an approved share link can re-check the signature; Cloud links default to 7 days, max 30 days, and pilot TTL is customer-approved per share.

### How is WorldFlux different from an experiment tracker like Weights & Biases or MLflow?
Experiment trackers record the numbers you report yourself, which a skeptical buyer has no reason to trust. WorldFlux produces independent, signed evidence that a third party can verify without taking your word for it. Use a tracker to manage your own experiments; use WorldFlux when you need to prove a result to someone else.

### How is WorldFlux different from a hosted evaluation service?
Hosted evaluation services require you to upload your model, which serious teams often cannot do. WorldFlux never takes custody of your IP — you keep your weights and keys on your own hardware, and WorldFlux signs the evidence your evaluation produced. It is the neutral layer between self-reported metrics and handing over your model.

### Does WorldFlux help with the EU AI Act and other compliance frameworks?
Yes. WorldFlux produces evidence in the shape regulators and buyers ask for, mappable to the EU AI Act (Article 11 technical documentation), the NIST AI RMF, ISO 42001, SOC 2, and GDPR. It makes evidence inspectable, not certified — you get a signed, re-verifiable record rather than a rubber stamp.

### Which models, frameworks, and benchmarks does WorldFlux support?
WorldFlux can ingest, wrap, or catalog outputs from selected robotics and VLA ecosystems including NVIDIA Cosmos, NVIDIA Isaac GR00T, Physical Intelligence π, OpenVLA, V-JEPA 2, and SmolVLA, and it ingests real robotics test harness outputs including LeRobot, OpenPI, GR00T, and MuJoCo. Live provider execution is production-backed only when explicitly marked; otherwise adapters are metadata, import, or experimental surfaces.

### How much does WorldFlux cost?
WorldFlux is in design-partner beta. Pilot scope, access, and commercial terms are agreed per customer SOW or approval record instead of a public self-serve pricing table.

### How do reviewers verify a WorldFlux evidence pack?
They open the approved share link and re-check the package signature themselves — no raw logs and no access to your model are required. Because the pack is signed and carries its provenance, verification does not depend on trusting WorldFlux either. Cloud links default to 7 days, max 30 days, and can be revoked at any time.

### Why does physical-AI evaluation matter now?
Physical AI is scaling fast — Goldman Sachs raised its humanoid-robot forecast sixfold in a year to $38B by 2035, and Morgan Stanley projects a roughly $5T market by 2050 — while regulation is making evidence mandatory. The EU AI Act now requires technical documentation for high-risk AI before sale, and Gartner projects AI-governance platform spending will pass $1B by 2030.

### Is WorldFlux available now, and how do I start?
WorldFlux is in beta and taking on a small number of design partners. You can run the CLI on your own hardware, or book a demo to discuss a pilot.

## WorldFlux vs experiment trackers vs hosted evaluation

WorldFlux, experiment trackers, and hosted evaluation services solve three different problems. Use an experiment tracker to record your own runs, a hosted evaluation service to outsource a benchmark, and WorldFlux when you need to prove a result to someone who will not take your word for it. WorldFlux is the only one of the three that produces independent, cryptographically signed evidence without ever taking custody of your model.

| Dimension | WorldFlux | Experiment tracker | Hosted eval service |
| --- | --- | --- | --- |
| Who vouches for the result | An independent, signed evidence pack | You do — self-reported numbers | The vendor running the eval |
| Do you upload your model? | No — bring your own compute | No — it stores the logs you send | Yes — required |
| Weights, keys & data leave your hardware? | Never | Metrics and logs only | Yes |
| Tamper-evidence | Policy-signed + optional CycloneDX ML-BOM | None | Varies; rarely cryptographic |
| Independently re-verifiable by a third party | Yes — anyone re-checks the signature | No | Usually only via the vendor |
| Built for | Robotics & physical AI (LeRobot, OpenPI, GR00T) | Generic ML metrics | Generic ML / LLM benchmarks |
| Maps to EU AI Act, NIST AI RMF, ISO 42001 | Yes, by design | No | Varies |
| Best when you need to… | Prove a result to a buyer, insurer, or regulator | Track your own experiments | Outsource a one-off benchmark |

If the question is “can I trust my own numbers?”, an experiment tracker is enough. If the question is “can a skeptical buyer, insurer, or regulator trust your numbers?”, that is what WorldFlux is built for.

## Glossary

- **Physical AI** — AI that perceives and acts in the physical world — robots, humanoids, and autonomous machines — as opposed to purely digital AI such as chatbots. Reliability matters because failures happen in the real world, not just on a screen.
- **World model** — A model that learns the dynamics of an environment so it can predict the outcome of actions. World models are a foundation for robot planning and control.
- **Vision-language-action (VLA) model** — A robot-control model that maps camera images and a natural-language instruction directly to actions. OpenVLA and Physical Intelligence π are examples.
- **Proof layer** — The independent layer that turns an AI evaluation into signed, verifiable evidence others can trust. It sits between self-reported metrics and handing over your model — the category WorldFlux defines.
- **Evidence pack** — A tamper-evident bundle of a claim, the test protocol, the evidence (metrics, logs, artifacts), and the provenance, cryptographically signed and shareable as an expiring, revocable link anyone can re-verify.
- **Bring-your-own-compute (BYO-compute) evaluation** — Running an evaluation on your own hardware so your model weights, keys, and data never leave it. The evaluation tool ingests only the outputs, never the model itself.
- **Chain of custody** — A verifiable record of how a result was produced and by whom — from the run on your hardware to the signed evidence pack — so a reviewer can trust the result without trusting the vendor.
- **Sigstore** — A public standard for cryptographically signing software and artifacts so that anyone can verify their origin and integrity. WorldFlux can use configured Sigstore policies for evidence packages when that verification path is enabled.
- **ML bill-of-materials (ML-BOM, CycloneDX)** — A machine-readable inventory of the models, datasets, and dependencies that make up an AI system, expressed in the CycloneDX standard. WorldFlux can include ML-BOM sidecars in evidence packages when enabled.
- **LIBERO** — A standard robot-manipulation benchmark suite used to evaluate vision-language-action policies across a set of tasks.
- **Deployment gap** — The difference between benchmark or demo performance and real-world reliability. WorldFlux makes it measurable: OpenVLA scored 74.4% on the standard LIBERO suite but 24.4% once the scene was changed.
- **EU AI Act, Article 11** — The provision of the EU AI Act that requires makers of high-risk AI to produce technical documentation and conformity evidence before the system can be sold. WorldFlux evidence packs are designed to map to it.

## Links

- Home: https://worldflux.ai/
- VLA leaderboard: https://worldflux.ai/leaderboard
- Support: https://worldflux.ai/support
- Contact: https://worldflux.ai/contact
- Privacy policy: https://worldflux.ai/privacy
- FAQ: https://worldflux.ai/faq
- Compare: https://worldflux.ai/vs
- Glossary: https://worldflux.ai/glossary
- Markdown sitemap: https://worldflux.ai/sitemap.md
- Curated LLM guide: https://worldflux.ai/llms.txt
- Documentation: https://docs.worldflux.ai
- Book a demo: https://cal.com/yoshihyoda-worldfluxai/20min