what we measured · what we tested · what is already shipping
Atlas is not a thesis. It is a measured enforcement system: contract code, property-tested kernels, and deployment-grade primitives for bounded agent authority.
The evidence is simple: runtime policy gets weaker as policy gets richer. Substrate policy does not.
| Substrate | Runtime evaluation | |
|---|---|---|
| 3-rule policy | 0.0% | 11.3% |
| 7-rule policy | 0.0% | 29.3% |
| Between-model spread | 0.0pp | 31.3pp |
Two benches, 600 LLM calls total, same five frontier models, same day. One bench against a three-rule policy. One against a seven-rule institutional policy.
Each bench compared two enforcement architectures. The first arm: software that evaluates policy at runtime — the dominant architecture today. The second arm: an Atlas substrate, where policy is encoded into the contract path itself.
At three rules, runtime evaluation leaked 11.3% across the five models, with spread of 15.2 percentage points between the most-resistant and least-resistant model. At seven rules, the same arm leaked 29.3%, with spread of 31.3pp. The substrate arm held at 0.0% at both depths across all five models.
The richer the policy, the wider the gap. Adding bounds increases the surface a runtime evaluator must reason about. Adding bounds to substrate enforcement does not create an equivalent reasoning burden. The predicate either passes or fails.
Full methodology and per-model breakdown are available in the research download.
The strongest models did not usually fail because they were persuaded by adversarial framing. They ignored social engineering reliably.
They failed because they computed the action wrong.
The dominant failure mode was quantitative compositional reasoning: miscalculating trade cost, misapplying a bound, or crossing the balance floor while satisfying another constraint.
That matters because institutional policy is compositional by nature. A valid action may need to satisfy amount caps, counterparty rules, liquidity floors, settlement windows, rate limits, and reporting thresholds simultaneously. Each additional bound increases the reasoning surface for a runtime evaluator.
Substrate enforcement does not depend on that reasoning.
Atlas ships eight predicate primitives that map directly onto the institutional policy primitives compliance teams already author:
| Atlas predicate | Institutional policy primitive |
|---|---|
Amount cap | Per-action cap (pre-approval workflow) |
Amount minimum | Minimum trade size (dust filter) |
Balance floor | Treasury reserve floor |
Recipient (exact) | Bilateral counterparty pinning (ISDA-style) |
Recipient allowlist | KYC / sanctions allowlist (BSA / OFAC) |
Expiry | Settlement-window close |
Not-before | Blackout / earnings-window freeze |
Rate limit | Daily / hourly money-movement cap (BSA / AML reporting) |
These predicates compose. A single envelope can carry a cap, recipient allowlist, settlement window, rate limit, and balance floor — AND-composed. The action is valid only if every predicate passes.
Compliance teams already write these rules on paper. Atlas makes them executable.
The benchmark measures the runtime-vs-substrate enforcement gap. The property suites test a different question: does the gate implementation match the bounded-authority specification?
Two property suites cover the fixed-rule kernel and the composable-predicate kernel. 42 property tests. More than 7,000 sampled inputs. Zero failures. The suites cover soundness, completeness, determinism, AND-composition equivalence, short-circuit evaluation, monotonicity, and concrete adversarial replays.
This is not a closed-form proof. SMT-backed verification remains on the roadmap.
What property testing establishes is narrower but still material: under realistic adversarial input distributions, the gate is observationally indistinguishable from the spec.
Atlas runs three on-chain primitives:
More than 600 contract tests pass. The reference implementation is available under access agreement. Atlas is in deployment with our first design partners.
We do not claim audit completeness we do not yet have. External third-party audit is sequenced with the institutional design-partner cohort and is not yet complete.
The current claim is precise: Atlas has measured the runtime-enforcement gap, implemented the substrate alternative, tested the gate against specification, and is now moving through design-partner deployment.
If the architecture is irreversible — and the measurements, property tests, and deployments above show that it is — then the only remaining question is who builds the substrate.
We do.
The agent can be wrong. The bound stays right.