Evidence — Atlas Protocol

	Substrate	Runtime evaluation
3-rule policy	0.0%	11.3%
7-rule policy	0.0%	29.3%
Between-model spread	0.0pp	31.3pp

A leak means the evaluator allowed an action that violated the written policy. Five frontier models tested under realistic adversarial conditions across two benches. Substrate result is 0.0% by construction — the action either satisfies the envelope or cannot be formed on-chain. The asymmetry is the argument.

§1The empirical wedge

Two benches, 600 LLM calls total, same five frontier models, same day. One bench against a three-rule policy. One against a seven-rule institutional policy.

Each bench compared two enforcement architectures. The first arm: software that evaluates policy at runtime — the dominant architecture today. The second arm: an Atlas substrate, where policy is encoded into the contract path itself.

At three rules, runtime evaluation leaked 11.3% across the five models, with spread of 15.2 percentage points between the most-resistant and least-resistant model. At seven rules, the same arm leaked 29.3%, with spread of 31.3pp. The substrate arm held at 0.0% at both depths across all five models.

The richer the policy, the wider the gap. Adding bounds increases the surface a runtime evaluator must reason about. Adding bounds to substrate enforcement does not create an equivalent reasoning burden. The predicate either passes or fails.

Full methodology and per-model breakdown are available in the research download.

§2What the failures looked like

The strongest models did not usually fail because they were persuaded by adversarial framing. They ignored social engineering reliably.

They failed because they computed the action wrong.

The dominant failure mode was quantitative compositional reasoning: miscalculating trade cost, misapplying a bound, or crossing the balance floor while satisfying another constraint.

That matters because institutional policy is compositional by nature. A valid action may need to satisfy amount caps, counterparty rules, liquidity floors, settlement windows, rate limits, and reporting thresholds simultaneously. Each additional bound increases the reasoning surface for a runtime evaluator.

Substrate enforcement does not depend on that reasoning.

§3What Atlas enforces

Atlas ships eight predicate primitives that map directly onto the institutional policy primitives compliance teams already author:

Atlas predicate	Institutional policy primitive
`Amount cap`	Per-action cap (pre-approval workflow)
`Amount minimum`	Minimum trade size (dust filter)
`Balance floor`	Treasury reserve floor
`Recipient (exact)`	Bilateral counterparty pinning (ISDA-style)
`Recipient allowlist`	KYC / sanctions allowlist (BSA / OFAC)
`Expiry`	Settlement-window close
`Not-before`	Blackout / earnings-window freeze
`Rate limit`	Daily / hourly money-movement cap (BSA / AML reporting)

These predicates compose. A single envelope can carry a cap, recipient allowlist, settlement window, rate limit, and balance floor — AND-composed. The action is valid only if every predicate passes.

Compliance teams already write these rules on paper. Atlas makes them executable.

§4How we tested the gate

The benchmark measures the runtime-vs-substrate enforcement gap. The property suites test a different question: does the gate implementation match the bounded-authority specification?

Two property suites cover the fixed-rule kernel and the composable-predicate kernel. 42 property tests. More than 7,000 sampled inputs. Zero failures. The suites cover soundness, completeness, determinism, AND-composition equivalence, short-circuit evaluation, monotonicity, and concrete adversarial replays.

This is not a closed-form proof. SMT-backed verification remains on the roadmap.

What property testing establishes is narrower but still material: under realistic adversarial input distributions, the gate is observationally indistinguishable from the spec.

§5What is shipping

Atlas runs three on-chain primitives:

a policy gate,
a commitment ledger,
an audit chain.

More than 600 contract tests pass. The reference implementation is available under access agreement. Atlas is in deployment with our first design partners.

We do not claim audit completeness we do not yet have. External third-party audit is sequenced with the institutional design-partner cohort and is not yet complete.

The current claim is precise: Atlas has measured the runtime-enforcement gap, implemented the substrate alternative, tested the gate against specification, and is now moving through design-partner deployment.