Active validation lane

AI reasoning

A reasoning system can keep generating after its sampled future has collapsed into the wrong family. The reasoning lane measures that collapse — not by waiting for a wrong endpoint, but by measuring the future field induced by the committed prefix while the system is still working.

In-flight observability for reasoning systems, built on the same conceptual object the SAT flagship measures.

See the first case study View on the evidence page Read the vocabulary

On this page

What endpoint accuracy misses Future field Case study Connection to SAT What's being measured Claim boundary Vocabulary

Endpoint evaluation collapses a richer object.

A reasoning trace is not only an answer-producing sequence. It is a path of commitments. Each emitted token fixes part of the future: some continuations remain easy, some become narrow, some become budget-censored, and some become systematically wrong.

Standard evaluation usually sees only the endpoint. Process supervision sees steps. Self-consistency aggregates over endpoints. None of these, by itself, measures the distribution of futures that the committed state leaves open.

The reasoning lane measures that distribution. Given a committed prefix, a continuation policy, and a resource budget, we repeatedly continue from the prefix and examine where the continuations land. Correct-answer mass is one coordinate of that distribution. It is not the whole object.

The sentence the framework makes pronounceable

Self-consistency calls a state uncertain when the answers vary. This measurement calls it family-locked when the answers vary inside the same wrong family.

A committed prefix induces a field of futures.

Committed prefix, future field, diagnostic projection.

Three primitives, each of which has a plain-language version.

Primitive 01

Committed prefix

The state a process has already produced. In a reasoning system this is the token sequence written so far. In a solver it is a partial assignment. In a scheduler it is a partial schedule. The surface is different; the conceptual object is the same: an irreversible commitment.

Primitive 02

Future field

The distribution of outcomes the system can still produce from that exact committed state under a declared policy and budget. We measure it by sampling many continuations and reading where they land. The field has structure beyond a single most likely answer.

Primitive 03

Diagnostic projection

A way of looking at the future field through one labeling rule at a time. Are most futures multiples of 8? Are they all in the same operation family? Do they share a residue the target lacks? Each projection is a different question about the same field.

The intuitive picture: tunnels and buckets.

Imagine the system writing a reasoning trace as opening a tunnel into the space of possible future answers. Once the tunnel is open, all subsequent continuations have to come out somewhere downstream of where the system is now standing. The tunnel can stay wide or narrow sharply. It can point at the right region or at a wrong one.

A diagnostic projection sorts the answers into buckets by some property — like sorting laundry by color. If the target is a blue shirt and the system keeps producing answers from the white-clothes pile, it isn't merely wrong. It's searching in a bucket that doesn't contain the target.

The reasoning lane measures both the shape of the tunnel and which bucket its outputs are landing in.

Why this is different from existing approaches

Self-consistency picks the majority answer. Process supervision rewards good steps. Tree-of-thought searches branches. Each treats the trace as a means to an answer. The reasoning lane treats the trace as a state and asks what futures remain attached to it.

First case study, on a math-tuned 7B reasoning model.

Scaffold-lattice confinement on an arithmetic reasoning trace.

A floor-sum target with correct answer 532. Three committed-prefix checkpoints — at 25%, 50%, and 75% of the trace. M = 64 continuations per checkpoint. The animation below shows the same future field projected four ways.

Interactive: tap the checkpoint buttons to scrub. Each band is a different diagnostic projection of the same future field. The upper bands lock first; the lower bands stay diverse longer.

What the animation shows.

At 25% progress, the upper bands are already locked: 96.9% of all continuations — and all extracted numeric answers — land in the wrong arithmetic family, both as a residue (mod 8 = 0) and as scaffold-family membership (Y ∈ 8ℤ). But at the same checkpoint, the parameter band still spans seven distinct values — the system is producing answers like 56, 88, 112, 120 — and the answer band still has multiple peaks.

This is the central decoupling. The field is family-locked but parameter-diverse. Self-consistency would call this state uncertain because the answers vary. The measurement calls it already committed because the answers all vary inside the same wrong family that excludes the target.

By the 50% checkpoint, the lower bands have caught up. All four projections are locked. The system has consolidated onto the wrong attractor 56. By 75%, that consolidation persists. True rewind across 30 conditions and 1,920 continuations did not recover correctness.

The receipts

1,840 of 1,842 numeric continuations across the run land in the wrong arithmetic family 8ℤ.

The target 532 is not in 8ℤ. Its residue is 4.

Zero correct continuations across 2,112 total samples — three checkpoints plus 30 rewind conditions.

Operation-sequence diversity drops from 51 distinct sequences at 25% to 19 at 50% to 15 at 75%, while exact-text diversity stays maximal. Surface variety, family lock.

The same conceptual object, on a substrate without an oracle.

The SAT flagship measures a partial assignment and asks whether satisfying completions remain reachable from it. The reasoning lane measures a token prefix and asks where the model's continuations land. The objects are different in surface and identical in structure: a committed history, and the futures it leaves open.

The substrates differ in one crucial way. SAT has an exact bridge oracle: a solver can certify that no satisfying completion remains from a given prefix. Reasoning systems do not. A model might recover under a larger budget, a different sampler, a different prompt, or a verifier we have not yet attached. The framework adapts honestly.

What the SAT side calls certified non-extendability, the reasoning side calls persistence under declared instrument. Same conceptual program, different epistemic status. The vocabulary tracks the distinction precisely so that a careful reader can never confuse the two.

The cross-substrate sentence

A committed process can enter a region of future space that no longer intersects, or no longer substantially reaches, the target set. In SAT, the region is a dead prefix cylinder. In reasoning, it is a target-excluding answer fiber. The shape of the region is different. The phenomenon is the same.

See the SAT-side measurement framework →

What's currently being measured.

The first case study is one demonstration. Two ongoing measurements settle the questions it cannot answer.

Early-prefix tomography

A prompt-level baseline at higher rollout count — M = 256 — plus dense early checkpoints at 2.5%, 5%, 10%, 15%, 20%, 25%. The question this run answers: did correct mass exist before the wrong family appeared? If yes, the case becomes a measured live-to-confined transition. If the prompt-level field is already on the wrong family, the case becomes a stable wrong-scaffold attractor for this model under this prompt. Either result sharpens the framework.

Predictive cross-problem replication

A panel of floor-sum variants whose block structure predicts a different lattice modulus before any rollout is sampled. If the wrong-family modulus tracks the predicted modulus across the panel, the lattice phenomenon is structural. If it doesn't, the framework needs revision. The replication is designed to be falsifiable: there is no way for the framework to slip past the data and claim victory.

Why these specific runs.

The first case study has two acknowledged gaps. The earliest measured checkpoint was already zero-correct, so the run cannot show whether the field was ever live before lattice confinement appeared. And it ran on one problem on one model, so the lattice modulus could be a coincidence rather than a structural property. The two runs above are designed to close those gaps in a single round of compute.

Claim boundary.

Every claim on this page is indexed. The indexing is part of the contribution.

What is supported

Scaffold confinement under declared instrument

The case study supports finite-M, model/policy/budget-indexed claims about scaffold-lattice confinement and non-recovery. Wrong-scaffold consolidation from a parameter-diverse field at 25% to the X = 7 attractor at 50% is measured. Operation-sequence convergence under surface diversity is measured. Persistence under 30 rewind conditions is measured.

What is not yet supported

Live-to-dead transition · model-general behavior

The case study does not yet establish a live-to-dead transition (the earliest measured checkpoint was already zero-correct). It does not establish absolute irrecoverability (the result is finite-M with a Wilson upper bound of 0.0566 for the rewind grid). It does not establish model-general behavior (one model, one problem, one prompt family).

The epistemic distinction

Persistence, not non-extendability

The reasoning substrate has no exact oracle analogous to the SAT bridge check. A reasoning-side "dead" label is not a logical non-extendability certificate. It is an empirical persistence claim: low target mass under declared rollouts, low truncation, sufficient sample count, and failure to recover under prescribed rewind. The vocabulary tracks the distinction throughout.

The vocabulary.

Each term refers to a measurable object. The arithmetic is simple. The contribution is in choosing what to measure.

Committed prefix

The state already produced. In a reasoning trace, the token sequence written so far. The conceptual primitive is irreversible commitment, not the surface form.

Future field

The empirical distribution of outcomes reachable from a committed prefix under a declared policy and budget. The central observable.

Cylinder

The set of all continuations consistent with a committed prefix. A tunnel of futures opened by commitment. Borrowed from combinatorial probability where prefix-cylinders are standard objects.

Fiber

A bucket of outcomes sharing a hidden label under some projection — for example, all answers with the same residue mod 8. A wrong fiber is one that excludes the target.

Scaffold

The reasoning pattern that produces a particular fiber of answers. In the case study, the scaffold is "eight repeated blocks times a base-block value," which produces multiples of 8.

Lattice

The arithmetic shape of a fiber when the answers sit at evenly spaced integers. In the case study, the lattice is 8ℤ — multiples of 8. Not metaphorical; a measured support property of the answer field.

Residue

The remainder label that determines fiber membership. The target 532 has residue 4 mod 8; the wrong family has residue 0. The framework asks not just whether answers are wrong but whether they share the wrong residue.

Target-excluding fiber

The deepest version of the case-study finding. The system's continuations concentrate in a fiber whose label is incompatible with the target. The system isn't just searching badly. It's searching in the wrong bucket.

SAT-side framework Core distinction Single-path view Evidence summary