AI reasoning
A reasoning system can keep generating after its sampled future has collapsed into the wrong family. The reasoning lane measures that collapse — not by waiting for a wrong endpoint, but by measuring the future field induced by the committed prefix while the system is still working.
In-flight observability for reasoning systems, built on the same conceptual object the SAT flagship measures.
Endpoint evaluation collapses a richer object.
A reasoning trace is not only an answer-producing sequence. It is a path of commitments. Each emitted token fixes part of the future: some continuations remain easy, some become narrow, some become budget-censored, and some become systematically wrong.
Standard evaluation usually sees only the endpoint. Process supervision sees steps. Self-consistency aggregates over endpoints. None of these, by itself, measures the distribution of futures that the committed state leaves open.
The reasoning lane measures that distribution. Given a committed prefix, a continuation policy, and a resource budget, we repeatedly continue from the prefix and examine where the continuations land. Correct-answer mass is one coordinate of that distribution. It is not the whole object.
The sentence the framework makes pronounceable
Self-consistency calls a state uncertain when the answers vary. This measurement calls it family-locked when the answers vary inside the same wrong family.
Committed prefix, future field, diagnostic projection.
Three primitives, each of which has a plain-language version.
Committed prefix
The state a process has already produced. In a reasoning system this is the token sequence written so far. In a solver it is a partial assignment. In a scheduler it is a partial schedule. The surface is different; the conceptual object is the same: an irreversible commitment.
Future field
The distribution of outcomes the system can still produce from that exact committed state under a declared policy and budget. We measure it by sampling many continuations and reading where they land. The field has structure beyond a single most likely answer.
Diagnostic projection
A way of looking at the future field through one labeling rule at a time. Are most futures multiples of 8? Are they all in the same operation family? Do they share a residue the target lacks? Each projection is a different question about the same field.
The intuitive picture: tunnels and buckets.
Imagine the system writing a reasoning trace as opening a tunnel into the space of possible future answers. Once the tunnel is open, all subsequent continuations have to come out somewhere downstream of where the system is now standing. The tunnel can stay wide or narrow sharply. It can point at the right region or at a wrong one.
A diagnostic projection sorts the answers into buckets by some property — like sorting laundry by color. If the target is a blue shirt and the system keeps producing answers from the white-clothes pile, it isn't merely wrong. It's searching in a bucket that doesn't contain the target.
The reasoning lane measures both the shape of the tunnel and which bucket its outputs are landing in.
Why this is different from existing approaches
Self-consistency picks the majority answer. Process supervision rewards good steps. Tree-of-thought searches branches. Each treats the trace as a means to an answer. The reasoning lane treats the trace as a state and asks what futures remain attached to it.
Scaffold-lattice confinement on an arithmetic reasoning trace.
A floor-sum target with correct answer 532. Three committed-prefix checkpoints — at 25%, 50%, and 75% of the trace. M = 64 continuations per checkpoint. The animation below shows the same future field projected four ways.
What the animation shows.
At 25% progress, the upper bands are already locked: 96.9% of all continuations — and all extracted numeric answers — land in the wrong arithmetic family, both as a residue (mod 8 = 0) and as scaffold-family membership (Y ∈ 8ℤ). But at the same checkpoint, the parameter band still spans seven distinct values — the system is producing answers like 56, 88, 112, 120 — and the answer band still has multiple peaks.
This is the central decoupling. The field is family-locked but parameter-diverse. Self-consistency would call this state uncertain because the answers vary. The measurement calls it already committed because the answers all vary inside the same wrong family that excludes the target.
By the 50% checkpoint, the lower bands have caught up. All four projections are locked. The system has consolidated onto the wrong attractor 56. By 75%, that consolidation persists. True rewind across 30 conditions and 1,920 continuations did not recover correctness.
The receipts
1,840 of 1,842 numeric continuations across the run land in the wrong arithmetic family 8ℤ.
The target 532 is not in 8ℤ. Its residue is 4.
Zero correct continuations across 2,112 total samples — three checkpoints plus 30 rewind conditions.
Operation-sequence diversity drops from 51 distinct sequences at 25% to 19 at 50% to 15 at 75%, while exact-text diversity stays maximal. Surface variety, family lock.
The same conceptual object, on a substrate without an oracle.
The SAT flagship measures a partial assignment and asks whether satisfying completions remain reachable from it. The reasoning lane measures a token prefix and asks where the model's continuations land. The objects are different in surface and identical in structure: a committed history, and the futures it leaves open.
The substrates differ in one crucial way. SAT has an exact bridge oracle: a solver can certify that no satisfying completion remains from a given prefix. Reasoning systems do not. A model might recover under a larger budget, a different sampler, a different prompt, or a verifier we have not yet attached. The framework adapts honestly.
What the SAT side calls certified non-extendability, the reasoning side calls persistence under declared instrument. Same conceptual program, different epistemic status. The vocabulary tracks the distinction precisely so that a careful reader can never confuse the two.
The cross-substrate sentence
A committed process can enter a region of future space that no longer intersects, or no longer substantially reaches, the target set. In SAT, the region is a dead prefix cylinder. In reasoning, it is a target-excluding answer fiber. The shape of the region is different. The phenomenon is the same.
What's currently being measured.
The first case study is one demonstration. Two ongoing measurements settle the questions it cannot answer.
Early-prefix tomography
A prompt-level baseline at higher rollout count — M = 256 — plus dense early checkpoints at 2.5%, 5%, 10%, 15%, 20%, 25%. The question this run answers: did correct mass exist before the wrong family appeared? If yes, the case becomes a measured live-to-confined transition. If the prompt-level field is already on the wrong family, the case becomes a stable wrong-scaffold attractor for this model under this prompt. Either result sharpens the framework.
Predictive cross-problem replication
A panel of floor-sum variants whose block structure predicts a different lattice modulus before any rollout is sampled. If the wrong-family modulus tracks the predicted modulus across the panel, the lattice phenomenon is structural. If it doesn't, the framework needs revision. The replication is designed to be falsifiable: there is no way for the framework to slip past the data and claim victory.
Why these specific runs.
The first case study has two acknowledged gaps. The earliest measured checkpoint was already zero-correct, so the run cannot show whether the field was ever live before lattice confinement appeared. And it ran on one problem on one model, so the lattice modulus could be a coincidence rather than a structural property. The two runs above are designed to close those gaps in a single round of compute.
Claim boundary.
Every claim on this page is indexed. The indexing is part of the contribution.
Scaffold confinement under declared instrument
The case study supports finite-M, model/policy/budget-indexed claims about scaffold-lattice confinement and non-recovery. Wrong-scaffold consolidation from a parameter-diverse field at 25% to the X = 7 attractor at 50% is measured. Operation-sequence convergence under surface diversity is measured. Persistence under 30 rewind conditions is measured.
Live-to-dead transition · model-general behavior
The case study does not yet establish a live-to-dead transition (the earliest measured checkpoint was already zero-correct). It does not establish absolute irrecoverability (the result is finite-M with a Wilson upper bound of 0.0566 for the rewind grid). It does not establish model-general behavior (one model, one problem, one prompt family).
Persistence, not non-extendability
The reasoning substrate has no exact oracle analogous to the SAT bridge check. A reasoning-side "dead" label is not a logical non-extendability certificate. It is an empirical persistence claim: low target mass under declared rollouts, low truncation, sufficient sample count, and failure to recover under prescribed rewind. The vocabulary tracks the distinction throughout.
The vocabulary.
Each term refers to a measurable object. The arithmetic is simple. The contribution is in choosing what to measure.
Committed prefix
The state already produced. In a reasoning trace, the token sequence written so far. The conceptual primitive is irreversible commitment, not the surface form.
Future field
The empirical distribution of outcomes reachable from a committed prefix under a declared policy and budget. The central observable.
Cylinder
The set of all continuations consistent with a committed prefix. A tunnel of futures opened by commitment. Borrowed from combinatorial probability where prefix-cylinders are standard objects.
Fiber
A bucket of outcomes sharing a hidden label under some projection — for example, all answers with the same residue mod 8. A wrong fiber is one that excludes the target.
Scaffold
The reasoning pattern that produces a particular fiber of answers. In the case study, the scaffold is "eight repeated blocks times a base-block value," which produces multiples of 8.
Lattice
The arithmetic shape of a fiber when the answers sit at evenly spaced integers. In the case study, the lattice is 8ℤ — multiples of 8. Not metaphorical; a measured support property of the answer field.
Residue
The remainder label that determines fiber membership. The target 532 has residue 4 mod 8; the wrong family has residue 0. The framework asks not just whether answers are wrong but whether they share the wrong residue.
Target-excluding fiber
The deepest version of the case-study finding. The system's continuations concentrate in a fiber whose label is incompatible with the target. The system isn't just searching badly. It's searching in the wrong bucket.