Arithmetic Without Numbers (Draft)

Exhibit 1

Integers as phase

interactive

The spiral is a simplified picture of a Fourier-style number code: one part of the vector tracks phase around a circle, while another tracks coarse position.

integer 137

phase49.3° cos0.65 sin0.76 coarse13

integer value

0 999

The question

A model has no fingers

If you learned arithmetic the ordinary human way, you probably learned it with a body. You counted on fingers. You grouped things into piles. You lined digits into columns. You carried a one. Later, perhaps, you used an abacus, graph paper, or a calculator.

A language model has none of that. It has matrices. Tokens enter, activations flow, logits come out. And yet, if you ask a modern language model for a greatest common divisor, a multiplication, or a division with remainder, something inside that matrix-only body responds.

Working vocabulary

Token: one unit the model reads or prints. A token might be a word, part of a word, punctuation, or a chunk of digits.

Vector: a list of numbers. A model stores each token's current state as a vector with many dimensions.

Activation: the model's temporary internal state while it is processing a token.

Readout: a small external model trained to recover a fact from an activation, such as the operation or an operand.

Logit: a raw score for a possible next token. Higher logit means the model is more likely to print that token.

Layer: one repeated processing step in the transformer. A modern model has many layers, each updating the running state.

Residual stream: the main running vector passed from layer to layer, like a shared scratchpad without named variables.

Attention: the part of a layer that lets one token position look at information from other positions.

MLP / feed-forward block: the part of each transformer layer that transforms one token position's vector by itself. Attention lets positions exchange information; the MLP then reshapes the local vector, often strengthening, suppressing, or recombining features already present there.

Next-token prediction: the training and generation rule for ordinary language models. Given the text so far, the model scores possible next tokens, prints one, then repeats the process.

Phase: position around a repeating cycle, like the angle of a hand on a clock. Helix-style number codes use phase-like geometry.

GCD / LCM: greatest common divisor and least common multiple. For example, gcd(84, 36) = 12.

Rune began with the debugging question behind the jargon: when a language model gives an arithmetic answer, is it recalling a pattern, running something like an algorithm, or merely producing a plausible next token?

The article's promise is not that a model invented arithmetic from nothing. It learned from human text about human mathematics. The surprise is that, given those symbols and a matrix-only body, gradient descent can settle on representations no human would learn with fingers: directions, phase, chunks, and value-like states in a high-dimensional vector.

The human contrast

We learned arithmetic with bodies

George Lakoff and Rafael E. Núñez argued in Where Mathematics Comes From that human mathematical ideas are grounded in embodied experience: grouping, moving, measuring, balancing, collecting, and mapping one domain onto another. Whatever one thinks of the full philosophical claim, it is a useful starting point for this story.

A transformer has no fingers, no beads, no written columns, and no scratch paper. It has token embeddings, attention, feed-forward networks, residual streams, and matrices. If it learns arithmetic at all, it has to invent a machine-native version of number.

Humans also do arithmetic in more than one way. We answer 7 x 8 from a memorized multiplication table. We may divide 963 / 17 by running a written algorithm. We estimate tips with shortcuts. So the first scientific problem was not just "can the model answer?" It was "what kind of answering is this?" A memorized table, a learned shortcut, and a real multi-step calculation can all print the same number.

Human arithmetic bodyMatrix arithmetic body

Fingers, beads, tally marksDirections in activation space

Number line and motionPhase rotation on a helix

Scratch paper and written columnsResidual stream with no named variables

Right-to-left carriesLeft-to-right token emission

Running out of column widthCrowded chunk readout directions

Residual stream

A vector changes as the model reads

Before we can ask whether a number is memorized, computed, or merely rendered, we need one more piece of machinery: the model's running state.

Imagine reading What is the gcd of 84 and 36? one token at a time. The model does not create a neat little variable named operand_a. Instead, each token position carries a long vector of numbers. As the prompt passes through the transformer layers, those vectors are updated again and again.

Some updates move information across positions: the token for 36 can affect the state near the answer position. Other updates reshape the local state: a direction in the vector may become more gcd-like, more operand-like, or more answer-like. The residual stream is the running scratchpad where those changes accumulate.

This is why readouts and patches are possible at all. If the operation and operands leave traces in the residual stream, a small readout may recover them. If a state really matters, a patch may change behavior. If a state is writable, an intervention may guide the model. But those are increasingly strong claims, and the vector itself does not label which claim is true.

tiny residual vector

layer update: scratch vector + attention + feed-forward

Each token position carries a vector, shown here as the little bars. A layer first uses attention to gather information from other token positions, then uses a feed-forward network, often called an MLP, to transform that information. The result is added back into the running vector. The important part is not the formula; it is that the scratch vector can contain facts such as "this looks like operand A" or "this is where the answer begins."

Next-token constraint

The model emits the answer left to right

Humans often compute arithmetic from the rightmost digit inward because carries start at the ones place. A language model has the opposite interface: it must print the first visible answer token before it has printed the later ones.

327 x 48 = 15696

human scratch direction

69651

carry, then move left

model emission direction

A next-token model must commit to the first visible chunk before it emits the later chunks, which makes answer rendering a separate pressure from exact computation.

For 327 x 48, a person can multiply from the ones place, carry intermediate values, and only later write the leftmost digit. The model does not get that luxury when it is generating text. To answer 15696, it first has to choose something like 15, then 696, then stop.

That matters for the helix story. As answers get longer, more digit chunks have to be represented, tracked, and emitted in order. The experiments found that these chunk readouts can remain partly readable while becoming less separated from one another. When the geometry gets crowded, the model may still have number-like structure inside, but the visible next-token decision loses resolution.

A separate counting experiment made the same pressure visible in a simpler setting. The model saw four consecutive large numbers and had to print the next one. An easy case looks like 314582706123450, 314582706123451, 314582706123452, 314582706123453, ...; the next number is just 314582706123454. Llama's tokenizer split that answer into 3-digit chunks: 314 | 582 | 706 | 123 | 454.

The failures appeared at carry boundaries. A deep-carry case has the shape 314582706999996, 314582706999997, 314582706999998, 314582706999999, .... The correct next number is 314582707000000: the tokenizer chunks move from 314 | 582 | 706 | 999 | 999 to 314 | 582 | 707 | 000 | 000. In the experiment, cases like this collapsed; the best deep-carry cell reached only 18.75% accuracy, and the dominant error was simply repeating 314582706999999.

In the long-continuation version, the failure became almost theatrical. Once the model missed a deep carry, 96.88% of deep cases collapsed to a fixed point: it kept re-emitting the last correct number instead of recovering. The subtraction helix may fail for different reasons, but the engineering lesson is the same shape: long numeric continuations can look stable until a carry or token boundary asks the model to coordinate more state than its learned shortcut can handle.

The original dream

A just-in-time compiler for arithmetic

The tempting product answer is simple: if the model is bad at arithmetic, call a calculator. A parser can read the prompt What is 84 times 37?, translate it into 84 * 37, send that expression to Python, and return the result.

Rune was chasing a stricter question. Could we look inside the model and find the calculation it was trying to perform? Could the model's own activations tell us the operation and operands? And if we computed the exact answer, could we put that answer back inside the model so it continued naturally?

Use one concrete prompt: What is the gcd of 84 and 36? A normal tool route reads the text, extracts gcd, 84, and 36, and calls a calculator. Rune disallowed that at runtime. The route could see token IDs and internal activation vectors, but not the prompt string, regex matches, hidden operands, operation labels, or the gold answer. Only if the model's internal state supplied gcd, 84, 36 could Python compute 12.

Standard tool use already makes external computation available. PAL, Program-of-Thoughts, ReAct, Toolformer-style systems, and ordinary function calling all live in that neighborhood. Rune was not trying to beat that path on product simplicity. It was asking whether the arguments to the tool could come from the model's own hidden state rather than from the prompt text.

The plain version is easier to see with one prompt. Prompt parsing looks at the sentence as text: find the word gcd, find the two numerals, call the function. Internal-state observation treats the sentence as already swallowed by the model. The only things left are vectors at token positions and layers. If a small readout can recover gcd, 84, and 36 from those vectors, then the model has made those facts available internally. Rune tried to isolate exactly that claim.

The full version of that dream did not land. We did not prove behavior-preserving residual JIT replacement. But the failure was productive. It forced the project to separate three different things that are easy to confuse: rendering a known answer, reading a variable from activations, and safely writing a corrected state back into the model.

Claim ladder

Five ways to get the right answer

Use the same prompt as a test case: What is the gcd of 84 and 36? All five systems can print 12. The difference is where gcd, 84, and 36 came from, and what part of the model, if any, was actually changed.

Prompt parser

Regex or parser reads the prompt text: gcd(84, 36). Practical, but not evidence about model internals.

Generated program

The model emits math.gcd(84, 36) or a tool call. This is the PAL, Program-of-Thoughts, ReAct, and Toolformer neighborhood.

Final-token correction

A wrapper biases the next token toward 12. It may not know why 12 is right.

Activation-derived tool arguments

A readout observes internal states and decodes gcd, 84, 36. Only then does Python compute.

Residual JIT replacement

The system writes the computed result back into the hidden state and asks the model to continue naturally. This was the original dream; it did not land for the tested writes.

The first trap

Rendering is not computing

Early experiments showed that late-layer writer states could help the model emit numeric chunks. That was exciting: if you supplied the right internal state, the model could render a desired part of an answer.

But this created a trap. If the experiment already knows the answer and uses that answer to choose a steering vector, it has measured the model's ability to render a supplied value. Useful, yes; but it is not the same as showing that the model computed the value, or that a deployment system could find the value without help.

This is the same distinction people make intuitively. A child who recites 7 x 8 = 56 may be recalling a table. A child who solves 73 x 48 on paper is doing a procedure. Both answers are arithmetic, but they are not the same evidence. Rune had to keep asking whether a model was recalling, pattern-matching, computing, or being helped by the experiment.

So the rule became stricter. At runtime, the prompt must be opaque. No regex. No prompt parsing. No hidden harness operands. Python may compute only after the operation and operands have been decoded from model internals.

What prior work taught us

Helixes, heuristics, and causal tests

Rune did not discover the helix idea. Kantamneni and Tegmark's Language Models Use Trigonometry to Do Addition made the geometric hypothesis concrete: integer representations can lie on generalized helices, and addition can be described as manipulating phase. Nikankin and colleagues' Arithmetic Without Algorithms gives the complementary warning: model arithmetic can also look like a bag of learned heuristics rather than a clean schoolbook algorithm.

The tools were not invented here either. Probes, sparse autoencoders, activation patching, causal mediation, and tool-calling systems all come from existing interpretability and tool-use literatures. Rune's contribution is narrower: applying those ideas under a no-parser provenance rule and reporting where the stronger residual-JIT story did not hold.

That tension shaped the experiments. When a probe decoded a number, we asked whether it was reading a robust variable or a surface cue. When a steering vector made a token appear, we asked whether it had changed computation or merely forced rendering. When an activation patch worked, we asked whether the control patch worked too.

A cinematic render of an integer helix with phase points — One way a matrix can carry a number is by splitting it into coordinates: phase around a circle for repeating digit structure, and position along the spiral for scale. The engineering point is that numbers need not be stored as decimal strings. A readout may recover `84` from a direction or clock-like phase pattern even when no component is literally "the 84 slot."

The toolbox

How do you inspect a matrix body?

Most of Rune was not one big experiment. It was a toolbox applied over and over, with stricter controls each time.

The four instruments below touch the same running activation vector in different ways. A probe reads a fact. An SAE tries to name reusable parts of the vector. A patch copies part of one run into another and watches whether the answer follows. Steering is the blunt instrument: push the state and see what moves, or what breaks. Each is useful; each can mislead if treated as stronger evidence than it is.

Toolbox simulator

Four ways to touch an activation

The residual stream is the model's running scratch vector. These four tools touch that vector in four different ways: read a fact, name reusable features, test whether copying a state changes behavior, or push the state directly and watch what moves. Treating those as the same claim is how interpretability experiments overclaim.

internal signal 78% patch strength 42% steering push 56%

Probe

gcd(84, 36) is already inside the model as vectors. Can a small readout recover the operation and numbers?

hidden vector only

dot

op=gcd
a=84 b=36 decoded arguments

A probe is an external measuring instrument. We freeze the model, collect activation vectors, and train a small readout to answer questions such as "is this gcd?" or "is operand A equal to 84?"

Shows: information is readable. Does not show: the direction caused the model's behavior.

Sparse autoencoder

Can the dense state be described by a few reusable features instead of thousands of anonymous coordinates?

dense activation

encode

operand A operand B number phase syntax feature dictionary

A sparse autoencoder is a dictionary-building tool. It tries to rewrite a messy vector as a small set of active feature tags rather than one opaque blob.

Shows: a possible feature vocabulary. Does not show: those features are sufficient for computation.

Activation patch

If we copy a state from gcd(90, 36) into the gcd(84, 36) run, does the decoded operand or answer follow the donor?

donor: gcd(90,36)

swap

recipient run

A patch is a controlled transplant. Copy a donor state into a recipient run and ask whether the output follows the donor.

Shows: causal influence when controls pass. Does not show: safe deployment.

Steering

Can we add a direction that pushes the next token toward 12 without breaking the rest of the sentence?

push toward token "12"

Steering adds a vector to move the model toward a feature or token. Among these tools, it is the closest operation to "editing" the run.

Shows: the state is movable. Does not show: the rest of the model remains coherent.

Readable vs writable

A decoded variable is not an API

The residual-write experiments were the closest version of the original compiler thesis. We tried to write corrected answer information into the residual stream and let the model continue.

For the tested single-site writes, that did not earn its keep. Residual interventions had no accuracy advantage over simpler token or logit correction, and they disturbed surrounding behavior more. On multi-token answers, forcing the first correct token could lead the model to complete the rest better than a crude residual write.

Think about gcd(84, 36) = 12. A final-token correction can simply make 12 more likely at the moment the answer is printed. A residual write is a harder promise: it should update the hidden state so the model can still explain, format, and continue consistently. Rune found the former much easier than the latter.

A readable variable is not necessarily a writable register. Mechanistic interpretability often celebrates reading; engineering wants writing, and Rune only got the first one to behave reliably in these tests.

Three candidates

One route survived the controls

By the end, the project had three plausible stories. The first was prompt parsing, which works but was outside the scientific question. The second was residual replacement, the original compiler dream, which was too brittle in the tested form. The third was activation-derived tool arguments: read the operation and operands from hidden states, compute externally, and keep a replay audit proving that no forbidden text fields slipped in.

The third story is the proof apparatus for the larger wonder. It does not say "we taught the model arithmetic." It says something narrower and cleaner: arithmetic prompts can leave recoverable operation and operand structure in the residual stream, and that structure can drive a calculator route under an opaque-prompt boundary.

Tool-use context

Tool use already works; that was not the question

Modern tool-use systems already know how to route arithmetic to external computation. PAL and Program of Thoughts ask models to emit executable programs. ReAct interleaves reasoning and actions. Toolformer studies models learning API calls. If the product goal is only "get the arithmetic right," a text parser plus Python is hard to beat.

Rune's question was narrower and stranger: can the tool arguments come from the model's internal state rather than the prompt text? The runtime boundary carries the weight. Calibration data can use labels. Evaluation can use gold answers for scoring. The deployed route cannot receive prompt text, regex spans, harness operands, or a hidden operation flag.

Rune is not competing with tool use. It asks what a tool-use boundary looks like when the argument source is mechanistic rather than textual.

This is also not neurosymbolic AI in the usual sense: there is no symbolic frontend bolted onto the model and no theorem prover. The operands are recovered from the model's own activations, and the only external step is the calculator at the end.

What survived

No-parser tool arguments from activations

At this point the important question is no longer whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter.

The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule.

The production operand route used a layer-22 chunk path: selected operand chunks, chosen with attention-based selection, were decoded by a frozen chunk probe. Reading operands directly from their early input-token positions would be too close to parsing in disguise. This remains a readout claim, not a proof that we can safely rewrite the model's internal state.

On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent.

Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. Lift means the routed exact-answer rate minus the frozen model's native exact-answer rate on the target prompts. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator.

The route's operand bounds were frozen at integers from 0 through 9999, with at most 12 generated answer tokens. In plain terms, the supported result is two-integer arithmetic within that published route rather than arbitrary long arithmetic. The claim-bearing rerun was preregistered on June 2, 2026, with thresholds, operand bounds, and runner choices fixed around scripts/goalB3_repaired_benchmark_suite.py.

The DeepMind Mathematics Dataset, introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568., What is the remainder when 5734 is divided by 5529?, or Calculate the least common multiple of 839 and 6781.

On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. In plain percentages, routed exact rates were 99.2% for division with remainder, 100% for gcd, and 98.0% for lcm, with exact-answer gains of 81.0, 50.2, and 96.8 percentage points over the frozen model's native answers. The route was correcting a large fraction of cases the unassisted model missed.

Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result.

Should fire

Calculate the highest common factor of 5924 and 1024.

What is the remainder when 7696 is divided by 5130?

What is the smallest common multiple of 4740 and 1152?

Should not fire

She wrote 'gcd(48, 18) = 6' on the whiteboard and then changed the subject to budgets of 200 and 300.

A reporter typed '144 / 12' into her notes but the story was about a basketball game.

The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.

The honesty boundary

The provenance was the experiment

The most important engineering artifact may not be the accuracy number. It may be the replay boundary. Provenance means the audit trail for where the calculator arguments came from, and the final replay audit covered 15,558 runtime bundles while excluding forbidden fields: prompt text, regex outputs, decoded token spans, harness operands, operation labels, and gold answers. The route had to reproduce from allowed runtime artifacts.

The independent hard-negative audit asked whether the route fired when it should not. Those no-fire cases were generated separately from the positive arithmetic benchmark and included quoted arithmetic, do-not-compute prompts, wrong-operation prompts with the same numbers, tables, logs, code, invoices, distractor-heavy number text, decimals, signs, and out-of-domain cases. Across 10,200 non-trigger examples, it did not fire.

That zero is a scoped audit result rather than a universal safety guarantee. None of those 10,200 constructed negative cases triggered the route; arbitrary future text, other model families, and other arithmetic formats still need their own replay and hard-negative tests.

This is the difference between "we used a calculator" and the more interesting claim: the calculator arguments came from inside the model.

Resolution budget

Longer answers crowd the geometry

The subtraction scaling run found a practical boundary: exact free-generation stayed high at 6 digits, fell to 63.3% at 10 digits, and crossed the 50% threshold between 13 and 14 digits.

The helix-resolution tests did not say the representation simply disappears. For 12-digit answers, each 3-digit chunk remained strongly readable as a phase-like pattern. The failure signature was more subtle: the directions used to read nearby chunks became less separated, and the 14-digit run showed weaker readout quality for chunks 2-4. In the paper, R² measured how much chunk value a simple readout recovered; principal angle measured how separated two nearby readout directions were.

c1 c2 c3 c4

Visual metaphor: with more chunks, the readout directions are closer together. The experiment measured this as tighter separation, not as literal circles in the model.

subtraction exact match by digit count

96.7%6

63.3%10

53.3%13

43.3%14

33.3%16

6.7%24

Source: cd_e10_operand_scaling, 30 subtraction generations per digit band, using the model's top-choice token at each step.

chunk crowding in a late layer

c1-c2

71.3° -> 65.3°

c2-c3

66.0° -> 60.2°

c3-c4

67.4° -> 58.8°

12-digit 14-digit

Source: eg_e2d_helix_resolution_5chunk_sub; 2 of 3 preregistered predictions about chunk crowding passed.

Where the frontier remains

The wonder survives the caveats

The current final B3 route is Llama-specific. The project tried more than one model family: earlier mechanism and emission-path work touched Llama-3.1, Llama-3.2 1B/3B, Pythia, OLMo, Qwen-2.5, Mistral, Yi, and related cross-family checks. Some findings traveled: gcd motifs appeared across three families, operation-routing bands appeared across four families, and single-digit or chunk-emission results partially transferred.

Scope note: text parsers transfer because text is shared. Activation routes transfer only when the internal geometry lines up. In this project, some motifs traveled, but the final operand-localization route did not transfer as-is.

The strict final activation-derived operand route did not transfer as-is. In docs/goalB3_qwen_operand_diagnostics.md, the real Qwen operand-localization sweep reports QWEN_OPERAND_ROUTE_FAIL: across the sampled answer-site and input-token positions, ordered and unordered pair recovery stayed at 0.000. Internal activation routes are not portable like text parsers. A parser sees the same string. A different model may have a different body, a different geometry, and therefore a different arithmetic.

The final causal evidence is also scoped. Selected operand chunks influenced the decoded tuple and routed calculator answer, and division with remainder had enough matched intervention pairs for the final causal test. Gcd and lcm were supportive but did not have enough final matched pairs to carry the same strength of claim.

The next steps are concrete: build model-specific operand localizers, meaning readouts that find operands inside each model's own geometry instead of assuming transfer; complete causal interchange tests where source coverage allows it; compare any residual-write attempt against boring logit and parser baselines; and keep the no-parser replay boundary as a non-negotiable test.

The wonder remains. A transformer does not know fingers or an abacus. It has matrices, activations, and learned geometry. Under the right tests, those matrices contain traces of arithmetic: not human arithmetic exactly, but a machine's version of it.

References

Where this story sits

Embodied mathematics: George Lakoff and Rafael E. Núñez, Where Mathematics Comes From: How the Embodied Mind Brings Mathematics into Being, Basic Books, 2000.

Helix arithmetic: Subhash Kantamneni and Max Tegmark, Language Models Use Trigonometry to Do Addition, 2025.

Heuristic arithmetic: Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov, Arithmetic Without Algorithms, 2024.

Causal arithmetic mechanisms: Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan, A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis, 2023.

External benchmark source: David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli, Analysing Mathematical Reasoning Abilities of Neural Models, 2019; associated DeepMind Mathematics Dataset.

Tool routes: PAL, Program of Thoughts, ReAct, and Toolformer show the value of routing model work through external actions, programs, or APIs. Rune's narrower question is where the tool arguments come from: prompt text, generated code, or internal activations.

Sparse feature vocabularies: Anthropic's Towards Monosemanticity and Cunningham et al.'s Sparse Autoencoders Find Highly Interpretable Features in Language Models motivate the SAE discussion.

Activation-patching discipline: Fred Zhang and Neel Nanda's Towards Best Practices of Activation Patching in Language Models: Metrics and Methods is the reason the article keeps separating readout, causality, and writable replacement.

Experiment trail

Read the underlying work

Repository: Rune on GitHub.

Article draft: long-form markdown draft.

Helix and resolution artifacts: 12-digit helix-resolution test, 14-digit crowding follow-up, and subtraction digit-scaling result.

Final claim boundary: final preregistration and claim-control note.

Benchmark results: broad four-op frozen/adversarial cross-seed and DeepMind recognized-source three-op cross-seed.

Provenance and controls: full replay provenance audit, independent hard-negative audit, and final DeepMind causal-interchange summary.

Cross-model falsifier: real Qwen operand-localization failure.

Figure generators: matplotlib article figure script and Blender helix render script.

Static review exports: PDF article and Word draft. These are the print-safe sibling version of the animated article.