Rune article companion

Arithmetic Without Numbers

What happens inside an LLM when it tries to calculate with nothing but matrices.

Exhibit 1

Integers as phase

interactive

The spiral is a simplified picture of a Fourier-style number code: one part of the vector tracks phase around a circle, while another tracks coarse position.

integer 137
phase49.3° cos0.65 sin0.76 coarse13
0 999

The question

A model has no fingers

If you learned arithmetic the ordinary human way, you probably learned it with a body. You counted on fingers. You grouped things into piles. You lined digits into columns. You carried a one. Later, perhaps, you used an abacus, graph paper, or a calculator.

A language model has none of that. It has matrices. Tokens enter, activations flow, logits come out. And yet, if you ask a modern language model for a greatest common divisor, a multiplication, or a division with remainder, something inside that matrix-only body responds.

Working vocabulary

Token: one unit the model reads or prints. A token might be a word, part of a word, punctuation, or a chunk of digits.

Vector: a list of numbers. A model stores each token's current state as a vector with many dimensions.

Activation: the model's temporary internal state while it is processing a token.

Readout: a small external model trained to recover a fact from an activation, such as the operation or an operand.

Logit: a raw score for a possible next token. Higher logit means the model is more likely to print that token.

Layer: one repeated processing step in the transformer. A modern model has many layers, each updating the running state.

Residual stream: the main running vector passed from layer to layer, like a shared scratchpad without named variables.

Attention: the part of a layer that lets one token position look at information from other positions.

MLP / feed-forward block: the part of each transformer layer that transforms one token position's vector by itself. Attention lets positions exchange information; the MLP then reshapes the local vector, often strengthening, suppressing, or recombining features already present there.

Next-token prediction: the training and generation rule for ordinary language models. Given the text so far, the model scores possible next tokens, prints one, then repeats the process.

Phase: position around a repeating cycle, like the angle of a hand on a clock. Helix-style number codes use phase-like geometry.

GCD / LCM: greatest common divisor and least common multiple. For example, gcd(84, 36) = 12.

Rune began with the debugging question behind the jargon: when a language model gives an arithmetic answer, is it recalling a pattern, running something like an algorithm, or merely producing a plausible next token?

The human contrast

We learned arithmetic with bodies

George Lakoff and Rafael E. Núñez argued in Where Mathematics Comes From that human mathematical ideas are grounded in embodied experience: grouping, moving, measuring, balancing, collecting, and mapping one domain onto another. Whatever one thinks of the full philosophical claim, it is a useful starting point for this story.

A transformer has no fingers, no beads, no written columns, and no scratch paper. It has token embeddings, attention, feed-forward networks, residual streams, and matrices. If it learns arithmetic at all, it has to invent a machine-native version of number.

Humans also do arithmetic in more than one way. We answer 7 x 8 from a memorized multiplication table. We may divide 963 / 17 by running a written algorithm. We estimate tips with shortcuts. So the first scientific problem was not just "can the model answer?" It was "what kind of answering is this?" A memorized table, a learned shortcut, and a real multi-step calculation can all print the same number.

Residual stream

A vector changes as the model reads

Before we can ask whether a number is memorized, computed, or merely rendered, we need one more piece of machinery: the model's running state.

Imagine reading What is the gcd of 84 and 36? one token at a time. The model does not create a neat little variable named operand_a. Instead, each token position carries a long vector of numbers. As the prompt passes through the transformer layers, those vectors are updated again and again.

Some updates move information across positions: the token for 36 can affect the state near the answer position. Other updates reshape the local state: a direction in the vector may become more gcd-like, more operand-like, or more answer-like. The residual stream is the running scratchpad where those changes accumulate.

This is why readouts and patches are possible at all. If the operation and operands leave traces in the residual stream, a small readout may recover them. If a state really matters, a patch may change behavior. If a state is writable, an intervention may guide the model. But those are increasingly strong claims, and the vector itself does not label which claim is true.

tiny residual vector
layer update: scratch vector + attention + feed-forward
+
=

Each token position carries a vector, shown here as the little bars. A layer first uses attention to gather information from other token positions, then uses a feed-forward network, often called an MLP, to transform that information. The result is added back into the running vector. The important part is not the formula; it is that the scratch vector can contain facts such as "this looks like operand A" or "this is where the answer begins."

Next-token constraint

The model emits the answer left to right

Humans often compute arithmetic from the rightmost digit inward because carries start at the ones place. A language model has the opposite interface: it must print the first visible answer token before it has printed the later ones.

327 x 48 = 15696
human scratch direction
69651
carry, then move left
model emission direction

A next-token model must commit to the first visible chunk before it emits the later chunks. That is why answer rendering and exact computation are not the same problem.

For 327 x 48, a person can multiply from the ones place, carry intermediate values, and only later write the leftmost digit. The model does not get that luxury when it is generating text. To answer 15696, it first has to choose something like 15, then 696, then stop.

That matters for the helix story. As answers get longer, more digit chunks have to be represented, tracked, and emitted in order. The experiments found that these chunk readouts can remain partly readable while becoming less separated from one another. When the geometry gets crowded, the model may still have number-like structure inside, but the visible next-token decision loses resolution.

A separate counting experiment made the same pressure visible in a simpler setting. The model saw four consecutive large numbers and had to print the next one. An easy case looks like 314582706123450, 314582706123451, 314582706123452, 314582706123453, ...; the next number is just 314582706123454. Llama's tokenizer split that answer into 3-digit chunks: 314 | 582 | 706 | 123 | 454.

The failures appeared at carry boundaries. A deep-carry case has the shape 314582706999996, 314582706999997, 314582706999998, 314582706999999, .... The correct next number is 314582707000000: the tokenizer chunks move from 314 | 582 | 706 | 999 | 999 to 314 | 582 | 707 | 000 | 000. In the experiment, cases like this collapsed; the best deep-carry cell reached only 18.75% accuracy, and the dominant error was simply repeating 314582706999999.

In the long-continuation version, the failure became almost theatrical. Once the model missed a deep carry, 96.88% of deep cases collapsed to a fixed point: it kept re-emitting the last correct number instead of recovering. That does not prove the subtraction helix failed for the same reason. It shows the broader engineering lesson: long numeric continuations can look stable until a carry or token boundary asks the model to coordinate more state than its learned shortcut can handle.

The original dream

A just-in-time compiler for arithmetic

The tempting product answer is simple: if the model is bad at arithmetic, call a calculator. A parser can read the prompt What is 84 times 37?, translate it into 84 * 37, send that expression to Python, and return the result.

Rune was chasing a stricter question. Could we look inside the model and find the calculation it was trying to perform? Could the model's own activations tell us the operation and operands? And if we computed the exact answer, could we put that answer back inside the model so it continued naturally?

Use one concrete prompt: What is the gcd of 84 and 36? A normal tool route reads the text, extracts gcd, 84, and 36, and calls a calculator. Rune disallowed that at runtime. The route could see token IDs and internal activation vectors, but not the prompt string, regex matches, hidden operands, operation labels, or the gold answer. Only if the model's internal state supplied gcd, 84, 36 could Python compute 12.

That is different from standard tool use. PAL, Program-of-Thoughts, ReAct, Toolformer-style systems, and ordinary function calling already make external computation available. Rune was not trying to beat that path on product simplicity. It was asking whether the arguments to the tool could come from the model's own hidden state rather than from the prompt text.

That distinction is easy to miss, so here is the plain version. Prompt parsing looks at the sentence as text: find the word gcd, find the two numerals, call the function. Internal-state observation treats the sentence as already swallowed by the model. The only things left are vectors at token positions and layers. If a small readout can recover gcd, 84, and 36 from those vectors, then the model has made those facts available internally. That is the claim Rune tried to isolate.

The full version of that dream did not land. We did not prove behavior-preserving residual JIT replacement. But the failure was productive. It forced the project to separate three different things that are easy to confuse: rendering a known answer, reading a variable from activations, and safely writing a corrected state back into the model.

Claim ladder

Five ways to get the right answer

Use the same prompt as a test case: What is the gcd of 84 and 36? All five systems can print 12. The difference is where gcd, 84, and 36 came from, and what part of the model, if any, was actually changed.

1

Prompt parser

Regex or parser reads the prompt text: gcd(84, 36). Practical, but not evidence about model internals.

2

Generated program

The model emits math.gcd(84, 36) or a tool call. This is the PAL, Program-of-Thoughts, ReAct, and Toolformer neighborhood.

3

Final-token correction

A wrapper biases the next token toward 12. It may not know why 12 is right.

4

Activation-derived tool arguments

A readout observes internal states and decodes gcd, 84, 36. Only then does Python compute.

5

Residual JIT replacement

The system writes the computed result back into the hidden state and asks the model to continue naturally. This was the original dream; it did not land for the tested writes.

The first trap

Rendering is not computing

Early experiments showed that late-layer writer states could help the model emit numeric chunks. That was exciting: if you supplied the right internal state, the model could render a desired part of an answer.

But this created a trap. If the experiment already knows the answer and uses that answer to choose a steering vector, it has measured the model's ability to render a supplied value. That is useful, but it is not the same as showing that the model computed the value, or that a deployment system could find the value without help.

This is the same distinction people make intuitively. A child who recites 7 x 8 = 56 may be recalling a table. A child who solves 73 x 48 on paper is doing a procedure. Both answers are arithmetic, but they are not the same evidence. Rune had to keep asking whether a model was recalling, pattern-matching, computing, or being helped by the experiment.

So the rule became stricter. At runtime, the prompt must be opaque. No regex. No prompt parsing. No hidden harness operands. Python may compute only after the operation and operands have been decoded from model internals.

What prior work taught us

Helixes, heuristics, and causal tests

Rune did not discover the helix idea. Kantamneni and Tegmark's Language Models Use Trigonometry to Do Addition made the geometric hypothesis concrete: integer representations can lie on generalized helices, and addition can be described as manipulating phase. Nikankin and colleagues' Arithmetic Without Algorithms gives the complementary warning: model arithmetic can also look like a bag of learned heuristics rather than a clean schoolbook algorithm.

The tools were not invented here either. Probes, sparse autoencoders, activation patching, causal mediation, and tool-calling systems all come from existing interpretability and tool-use literatures. Rune's contribution is narrower: applying those ideas under a no-parser provenance rule and reporting where the stronger residual-JIT story did not hold.

That tension shaped the experiments. When a probe decoded a number, we asked whether it was reading a robust variable or a surface cue. When a steering vector made a token appear, we asked whether it had changed computation or merely forced rendering. When an activation patch worked, we asked whether the control patch worked too.

A cinematic render of an integer helix with phase points
One way a matrix can carry a number is by splitting it into coordinates: phase around a circle for repeating digit structure, and position along the spiral for scale. The engineering point is that numbers need not be stored as decimal strings. A readout may recover 84 from a direction or clock-like phase pattern even when no component is literally "the 84 slot."

The toolbox

How do you inspect a matrix body?

Most of Rune was not one big experiment. It was a toolbox applied over and over, with stricter controls each time.

The four instruments below touch the same running activation vector in different ways. A probe reads a fact. An SAE tries to name reusable parts of the vector. A patch asks whether copying a part changes the answer. Steering pushes the state and watches what happens. Each is useful; each can mislead if treated as stronger evidence than it is.

Toolbox simulator

Four ways to touch an activation

The residual stream is the model's running scratch vector. These four tools touch that vector in four different ways. A probe asks, "Can I read something?" An SAE asks, "Can I name the parts?" A patch asks, "Does this part matter?" Steering asks, "What happens if I push it?" Confusing those questions is how interpretability experiments accidentally overclaim.

Probe

gcd(84, 36) is already inside the model as vectors. Can a small readout recover the operation and numbers?

hidden vector only
dot
op=gcd
a=84 b=36
decoded arguments

A probe is an external measuring instrument. We freeze the model, collect activation vectors, and train a small readout to answer questions such as "is this gcd?" or "is operand A equal to 84?"

Shows: information is readable. Does not show: the direction caused the model's behavior.

Sparse autoencoder

Can the dense state be described by a few reusable features instead of thousands of anonymous coordinates?

dense activation
encode
operand A operand B number phase syntax feature dictionary

A sparse autoencoder is a dictionary-building tool. It tries to rewrite a messy vector as a small set of active feature tags rather than one opaque blob.

Shows: a possible feature vocabulary. Does not show: those features are sufficient for computation.

Activation patch

If we copy a state from gcd(90, 36) into the gcd(84, 36) run, does the decoded operand or answer follow the donor?

donor: gcd(90,36)
swap
recipient run

A patch is a controlled transplant. Copy a donor state into a recipient run and ask whether the output follows the donor.

Shows: causal influence when controls pass. Does not show: safe deployment.

Steering

Can we add a direction that pushes the next token toward 12 without breaking the rest of the sentence?

push toward token "12"

Steering adds a vector to move the model toward a feature or token. It is the closest operation to "editing" the run.

Shows: the state is movable. Does not show: the rest of the model remains coherent.

Readable vs writable

A decoded variable is not an API

The residual-write experiments were the closest version of the original compiler thesis. We tried to write corrected answer information into the residual stream and let the model continue.

For the tested single-site writes, that did not earn its keep. Residual interventions had no accuracy advantage over simpler token or logit correction, and they disturbed surrounding behavior more. On multi-token answers, forcing the first correct token could lead the model to complete the rest better than a crude residual write.

Think about gcd(84, 36) = 12. A final-token correction can simply make 12 more likely at the moment the answer is printed. A residual write is a harder promise: it should update the hidden state so the model can still explain, format, and continue consistently. Rune found the former much easier than the latter.

The lesson is simple and important: a readable variable is not necessarily a writable register. Mechanistic interpretability often celebrates reading; engineering wants writing. Those are different problems.

Three candidates

One route survived the controls

By the end, the project had three plausible stories. The first was prompt parsing, which works but was outside the scientific question. The second was residual replacement, the original compiler dream, which was too brittle in the tested form. The third was activation-derived tool arguments: read the operation and operands from hidden states, compute externally, and keep a replay audit proving that no forbidden text fields slipped in.

That third story is less glamorous than "we taught the model arithmetic," but it is cleaner. It says something specific about model internals: arithmetic prompts can leave recoverable operation and operand structure in the residual stream, and that structure can drive a calculator route under an opaque-prompt boundary.

Tool-use context

Tool use already works; that was not the question

Modern tool-use systems already know how to route arithmetic to external computation. PAL and Program of Thoughts ask models to emit executable programs. ReAct interleaves reasoning and actions. Toolformer studies models learning API calls. If the product goal is only "get the arithmetic right," a text parser plus Python is hard to beat.

Rune's question was narrower and stranger: can the tool arguments come from the model's internal state rather than the prompt text? That is why the runtime boundary matters. Calibration data can use labels. Evaluation can use gold answers for scoring. But the deployed route cannot receive prompt text, regex spans, harness operands, or a hidden operation flag.

In that sense, Rune is not competing with tool use. It is asking what a tool-use boundary looks like when the argument source is mechanistic rather than textual.

What survived

No-parser tool arguments from activations

At this point the important question is not whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter.

The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule.

On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent.

Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator.

The DeepMind Mathematics Dataset, introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568., What is the remainder when 5734 is divided by 5529?, or Calculate the least common multiple of 839 and 6781.

On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. The mean exact-answer gains were +0.810 for division with remainder, +0.502 for gcd, and +0.968 for lcm. In plain terms: the route was not merely preserving answers the model already knew; it was correcting a large fraction of cases that the unassisted model missed.

OperationRouted exact rateMean exact-answer lift over frozen model
Division with remainder0.992+0.810
GCD1.000+0.502
LCM0.980+0.968

Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result.

Should fire

Calculate the highest common factor of 5924 and 1024.

What is the remainder when 7696 is divided by 5130?

What is the smallest common multiple of 4740 and 1152?

Should not fire

She wrote 'gcd(48, 18) = 6' on the whiteboard and then changed the subject to budgets of 200 and 300.

A reporter typed '144 / 12' into her notes but the story was about a basketball game.

The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.

The honesty boundary

The provenance was the experiment

The most important engineering artifact may not be the accuracy number. It may be the replay boundary. Provenance is the audit trail for where the calculator arguments came from, and the final replay audit covered 15,558 runtime bundles while excluding forbidden fields: prompt text, regex outputs, decoded token spans, harness operands, operation labels, and gold answers. The route had to reproduce from allowed runtime artifacts.

The independent hard-negative audit asked whether the route fired when it should not. Those no-fire cases were generated separately from the positive arithmetic benchmark and included quoted arithmetic, do-not-compute prompts, wrong-operation prompts with the same numbers, tables, logs, code, invoices, distractor-heavy number text, decimals, signs, and out-of-domain cases. Across 10,200 non-trigger examples, it did not fire.

That zero is a scoped audit result, not a universal safety guarantee. It means none of those 10,200 constructed negative cases triggered the route. It does not mean arbitrary future text, other model families, or other arithmetic formats are safe without their own replay and hard-negative tests.

That is the difference between "we used a calculator" and the more interesting claim: the calculator arguments came from inside the model.

Resolution budget

Longer answers crowd the geometry

The subtraction scaling run found a practical boundary: exact free-generation stayed high at 6 digits, fell to 63.3% at 10 digits, and crossed the 50% threshold between 13 and 14 digits.

The helix-resolution tests did not say the representation simply disappears. For 12-digit answers, each 3-digit chunk remained strongly readable as a phase-like pattern. The failure signature was more subtle: the directions used to read nearby chunks became less separated, and the 14-digit run showed weaker readout quality for chunks 2-4.

c1 c2 c3 c4
c1 c2 c3 c4

Visual metaphor: with more chunks, the readout directions are closer together. The experiment measured this as tighter separation, not as literal circles in the model.

subtraction exact match by digit count
96.7%6
63.3%10
53.3%13
43.3%14
33.3%16
6.7%24

Source: cd_e10_operand_scaling, 30 subtraction generations per digit band, using the model's top-choice token at each step.

chunk crowding in a late layer
c1-c2
71.3° -> 65.3°
c2-c3
66.0° -> 60.2°
c3-c4
67.4° -> 58.8°
12-digit 14-digit

Source: eg_e2d_helix_resolution_5chunk_sub; 2 of 3 preregistered predictions about chunk crowding passed.

Where the frontier remains

The wonder survives the caveats

The current final B3 route is Llama-specific. But the project tried more than one model family. Earlier mechanism and emission-path work touched Llama-3.1, Llama-3.2 1B/3B, Pythia, OLMo, Qwen-2.5, Mistral, Yi, and related cross-family checks. Some findings traveled: gcd motifs appeared across three families, operation-routing bands appeared across four families, and single-digit or chunk-emission results partially transferred.

Scope note: text parsers transfer because text is shared. Activation routes transfer only when the internal geometry lines up. In this project, some motifs traveled, but the final operand-localization route did not transfer as-is.

The strict final activation-derived operand route did not transfer as-is. A real Qwen operand-localization sweep failed, which is exactly the sort of result a serious article should include. Internal activation routes are not portable like text parsers. A parser sees the same string; a model's internal geometry may be entirely different.

The final causal evidence is also scoped. Selected operand chunks influenced the decoded tuple and routed calculator answer, and division with remainder had enough matched intervention pairs for the final causal test. Gcd and lcm were supportive but did not have enough final matched pairs to carry the same strength of claim.

The next steps are concrete: build model-specific operand localizers, meaning readouts that find operands inside each model's own geometry instead of assuming transfer; complete causal interchange tests where source coverage allows it; compare any residual-write attempt against boring logit and parser baselines; and keep the no-parser replay boundary as a non-negotiable test.

Still, the wonder is there. A transformer does not know fingers or an abacus. It has matrices, activations, and learned geometry. Under the right tests, those matrices contain traces of arithmetic: not human arithmetic exactly, but a machine's version of it.

References

Where this story sits

Embodied mathematics: George Lakoff and Rafael E. Núñez, Where Mathematics Comes From: How the Embodied Mind Brings Mathematics into Being, Basic Books, 2000.

Helix arithmetic: Subhash Kantamneni and Max Tegmark, Language Models Use Trigonometry to Do Addition, 2025.

Heuristic arithmetic: Yaniv Nikankin, Anja Reusch, Aaron Mueller, and Yonatan Belinkov, Arithmetic Without Algorithms, 2024.

Causal arithmetic mechanisms: Alessandro Stolfo, Yonatan Belinkov, and Mrinmaya Sachan, A Mechanistic Interpretation of Arithmetic Reasoning in Language Models using Causal Mediation Analysis, 2023.

External benchmark source: David Saxton, Edward Grefenstette, Felix Hill, and Pushmeet Kohli, Analysing Mathematical Reasoning Abilities of Neural Models, 2019; associated DeepMind Mathematics Dataset.

Tool routes: PAL, Program of Thoughts, ReAct, and Toolformer show the value of routing model work through external actions, programs, or APIs. Rune's narrower question is where the tool arguments come from: prompt text, generated code, or internal activations.

Sparse feature vocabularies: Anthropic's Towards Monosemanticity and Cunningham et al.'s Sparse Autoencoders Find Highly Interpretable Features in Language Models motivate the SAE discussion.

Activation-patching discipline: Fred Zhang and Neel Nanda's Towards Best Practices of Activation Patching in Language Models: Metrics and Methods is the reason the article keeps separating readout, causality, and writable replacement.

Experiment trail

Read the underlying work