At this point the important question is no longer whether arithmetic can be routed to Python. It can. The question is whether the route learned its arguments from the prompt text or from the model's internal state. Rune's final supported claim is only about the latter.
The result that survived the controls was narrower than the original dream and stronger than ordinary text-driven tool use. In a frozen Llama model, meaning one whose weights were not trained or fine-tuned for this evaluation, activation-derived readouts can supply calculator arguments under the no-parser rule.
The production operand route used a layer-22 chunk path: selected operand chunks, chosen with attention-based selection, were decoded by a frozen chunk probe. Reading operands directly from their early input-token positions would be too close to parsing in disguise. This remains a readout claim, not a proof that we can safely rewrite the model's internal state.
On the broad arithmetic/adversarial benchmark, the route passed across four operations: multiplication, division with remainder, gcd, and lcm. Passing meant two things at once. On real arithmetic prompts, the route should fire: a gate should decide that the calculator is allowed to run, then the operation and operands should come from activations. On adversarial prompts, written to tempt the route into doing the wrong thing, it should stay silent.
Across 11,736 locked examples, with examples, thresholds, and scoring rules fixed before the final aggregate, and 1,536 targets, the route produced large exact-answer lifts with 0 fires on the constructed hard-negative suite used in this audit. Lift means the routed exact-answer rate minus the frozen model's native exact-answer rate on the target prompts. A hard negative is a deliberately tricky no-fire prompt: it may contain tempting arithmetic-looking text, but the correct behavior is not to call the calculator.
The route's operand bounds were frozen at integers from 0 through 9999, with at most 12 generated answer tokens. In plain terms, the supported result is two-integer arithmetic within that published route rather than arbitrary long arithmetic. The claim-bearing rerun was preregistered on June 2, 2026, with thresholds, operand bounds, and runner choices fixed around scripts/goalB3_repaired_benchmark_suite.py.
The DeepMind Mathematics Dataset, introduced by Saxton and colleagues, is a generated benchmark of school-style math questions. Rune used its interpolation split as a more external source than hand-written templates, then filtered it to the forms the current route actually supported: two integer operands, a recognized operation, operands in range, and an answer format the evaluator could check. Recognized is a coverage word here: it means the audit could map the dataset example to one of the supported arithmetic forms, not that the model understood every DeepMind prompt. Positive examples looked like ordinary arithmetic requests: Calculate the greatest common divisor of 2474 and 5568., What is the remainder when 5734 is divided by 5529?, or Calculate the least common multiple of 839 and 6781.
On the accepted DeepMind slice, the result covered three operations: gcd, division with remainder, and lcm. Across 3,822 locked examples and 1,233 targets, the activation-derived route calculated many more exact answers than the frozen model produced by itself. In plain percentages, routed exact rates were 99.2% for division with remainder, 100% for gcd, and 98.0% for lcm, with exact-answer gains of 81.0, 50.2, and 96.8 percentage points over the frozen model's native answers. The route was correcting a large fraction of cases the unassisted model missed.
Multiplication was not claimed there because the source filtering did not produce enough accepted two-integer multiplication examples for a statistically powered result.
Should fire
Calculate the highest common factor of 5924 and 1024.
What is the remainder when 7696 is divided by 5130?
What is the smallest common multiple of 4740 and 1152?
Should not fire
She wrote 'gcd(48, 18) = 6' on the whiteboard and then changed the subject to budgets of 200 and 300.
A reporter typed '144 / 12' into her notes but the story was about a basketball game.
The chart showed 6, 12, 18, 24 as factor labels but the article discussed musical notation.