Experiment 27 - MicroGPT on ANE — Minimum Size Constraint Discovery

Date: 2026-05-03

Sources: Dragon Book §8.7 (Peephole Optimization) + Knuth TAOCP Vol. 2 Ch. 4 (arithmetic: numerics, overflow avoidance)

Context: Karpathy’s MicroGPT (gist / blog post 2026-02-12) is a 200-line educational GPT with scalar autograd. No pre-trained checkpoint exists — it is a training script. This experiment builds the full ANE pipeline: train from scratch, export weights, CoreML conv shard, Swift + Python chat runtime.

Problem discovered: The original MicroGPT architecture (n_embd=16, n_head=4, block_size=16, n_layer=1) converted to a 0.03 MB compiled INT8 shard. Every op fell to CPU — conv_ane=0/47, compute_ane=0/47. No error is raised; the ANE cost model simply refuses sub-threshold graphs.

Root cause (empirical ANE law): The ANE conv scheduler has a minimum compiled-shard size of approximately 14 MB for transformer 1×1-conv graphs. Below this floor the cost model prefers CPU scheduling regardless of op type. This threshold had been documented in ANE_CHAIN_SCHEMA.md but was never triggered by the Hy-MT or Phi-4 shards (both well above floor). MicroGPT’s toy architecture hit it for the first time.

Fix — scaling to clear the ANE floor:

The correct response per project policy is to move compute onto ANE, never to optimise a CPU fallback. The model was scaled to:

Parameter Original Scaled
n_embd 16 512
n_head 4 8
head_dim 4 64
n_layer 1 6
block_size 16 64
Params ~4,192 ~18.9 M
Compiled INT8 size 0.03 MB 19.07 MB

With n_embd=512 the shard is comfortably above the 14 MB floor.

Safe-norm peephole (Dragon Book §8.7): The original RMSNorm implementation accumulates directly in fp16, which overflows for large channels. The peephole fix divides by √d before squaring, matching the pattern in gguf_to_ane.py:

K   = x.shape[1] ** 0.5          # √d, scalar
xs  = x * (1.0 / K)              # x / √d  — keeps fp16 in range
rms = (xs.pow(2).mean(dim=1, keepdim=True) + eps/(K*K)).rsqrt()
return (xs * rms).half()

This is a textbook peephole. The unstable pattern:

[ \left(\frac{\sum x^2}{d} + \varepsilon\right)^{-1/2} ]

is rewritten as:

[ \left(\sum \left(\frac{x}{\sqrt{d}}\right)^2 + \frac{\varepsilon}{d}\right)^{-1/2} ]

The two forms are mathematically identical, but the second is numerically safe and preferred by the ANE cost model for norm ops.

Results:

Artifacts:

Key empirical law confirmed: Transformer 1×1-conv shards require ≥14 MB compiled INT8 for ANE placement. Shards below this threshold fall silently to CPU. The fix is always to scale the model, not to optimise the CPU path.