Experiment 27 - MicroGPT on ANE — Minimum Size Constraint Discovery
Date: 2026-05-03
Sources: Dragon Book §8.7 (Peephole Optimization) + Knuth TAOCP Vol. 2 Ch. 4 (arithmetic: numerics, overflow avoidance)
Context: Karpathy’s MicroGPT (gist / blog post 2026-02-12) is a 200-line educational GPT with scalar autograd. No pre-trained checkpoint exists — it is a training script. This experiment builds the full ANE pipeline: train from scratch, export weights, CoreML conv shard, Swift + Python chat runtime.
Problem discovered: The original MicroGPT architecture (n_embd=16, n_head=4,
block_size=16, n_layer=1) converted to a 0.03 MB compiled INT8 shard. Every
op fell to CPU — conv_ane=0/47, compute_ane=0/47. No error is raised; the ANE
cost model simply refuses sub-threshold graphs.
Root cause (empirical ANE law):
The ANE conv scheduler has a minimum compiled-shard size of approximately 14 MB
for transformer 1×1-conv graphs. Below this floor the cost model prefers CPU
scheduling regardless of op type. This threshold had been documented in
ANE_CHAIN_SCHEMA.md but was never triggered by the Hy-MT or Phi-4 shards
(both well above floor). MicroGPT’s toy architecture hit it for the first time.
Fix — scaling to clear the ANE floor:
The correct response per project policy is to move compute onto ANE, never to optimise a CPU fallback. The model was scaled to:
| Parameter | Original | Scaled |
|---|---|---|
n_embd |
16 | 512 |
n_head |
4 | 8 |
head_dim |
4 | 64 |
n_layer |
1 | 6 |
block_size |
16 | 64 |
| Params | ~4,192 | ~18.9 M |
| Compiled INT8 size | 0.03 MB | 19.07 MB |
With n_embd=512 the shard is comfortably above the 14 MB floor.
Safe-norm peephole (Dragon Book §8.7):
The original RMSNorm implementation accumulates x² directly in fp16, which
overflows for large channels. The peephole fix divides by √d before squaring,
matching the pattern in gguf_to_ane.py:
K = x.shape[1] ** 0.5 # √d, scalar
xs = x * (1.0 / K) # x / √d — keeps fp16 in range
rms = (xs.pow(2).mean(dim=1, keepdim=True) + eps/(K*K)).rsqrt()
return (xs * rms).half()
This is a textbook peephole. The unstable pattern:
[ \left(\frac{\sum x^2}{d} + \varepsilon\right)^{-1/2} ]
is rewritten as:
[ \left(\sum \left(\frac{x}{\sqrt{d}}\right)^2 + \frac{\varepsilon}{d}\right)^{-1/2} ]
The two forms are mathematically identical, but the second is numerically safe and preferred by the ANE cost model for norm ops.
Results:
- Training: 18.9 M params, 5000 steps, Adam (β=(0.85, 0.99)), linear LR decay, dataset = 32,033 baby names (character-level), final loss 1.60.
- CoreML shard:
local-artifacts/microgpt_shards/MicroGPT.mlpackage+.mlmodelc(19.07 MB). - ANE residency:
conv_ane=37/37,compute_ane=260/260, PASS=True, 100% ANE. - Swift runtime:
local-artifacts/microgpt_ane_runtime, stateful KV cache (MLStateAPI), FLOAT16 conv shard, host-side embedding lookup + argmax. - Benchmark: ~1535 tok/s warm (500 names, 3352 tokens in 2.18 s).
- Sample output: karrin, avian, ana, alina, jelah, dari — plausible name-like forms.
Artifacts:
local-artifacts/microgpt_train.py— PyTorch training script (.venv)local-artifacts/microgpt_to_ane.py— CoreML conversion + compile (Xcode python3)local-artifacts/microgpt_export_runtime.py— wte/wpe fp16 bin export (.venv)local-artifacts/microgpt_ane.swift/microgpt_ane_runtime— Swift CLIpython/microgpt_ane_chat.py— Python wrapperlocal-artifacts/microgpt_ane/— weights, vocab JSON, fp16 bins, manifest
Key empirical law confirmed: Transformer 1×1-conv shards require ≥14 MB compiled INT8 for ANE placement. Shards below this threshold fall silently to CPU. The fix is always to scale the model, not to optimise the CPU path.