2026-05-14 - T4.3 CLOSED: All-FP16 ANE inference passes golden gate
Intent recorded before this session: Move all Gemma-4-26B-A4B FFN shards from GPU to ANE by splitting from 2 sub-shards to 8 sub-shards at FP16.
Root cause confirmed: The original q8c FFN shards were 364 MB (p0of2) and 398 MB (p1of2) compiled — both above the empirically validated ~250 MB ANE shard limit. CoreML silently placed them on GPU. GPU float16 ≠ ANE float16 numerics + INT8 quantization error compounded across 30 layers, producing wrong decode tokens [236881, 236881] instead of [669, 5279].
Fix applied:
- Rebuilt all 30 FFN layers with
--ffn-shards 8 --quant-bits 0 - Each sub-shard: 1 expert pack (16 experts) ≈ 182 MB (p0–p6) or 216 MB (p7, includes combiner + norms)
- All 8 sub-shards per layer land on ANE — confirmed within the 250 MB limit
- All 30 attn shards also FP16 (rebuilt in prior session to fix global attn INT8 per-channel error)
- Total: 30 layers × (1 attn + 8 FFN) = 270 compiled mlmodelc files, all on ANE
- Production meta: helper script
Gate results (7-token prompt [3689,563,506,5279,529,7001,236881], 2 decode steps):
- Prompt pos 0–6 cosine vs
gemma_golden.npz[logits_full]: 0.9997, 0.9996, 0.9977, 0.9980, 0.9944, 0.9982, 0.9957 — all ≥ 0.97 PASS - Decode pos 0 cosine vs
gemma_golden.npz[next_token_logits][0]: 0.9976 PASS - Decode tokens:
[669, 5279]— exact match with HF reference
Key lesson (burn this in): CoreML does NOT warn when a shard exceeds the ANE limit — it silently falls back to GPU. The only reliable check is du -sh *.mlmodelc: if any shard > 250 MB, it’s on GPU regardless of the computeUnits = .cpuAndNeuralEngine flag. The fix is always to split further, never to optimise the GPU path.
Timing: 37 s per layer for 8-shard FP16 FFN export (Xcode python3, M4 Max). Full rebuild of 29 layers ≈ 18 min. TTFT with all-ANE: ~208 s (model load + 7-tok prefill, not optimised). Per-token decode: ~29 s (270 shards sequential, not optimised).
Dead end noted: Trying INT8 per-channel quantization on attn shards caused >0.03 cosine drop per global attention layer, cascading to 0.55 cosine at L25. FP16 attn is mandatory for quality.
Dead end noted: 2-shard FFN (even FP16) would be 364/398 MB → GPU. The 8-shard split is the minimum to stay under ANE limit for this model.
Next: ANE residency probe on one rebuilt FFN shard (project policy: ane-validator gate before scale-out). Then INT4 palettization investigation as next compression path.
Refs: research/ANE_CHAIN_SCHEMA.md