2026-05-14 - T4.1.5 CLOSED: Full 16-token decode exact match on all-FP16 ANE stack

Intent: Verify that the all-FP16 ANE stack (T4.3, all 270 shards on ANE) produces correct output across a full 16-token autoregressive decode, closing the T4 correctness milestone.

Setup: Runtime: gemma_swift_head_meta_allfp16.json, 270 shards (30 attn + 30×8 FFN, all FP16, all ANE). Prompt: [3689, 563, 506, 5279, 529, 7001, 236881] (7 tokens). Decode: --n-new 16. Reference: gemma_golden.npz[next_token_ids]. Hardware: M4 Max, no sudo, unoptimised sequential shard-reload path.

Result: Generated [669, 5279, 529, 7001, 236881, 669, 5279, 529, 7001, 236881, 669, 5279, 529, 7001, 236881, 669] — exact 16/16 match. T4 correctness milestone closed. Timing baseline: TTFT ~212 s (model load + 7-tok prefill), decode 28.9 s/tok (0.034 tok/s), 270 shards sequential per token.

Surprise / hurdle: The prior investigation (2026-04-24, Row 7 divergence 506 → 9405) required rounds of hidden-boundary attribution, gamma amplification analysis, and layer-27/28/29 debug taps before the GPU-FFN root cause was confirmed. In hindsight, du -sh *.mlmodelc would have identified the over-limit shards in seconds. The debugging effort was several days; the fix was one build flag (--ffn-shards 8).

Lesson: Before debugging hidden-state divergence in a multi-shard ANE stack, always check du -sh *.mlmodelc first — any shard > 250 MB is silently on GPU, and GPU numerical drift compounding across 30 layers is indistinguishable from a model bug without this check.

Next: T4 correctness is closed. The ANE chain primitive work (Rounds 2–3 in ANE_CHAIN_SCHEMA.md) — eliminating per-token shard reload overhead — is now the primary performance path. The 28.9 s/tok figure is the unoptimised correctness baseline to beat.

Refs: research/ANE_CHAIN_SCHEMA.md