2026-04-29 - Phi Full-Stack GGUF Reference Gate Blocks Q8 Chat

Intent: Record the full-stack Phi-4-mini GGUF reference result as a validation gate only, not an inference shortcut, following validation-before-performance discipline and Knuth-style end-to-end sequential verification.

Result: The GGUF fp16 reference predicted first token 168394, decoded as a code fence token, with top8 [168394,1,1385,95839,62915,26178,185334,13225]. A 16-token reference run generated IDs [168394,259,7585,198,1314,61400,8595,271,2123,568,13225,11,5922,0,200020], decoded roughly as a code-fence/code path. The current ANE q8 runtime previously predicted 182298 and went into a Russian greeting path. Existing per-layer, multi-token state, and LM-head shard gates were green.

Surprise / hurdle: Shard-local gates were insufficient: all local quality/residency checks can pass while full-stack chat prefill+decode still diverges at the first generated token on a real prompt.

Lesson: Phi q8 chat must remain blocked until a full-stack prefill+decode golden gate passes against the GGUF reference.

Next: Run an ANE-vs-reference layer trace on the real prompt to localize the drift; consider targeted FP16 or mixed rebuild only after localization, ANE residency, and full-stack quality gates pass. Do not run long conversions from this finding alone.

Refs: runtime/phi4_mini_ane.swift; research/ANE_CHAIN_SCHEMA.md