2026-04-28 - Phi Private E5 One-Stream Timing Reality Check

Intent: Measure whether the validated private one-stream E5 path materially improves Phi decode latency by removing public CoreML hidden-state roundtrips between fused layer shards.

Setup: Added --iterations and --warmup-iterations to e5_two_op_stream_probe --manual-chain-all. Ran the full 16+8+6+2 fused layer stack with 10 warmup executes and 100 measured executes. Re-ran the public Swift runtime on phi4mini_runtime_meta_16_8_6_2.json with 5 warmup calls, 30 generated tokens, and --profile.

Result: The private stream stayed correct, with final hidden sum -196.834778. Private one-stream layers measured 52.593 ms/execute; public CoreML layers measured 53.121 ms/token. Public decode was 17.179 tok/s, with head_predict_reduce_ms=5.082.

Surprise / hurdle: The private stream win is real but small: about 0.53 ms/token for this already-fused topology. The host hidden-state roundtrip is not the primary bottleneck once the topology is 16+8+6+2.

Lesson: Private E5 chaining is a capability breakthrough, not an immediate large throughput breakthrough for the current Phi topology. It may matter more for finer sharding, but current speed work should focus on ANE compute shape/topology and LM-head latency.

Next: Keep the private chain as a validated research path; prioritize higher-leverage public/ANE optimizations unless a future topology needs many more shard boundaries.

Refs: runtime/phi4_mini_ane.swift; research/ANE_CHAIN_SCHEMA.md