2026-05-12 - Exp 26 Follow-Up: Prompt-Length Sweep for N-Gram Speculative Decode
Intent: Characterise how n-gram speculative decode speedup scales with context length by running a 4-point sweep (100 → 200 → 372 → 800 tokens), working toward the 2.04× simulated upper bound established in Exp 23. Per Knuth TAOCP §6.1, match-collision frequency grows with context density, predicting a monotone but sublinear acceptance-rate curve; the sweep is the empirical trace of that curve.
Setup: Runtime local artifacts; shards local artifacts (RangeDim T=1..4, 100% ANE residency, topology 20+4+6+2); manifest phi4mini_runtime_meta_rope96_rangedim_20_4_6_2.json. Prompts tiled from temporary output (dense Swift CoreML code: MLMultiArray, MLModel, etc.). 5 reps per length, 80 new tokens per request. Single daemon session; JIT paid once (T=1 JIT 113.4s, T=4 JIT 140.8s). Sweep script: helper script. Raw log: temporary output.
Result:
| Prompt length | Decode tok/s | Prefill tok/s | Wall/req | Speedup vs T=1 (17.8) |
|---|---|---|---|---|
| 100 tokens | 21.1 | 70.1 | 5.17s | 1.19× |
| 200 tokens | 22.1 | 70.3 | 6.42s | 1.24× |
| 372 tokens | 26.7 | 70.1 | 8.26s | 1.50× |
| 800 tokens | 28.9 | 69.9 | 14.19s | 1.62× |
Prefill stable at ~70 tok/s across all lengths (T=4 chunked path scales cleanly). Decode speedup is monotonically rising and not yet saturated at 800 tokens. Artifacts: helper script (sweep script), temporary output (raw output), the validation-first notes Exp 26 section updated with prompt-length sweep table, local runtime notes updated with sweep curve data.
Surprise / hurdle: The 1.62× at 800 tokens approaches but has not reached the 2.04× simulated ceiling, meaning the acceptance rate is still climbing. The gap implies either some T=4 verify calls accept fewer than 4 tokens or occasional fallbacks to T=1 persist at longer contexts. The sweep also reveals that the 372-token point is squarely mid-curve, not near saturation — previous Exp 26 reports should not be cited as a plateau.
Lesson: N-gram acceptance rate is strongly context-density-dependent and has not saturated by 800 tokens; any speedup claim should always state the prompt length alongside it.
Next: Extend the sweep to 1200–2048 tokens to find the saturation knee; instrument per-call draft acceptance count to close the measured-vs-simulated ceiling gap; if the curve has not flattened by 2048 tokens, revisit the Exp 23 upper-bound simulation assumptions.
Refs: research/ANE_CHAIN_SCHEMA.md