Experiment 25 - Prompt-Lookup Force Mode as a Head-Skip Ceiling
Sources: Knuth pattern matching + Dechter constraint propagation + public
CoreML MLState.withMultiArray(for:) state access
Public CoreML state access is better than expected: Python exposes
MLState.read_state/write_state, and the Swift SDK exposes
MLState.withMultiArray(for:). This means exact state copy is possible without
private API. It does not by itself create speculative speedup, because copying
the full Phi KV cache is a large host memory transfer and a single-token verifier
still performs one ANE layer pass per target token. Real pass-count speedup still
needs batch-token layer artifacts or another way to verify multiple tokens per
ANE call.
To quantify the cheap public ceiling, phi4_mini_ane.swift now has an
experimental approximate mode:
--ngram-force --ngram-min 2 --ngram-max 8
Unlike --ngram-probe, this changes generation: if prompt lookup finds a prior
suffix match, the runtime forces the proposed token and skips the ANE LM-head
prediction/reduction for that step. The ANE layer stack still runs so KV state
stays aligned with the emitted token stream.
Code-shaped suite result on the same 5 prompts / 95 decode tokens:
| mode | decode tokens | decode seconds | weighted tok/s | avg layer ms | avg head ms |
|---|---|---|---|---|---|
exact greedy + --ngram-probe |
95 | 5.605536 | 16.948 | 53.876 | 5.120 |
approximate --ngram-force |
95 | 5.269287 | 18.029 | 54.755 | 0.703 |
--ngram-force forced 82 of 100 target opportunities (force_rate=0.820) and
reduced mean head time by about 4.4 ms/token, but total speed improved only
~6.4% because the layer stack is now dominant. This is useful as a ceiling
measurement and maybe as a workload-specific approximate mode, but it is not an
exact speculative decoder and should not be the default shipping path without a
task-quality gate.