Experiment 32 - ZAYA1-8B Speculative Decode (T=4 Verifier + n-gram) [IMPLEMENTED; BOTTLENECKED]

Date: 2025-05
Objective: Port n-gram speculative decode from HyMT (Exp 28) to ZAYA1-8B using the Exp 31 CCA stateful shards which already carry rangedim_t_max: 4.

Key finding: ZAYA’s MoE-dominated compute makes T=4 batch decode ineffective without T=4 MoE shards. The attn layers (40 shards, T=4 enabled) represent only ~15% of wall-clock time; MoE layers (40 shards, T=1 fixed) represent ~85%.

Architecture analysis

Compute Per decode step T=4 batch behaviour
Attn (40 shards, RangeDim T=1..4) ~15 ms ~15 ms for 4 tokens (4× cheaper)
MoE (40 shards, T=1 fixed) ~110 ms 4 × 110 ms = 440 ms (not cheaper)
LM head (3 shards, T=1) ~5 ms 4 × 5 ms = 20 ms
T=1 total ~130 ms/tok = 7.7 tok/s
T=4 verifier total 475 ms for 4-token batch

Break-even equation — need:

[ \frac{1 + 3p}{475\ \text{ms}} > \frac{1}{130\ \text{ms}} ]

which means:

[ p > 0.883 ]

That is an 88.3% n-gram acceptance rate required for any speedup.

Measured at 1.8% acceptance on synthetic prompts. Even with perfect acceptance (p=1.0, all 3 draft tokens accepted every call) speedup would only be:

[ \frac{(1 + 3) \times 130\ \text{ms}}{475\ \text{ms}} = 1.09\times ]

That is only a 9% improvement.

Implementation status

local-artifacts/zaya_ane.swiftcomplete and correct:

The implementation routes through runGeneration when --speculative is passed; the infrastructure is fully in place for when T=4 MoE shards are available.

Benchmark results

Mode Prompt max_new Decode tok/s vs Baseline
T=1 baseline 41-tok 40 7.66
--speculative --ngram-min 1 41-tok 40 2.01 −74% (MoE bottleneck)

Acceptance rate: 1.8% (synthetic prompt; real code prompts may reach 60–80% but break-even is still 88.3%).

Conclusion and next step

The --speculative flag is implemented and correct. Real speedup requires T=4 MoE shards (Exp 33). The ZAYA MoE shard exporter (local-artifacts/zaya_full_convert.py) would need ct.RangeDim(lower_bound=1, upper_bound=4, default=1) added to the batch-token axis and shards recompiled (~40 shards × 193 MB compiled = ~7.7 GB). With T=4 MoE, the verifier cost drops from 475 ms → ~130 ms and the break-even acceptance rate falls to p > 0 (any n-gram hit is beneficial), matching the HyMT Exp 28 result (+62%).

Reference: [EoP §2] — zero-alloc hot path; [Concrete Math Ch.9] — n-gram cost; [Dragon Book §8] — prefill head-skip optimisation.