Experiment 35 - ZAYA1-8B MoE INT4pal (per-grouped-channel palettization, group_size=32) [COMPLETE]
Date: 2026-05-13
Objective: Replace INT8 per-tensor MoE shards (Exp 34, 202 MB each) with INT4
per-grouped-channel palettized shards (OpPalettizerConfig(mode="uniform", nbits=4,
granularity="per_grouped_channel", group_size=32) → constexpr_lut_to_dense ops),
halving compiled shard size (~101 MB) and improving T=1 baseline throughput.
Shard build
Script: python/zaya_moe_export_int4pal.py
Config: OpPalettizerConfig(mode="uniform", nbits=4, granularity="per_grouped_channel",
group_size=32). RangeDim T∈[1..4] retained from Exp 34.
TMPDIR: $PWD/local-artifacts/zaya_ane/cml_tmp (main disk hit 100% capacity mid-session
due to 86 GB stale ANE plan caches from macOS 26 upgrade; freed by removing
~/Library/Caches/zaya_ane_runtime/, com.apple.python3/, and related runtime caches).
| Metric | Value |
|---|---|
| Shards built | 40/40 (L01, L03, … L79) |
| Compiled size per shard | 101.2 MB (halved from 202 MB INT8) |
| ANE residency (gate L01) | conv_ane=36/36 conv_non_ane=0 |
| Total disk | ~4.0 GB |
Attn shards (Exp 31 CCA, 40 shards) also required recompilation after macOS 26 upgrade
invalidated all .mlmodelc ANEF plans (CoreML error -14). Fixed via xcrun coremlcompiler
compile on all 40 .mlpackage files.
Golden validator
python/zaya_golden_validator.py --layer 1 --n-probes 8 on L01 INT4pal shard:
| Metric | Value |
|---|---|
| Min cosine | 0.9994 |
| Mean cosine | 0.9996 |
| Gate | GREEN ✓ |
Benchmark results
Hardware: M4 Max (Apple Neural Engine, 100% ANE residency)
Prompt: --prompt-ids 2,42 --max-new 40
| Mode | tok/s | vs Exp 34 INT8 |
|---|---|---|
| Baseline T=1 (INT8, Exp 34) | 8.59 | — |
| Baseline T=1 (INT4pal, Exp 35) | 9.25 | +7.7% |
| Speculative ngram (INT8 rangedim, Exp 34) | 2.69 | — |
| Speculative ngram (INT4pal, Exp 35) | 2.52 | −6% |
Speculative profile (Exp 35, --ngram-min 1, prompt-ids 2,42, 40 new tokens):
verifier_calls=32 drafted=96 accepted=7 fallbacks=31 acceptance=7.3%
Verifier wall cost: 15.455s / 32 calls ≈ 483 ms/call
Analysis: INT4pal improves T=1, not T=vbt verifier
T=1 baseline improvement (+7.7%): INT4pal halves MoE weight bandwidth. At T=1, ZAYA’s MoE forward pass is DRAM-streaming-bound — the ANE must stream 101 MB of LUT weights from DRAM per shard vs 202 MB INT8. This directly reduces per-token latency.
T=4 verifier cost unchanged (483 ms vs 499 ms INT8 = −3%): At T=4, ZAYA soft-routing
computes all 16 expert FFNs over all 4 tokens — compute load is 16 × 4 × FFN_hidden
MACs. INT4pal reduces weight bandwidth but not MAC operation count. The ANE becomes
MAC-bound at T=4, not bandwidth-bound. Therefore INT4pal delivers diminishing returns
on the verifier call, in contrast to the T=1 case.
This is the ANE equivalent of Knuth’s observation in TAOCP Vol. 2 §4.3 about arithmetic-vs-memory bottlenecks: the bottleneck shifts with the operation count, and optimizations targeting the wrong resource leave performance on the table.
Break-even acceptance rate with 483 ms verifier vs 109 ms T=1 (9.25 tok/s):
[ p_{\text{break-even}} = 1 - \frac{t_1}{t_v/\text{vbt}} = 1 - \frac{109}{483/4} \approx 0.10 ]
At 7.3% observed acceptance (synthetic prompt), speculative is below break-even — matching the pattern from Exp 34. Real code-completion prompts at 60–80% acceptance would yield approximately:
[ \frac{1 + 3 \times 0.7}{483\ \text{ms}} \approx 6.4\ \text{tok/s} ]
which is still slower than the T=1 baseline at 9.25 tok/s because the verifier cost dominates.
Conclusion
INT4pal is a net win for the T=1 baseline (memory-bandwidth-bound): +7.7% at half the shard size. It is not a win for the T=4 MoE verifier (MAC-bound at T=4): essentially no improvement. The path to speculative speedup on ZAYA requires either reducing soft routing to top-K sparse (like standard MoE), or moving to a model whose dominant compute is attn (not FFN).
Reference: [TAOCP Vol. 2 §4.3] — arithmetic vs memory bottleneck identification; [Dragon Book §8.7] — same principle applied to instruction scheduling; [EoP §4] — reduction via semigroup (INT4pal halves the semigroup element size, not the op count).