The Apple Neural Engine Inference Book
A practitioner’s guide to production inference on the Apple Neural Engine with CoreML, Swift runtimes, ANE-only residency checks, and validated model manifests.
By Alvaro Videla - @old_sound
Chapters
| Chapter | Topic |
|---|---|
| 00 - Modern Inference | Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick |
| 01 - ANE Laws | Empirical rules: shard limits, quantization, residency |
| 02 - Porting Recipe | GGUF to CoreML, step by step |
| 03 - Quantization | INT8 production, INT4 tradeoffs, the silent CPU fallback |
| 04 - Shard Sizing | Layer count vs size, 250 MB limit, LM-head splits |
| 05 - Stateful KV Cache | MLState, Swift daemon design, decode loop |
| 06 - RangeDim + Speculative | Variable T, n-gram acceptance |
| 07 - MoE on ANE | Soft routing, per-expert dispatch, ZAYA and Privacy Filter |
| 08 - Swift Runtime | Cache-friendly CoreML orchestration, state, buffers, and serving |
| 09 - Experiment Index | Searchable index of experiment writeups |
| 10 - Decision Journal | The thinking behind the hard calls |
| Glossary | Definitions for inference, CoreML, ANE, and validation terms |
Repository
The source code, converters, Swift runtimes, validators, and model manifests live in the ane-book repository.