The Apple Neural Engine Inference Book
A practitioner’s guide to production inference on the Apple Neural Engine. The book documents the practical path from model weights to ANE-resident CoreML graphs, Swift runtimes, validation gates, and the engineering tradeoffs found while porting real LLMs.
Read Online
This folder is configured as the source for the repository’s GitHub Pages site. When Pages is enabled for the repository, the rendered book is available at:
https://videlalvaro.github.io/ane-book/
Chapters
| Chapter | Topic |
|---|---|
| 00 - Modern Inference | Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick |
| 01 - ANE Laws | Empirical rules: shard limits, quantization, residency |
| 02 - Porting Recipe | GGUF to CoreML, step by step |
| 03 - Quantization | INT8 production, INT4 tradeoffs, the silent CPU fallback |
| 04 - Shard Sizing | Layer count vs size, compiler limits, LM-head splits |
| 05 - Stateful KV Cache | MLState, Swift daemon design, decode loop |
| 06 - RangeDim + Speculative | Variable token axes, prefill batching, n-gram speculation |
| 07 - MoE on ANE | Soft routing, expert shards, ZAYA and Privacy Filter |
| 08 - Swift Runtime | Cache-friendly CoreML orchestration, state, buffers, and serving |
| 09 - Experiment Index | Searchable index of experiment writeups |
| 10 - Decision Journal | Design decisions and the reasoning behind them |
| Glossary | Definitions for inference, CoreML, ANE, and validation terms |
What This Book Covers
- CoreML graph shapes that keep transformer compute on the Apple Neural Engine.
- The modern inference loop: tokens, prefill, decode, logits, sampling, and KV cache.
- Quantization choices that preserve quality without triggering CPU fallback.
- Shard sizing rules for compiler reliability and ANE residency.
- Stateful KV-cache runtimes using public
MLStateAPIs. - Cache-friendly Swift runtime design for warm decode and serving.
- RangeDim and speculative decoding patterns for better throughput.
- MoE-specific lessons from ZAYA and the Privacy Filter runtime.
Repository Context
The surrounding repository contains the converters, validators, Swift runtimes, model manifests, and demos referenced by the book. Start from the top-level README for setup instructions and model-specific entry points.
License
The book and repository are released under the MIT License.