The Apple Neural Engine Inference Book

A practitioner’s guide to production inference on the Apple Neural Engine. The book documents the practical path from model weights to ANE-resident CoreML graphs, Swift runtimes, validation gates, and the engineering tradeoffs found while porting real LLMs.

Read Online

This folder is configured as the source for the repository’s GitHub Pages site. When Pages is enabled for the repository, the rendered book is available at:

https://videlalvaro.github.io/ane-book/

Chapters

Chapter Topic
00 - Modern Inference Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick
01 - ANE Laws Empirical rules: shard limits, quantization, residency
02 - Porting Recipe GGUF to CoreML, step by step
03 - Quantization INT8 production, INT4 tradeoffs, the silent CPU fallback
04 - Shard Sizing Layer count vs size, compiler limits, LM-head splits
05 - Stateful KV Cache MLState, Swift daemon design, decode loop
06 - RangeDim + Speculative Variable token axes, prefill batching, n-gram speculation
07 - MoE on ANE Soft routing, expert shards, ZAYA and Privacy Filter
08 - Swift Runtime Cache-friendly CoreML orchestration, state, buffers, and serving
09 - Experiment Index Searchable index of experiment writeups
10 - Decision Journal Design decisions and the reasoning behind them
Glossary Definitions for inference, CoreML, ANE, and validation terms

What This Book Covers

Repository Context

The surrounding repository contains the converters, validators, Swift runtimes, model manifests, and demos referenced by the book. Start from the top-level README for setup instructions and model-specific entry points.

License

The book and repository are released under the MIT License.