The Apple Neural Engine Inference Book

A practitioner’s guide to production inference on the Apple Neural Engine with CoreML, Swift runtimes, ANE-only residency checks, and validated model manifests.

By Alvaro Videla - @old_sound

Chapters

Chapter Topic
00 - Modern Inference Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick
01 - ANE Laws Empirical rules: shard limits, quantization, residency
02 - Porting Recipe GGUF to CoreML, step by step
03 - Quantization INT8 production, INT4 tradeoffs, the silent CPU fallback
04 - Shard Sizing Layer count vs size, 250 MB limit, LM-head splits
05 - Stateful KV Cache MLState, Swift daemon design, decode loop
06 - RangeDim + Speculative Variable T, n-gram acceptance
07 - MoE on ANE Soft routing, per-expert dispatch, ZAYA and Privacy Filter
08 - Swift Runtime Cache-friendly CoreML orchestration, state, buffers, and serving
09 - Experiment Index Searchable index of experiment writeups
10 - Decision Journal The thinking behind the hard calls
Glossary Definitions for inference, CoreML, ANE, and validation terms

Repository

The source code, converters, Swift runtimes, validators, and model manifests live in the ane-book repository.