The Apple Neural Engine Inference Book

A practitioner’s guide to production inference on the Apple Neural Engine with CoreML, Swift runtimes, ANE-only residency checks, and validated model manifests.

By Alvaro Videla - @old_sound

Chapters

Chapter	Topic
00 - Modern Inference	Tokens, prefill/decode, KV cache, ANE vs GPU vs CPU, the Conv2d trick
01 - ANE Laws	Empirical rules: shard limits, quantization, residency
02 - Porting Recipe	GGUF to CoreML, step by step
03 - Quantization	INT8 production, INT4 tradeoffs, the silent CPU fallback
04 - Shard Sizing	Layer count vs size, 250 MB limit, LM-head splits
05 - Stateful KV Cache	MLState, Swift daemon design, decode loop
06 - RangeDim + Speculative	Variable T, n-gram acceptance
07 - MoE on ANE	Soft routing, per-expert dispatch, ZAYA and Privacy Filter
08 - Swift Runtime	Cache-friendly CoreML orchestration, state, buffers, and serving
09 - Experiment Index	Searchable index of experiment writeups
10 - Decision Journal	The thinking behind the hard calls
11 - ONNX Bundles to ANE	ONNX Runtime contrib ops, local weight materialization, CoreML rebuilds
Glossary	Definitions for inference, CoreML, ANE, and validation terms

Repository

The source code, converters, Swift runtimes, validators, and model manifests live in the ane-book repository.