Glossary

Short definitions for terms used throughout the book.

Inference Basics

Autoregressive decode: Generating text one token at a time, feeding each sampled token back into the next model call.

Decode: The one-token-at-a-time generation phase after the prompt has been processed.

Embedding: A learned vector looked up from a token ID before the transformer layers run.

Hidden state: The vector representation carried through the transformer stack.

KV cache: Stored key and value tensors from previous tokens, reused so decode does not recompute the entire prefix.

Logits: Raw scores over the vocabulary. Sampling or argmax turns logits into the next token ID.

Prefill: The phase that processes the prompt tokens before decode begins.

Projection: A learned linear map, usually written as y = Wx. Attention, FFNs, and LM heads are projection-heavy.

Token: An integer ID representing a text fragment.

ANE and CoreML

ANE: Apple Neural Engine, Apple’s fixed-function neural accelerator.

ANEF: The ANE compiler used during CoreML compilation to decide whether operations can run on the Neural Engine.

CoreML MIL: CoreML’s Model Intermediate Language, the graph representation produced during conversion.

ios18.conv: The CoreML operation class that maps 1x1 convolution projections onto ANE.

MLComputePlan: The ground-truth API for checking which compute device CoreML selected for each operation.

mlmodelc: A compiled CoreML model directory produced by xcrun coremlcompiler compile.

mlpackage: A CoreML model package before compilation.

MLState: CoreML’s public API for state tensors that persist across prediction() calls.

RangeDim: A CoreML shape declaration that allows a dimension, such as token length T, to vary within bounds at runtime.

Residency: Whether the intended operations actually run on ANE rather than CPU or GPU.

Porting and Validation

Cosine gate: A quality check comparing CoreML output to a reference output with cosine similarity, usually requiring at least 0.97.

Golden: A trusted reference output captured from a known-good backend, usually PyTorch or FP16 CoreML.

Shard: A separately compiled piece of a larger model, such as a few transformer layers or one LM-head slice.

Silent fallback: A failure mode where a model compiles and runs correctly but CoreML places important operations on CPU or GPU instead of ANE.

Model Architecture

Attention: The transformer mechanism that lets a token read earlier tokens using query, key, and value projections.

FFN: Feed-forward network inside a transformer block, usually the largest projection-heavy part of a dense layer.

LM head: The final projection from hidden state to vocabulary logits.

MoE: Mixture of Experts, a layer design with multiple expert FFNs and a router that chooses which experts contribute.

RMSNorm: A normalization layer commonly used in modern decoder-only LLMs.