Chapter 6 — RangeDim and Speculative Decode

Decode is sequential, but not every part of inference has to be one token wide. Prefill can process many prompt tokens at once, and speculative decoding can ask the full model to verify a short draft span in one call. Both techniques depend on the model accepting a variable token dimension T.

RangeDim is CoreML’s way to declare that variability. Speculative decoding is the algorithmic use case: propose several likely next tokens cheaply, run the real model on that small batch, and accept the longest correct prefix. If the acceptance rate is high enough, the runtime gets more than one generated token per expensive verifier call.

The Prefill/Decode Asymmetry

LLM inference has two phases with opposite bottlenecks:

Prefill: process N tokens in parallel. Bottleneck: compute (matrix-vector → matrix-matrix).
Decode: generate one token at a time. Bottleneck: memory bandwidth (load all weights for a single vector).

A naive port to ANE runs both at T=1 (one token per CoreML call). This works but leaves prefill speed on the table — the ANE can process multiple tokens in one kernel launch, but you need the model to accept variable-length inputs.

RangeDim: Variable Sequence Length in CoreML

CoreML RangeDim declares that a particular dimension can vary between a min and max value at runtime, without recompiling:

ct.TensorType(
    name="hidden",
    shape=[1, d_model, ct.RangeDim(lower_bound=1, upper_bound=4), 1]
)

This tells CoreML: “the T dimension (height) can be 1, 2, 3, or 4 tokens.”

At runtime, you pass actual T-length tensors and CoreML handles the dispatch. The same .mlmodelc is used for T=1 (decode) and T=2,3,4 (speculative / prefill).

Upper bound: set to 4 for decode + n-gram speculative. Set higher (e.g., 128) for aggressive prefill, but test ANE residency — very large T can trigger fallback.

n-Gram Speculative Decode

The idea: instead of generating tokens one at a time, draft n candidate tokens using a cheap heuristic (n-gram lookup), verify all n at once with the full model in a single T=n forward pass, and accept as many as are correct.

N-gram speculative draft and verify flow

The accepted draft prefix glows green; the first mismatch becomes the fallback token from the full model verifier.

Expected tokens/call with n-gram matching on natural language:

T=1: 1.0 tokens/call (baseline)
T=4 with good n-gram hit rate: 2.5–3.5 effective tokens/call

The acceptance logic:

func speculativeDecode(prompt: [Int], maxNew: Int) -> [Int] {
    var tokens = prompt
    var pos = prompt.count

    while tokens.count < prompt.count + maxNew {
        // Draft: look up next n tokens from n-gram table
        let drafts = ngramLookup(tokens.suffix(3), n: 4)  // [d1, d2, d3, d4]
        let T = drafts.count

        // Verify: one forward pass with T tokens
        let hidden = embedTokens(drafts, startPos: pos)  // [1, d_model, T, 1]
        let logits = forwardPass(hidden, startPos: pos)  // [1, vocab, T, 1]

        // Accept prefix of drafts that match argmax
        var accepted = 0
        for i in 0..<T {
            let predicted = argmax(logits[i])
            if i == 0 || predicted == drafts[i-1] {
                tokens.append(predicted)
                accepted += 1
            } else {
                break
            }
        }

        pos += accepted
        if accepted == 0 { break }  // no drafts accepted, fall back to greedy
    }
    return Array(tokens.dropFirst(prompt.count))
}

N-gram table construction: at inference time, maintain a [Int: [Int]: Int] map from context → next token, updated with every generated token. No external data.

RangeDim Conversion

# Convert with RangeDim T=1..4
example_input = torch.zeros(1, d_model, 4, 1)
traced = torch.jit.trace(layer.eval(), example_input)

model = ct.convert(
    traced,
    inputs=[ct.TensorType(
        name="hidden",
        shape=[1, d_model, ct.RangeDim(lower_bound=1, upper_bound=4), 1]
    )],
    outputs=[ct.TensorType(name="out_hidden")],
    convert_to="mlprogram",
    minimum_deployment_target=ct.target.macOS15,
    compute_units=ct.ComputeUnit.CPU_AND_NE,
)

Trace at T=4 (the max), not T=1. Tracing at T=1 can cause CoreML to bake in the wrong slice shapes for the cache writes.

Stateful + RangeDim: Known Complication

Combining stateful KV cache with RangeDim is the trickiest configuration.

The state write (k_cache[:, :, pos:pos+T, :]) uses a dynamic slice that depends on T. CoreML’s MIL must see this as a valid scatter operation for the state backend to accept it.

Validation recipe:

Run T=1 prefill. Capture output logits.
Run T=1 prefill again from scratch using T=1 in a loop.
Compare outputs at every position — must be identical.
If divergence appears at T>1 but not T=1, the state write has a bug.

In practice: ZAYA1-8B uses RangeDim T=1..4 with stateful attention (CCA validated). Phi-4-mini uses RangeDim T=1..4, stateful, validated at 17 tok/s decode.

Benchmarked Speeds (M4 Max, 48 GB)

Model	Config	tok/s
Phi-4-mini 3.8B	INT8, RangeDim T=1..4, stateful	~17
Hy-MT 1.5B translation	INT8, RangeDim, stateful	~34
ZAYA1-8B MoE	INT8, RangeDim T=1..4, stateful	~9
Privacy Filter ~1.5B MoE	INT8, T=1	~24.6 sent/s

Checklist

[ ] Traced at T=max (not T=1)
[ ] RangeDim(lower_bound=1, upper_bound=N) matches Swift runtime expectations
[ ] MLComputePlan checked at both T=1 and T=max — 100% ANE both cases
[ ] Stateful writes validated: T=1 and T>1 outputs agree vs PyTorch
[ ] n-gram table populated from prompt context before decode starts