Chapter 4 — Shard Sizing
A shard is a separately compiled piece of the model. Sharding is not a model architecture feature; it is a packaging strategy forced by compiler limits, memory layout, and runtime load costs.
In a desktop GPU runtime, a whole model can often live behind one inference engine object. On ANE, large transformer models usually have to be split into layer groups, LM-head slices, and sometimes separate embedding assets. The host runtime then calls those pieces in order while preserving the same hidden state flow the original model used.
The central sizing question is: how much model can one CoreML package contain while still compiling reliably and staying ANE-resident?
The 250 MB Wall
The ANE compiler (ANEF) imposes a hard limit: compiled model weights ≥ ~1 GB produce
error -14 at compile time. But the practical ceiling for reliable ANE-resident
shards is lower: ~250 MB per .mlpackage.
Validated data points:
| Shard | Size | Layers | Result |
|---|---|---|---|
| Phi-4-mini, 3 layers, INT8 | 223 MB | 3 | ANE-resident ✓ |
| Gemma-4-26B, 1 layer, INT8 | ~180 MB | 1 | ANE-resident ✓ |
| ZAYA1-8B MoE, 1 layer, INT8 | ~120 MB | 1 | ANE-resident ✓ |
| Qwen 1.5B monolithic, INT8 | ~1.5 GB | 28 | Error -14 ✗ |
| Qwen 3B monolithic, INT4 | ~1.7 GB | 36 | Error -14 ✗ |
The 223 MB data point (Phi-4-mini 3-layer shard) is the validated ceiling. Beyond 250 MB, expect ANEF to reject the model.
Layer Counting: How Many Layers Per Shard?
The right number of layers per shard depends on two things:
- Per-layer weight size (function of
d_model,n_heads,d_ff) - Whether the shard includes the embedding table and/or LM head
Formula for a single transformer layer’s parameter count:
params_per_layer =
4 * d_model * d_model # Q, K, V, O projections (approx, ignoring GQA)
+ 3 * d_model * d_ff # gate, up, down projections
+ 2 * d_model # RMSNorm weights (x2)
At INT8 (1 byte/param), weight size in MB = params_per_layer / 1e6.
Example calculation:
| Model | d_model | d_ff | Params/layer | INT8 MB/layer | Layers at 200 MB |
|---|---|---|---|---|---|
| Phi-4-mini | 3072 | 8192 | ~62M | ~62 MB | 3 |
| Gemma-4-26B | 5120 | 32768 | ~200M | ~200 MB | 1 |
| ZAYA1-8B | 4096 | 14336 (MoE) | ~120M | ~120 MB | 1 |
| Qwen 0.5B | 896 | 4864 | ~10M | ~10 MB | 20 (monolithic OK) |
For models with d_ff > 16K, one layer per shard is the only viable option.
The LM Head Problem
The language model head is a weight matrix of shape [vocab_size, d_model].
For a model with vocab_size=32000 and d_model=4096:
LM head: 32000 * 4096 = 131M params → 131 MB at INT8
For vocab_size=151936 (Qwen 2.5 tokenizer):
LM head: 151936 * 4096 = 622M params → 622 MB at INT8 → ERROR -14
The LM head cannot be a single shard at large vocab sizes.
Solution: split the LM head into vocab slices.
# Split into N chunks along vocab dimension
vocab_chunk_size = vocab_size // n_lm_head_shards # e.g., 4 shards = 32K each
for i in range(n_lm_head_shards):
start = i * vocab_chunk_size
end = start + vocab_chunk_size
# Build a CoreML shard with lm_head.weight[start:end, :]
# Input: final hidden state [1, d_model, 1, 1]
# Output: logit slice [1, vocab_chunk_size, 1, 1]
At runtime, run all LM-head shards, concatenate the logit slices, then sample. The overhead is O(n_lm_head_shards) ANE calls but each call is small.
Phi-4-mini uses 2 LM-head shards. ZAYA1-8B uses 2.
Embedding Table
The embedding table ([vocab_size, d_model]) is a lookup, not a matmul. It does
not need to run on ANE. Implement it on the host:
// Swift: embedding lookup (host-side, not CoreML)
func embed(_ tokenId: Int) -> [Float] {
let offset = tokenId * dModel
return Array(embedWeights[offset ..< offset + dModel])
}
This is an intentional exception to the ANE-only mandate. Embedding lookup is an index operation (trivial, O(1)), not a matmul.
Shard Naming Convention
Use a consistent naming scheme so the runtime can discover shards:
layers/
layer_00.mlmodelc/
layer_01.mlmodelc/
...
layer_N.mlmodelc/
lm_head/
lm_head_0.mlmodelc/
lm_head_1.mlmodelc/
embed.bin ← raw float16 embedding matrix
runtime_meta.json ← vocab_size, d_model, n_layers, n_lm_head_shards, etc.
runtime_meta.json example:
{
"model_name": "phi4-mini",
"n_layers": 32,
"n_lm_head_shards": 2,
"d_model": 3072,
"n_heads": 32,
"n_kv_heads": 8,
"d_head": 96,
"vocab_size": 100352,
"max_seq_len": 4096,
"rangedim_max_t": 4,
"int8": true
}
Shard Sizing Checklist
[ ] Per-layer INT8 MB calculated: d_model × d_ff × 3 (approx)
[ ] Layers-per-shard chosen to stay under 200 MB (leave 50 MB headroom)
[ ] LM head split if vocab_size * d_model > 200M params
[ ] Embedding table extracted as .bin for host-side lookup
[ ] runtime_meta.json written before building Swift runtime
[ ] Each shard's MLComputePlan checked: 100% ANE conv ops