Chapter 4 — Shard Sizing
A shard is a separately compiled piece of the model. Sharding is not a model architecture feature; it is a packaging strategy forced by compiler limits, memory layout, and runtime load costs.
In a desktop GPU runtime, a whole model can often live behind one inference engine object. On ANE, large transformer models usually have to be split into layer groups, LM-head slices, and sometimes separate embedding assets. The host runtime then calls those pieces in order while preserving the same hidden state flow the original model used.
The central sizing question is: how much model can one CoreML package contain while still compiling reliably and staying ANE-resident?
The 250 MB Wall
The ANE compiler (ANEF) imposes a hard limit: compiled model weights ≥ ~1 GB produce
error -14 at compile time. But the practical ceiling for reliable ANE-resident
shards is lower: ~250 MB per .mlpackage.
Validated data points:
| Shard | Size | Layers | Result |
|---|---|---|---|
| Phi-4-mini, 3 layers, INT8 | 223 MB | 3 | ANE-resident ✓ |
| Gemma-4-26B, 1 layer, INT8 | ~180 MB | 1 | ANE-resident ✓ |
| ZAYA1-8B MoE, 1 layer, INT8 | ~120 MB | 1 | ANE-resident ✓ |
| Qwen 1.5B monolithic, INT8 | ~1.5 GB | 28 | Error -14 ✗ |
| Qwen 3B monolithic, INT4 | ~1.7 GB | 36 | Error -14 ✗ |
The 223 MB data point (Phi-4-mini 3-layer shard) is the validated ceiling. Beyond 250 MB, expect ANEF to reject the model.
Layer Counting: How Many Layers Per Shard?
The right number of layers per shard depends on two things:
- Per-layer weight size (function of
d_model,n_heads,d_ff) - Whether the shard includes the embedding table and/or LM head
Use formulas only for first-pass planning. The final authority is the compiled
artifact size plus MLComputePlan, because CoreML may store weights differently
after quantization, palettization, graph folding, or state packing.
For a dense decoder layer with full multi-head attention, a rough parameter count is:
params_per_layer =
4 * d_model * d_model # Q, K, V, O projections
+ 3 * d_model * d_ff # SwiGLU gate, up, down projections
+ 2 * d_model # RMSNorm weights, usually negligible
For grouped-query attention, K and V are smaller than Q and O. Let:
kv_dim = n_kv_heads * d_head
Then attention is closer to:
attention_params =
d_model * d_model # Q
+ d_model * kv_dim # K
+ d_model * kv_dim # V
+ d_model * d_model # O
For a SwiGLU FFN:
ffn_params = 3 * d_model * d_ff # gate, up, down
At INT8, raw weight size in MB is roughly params / 1e6. Treat that as an
estimate, not a promise. A measured .mlmodelc size can be lower or higher than
the raw count depending on the conversion path.
Example calculation:
Phi-4-mini planning estimate, using the 32-head / d_head=96 export docs:
d_model = 3072
n_kv_heads = 8
d_head = 96
kv_dim = 768
d_ff = 8192
attention ~= 2 * 3072^2 + 2 * 3072 * 768 = 23.6M params
FFN ~= 3 * 3072 * 8192 = 75.5M params
total ~= 99M params before converter-specific packing effects
The validated 3-layer Phi-4-mini INT8 shard was 223 MB compiled, so for that artifact family the measured planning number is about 74 MB per layer. Use the measured compiled artifact when choosing a shard boundary.
Observed planning numbers from this repository:
| Artifact family | Measured compiled size | Practical packing note |
|---|---|---|
| Phi-4-mini INT8 | 223 MB for 3 layers | 3 layers is near the validated ceiling |
| Gemma-4-26B INT8 | ~180 MB for 1 layer | 1 layer per shard |
| ZAYA MoE INT8, 8-expert book variant | ~120 MB for 1 MoE layer | 1 MoE layer per shard |
| ZAYA Exp 34 RangeDim exporter | ~193-202 MB for 1 16-expert MoE shard | 1 MoE layer per shard |
| Qwen 0.5B INT8 | ~10 MB/layer class | monolithic or large layer groups are plausible |
For models with d_ff > 16K, one layer per shard is usually the only viable
starting point.
For MoE models, do not reuse the dense FFN formula directly. A soft-routed MoE
shard that runs all experts must budget every expert’s gate/up/down projections,
not only the top-k active experts.
The LM Head Problem
The language model head is a weight matrix of shape [vocab_size, d_model].
For a model with vocab_size=32000 and d_model=4096:
LM head: 32000 * 4096 = 131M params → 131 MB at INT8
For vocab_size=151936 (Qwen 2.5 tokenizer):
LM head: 151936 * 4096 = 622M params → 622 MB at INT8 → ERROR -14
The LM head cannot be a single shard at large vocab sizes.
Solution: split the LM head into vocab slices.
# Split into N chunks along vocab dimension
vocab_chunk_size = vocab_size // n_lm_head_shards # e.g., 4 shards = 32K each
for i in range(n_lm_head_shards):
start = i * vocab_chunk_size
end = start + vocab_chunk_size
# Build a CoreML shard with lm_head.weight[start:end, :]
# Input: final hidden state [1, d_model, 1, 1]
# Output: logit slice [1, vocab_chunk_size, 1, 1]
At runtime, run all LM-head shards, concatenate the logit slices, then sample. The overhead is O(n_lm_head_shards) ANE calls but each call is small.
Phi-4-mini uses 2 LM-head shards. ZAYA1-8B uses 2.
Embedding Table
The embedding table ([vocab_size, d_model]) is a lookup, not a matmul. It does
not need to run on ANE. Implement it on the host:
// Swift: embedding lookup (host-side, not CoreML)
func embed(_ tokenId: Int) -> [Float] {
let offset = tokenId * dModel
return Array(embedWeights[offset ..< offset + dModel])
}
This is an intentional exception to the ANE-only mandate. Embedding lookup is an index operation (trivial, O(1)), not a matmul.
Shard Naming Convention
Use a consistent naming scheme so the runtime can discover shards:
layers/
layer_00.mlmodelc/
layer_01.mlmodelc/
...
layer_N.mlmodelc/
lm_head/
lm_head_0.mlmodelc/
lm_head_1.mlmodelc/
embed.bin ← raw float16 embedding matrix
runtime_meta.json ← vocab_size, d_model, n_layers, n_lm_head_shards, etc.
runtime_meta.json example:
{
"model_name": "phi4-mini",
"n_layers": 32,
"n_lm_head_shards": 2,
"d_model": 3072,
"n_heads": 32,
"n_kv_heads": 8,
"d_head": 96,
"vocab_size": 100352,
"max_seq_len": 4096,
"rangedim_max_t": 4,
"int8": true
}
Shard Sizing Checklist
[ ] Per-layer INT8 MB estimated with attention + FFN/MoE formulas
[ ] Compiled `.mlmodelc` size measured before scaling the shard family
[ ] Layers-per-shard chosen to stay under 200 MB (leave 50 MB headroom)
[ ] LM head split if vocab_size * d_model > 200M params
[ ] Embedding table extracted as .bin for host-side lookup
[ ] runtime_meta.json written before building Swift runtime
[ ] Each shard's MLComputePlan checked: 100% ANE conv ops