Experiment 21 - APL-Style Token/Stream Batching

Sources: Iverson APL inner/outer product + Concrete Mathematics amortization

Single-token decode is a poor ANE shape: [1,D,1,1] gives the conv engine only one spatial point per weight load. The next array-shape probe should convert a representative layer shard to accept T > 1 positions, e.g. [1,D,T,1], and measure whether 1x1 conv weight reuse improves prefill, multi-agent serving, or speculative verification.

This does not directly accelerate single-stream greedy decode unless speculation or batching supplies independent tokens, but it can be the largest throughput lever for coding-agent workloads.

First probe: the LM head now has an opt-in --batch-tokens builder path. The full 4-shard T=4 set, hidden shape [1,3072,4,1], passed strict residency (conv_non_ane=0, compute_non_ane=0 on every shard) and numerical golden against NumPy (cos_logits from 0.999926 to 0.999937). A shard-0 microbench measured one batched prediction at 0.691 ms/token versus four single-token predictions at 1.608 ms/token, a 2.33x per-token improvement for that shard. This is a multi-stream/speculative/prefill shape lever, not a direct greedy single-stream decode win until the runtime can supply independent hidden vectors.