Experiment 24 - Structured CoT as a Grammar-Constrained Sampler

Sources: Dechter constraint propagation + Willard/Louf guided generation + Kaya Omer, “Structured CoT: Shorter Reasoning with a Grammar File” (2026)

The structured-CoT post is directly relevant to Phi-on-ANE, but the expected win is energy/tokens-per-task, not raw tok/s. The mechanism is an inference harness: constrain the scratchpad with a finite-state grammar such as GOAL/STATE/ALGO/EDGE/VERIFY, then leave the code/answer channel permissive.

This fits the ANE-only mandate because it lives at the permitted host-side sampling boundary. Current Phi generation already does all heavy work in ANE layer shards plus ANE LM-head shards, then the host scans logits for argmax. A grammar/FSM would replace unconstrained argmax with constrained argmax over the valid next-token set. No CPU/GPU matmul, norm, attention, FFN, or LM-head compute is introduced.

Expected benefits:

ANE-specific caveats:

Smallest implementation path:

  1. Add a tokenizer-derived grammar manifest for fixed literals and newline.
  2. Add constrained argmax at the existing host sampling point.
  3. Add optional forced-token advance for grammar literals to avoid unnecessary LM-head calls while still updating KV on ANE.
  4. Gate on a small coding suite: pass/fail, total tokens, plan tokens, code extraction errors, and energy per solved task.

Implemented first shipping slice:

Smoke command:

local-artifacts/phi4_mini_ane_runtime \
  --meta local-artifacts/phi4_mini_ane/phi4mini_runtime_meta_20_4_6_2.json \
  --max-new 16 \
  --structured-cot \
  --profile

Smoke result: structured mode forced 6 literal tokens, emitted 10 field-content tokens, completed no fields within the short 16-token budget, and decoded at 16.609 tok/s after cold CoreML first-use load. Per-token decode profile stayed in the known public baseline range: layers_ms=56.151, head_predict_reduce_ms=4.049.