We just finished a full Stories throughput mini research experiment inside Espresso. The headline is not the one people usually want to post.

We did not get a retained-path speedup. The best result we measured stayed on the existing exact path at about 121.88 tok/s. The two most plausible exact-path candidates both lost. A research-only 12-layer GQA proof produced useful infrastructure, but it was slower, compile-unstable, and the output quality was visibly worse.

Most performance work gets derailed by weak controls. Teams compare against stale baselines, keep half-working branches around, rerun ideas that were already falsified, and slowly lose track of what actually counts as a win. We wanted the opposite. This document describes what we actually found and how we tried to avoid fooling ourselves in the process.

The bar was narrow, deliberately

This program only counted wins that held up under real serving conditions. The Stories release benchmark had to improve. The .esp serving path had to work as-is. ANE-centric execution could not change. exact_head_backend=ane_classifier and cached_bindings_enabled=true had to stay true. No CPU fallback. No second full llama trunk runtime. No quality regressions dressed up as improvements.

Loose criteria produce loose results. Tight criteria are the only thing that makes “we improved the benchmark” mean anything.

The counted benchmark was:

./.build/arm64-apple-macosx/release/espresso-generate generate \
  --bundle /private/tmp/stories110m-ctx256-v1.esp \
  --max-tokens 64 \
  --benchmark-generate \
  --compare-warmup 1 \
  --compare-iterations 3 \
  Hello

If an idea could not beat that command while preserving those invariants, it was not a retained breakthrough. Research, yes. Product progress, no.

What we actually tested

We started from a pinned baseline commit (cf960f5) and checked our planning document against the repo’s experiment ledger, what had been tried, what bombed, and whether those failures actually settled the question under real constraints.

The queue we ended up running:

  1. Shorter-context exact SKUs (ctx128, ctx192)
  2. Compile-stable 6-layer hybrid retry
  3. 12-layer grouped-KV GQA proof
  4. 10-layer and 8-layer GQA sweep if the 12-layer proof held up
  5. Verify-first / pay-later staging
  6. Token-conditioned proposer or state-predictor path
  7. k>2 verifier ladder / MTP research

The rule: highest confidence, lowest assumption, easiest revert, in that order.

Git as the control plane

Every experiment started from a pinned baseline commit. If something produced a real, lasting improvement, it got committed cleanly. If it regressed throughput or broke invariants, it got reverted before it touched main. No half-baked experiments lingered.

Once you let mixed changes accumulate, you lose the ability to trust any individual result. You cannot tell what is working, what is broken, and what is just noise.

Baseline for this batch was cf960f5. The only commit we kept was 1e5840b, and that was research infrastructure, not a serving-path win.

Every unit of work was a falsifiable experiment

Each experiment had to be stated in a way that could fail cleanly, before we wrote a line of code. For each lane, we wrote down the hypothesis, the exact mechanism being tested, which invariants we were not allowed to break, what success looked like, what would kill the experiment, the same-revision control run for comparison, the minimal code or artifact change that could test the idea, which tests we would run, which benchmark we would measure, short-prompt and long-prompt quality checks, the immediate decision (retain, reject, revert, or block), and the commit or revert record.

This solved two problems we kept running into.

First: vague optimism. Performance ideas always sound promising until you make someone name their kill gate ahead of time. Suddenly that brilliant idea has a death certificate and nobody’s feelings get hurt.

Second: “interesting” regressions. A path that is slower, lower quality, or off the retained surface does not get more story. It gets rejected or blocked.

Comparing candidates against baselines

Comparing a candidate against a baseline from a different revision, a different thermal state, or a different compiler state is one of the quickest ways to fool yourself. We ran same-revision controls alongside every experiment.

On the Stories bundle, our control looked like this:

tok/s: 121.88
compile_ms: 1281.35
first_token_ms: 1.55
median_token_ms: 8.55
p95_token_ms: 10.56
exact_head_backend: ane_classifier
cached_bindings_enabled: true

After running candidates, we re-ran the control to confirm the baseline had not drifted. Post-candidate control came in at 119.56 tok/s, close enough that the losses we saw were real losses, not measurement noise.

Experiment 1: smaller exact context

The obvious first move was to shrink the exact-path Stories bundles by dialing down maxSeq. Less context, lower compile and state costs, better throughput. Low risk, no semantics changes, just a capacity knob.

We built two probes: ctx128 and ctx192. Both preserved exact short-prompt and long-prompt behavior.

The results:

  • retained exact control: 121.88 tok/s
  • ctx128: 107.16 tok/s
  • ctx192: 105.18 tok/s

Both were slower. ctx64 never became a valid candidate. The benchmark command requires at least 68 total tokens, so a ctx64 claim could not be proven on the mandated run.

Smaller exact context did not buy us retained throughput on the real Stories path. That lane is closed unless something deeper in the runtime changes.

Experiment 2: the old 6-layer hybrid lane routes through CPU now

The planning document flagged the compile-stable 6-layer retry as a credible lane, on the theory that removing compile churn might open some headroom.

It turned out the repo no longer serves that path the way older experiments assumed. When we probed the existing 6L and 8L draft bundles, we got a CPU draft runtime instead of the ANE hybrid path:

  • 6L: 81.61 tok/s
  • 8L: 85.75 tok/s
  • exact_head_backend=cpu_exact_two_token_draft
  • cached_bindings_enabled=false

Not a retained-path comparison. A different execution model entirely.

The lesson is not “6 layers are bad.” The code surface has moved. exact_two_token drafts now route through a CPU exact-two-token runtime at HEAD. Old hybrid-draft evidence does not replay cleanly here.

Blocked lane, not a failed lane. Different implications downstream.

Experiment 3: grouped-KV GQA

The most substantive experiment this batch was a 12-layer GQA proof with grouped-KV teacher initialization.

The logic: the repo had already shown naive truncation was a dead end, but grouped-KV reduction might still expose useful structure. And even if this never became a serving path, better initialization would be useful infrastructure for later student experiments.

We added grouped-KV teacher initialization to the distillation pipeline, wrote tests, exported a proof artifact. That is the kept commit: 1e5840b Add grouped-KV GQA Stories proof initialization.

The benchmark numbers:

  • 111.75 tok/s
  • compile_ms=10338.24
  • compile_retries=22
  • compile_failures=27
  • first_token_ms=1.86
  • median_token_ms=8.88
  • p95_token_ms=13.17

The outputs were bad. Short prompts fell into "Icy. Icy. Icy." repetition. Long prompts collapsed into fragments.

Grouped-KV initialization stays as research infrastructure. The 12-layer GQA artifact is not a serving win and not a product-quality path. Keep the enabling work. Reject the false claim.

What we did not rerun because the ledger already had answers

When the repo already contains decisive falsification under equivalent mechanism and constraints, rerunning it just wastes time and muddies the record. We marked those ideas as attempted and moved on:

  • unconditioned future-head retries
  • split-runtime speculation paths
  • prepared-pair upper-bound paths
  • second-runtime draft paths
  • naive truncation as a product path
  • factored-head paths as the primary retained bet
  • any claim about a simple ~400 tok/s two-token Stories win on the current retained constraints

Not laziness. Respect for evidence.

The blockers we found are structural, not motivational

The queue was not blocked because we ran out of ideas. It was blocked because the next ideas require new contracts or new runtime surfaces. Four things stand between here and the next experiment:

1. Draft bundles currently route through CPU serving. The exact_two_token bundle path uses a CPU exact-two-token runtime. Old hybrid-draft experiments cannot be replayed as retained ANE-path comparisons without changing serving architecture first.

2. The current ANE two-step runtime is proposer-first. “Verify first, pay later” sounds attractive when zero-accept cost is too high. But the runtime we have selects proposed future tokens before exact pair preparation. There is no verifier-first staging path to benchmark yet.

3. The FTS2 sidecar contract is unconditioned. Right now the future sidecar is an unconditioned futureRMS + futureClassifier head. There is no token-conditioned proposer or post-commit state-predictor contract in-tree. Until that exists, the next serious future-token acceptance experiment cannot even be expressed cleanly.

4. The artifact contract is hard-wired to horizon == 2. Anything beyond two-token speculation is blocked by representation. Bundle validation and TwoStepStudentContract both enforce horizon two. A k>2 verifier ladder is not something you can just try. It starts with changing the contract surface.

This was the most useful output of the whole batch. We now know where the problem stops being about tuning and starts being about architecture.

One practical note

Do not parallelize ANE throughput benchmarks.

We watched benchmark runs contaminate each other through shared compiler and thermal state. Once parallel runs have touched the state you care about, the numbers are no longer decision-grade. Suggestive, maybe. Trustworthy enough to retain or reject a path? No.

Run counted ANE throughput benchmarks sequentially, with explicit control bracketing. This is boring. It will probably save more time than any remaining speculative model tweak.

What this research style bought us

If you only look at the scoreboard, this batch produced no retained throughput win.

If you look at the program quality, it produced several things I trust much more than a noisy local speedup:

  • a cleaned-up experiment queue grounded in the real ledger
  • decisive rejection of smaller exact-context bundles as a retained win
  • a kept research-infrastructure commit for grouped-KV GQA initialization
  • a sharper map of what is blocked by runtime and artifact contracts
  • a workflow correction around benchmark discipline

Most importantly, it reduced ambiguity.

Ambiguity is expensive. It creates fake options, stale optimism, and repeat experiments that look active but are really just forgetting.

Where to go next

Stop trying to squeeze a breakthrough out of knobs that the current retained surface cannot support.

The next serious moves are structural:

  • restore a retained hybrid draft-sidecar serving path, or
  • design a token-conditioned or post-commit proposer contract that can plausibly drive real future-token acceptance, or
  • expand the artifact and runtime contracts beyond horizon two if the project wants to explore k>2 seriously

Those are not small tasks. They are also more honest than pretending another minor truncation sweep is going to unlock the answer.

Good performance research is not just about finding wins. It is about eliminating fake wins, naming real blockers, and making sure the next engineer starts from truth instead of momentum.


Espresso repo: github.com/christopherkarani/Espresso