The numbers

EdgeRunner is a Swift + Metal LLM inference engine for Apple Silicon. No llama.cpp. No MLX. No C++ at all. Everything from scratch.

Qwen3-0.6B Q8_0, 128-token greedy decode, M3 Max:

EngineDecode tok/s
EdgeRunner212
llama.cpp~183
MLX (Q8_0)~278

16% ahead of llama.cpp on the same model and chip. Still behind MLX by about 24%.

522 MB peak RSS. 4.1 ms time to first token.


Why build this

Almost every local LLM tool on Mac wraps llama.cpp or MLX. Both are great projects. But wrapping them means living with their architecture, their bottlenecks, their tradeoffs.

I wanted to see what happens when you start from zero. Pure Swift. Pure Metal. No C++ interop. No Python runtime. No Go wrapper. Just compute kernels talking directly to the GPU.

34 experiments in a single weekend. Most got rolled back. The ones that stuck pushed us to 212 tok/s.


The benchmark

swift test -c release --filter "PublishableBenchmark/fullBenchmark"

That’s it. One command. Deterministic, repeatable, same test every time.

Configuration:

  • Model: Qwen3-0.6B Q8_0 (639 MB GGUF)
  • Decode: 128 tokens, greedy (temperature = 0)
  • Chip: Apple M3 Max
  • Build: Release
  • Correctness guard: greedy prefix must start with [1, 1479, 35]

The correctness guard is the part most benchmark posts skip. It’s easy to make a model fast by breaking determinism. We don’t count those runs. If the prefix changes, it’s a regression, not an optimization.


What worked

Five things got us from 0.06 tok/s to 212 tok/s. The other 29 experiments got reverted.

1. Fused GPU pipeline

All 28 transformer layers in a single Metal command buffer. 420+ dispatches, one sync point. Biggest win in the project.

Before: each layer created its own command buffer, committed it, waited for completion. 28 round-trips to the GPU per token. After: one command buffer, one commit, one wait.

Each dispatch encoding takes about 6 microseconds. Doesn’t sound like much until you’re doing it 420 times per token.

2. Fused Q8_0 dequant + GEMV

Kernels read quantized weights directly. No float32 materialization anywhere in the hot path.

Q8_0 weights are 1 byte per element. Float32 is 4 bytes. That’s a 4x bandwidth reduction. The fused kernel does dequant math inline during the matrix-vector multiply. No separate pass.

We tried Q4_0 for the LM head. Slower. The nibble extraction (AND/SHIFT per element) added so much ALU work that it ate the bandwidth savings. On Apple Silicon, GEMV stays memory-bound until the ALU catches up.

3. The mega-kernel

fused_qk_norm_rope_gqa does three things in one dispatch: per-head Q/K RMSNorm, NeoX RoPE rotation, and GQA attention. Replaced 56 dispatches with 1.

32 threads per head. One simdgroup. Each thread handles 4 head dimensions. KV threads compute norm + rotation, write to cache, and exit. Q threads keep going into the GQA phase. No threadgroup barrier between phases. The KV threads just return.

Zero barriers in the attention loop. That’s the point.

4. KV cache with single-token decode

During decode, only the new token gets processed. All previous K/V pairs are cached. Standard practice in every inference engine, but getting it right in Metal took a few tries.

The tricky part: asymmetric Q/KV shapes. Q is 1 token, KV is N tokens. The GQA kernel handles this with kvSeqLen and qOffset parameters.

5. Decode warmup

15 dummy decode passes before the timed runs. Pre-heats Metal pipeline states, command processor cache, DRAM page tables.

This one surprised me. I expected noise. Got +16% median throughput instead. The first timed decode has a cold-start penalty from GPU pipeline JIT compilation for decode-specific kernel variants. Prefill warmup doesn’t cover that.


The progression

Stagetok/sWhat changed
Baseline0.06First working Qwen3 inference
Weight buffer caching3.6Cached MTLBuffers, GPU LM head
Batched command buffers5.1Q/K/V in one buffer
Fused GPU pipeline16Entire transformer, one sync
Q8_0 fused kernels63Dequant inline with GEMV
Kernel fusion120RMSNorm+GEMV, Q/K norm+RoPE
Mega-kernel212Q/K norm + RoPE + GQA in one

0.06 to 212. 3,500x improvement. Most experiments rolled back. The ones that survived the correctness guard are the ones in the table.


What’s left

MLX hits ~278 tok/s on the same model and chip. We’re at 212. 66 tok/s gap.

The bottlenecks:

  • DRAM bandwidth: ~172 GB/s effective, 43% of M3 Max theoretical. MLX does better.
  • Dispatch count: still 140+ per decode. Metal Indirect Command Buffers could help.
  • Swift async overhead: ~0.23ms per decode call. A non-async forward pass would remove it.

Beating llama.cpp is nice. But MLX is the real target.


Run it yourself

EdgeRunner is open source. Loads GGUF models: Q8_0, Q4_K_M, Q6_K, whatever.

let model = try await LlamaLanguageModel.load(
    from: URL(fileURLWithPath: "Qwen3-0.6B-Q8_0.gguf")
)

github.com/christopherkarani/EdgeRunner

If you’ve got an M1, M2, or M4 Mac:

swift test -c release --filter "PublishableBenchmark/fullBenchmark"

Takes about 5 seconds. I’d love to see numbers from other chips.