An agent would run for hours. It accumulated decisions and facts across turns. Then the context window rolled over or the process restarted and the accumulated state vanished. Or the agent “remembered” by stuffing the entire transcript back into the prompt on every turn. Tokens burned. The one fact from three sessions ago that the current decision depended on stayed missing.
The usual fixes were wrappers. JSONL logs plus a vector store. Or a call to a memory service that ran LLM extraction on every turn and wrote into Postgres plus pgvector plus a graph. The file got corrupted on a hard kill. The sidecar SQLite drifted from the vectors. The cloud service added latency and a hard dependency. Dedup remained a best-effort LLM call that sometimes emitted three versions of the same preference.
The requirement was one file. An agent treats it like a document: AirDrop the file, back it up, open it after a hard kill, and recover the exact state that includes the pending WAL records, the vector index manifests, the FTS5 segments, and the bitemporal facts with their valid-time and system-time ranges. No network hop. No sidecar database whose state can diverge. No hand-wave about eventual consistency.
Wax is the engine that implements that file.
The File Format
A Wax file stores its own write-ahead log. It is not a bag of documents plus a separate index file. The layout is self-describing.
The layout on disk:
Offset Region
0 KiB Header Page A (4 KiB)
4 KiB Header Page B (4 KiB)
8 KiB WAL ring buffer (default 256 MiB)
... Frame payloads (variable, compressed)
... TOC (BinaryEncoder, up to 64 MiB)
... Footer (64 bytes)
Two identical header pages exist so that a crash during a header write leaves the previous valid generation intact. On open we read both, validate checksums (SHA-256 of the header with the checksum field zeroed), and pick the one with the highest headerPageGeneration. The footer at the very end contains the magic WAX1FOOT, the TOC length, a SHA-256 of the TOC body, a generation, and the WAL committed sequence number. The TOC hash in the footer must match the TOC’s own self-checksum.
Every frame carries a FrameMeta record. It holds the id, timestamps, uri, title, payload offset and length, canonical encoding, checksums, kind, track, tags, labels, metadata dictionary, search text, content dates, role, parent id, supersede links, chunk index, and the active flag.
Payloads land in the WAL first. The orchestrator stages the indexes (FTS5 segments, vector manifests, temporal manifests) and serializes them into the TOC segment catalog only on commit. The commit advances the header generation and the WAL checkpoint.
Hybrid search, RAG assembly, structured memory, and VideoRAG all operate on the same frame records. Those records remain intact after abrupt termination because the WAL and dual-header protocol guarantee it.
WAL Ring and Recovery
The WAL is a fixed-size circular buffer starting at offset 8 KiB. Each record has a 48-byte header: 8-byte monotonic sequence, 4-byte payload length, 4-byte flags (bit 0 = padding), and 32-byte SHA-256 of the payload.
Record types are putFrame, deleteFrame, supersedeFrame, putEmbedding. When the ring wraps, padding records fill the gap. A sentinel (all-zero header) marks the end of valid data.
Fsync policy is configurable: always, on commit (default), or every N bytes. There is also a proactive commit threshold so that the amount of data at risk in the WAL stays bounded.
Recovery is the part most “local vector stores” hand-wave:
- Select the best header page by generation + checksum.
- If a WAL replay snapshot exists in the header (magic WALSNAP1), use its cursors to skip already-committed records.
- Scan forward from the checkpoint (or snapshot position) with
WALRingReader. - Apply the pending mutations to rebuild in-memory indexes and frame state.
- Corrupted pending records are tolerated for position tracking; later valid records are still applied.
- Validate checksums on committed payloads.
We test this path extensively with deliberate fault injection: manually appending uncommitted WAL records and then “crashing” before updating the header or footer, stomping individual header pages with zeros, corrupting checksums in pending records, and truncating the file at various points. In all recoverable cases the store either replays the pending work or correctly falls back to the last known good state. Unrecoverable corruption (bad magic, mismatched TOC hash on committed data, etc.) is rejected with a clear error instead of producing garbage.
The recovery contract is explicit. The WAL ring plus the dual header pages plus the footer scanner produce a deterministic state after a hard kill. The access pattern (append-heavy, occasional correction via supersede, frequent hybrid queries) drove every layout choice.
Concurrency Model
Every major piece is an actor: the low-level Wax coordinator in WaxCore, WaxSession, MemoryOrchestrator, FTS5SearchEngine, the vector engines, the embedding memoizer.
Actors give isolation. The hard part is I/O. Blocking POSIX calls on the actor thread pool would starve everything else. So there is a BlockingIOExecutor that owns a dedicated concurrent DispatchQueue. Reads run concurrently through it. Writes go through a barrier. All file operations (FDFile, pread/pwrite/fsync) are funneled there.
On top of that sit purpose-built locks: AsyncReadWriteLock (actor + continuations, writer priority), ReadWriteLock (pthread for hot paths where async overhead hurts), AsyncMutex, UnfairLock, and FileLock (flock advisory with upgrade/downgrade for cross-process safety).
A writer lease (acquired with policy .fail, .wait, or .timeout) serializes logical mutations. Multiple readers are allowed. The lease is what lets you have a long-running recall while another task is ingesting in the background, without two writers colliding inside the WAL and TOC.
The actor surface stays simple because the I/O, the WAL append, the commit barrier, and the index updates are all routed through the dedicated BlockingIOExecutor and the writer lease. The caller sees ordinary method calls. The durability invariants are enforced underneath.
Frames, Supersede, Dedup, and Bitemporal Structured Memory
A frame is the universal durable unit. When you remember text, it gets chunked (default 400 tokens with 40-token overlap using the pinned BPE tokenizer), optionally embedded, written as one or more frames with parent/child links, indexed in FTS5 and the vector engine, and committed in batches.
Supersede lets you correct earlier knowledge without losing history. Frame A can point to frame B as its replacement. Timeline queries and most recall paths exclude superseded frames unless you ask explicitly.
Content dedup is content-addressed at the document level via ContentHasher (SHA-256 of canonical form) plus embedding identity and chunk count, stored in metadata under wax.content.hash. The orchestrator exposes rememberDedupProbe so a partial ingest can be retried without re-chunking or re-embedding the parts that already landed.
Structured memory lives in the same file. It is a bitemporal entity-fact model. Entities have kinds and normalized aliases. Facts are (subject, predicate, object) where object can be string, int, double, bool, data, timestamp, or entity reference. Every fact carries two half-open time ranges: valid time (when the fact is true in the world) and system time (when the system recorded it). Queries are asOf(systemTime, validTime). Retraction closes the system range. Evidence links every fact back to the exact frame, chunk, and UTF-8 span that produced it. Dedup is by SHA-256 of the triple.
The structured facts use the identical FTS5 engine, the identical WAL append path, and the identical commit protocol as every other frame. An assertion of “the user switched to dark mode on 2026-03-12” with its source span becomes a bitemporal row whose evidence pointer points back to the exact chunk and UTF-8 range. It survives the same WAL replay and header selection as the transcript frames that produced it.
Hybrid Search and Token-Aware RAG Assembly
Search has four lanes: BM25 (FTS5), vector (Accelerate or MetalANNS), structured facts, and pure timeline fallback.
A small offline RuleBasedQueryClassifier looks at surface cues (“when”, “recent”, “what is”, “how”, “why”) and picks a QueryType. That drives AdaptiveFusionConfig weights for reciprocal rank fusion. The fusion is rank-based with a k=60 constant, weighted per lane, then tie-broken deterministically by best rank then frame id.
After fusion the pipeline applies optional reranking. Term overlap, entity coverage, date bonuses, and distractor penalties (words such as “tentative” or “draft” in the wrong context) adjust the order.
The token budget is applied inside the same orchestrator. The top hit expands up to a cap. Tier-selected surrogates fill the rest: full text, gist, or an MMR sentence-extractive summary. The tier choice uses age, access statistics, and query specificity.
Metadata and time predicates push down. Pending writes are visible on request. A caller can ask for frames tagged “meeting” after last Tuesday and receive only those results.
Why This Is Different From the Usual Options
Mem0 performs LLM fact extraction on each turn and writes the results into a vector store plus an optional Postgres graph. The extraction cost and the durability guarantees are those of the chosen backing stores. No single file contains both the raw transcript and the derived facts with a built-in replay path.
Zep (via Graphiti) stores bi-temporal facts and episodes in a graph engine such as Neo4j together with vector indexes. Both the graph and the vector store are external systems that must be deployed and administered.
Wax records every mutation first in the WAL ring at offset 8 KiB. On commit it advances the header generation, updates the footer with the TOC SHA-256 and the committed WAL sequence, and persists the frame payloads, the FTS5 segments, the vector manifests, and the structured fact rows inside the same file. Recovery selects the higher-generation valid header page, then uses WALRingReader to replay any records after the checkpoint. The same substrate serves raw frames, lexical search, vector search, and bitemporal facts. VideoRAG and PhotoRAG are additional orchestrators on that substrate.
A .wax file carries the structured facts with their evidence pointers and the serialized vector index manifests. Opening the file on another machine replays the WAL from the committed sequence and reconstructs the in-memory indexes from the TOC. Most agent memory systems are not built around a single portable file with its own recovery protocol.
What It Enables
A coding agent keeps the architectural decision from three weeks ago and the exact file that recorded the constraint. The frame that captured the discussion carries its timestamp, the supersede links if the decision changed, and the FTS5 + vector entries that surface it on a recall for “why did we pick this approach”.
A personal knowledge base stays in the Documents folder. iCloud or Git syncs the single file. A query for last month’s HNSW conclusion pulls the exact paragraph, the surrounding chunk, the bitemporal fact that recorded the conclusion date, and the source span that produced it. No service call.
An iOS app bundles the orchestrator and the on-device embedder. The RAG path never leaves the device. The privacy surface for the memory component is the file the user already controls.
The constraints were the point. Apple platforms from iOS 17 and macOS 14. Swift 6 actors. CoreML or Accelerate embedders. The engine had to match the hardware limits and the concurrency model where on-device inference was already the normal path.
The Implementation
Here are the core structures that make the durability guarantees concrete.
WAL Record Header
Every change is first written as a small, checksummed record:
struct WALRecordHeader {
let sequence: UInt64 // Monotonically increasing
let payloadLength: UInt32
let flags: UInt32 // bit 0 = padding record
let checksum: Data // 32-byte SHA-256 of the payload
}
The full record is 48 bytes. After the header comes the payload (a put, delete, supersede, or embedding operation). When the ring buffer wraps, we insert padding records so the next real record starts cleanly.
Dual Headers and Atomic Updates
There are two identical 4 KiB header pages at the start of the file. On every commit we write the new metadata to the older page, update its generation number, then fsync. On open we simply pick the page with the highest valid generation. A crash in the middle of a header write is harmless — we fall back to the previous good generation.
Recovery Flow (Simplified)
// 1. Choose the newest valid header page
let header = selectBestHeaderPage()
// 2. (Optional) fast path using a snapshot stored in the header
if let snapshot = header.replaySnapshot {
startScanningFrom(snapshot.checkpointPosition)
} else {
startScanningFrom(header.walCheckpointPosition)
}
// 3. Replay every pending record after the checkpoint
for record in walRingReader.scanPending() {
apply(record) // rebuild TOC, indexes, etc.
if record.checksumFails() { /* tolerate in pending region */ }
}
// 4. Verify the final TOC against the footer
Corrupted records in the uncommitted region are skipped for positioning, but later valid records are still applied. Committed data is never accepted if its checksum is bad.
Key Limits (baked into the format)
- Default WAL size: 256 MiB
- Maximum TOC size: 64 MiB
- Maximum single blob: 256 MiB
- Strings capped at 16 MiB
These numbers are chosen so the entire metadata for a very large agent memory still fits comfortably while keeping recovery fast.
The design is deliberately small and explicit. Most of the sophistication lives in the recovery paths and the careful layering of checksums (WAL record → header → TOC → footer).
The full source is available at https://github.com/christopherkarani/Wax.