ContextCore: GPU-Accelerated Context Memory for On-Device AI Agents

We built ContextCore to solve a specific problem: on-device AI agents running on Apple Silicon need to manage conversation context within a fixed token budget, without cloud, without killing battery, and without perceptible latency. The approach uses a four-tier memory architecture backed by Metal compute shaders that score, rank, and compress context at 63 million chunks per second, delivering sub-5ms p99 window construction.

The Problem Space

Modern AI agents forget things. As your conversation grows, early turns drop out of the model’s attention window. The standard fix, stuffing more tokens into context, costs money, adds latency, and hits model limits. Vector databases help, but they’re still retrieval over compressed windows the model has to attend to.

The insight driving the design: context management is not a retrieval problem. It’s a scoring and packing problem. Every memory chunk competes for a fixed token budget, and the goal is to select and order chunks so that the model’s attention lands on the most useful content.

LLMs have fixed context windows, 128K for Claude, 200K for GPT-4 Turbo, 8K-32K for on-device models. But the real constraint isn’t the model’s maximum context. It’s the effective context: the portion of the window where the model reliably pays attention. Research by Liu et al. (2024) shows that content at the beginning and end of long contexts receives disproportionate attention, while middle content is frequently underweighted.

So we had to solve two things at once: what to include, and how to order it so important content falls in high-attention positions.

On-Device Constraints

Running on Apple Silicon imposes specific constraints that cloud deployments don’t have:

Constraint	Implication
Thermal budget	Sustained GPU utilization causes throttling. Operations must be fast and efficient.
Memory bandwidth	Moving data between CPU and GPU is expensive. Buffer reuse matters.
Battery	Background compute is visible to users. Discrete GPU not available.
No cloud fallback	Everything must run locally, including embedding models.

The M2 Pro in a MacBook Air can sustain about 20W. That meant we had to build the whole window in under 5ms; fast enough that users never noticed the overhead.

Why Not Just Use a Vector DB?

Vector databases (Pinecone, Weaviate, Qdrant) are good at what they do: approximate nearest-neighbor retrieval over large embedding spaces. But they solve a different problem than context management. Vector DBs retrieve; they don’t pack. You still need to decide how to fit retrieved chunks into a token budget. They don’t understand memory hierarchy: a fact from last week and a turn from this morning are equally “nearest” to your query, but they should be treated differently. And even with HNSW indexes, the scoring step runs on CPU; and network round-trips add milliseconds even on localhost.

ContextCore performs scoring, ranking, compression, and packing in a single synchronous pipeline, all on the GPU.

Architecture

Four-Tier Memory

ContextCore maintains four distinct memory tiers, each with different semantics:

Episodic (Tier 1) stores raw conversation turns. This is the primary source of context; every user message and assistant response goes here. Episodic chunks have moderate retention scores that decay over time and get penalized when duplicates are found during consolidation.

Semantic (Tier 2) stores consolidated facts extracted during background consolidation. When episodic chunks are identified as near-duplicates, the shorter one is promoted to semantic memory with a higher retention score. Semantic memory is queried alongside episodic memory but with a longer half-life (90 days vs 7 days).

Procedural (Tier 3) stores tool-usage patterns keyed by task type. If the user has asked the agent to refactor code three times, procedural memory remembers the tool sequence used. This tier is small (1000 patterns max) and is scored by recency and frequency.

Recent Turns (Tier 4) is a pinned buffer of the N most recent turns (configurable, default 3). These are always included in context regardless of relevance scoring. They handle the recency bias that pure relevance scoring doesn’t capture well.

When buildWindow(currentTask:maxTokens:) is called, the current task is embedded and scored against all episodic and semantic chunks:

score_i = relevanceWeight × cosine_similarity(task_embedding, chunk_embedding)
        + recencyWeight × recency_weight(chunk)

Default weights: 70% relevance, 30% recency. Recency is computed as exponential decay with a configurable half-life (default 7 days for episodic, 90 days for semantic):

recency_weight = exp(-ln(2) × age_seconds / halfLifeSeconds)

The Attention Centrality Problem

Naive relevance scoring clusters. If the user has been debugging Swift concurrency for the last hour, every relevant chunk will be about actor isolation and sendable conformance. The result: the context window fills with near-duplicate content and the model loses visibility into earlier topics.

ContextCore addresses this with attention centrality, a measure of how representative each chunk is of the conversation as a whole. The token_centrality kernel computes each chunk’s average cosine similarity to all other chunks:

centrality[i] = (1/(n-1)) × Σ cosine_similarity(embeddings[i], embeddings[j]) for all j ≠ i

High centrality means the chunk is similar to many other chunks; it’s generic. During eviction, we remove low-centrality chunks first. The eviction score blends task relevance (60%) with inverse centrality (40%):

evictionScore = relevance × 0.6 + (1 - normalizedCentrality) × 0.4

We arrived at 60/40 through manual tuning across 14 conversation traces spanning debugging sessions, code review, and documentation tasks. It works well for us. It will not suit all workloads; we recommend benchmarking with your specific task distribution.

Packing and Compression

The WindowPacker uses a first-fit-decreasing algorithm:

System prompt (pinned first)
Recent turns buffer (pinned last, models attend most to end-of-context)
Scored memory chunks sorted by blended score

If a chunk doesn’t fit in the remaining token budget, ProgressiveCompressor attempts increasingly aggressive compression:

Level	Target Reduction	Method
Light	50%	Keep top-scoring sentences by embedding cosine
Heavy	75%	More aggressive sentence selection
Drop	100%	Remove entirely if nothing fits

The compression quality depends heavily on the embedding provider. We use Apple’s CoreML-based MiniLM for on-device embedding, which produces 384-dimensional vectors.

GPU Kernels

ContextCore ships five Metal compute shader kernels. This section covers the most important one, relevance_score, in detail.

relevance_score: The Hot Path

The relevance scoring kernel is the most frequently executed kernel in the pipeline. It computes cosine similarity between a query vector and N chunk embedding vectors, combines with recency weights, and outputs a score per chunk.

Naive approach: For each chunk, read the query from global memory, compute the dot product. With N chunks and dimension D, that’s N×D global memory reads for the query alone. For 500 chunks and 384 dimensions: 192,000 global reads.

Optimized approach: Load the query into threadgroup shared memory once, then all threads reuse it. The query stays in fast on-chip memory. Global reads drop to D (384).

constant uint relevance_shared_query_capacity = 1024;

if (useSharedQuery) {
    // Thread 0..dim-1 cooperatively load query into threadgroup
    uint queryIndex = thread_position_in_grid.x;
    if (queryIndex < dim) {
        sharedQuery[queryIndex] = chunks[queryIndex]; // query is at offset 0
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);
}

The float4 SIMD optimization then performs 4 elements per instruction:

for (; index + 3 < dim; index += 4) {
    const float4 q = float4(sharedQuery[index],
                            sharedQuery[index + 1],
                            sharedQuery[index + 2],
                            sharedQuery[index + 3]);
    const float4 c = float4(chunks[base + index],
                            chunks[base + index + 1],
                            chunks[base + index + 2],
                            chunks[base + index + 3]);
    dotProduct += dot(q, c);
}

On M2, this kernel hits about 63 million chunks per second at 50K batch size. What that means in practice: scoring 1000 chunks takes 16 microseconds. That’s fast enough that scoring is never the bottleneck.

The other kernels handle recency weighting (compute_recency_weights), attention centrality (token_centrality), cross-attention eviction scoring (cross_attention_score), and the n×n similarity matrix for consolidation (pairwise_similarity).

Implementation

Actor Isolation

AgentContext and all stores are Swift actor types. The public API is nonisolated where safe and actor-isolated where state mutation is required. This prevents data races without manual locking.

public actor AgentContext {
    public func append(turn: Turn) async throws
    public func buildWindow(currentTask: String, maxTokens: Int? = nil) async throws -> ContextWindow
    public func remember(_ fact: String) async throws
    public func recall(query: String, k: Int = 5) async throws -> [MemoryChunk]
    public func consolidate() async throws
    public func checkpoint(to url: URL) async throws
}

Buffer Pooling

Hitting 4.89ms p99 requires no malloc on the critical path. ScoringEngine maintains a reusable buffer pool; the same buffers are reused across invocations. Only when the working set grows beyond capacity does a new allocation occur.

Embedding Cache

Embedding the same text twice is wasteful. A SHA256-keyed LRU cache (512 entries default) avoids redundant CoreML inference.

Evaluation

Results

End-to-End buildWindow Latency:

Turns	Budget	p50	p95	p99
10	2048	1.21ms	1.89ms	2.31ms
50	4096	1.87ms	2.94ms	3.52ms
200	4096	2.74ms	3.88ms	4.41ms
500	8192	3.42ms	4.31ms	4.89ms

GPU vs CPU Scoring (50K chunks, 384 dims):

Implementation	Latency	Throughput	Speedup
CPU (Accelerate)	142.31ms	351K chunks/s	1×
GPU (Metal)	0.79ms	63.3M chunks/s	180×

Consolidation Latency (2000 chunks):

Operation	p50	p95	p99
Full consolidation	10.2ms	13.8ms	15.61ms

Recall Quality (7 relevant / 50 total):

Metric	Score
Precision@3	0.67
Precision@5	0.80
Precision@8	0.88

Latency Breakdown

For the 500-turn, 8192-token case at p99:

Component	Time
CPU→GPU buffer write	~0.1ms
GPU kernel dispatch	~0.05ms
GPU execution (all kernels)	~0.5–2ms
Command buffer wait	~0.1–0.5ms
Output readback	~0.1–0.3ms
Sort (top-K)	~0.1ms
Total	under 5ms p99

The GPU is underutilized; most time is spent in buffer management and synchronization, not actual compute. This is the most interesting finding: FP16 would give us 2× the FLOPs on Apple Silicon, and since we’re not compute-bound, the gain would be real. We didn’t get there at launch because FP16 precision in the similarity matrix needed more validation.

Limitations

FP16. Every kernel is float32. Apple Silicon has 2× the FLOPs for float16. We stuck with float32 because the pairwise similarity matrix in consolidation accumulates floating-point errors across the n×n computation, and we didn’t have time to validate FP16 precision there. This is the most obvious optimization left on the table.

Top-K kernel. The topk_indices kernel is single-threaded O(k×n) greedy. For k=8–16 (the normal case), it’s fine. For larger k or larger candidate sets, a bitonic sort would be the right answer.

Norm caching. The token_centrality kernel recomputes the L2 norm from scratch for every pair. A two-pass approach would cut that in half.

Memory footprint. Currently all episodic embeddings must fit in GPU memory for scoring. Sessions with 10K+ turns will exceed reasonable GPU memory budgets. ANN indexing (Metal’s built-in ANNSearch) would let us score over much larger stores without materializing everything.

Try It

dependencies: [
    .package(url: "https://github.com/christopherkarani/ContextCore.git", from: "1.0.0")
]

Requires iOS 17+, macOS 14+, visionOS 1.0+, Apple Silicon with Metal.

Repo: github.com/christopherkarani/ContextCore