ContextCore: GPU-Accelerated Context Memory for On-Device AI Agents
We built ContextCore to manage conversation context on Apple Silicon without cloud or battery drain. Four-tier memory, Metal shaders, 63M chunks/sec, sub-5ms p99.
We built ContextCore to solve a specific problem: on-device AI agents running on Apple Silicon need to manage conversation context within a fixed token budget, without cloud, without killing battery, and without perceptible latency. The approach uses a four-tier memory architecture backed by Metal compute shaders that score, rank, and compress context at 63 million chunks per second, delivering sub-5ms p99 window construction.
The Problem Space
Modern AI agents forget things. As your conversation grows, early turns drop out of the model’s attention window. The standard fix, stuffing more tokens into context, costs money, adds latency, and hits model limits. Vector databases help, but they’re still retrieval over compressed windows the model has to attend to.
The insight driving the design: context management is not a retrieval problem. It’s a scoring and packing problem. Every memory chunk competes for a fixed token budget, and the goal is to select and order chunks so that the model’s attention lands on the most useful content.
LLMs have fixed context windows, 128K for Claude, 200K for GPT-4 Turbo, 8K-32K for on-device models. But the real constraint isn’t the model’s maximum context. It’s the effective context: the portion of the window where the model reliably pays attention. Research by Liu et al. (2024) shows that content at the beginning and end of long contexts receives disproportionate attention, while middle content is frequently underweighted.
So we had to solve two things at once: what to include, and how to order it so important content falls in high-attention positions.
On-Device Constraints
Running on Apple Silicon imposes specific constraints that cloud deployments don’t have:
| Constraint | Implication |
|---|---|
| Thermal budget | Sustained GPU utilization causes throttling. Operations must be fast and efficient. |
| Memory bandwidth | Moving data between CPU and GPU is expensive. Buffer reuse matters. |
| Battery | Background compute is visible to users. Discrete GPU not available. |
| No cloud fallback | Everything must run locally, including embedding models. |
The M2 Pro in a MacBook Air can sustain about 20W. That meant we had to build the whole window in under 5ms; fast enough that users never noticed the overhead.
Why Not Just Use a Vector DB?
Vector databases (Pinecone, Weaviate, Qdrant) are good at what they do: approximate nearest-neighbor retrieval over large embedding spaces. But they solve a different problem than context management. Vector DBs retrieve; they don’t pack. You still need to decide how to fit retrieved chunks into a token budget. They don’t understand memory hierarchy: a fact from last week and a turn from this morning are equally “nearest” to your query, but they should be treated differently. And even with HNSW indexes, the scoring step runs on CPU; and network round-trips add milliseconds even on localhost.
ContextCore performs scoring, ranking, compression, and packing in a single synchronous pipeline, all on the GPU.
Architecture
Four-Tier Memory
ContextCore maintains four distinct memory tiers, each with different semantics:
Episodic (Tier 1) stores raw conversation turns. This is the primary source of context; every user message and assistant response goes here. Episodic chunks have moderate retention scores that decay over time and get penalized when duplicates are found during consolidation.
Semantic (Tier 2) stores consolidated facts extracted during background consolidation. When episodic chunks are identified as near-duplicates, the shorter one is promoted to semantic memory with a higher retention score. Semantic memory is queried alongside episodic memory but with a longer half-life (90 days vs 7 days).
Procedural (Tier 3) stores tool-usage patterns keyed by task type. If the user has asked the agent to refactor code three times, procedural memory remembers the tool sequence used. This tier is small (1000 patterns max) and is scored by recency and frequency.
Recent Turns (Tier 4) is a pinned buffer of the N most recent turns (configurable, default 3). These are always included in context regardless of relevance scoring. They handle the recency bias that pure relevance scoring doesn’t capture well.
When buildWindow(currentTask:maxTokens:) is called, the current task is embedded and scored against all episodic and semantic chunks:
score_i = relevanceWeight × cosine_similarity(task_embedding, chunk_embedding)
+ recencyWeight × recency_weight(chunk)
Default weights: 70% relevance, 30% recency. Recency is computed as exponential decay with a configurable half-life (default 7 days for episodic, 90 days for semantic):
recency_weight = exp(-ln(2) × age_seconds / halfLifeSeconds)
The Attention Centrality Problem
Naive relevance scoring clusters. If the user has been debugging Swift concurrency for the last hour, every relevant chunk will be about actor isolation and sendable conformance. The result: the context window fills with near-duplicate content and the model loses visibility into earlier topics.
ContextCore addresses this with attention centrality, a measure of how representative each chunk is of the conversation as a whole. The token_centrality kernel computes each chunk’s average cosine similarity to all other chunks:
centrality[i] = (1/(n-1)) × Σ cosine_similarity(embeddings[i], embeddings[j]) for all j ≠ i
High centrality means the chunk is similar to many other chunks; it’s generic. During eviction, we remove low-centrality chunks first. The eviction score blends task relevance (60%) with inverse centrality (40%):
evictionScore = relevance × 0.6 + (1 - normalizedCentrality) × 0.4
We arrived at 60/40 through manual tuning across 14 conversation traces spanning debugging sessions, code review, and documentation tasks. It works well for us. It will not suit all workloads; we recommend benchmarking with your specific task distribution.
Packing and Compression
The WindowPacker uses a first-fit-decreasing algorithm:
- System prompt (pinned first)
- Recent turns buffer (pinned last, models attend most to end-of-context)
- Scored memory chunks sorted by blended score
If a chunk doesn’t fit in the remaining token budget, ProgressiveCompressor attempts increasingly aggressive compression:
| Level | Target Reduction | Method |
|---|---|---|
| Light | 50% | Keep top-scoring sentences by embedding cosine |
| Heavy | 75% | More aggressive sentence selection |
| Drop | 100% | Remove entirely if nothing fits |
The compression quality depends heavily on the embedding provider. We use Apple’s CoreML-based MiniLM for on-device embedding, which produces 384-dimensional vectors.
GPU Kernels
ContextCore ships five Metal compute shader kernels. This section covers the most important one, relevance_score, in detail.
relevance_score: The Hot Path
The relevance scoring kernel is the most frequently executed kernel in the pipeline. It computes cosine similarity between a query vector and N chunk embedding vectors, combines with recency weights, and outputs a score per chunk.
Naive approach: For each chunk, read the query from global memory, compute the dot product. With N chunks and dimension D, that’s N×D global memory reads for the query alone. For 500 chunks and 384 dimensions: 192,000 global reads.
Optimized approach: Load the query into threadgroup shared memory once, then all threads reuse it. The query stays in fast on-chip memory. Global reads drop to D (384).
constant uint relevance_shared_query_capacity = 1024;
if (useSharedQuery) {
// Thread 0..dim-1 cooperatively load query into threadgroup
uint queryIndex = thread_position_in_grid.x;
if (queryIndex < dim) {
sharedQuery[queryIndex] = chunks[queryIndex]; // query is at offset 0
}
threadgroup_barrier(mem_flags::mem_threadgroup);
}
The float4 SIMD optimization then performs 4 elements per instruction:
for (; index + 3 < dim; index += 4) {
const float4 q = float4(sharedQuery[index],
sharedQuery[index + 1],
sharedQuery[index + 2],
sharedQuery[index + 3]);
const float4 c = float4(chunks[base + index],
chunks[base + index + 1],
chunks[base + index + 2],
chunks[base + index + 3]);
dotProduct += dot(q, c);
}
On M2, this kernel hits about 63 million chunks per second at 50K batch size. What that means in practice: scoring 1000 chunks takes 16 microseconds. That’s fast enough that scoring is never the bottleneck.
The other kernels handle recency weighting (compute_recency_weights), attention centrality (token_centrality), cross-attention eviction scoring (cross_attention_score), and the n×n similarity matrix for consolidation (pairwise_similarity).
Implementation
Actor Isolation
AgentContext and all stores are Swift actor types. The public API is nonisolated where safe and actor-isolated where state mutation is required. This prevents data races without manual locking.
public actor AgentContext {
public func append(turn: Turn) async throws
public func buildWindow(currentTask: String, maxTokens: Int? = nil) async throws -> ContextWindow
public func remember(_ fact: String) async throws
public func recall(query: String, k: Int = 5) async throws -> [MemoryChunk]
public func consolidate() async throws
public func checkpoint(to url: URL) async throws
}
Buffer Pooling
Hitting 4.89ms p99 requires no malloc on the critical path. ScoringEngine maintains a reusable buffer pool; the same buffers are reused across invocations. Only when the working set grows beyond capacity does a new allocation occur.
Embedding Cache
Embedding the same text twice is wasteful. A SHA256-keyed LRU cache (512 entries default) avoids redundant CoreML inference.
Evaluation
Results
End-to-End buildWindow Latency:
| Turns | Budget | p50 | p95 | p99 |
|---|---|---|---|---|
| 10 | 2048 | 1.21ms | 1.89ms | 2.31ms |
| 50 | 4096 | 1.87ms | 2.94ms | 3.52ms |
| 200 | 4096 | 2.74ms | 3.88ms | 4.41ms |
| 500 | 8192 | 3.42ms | 4.31ms | 4.89ms |
GPU vs CPU Scoring (50K chunks, 384 dims):
| Implementation | Latency | Throughput | Speedup |
|---|---|---|---|
| CPU (Accelerate) | 142.31ms | 351K chunks/s | 1× |
| GPU (Metal) | 0.79ms | 63.3M chunks/s | 180× |
Consolidation Latency (2000 chunks):
| Operation | p50 | p95 | p99 |
|---|---|---|---|
| Full consolidation | 10.2ms | 13.8ms | 15.61ms |
Recall Quality (7 relevant / 50 total):
| Metric | Score |
|---|---|
| Precision@3 | 0.67 |
| Precision@5 | 0.80 |
| Precision@8 | 0.88 |
Latency Breakdown
For the 500-turn, 8192-token case at p99:
| Component | Time |
|---|---|
| CPU→GPU buffer write | ~0.1ms |
| GPU kernel dispatch | ~0.05ms |
| GPU execution (all kernels) | ~0.5–2ms |
| Command buffer wait | ~0.1–0.5ms |
| Output readback | ~0.1–0.3ms |
| Sort (top-K) | ~0.1ms |
| Total | under 5ms p99 |
The GPU is underutilized; most time is spent in buffer management and synchronization, not actual compute. This is the most interesting finding: FP16 would give us 2× the FLOPs on Apple Silicon, and since we’re not compute-bound, the gain would be real. We didn’t get there at launch because FP16 precision in the similarity matrix needed more validation.
Limitations
FP16. Every kernel is float32. Apple Silicon has 2× the FLOPs for float16. We stuck with float32 because the pairwise similarity matrix in consolidation accumulates floating-point errors across the n×n computation, and we didn’t have time to validate FP16 precision there. This is the most obvious optimization left on the table.
Top-K kernel. The topk_indices kernel is single-threaded O(k×n) greedy. For k=8–16 (the normal case), it’s fine. For larger k or larger candidate sets, a bitonic sort would be the right answer.
Norm caching. The token_centrality kernel recomputes the L2 norm from scratch for every pair. A two-pass approach would cut that in half.
Memory footprint. Currently all episodic embeddings must fit in GPU memory for scoring. Sessions with 10K+ turns will exceed reasonable GPU memory budgets. ANN indexing (Metal’s built-in ANNSearch) would let us score over much larger stores without materializing everything.
Try It
dependencies: [
.package(url: "https://github.com/christopherkarani/ContextCore.git", from: "1.0.0")
]
Requires iOS 17+, macOS 14+, visionOS 1.0+, Apple Silicon with Metal.