Inside Swarm's WebSearchTool: How I Built It to Fit a 4K Context Window

Notes from building the Swift web tool I wanted Apple’s 3B Foundation Model to actually be able to use.

When I started building Swarm’s web tool, my test target was Apple’s 3B Foundation Model with a hard 4,096-token context window. Every design decision in the tool is downstream of that one constraint.

Most web-search tools for LLMs assume you have 100K+ tokens to spend. Mine couldn’t. If I wanted a 3B on-device model to run real research queries — search, fetch, cite, follow up — I had to fit the whole output, plus the system prompt, plus the user’s question, plus the answer, into 4K. One tool call is maybe 1,200 of those tokens. Everything else is budget I don’t get to use.

So I named a field final4KAnswer. Not shortAnswer. Not compactResponse. I wanted the name to be a promise to anyone reading the code six months later: this field is the thing a 4K model reads. Here’s the struct, from Sources/Swarm/Tools/Web/WebSearchSupport.swift:245–292:

public struct WebSearchEnvelope: Codable, Sendable, Equatable {
    public var mode: String
    public var summary: String
    public var final4KAnswer: String
    public var semanticCore: String?
    // ...
}

The rest of this post is about why every other line looks the way it does. I’m walking through the build decisions, not just the surface — what I tried, what I rejected, and the specific numbers I settled on.

The constraint I started from

Here’s the math that determined almost everything.

Context size	Typical home	Bytes (~4 chars/token)	What fits
4K tokens	Apple Foundation Models, small quantized MLX	~16 KB	System prompt + 1 tool result + 1 short answer
8K tokens	Mid-size MLX on M-series	~32 KB	System + 2 tool results + answer
200K tokens	Claude / GPT-4-class cloud	~800 KB	Entire books

On a 4K model one web result is the budget. If my tool dumped 20 KB of HTML, I’d have nothing left for the agent to think with. So the tool had exactly one job: return something useful and small.

“Useful” meant ranked, grounded, cited, deduplicated. “Small” meant under ~1.2 KB. Most tools I looked at picked one. I wanted both. The rest of this post is the specific moves I made to get there.

★ Insight

The conventional on-device RAG stack starts by loading an embedding model into memory. I started by refusing to. Every time someone suggested adding vector search “to make recall sharper” I said no, because the embedding model itself would eat more RAM than I could spare on a device already running a 3B LLM.

Why I made the envelope a tiered data structure

I didn’t want a single “result” blob. I wanted the output to know who was reading it.

Here’s the full envelope:

public struct WebSearchEnvelope: Codable, Sendable, Equatable {
    public var mode: String
    public var summary: String
    public var final4KAnswer: String
    public var semanticCore: String?
    public var hits: [WebSearchHit]
    public var artifact: WebArtifactRecord?
    public var normalizedDocument: NormalizedWebDocument?
    public var sectionChunks: [WebSectionChunk]
    public var groundedEvidence: GroundedEvidence?
    public var citations: [CitationRecord]
    public var artifactRefs: [String]
    public var bundle: EvidenceBundleRecord?
    public var cacheStatus: String
    public var rawArtifactRef: String?
}

Fourteen fields. Grouped by who reads them:

Tier	Fields	Typical bytes	Who reads it
4K	`final4KAnswer`, `summary`	~1.2 KB	Foundation Models, small MLX
8K	+ `semanticCore`, `hits`	~3–4 KB	Mid-size MLX
32K	+ `sectionChunks`, `citations`, `groundedEvidence`	~12 KB	Haiku, GPT-4o-mini
Metadata	`artifact`, `bundle`, `artifactRefs`, `cacheStatus`, `rawArtifactRef`	varies	Downstream tool calls, not LLMs directly

At call time the tool picks which tiers land in the string the agent reads. The rest stays on disk, addressable by ID, recoverable via expand without re-fetching.

The enforcement is in boundedEnvelope at WebSearchSupport.swift:778:

let charBudget = max(configuration.contextProfile.maxToolOutputTokens * 4, 512)

bounded.summary       = trimmedSnippet(bounded.summary,       limit: min(320, charBudget / 4))
bounded.final4KAnswer = trimmedSnippet(bounded.final4KAnswer, limit: min(900, charBudget / 2))
bounded.semanticCore  = bounded.semanticCore.map { trimmedSnippet($0, limit: min(700, charBudget / 3)) }

Four lines. Every one is a decision I can defend.

Line 778: budget declared by the agent. Token count times four gives me characters. (Four chars per token is a rough heuristic for English. Close enough. Good enough for the math I need.)

Lines 780–782: the three string tiers get fractions of the budget. Summary gets a quarter. final4KAnswer gets half. semanticCore gets a third.

Run the math on a 4K agent allocating 1,024 tokens to a tool call. charBudget = 4096. Summary caps at 320 chars. final4KAnswer caps at 900. semanticCore caps at 700. Total: 1,920 characters across three tiers that deliberately fit together without overflowing.

I picked 900 for final4KAnswer because that’s about the length of a paragraph a 3B model can actually reason over while still leaving room for the answer it’s about to produce. I picked 320 for summary because a one-sentence abstract shouldn’t be longer than that. I picked 700 for semanticCore because three short snippets average out to about that much — and I needed it to coexist with the other two without blowing the budget.

The `Detail` knob

Strings aren’t the only output. The envelope also carries arrays — hits, sections, citations. For those I wrote a separate dial.

Detail enum, at Sources/Swarm/Tools/WebSearchTool.swift:30:

public enum Detail: String, Codable, Sendable, Equatable, CaseIterable {
    case compact
    case standard
    case deep
    case raw
}

Four levels. I mapped them like this:

Detail	Fields included	Typical chars	Lives on
`compact`	`final4KAnswer` + `summary` + top 5 hit titles/snippets	~1.2 KB	4K
`standard`	+ up to 3 section chunks	~4 KB	8K
`deep`	+ all section chunks + citations + grounded evidence	~12 KB	32K+
`raw`	+ full normalized document	~30 KB+	100K+

The line I’m most careful with is the one that enforces how many things get returned, not how long each thing is:

// Sources/Swarm/Tools/Web/WebSearchSupport.swift:798
let maxSections = switch detail {
case .compact:
    0
case .standard:
    min(3, configuration.maxEvidenceSections)
case .deep, .raw:
    configuration.maxEvidenceSections
}

compact returns zero section chunks. Not “a few short ones.” Zero.

I rejected the version of this code where compact returned one or two short sections. The first few iterations had it at three, then two, then one. I kept cutting because every time I watched the tool run on the 3B model, those sections were crowding out the answer. Eventually I realized the right number was zero. A 4K agent doesn’t want shorter sections. It wants no sections and one really good paragraph in final4KAnswer instead.

Why I chose Wax (and turned vector search off)

Once the envelope shape was settled, I had to decide where to put everything the tool wasn’t returning. That’s where Wax comes in.

Wax is a file-backed graph memory library I use throughout Swarm (Package.swift pin: Wax 0.1.19, exact). The web tool’s wrapper is an actor, WaxWebArtifactStore, at WebSearchSupport.swift:881–1207. Three decisions shaped it.

Decision 1: `enableVectorSearch = false`

// WebSearchSupport.swift:904–906
var waxConfig = Wax.Memory.Config.default
waxConfig.enableVectorSearch = false
memory = try await Wax.Memory(at: indexURL, config: waxConfig)

I set this flag to false in four different places. WebSearchSupport.swift:905, 1128 plus WebSearchEvidence.swift:469, 556. Defense in depth. If six months from now someone copy-pastes Wax init code from another part of Swarm that does use vectors, I want every call site in the web tool to explicitly override.

The reasoning is simple. Vector search requires an embedding model loaded in RAM. That model is itself a model. On a phone already running a 3B LLM, adding a second model to compute similarities would kill me on memory pressure.

So Wax runs in text-only mode. Lexical match plus whatever semantic primitives Wax offers without embeddings. Not as sharp as cosine similarity on real embeddings. Good enough in the Pareto sense. And zero marginal memory.

Every on-device RAG stack I looked at before building this started by loading an embedding model. I started by refusing to. It’s the single decision I’m most confident was right, and the one I still get the most pushback on.

Decision 2: Index sections, not pages

When a document gets fetched, my WebContentExtractor (WebSearchSupport.swift:1410–1534) breaks it into WebSectionChunk records:

Field	Why I added it
`id`	Stable reference from envelopes and bundles
`artifactID`	Back-pointer to the fetched document
`heading`	Retrieval by topic rather than page
`text`	The actual section content
`index`	Order within the source document
`pageType`	Feeds per-type freshness scoring
`citations`	Links to original source

Wax indexes sections, not pages. Recall returns 500-byte sections, not 20 KB pages. A 4K agent can read a section. It cannot read a page. That asymmetry is the whole point — the retrievable unit has to match the budget of the caller, or the retrieval is useless.

Decision 3: Bundles as reference types

For ground mode — search, fetch, extract cited evidence — I persist the whole result set as an EvidenceBundleRecord:

public struct EvidenceBundleRecord: Codable, Sendable {
    public var bundleID: String
    public var query: String
    public var artifactIDs: [String]
    public var sectionIDs: [String]
    public var summary: String
    public var createdAt: Date
    public var updatedAt: Date
}

Note what the bundle doesn’t store: the content. It stores IDs pointing to content. Next turn, the agent says mode: .expand with the bundle ID, and the tool rebuilds output from stored artifacts and sections at whatever Detail level the agent’s current budget supports.

The design I’m emulating is reference types in a programming language. You pass around a handle. The handle is tiny. The value lives somewhere else. The LLM talks about bundles by ID. Everything that happens on the other side of that handle — disk reads, section materialization, tiering — is invisible to the model.

The whole substrate, drawn

flowchart LR
  A[HTTP Response] --> B[WebContentExtractor]
  B --> C["WebSectionChunk per heading"]
  C --> D["Wax text-only index (web-index.wax)"]
  C --> E["Artifact manifests (artifacts dir)"]
  C --> F["Raw bytes (raw dir)"]
  D --> G{recall}
  E --> H{expand}
  F --> I{raw detail}
  G --> J[WebSearchEnvelope]
  H --> J
  I --> J

Wax is doing the work a cloud vector DB would do. Indexing, recall, grouping by bundle. Minus the embedding model. Minus the network hop. Minus any user data leaving the device. It’s the right shape for the job, and I still haven’t found a case where I wished I’d taken the embedding-search shortcut.

The bundle-as-reference-type idea is the one I’m most attached to. It turns what would be a 40 KB JSON blob in the LLM’s context into an 8-byte UUID. From the model’s point of view, the world is just a conversation about bundle IDs that expand when asked. That’s close to how a human researcher works. You don’t re-read every source every time you want a citation. You keep a note pointing at the source. You re-fetch only when the note isn’t enough. That’s what I wanted the agent to have.

The decisions, cataloged

The tool is maybe 2,000 lines of Swift. Most of them are plumbing. The ones that matter are the fourteen below. Each one started as a constraint and ended as a one-liner.

1. final4KAnswer as a named field. WebSearchSupport.swift:248. I renamed this from answer after watching someone misuse the tool at a 200K context window and realize halfway through that they’d lost the citation graph. The name is a promise. It tells the reader what this field is for.

2. semanticCore is literally hits.prefix(3).joined("\n"). WebSearchSupport.swift:249, 404. The populating line reads merged.prefix(3).map(\.snippet).joined(separator: "\n"). Three hits, joined by newlines, capped at 700 chars by the bounder. The smallest “give me the gist” primitive I could build without losing ranking signal.

3. Base64 envelope smuggled in legacy text. WebSearchEvidence.swift:79–89:

static func embedEnvelope(_ envelope: WebSearchEnvelope, in legacyText: String) -> String {
    guard let data = try? JSONEncoder().encode(envelope) else {
        return legacyText
    }
    let encoded = data.base64EncodedString()
    return """
    \(legacyText)

    \(embeddedWebSearchEnvelopePrefix)\(encoded)\(embeddedWebSearchEnvelopeSuffix)
    """
}

With the constants at WebSearchEvidence.swift:4–5:

internal let embeddedWebSearchEnvelopePrefix = "[[swarm.websearch.envelope:"
internal let embeddedWebSearchEnvelopeSuffix = "]]"

I tried JSON in a fenced code block first. LLMs reformatted it. I tried raw JSON in the message. LLMs added commentary and broke the parse. I tried YAML. Same problem. Base64 was the only encoding that made it through every model I tested without being “helpfully” modified. Wrapping it in double-bracket markers meant it also survived markdown processors, copy-paste, and Slack quoting.

This is the trick I’m proudest of. Two constants, two short functions, and a side channel for structured metadata inside a text-only protocol. I use it for three other things in Swarm now.

4. charBudget = maxToolOutputTokens * 4 is the contract. WebSearchSupport.swift:778. The agent declares its budget. The tool honors it per call. Not a best-effort trim at the end — a resolved budget resolved upfront.

5. enableVectorSearch = false, four times. WebSearchSupport.swift:905, 1128 + WebSearchEvidence.swift:469, 556. Defense in depth. I don’t want this to silently flip back on.

6. Section chunking with WebSectionChunk. WebSearchSupport.swift:1410–1534. Retrievable unit matches caller budget. Sections, not pages.

7. Freshness decay tuned per page type. WebSearchSupport.swift:1206–1221:

static func freshnessScore(fetchedAt: Date, pageType: WebPageType) -> Double {
    let staleDays: Double = switch pageType {
    case .docs, .apiReference:                               7
    case .blog, .generic, .tableHeavy, .codeHeavy:           3
    case .forum:                                             1
    case .pdf:                                               7
    }
    let ageDays = Date().timeIntervalSince(fetchedAt) / 86_400
    let raw = max(0, 1 - (ageDays / max(1, staleDays)))
    return min(raw, 1)
}

I tuned the staleness windows empirically. Docs and API refs stay useful for a week. Blogs and generic pages decay in three days. Forums decay in one day because yesterday’s top post is already buried. Linear decay. No exponentials. I tried exponential first and the curve was too sharp — a 2-day-old doc would score lower than I wanted.

8. Host trust scoring during ranking. Official docs weight 1.0. Official product pages 0.9. Reference sites 0.75. Community 0.55. User-generated 0.35. When the 4K budget forces top-3 selection, I want those three to be the three most trustworthy. Not the three Tavily scored highest on keyword match.

9. Domain dedup in the merge step. WebSearchSupport.swift:293–309. If two hits come from the same domain I drop the weaker one. In a 4K world, same-domain duplicates are pure token waste. Early versions of the tool didn’t do this and I kept seeing the top three slots filled with three different pages from the same docs site.

10. Ephemeral URLSession per request. WebSearchSupport.swift:1309–1314:

private static func makeSession(timeout: TimeInterval) -> URLSession {
    let configuration = URLSessionConfiguration.ephemeral
    configuration.waitsForConnectivity = false
    configuration.timeoutIntervalForRequest = timeout
    configuration.timeoutIntervalForResource = timeout
    return URLSession(configuration: configuration)
}

I started with a shared URLSession. On LTE it caused pool-poisoning bugs — one bad connection would wedge the whole pool until the app was backgrounded. Ephemeral session per request fixed it. Creating a session per call is slightly more expensive, but not measurably so in the workloads I care about.

11. Tavily capped at 6 seconds inside a 20-second fetch budget. WebSearchSupport.swift:1277:

request.timeoutInterval = min(configuration.fetchTimeout, 6)

The search hop has to fail fast so there’s budget left for fetch + parse + extract. I watched too many agent runs die because the search took 15 seconds and the fetch never got to happen. Six seconds is empirically where Tavily’s p99 sits on decent WiFi. If it hasn’t responded by then, something is wrong and retrying is cheaper than waiting.

12. Conditional fetches with If-None-Match and If-Modified-Since. WebSearchSupport.swift:1333. A 304 response means no re-chunking, no re-persisting, no disk writes. I add these headers automatically whenever a prior ETag or Last-Modified is in the artifact store. Battery win on cellular.

13. 64 MB storage quota with LRU eviction. Configuration.storageQuotaBytes = 64 * 1024 * 1024. Enforcement at WebSearchSupport.swift:1048–1121. I picked 64 MB because it’s enough to cache a few hundred documents and small enough that a user who never looks at it won’t notice. I assume the tool is sharing disk with photos, apps, and the OS. It polices itself.

14. refresh mode as explicit cache-bust. WebSearchSupport.swift:374. I didn’t want a TTL counting down in the background. The agent should decide when a source is worth re-fetching. That keeps the model in the loop and means the cache policy is a product decision, not a hidden invariant.

Each of those is one or two lines of source. They don’t read as clever. They read as small. Together they’re the design language.

The part I thought about longest: SSRF on a phone

This isn’t the section I expected to spend the most time on. It’s the one I did.

A datacenter agent sits inside a VPC. Outbound traffic is firewalled from 10.*, 192.168.*, 169.254.*, and the usual set of link-local and metadata addresses. If the model hallucinates a URL like http://169.254.169.254/latest/meta-data/ (hi, AWS), the network layer drops it. Not my problem.

An on-device agent is on your LAN. It shares a subnet with your router, your printer, your NAS, your smart bulbs, the IoT device you never got around to putting on a guest network. Nothing is firewalled from those targets by default. The only thing standing between the LLM and your Hue bulb’s local HTTP API is whatever I chose to block.

I block them at parse time. SafeWebFetcher in WebSearchSupport.swift:1384–1407:

private func validate(url: URL) throws {
    guard let scheme = url.scheme?.lowercased(), ["http", "https"].contains(scheme) else { ... }
    guard let host = url.host?.lowercased(), !host.isEmpty else { ... }

    if host == "localhost" || host == "::1" || host == "[::1]" || host.hasPrefix("127.") { ... }

    if host.hasPrefix("10.") || host.hasPrefix("192.168.") || host.hasPrefix("169.254.") { ... }

    if host.hasPrefix("172.") {
        let octets = host.split(separator: ".")
        if octets.count >= 2, let second = Int(octets[1]), (16 ... 31).contains(second) { ... }
    }
}

Blocked	Example	What it stops on-device
Non-http(s) scheme	`file:///etc/passwd`, `ftp://...`	Reading local files the agent shouldn’t touch
Loopback	`127.0.0.1`, `::1`, `localhost`	Other services running on the same device
Link-local	`169.254.*`	Cloud metadata endpoints, AirPrint discovery, mDNS
Private class A	`10.*`	Corporate VPN-side targets
Private class B	`172.16.`–`172.31.`	Home routers that use this block
Private class C	`192.168.*`	Router admin page, IoT devices, NAS

I wrote this after thinking about prompt injection. Imagine a user pastes a webpage into the agent’s context and that webpage contains <img src="http://192.168.1.1/router-admin/reboot"> as a trick. If the agent’s tool can fetch that URL, the user’s router just rebooted because of a webpage.

Twenty-five lines of Swift. One function call per fetch. Cheap defense. The blast radius if I hadn’t added it would have been enormous.

Why the runtime is an actor

// Sources/Swarm/Tools/Web/WebSearchSupport.swift:331
internal actor WebToolRuntime {
    static let shared = WebToolRuntime()
    private var stores: [String: WaxWebArtifactStore] = [:]

    func execute(
        request: WebToolRequest,
        configuration: WebSearchTool.Configuration
    ) async throws -> WebSearchEnvelope {
        let store = try await store(for: configuration)
        let engine = WebExecutionEngine(configuration: configuration, store: store)
        return try await engine.execute(request: request)
    }
    // ...
}

I reached for an actor because the alternative was a custom DispatchQueue wrapper, and that’s the kind of code I’ve spent too much time maintaining. Swift 6 actor isolation does the serialization for free. No locks. No manual queueing. The compiler enforces the invariants I’d otherwise have to enforce by convention.

On a server you’d have a worker pool, a Redis rate limiter, a shared connection manager. None of that exists on a phone. None of it is missed. The actor is the serialization. The ephemeral URLSession is the pool. The Wax store actor underneath is the disk lock.

One process. One of each thing. Fewer moving pieces. The pieces that are there are the ones the Swift language hands me for free.

One call, end to end

Here’s an actual trace from my test harness against a 3B Foundation Model. Query: “What changed in SwiftUI 6 navigation?”. Agent budget: 1,024 output tokens. Mode: ground. Detail: compact.

sequenceDiagram
  participant Agent
  participant Runtime as "WebToolRuntime actor"
  participant Wax as "Wax Index local"
  participant Tavily
  participant Fetch as SafeWebFetcher
  participant Store as WaxWebArtifactStore

  Agent->>Runtime: execute ground, query, compact
  Runtime->>Wax: query "SwiftUI 6 nav"
  Wax-->>Runtime: 2 hits, similarity 0.71
  Note over Runtime: below 0.82 threshold, fall through to live
  Runtime->>Tavily: POST /search, 6s cap
  Tavily-->>Runtime: 5 hits + scores
  Note over Runtime: merge, dedup, trust-weight, prefix 3
  Runtime->>Fetch: GET top 2, ETag-aware
  Fetch-->>Runtime: HTML ~40 KB each
  Runtime->>Store: chunk + persist
  Note over Runtime: boundEnvelope compact: final4KAnswer<=900, summary<=320, sections=0
  Runtime->>Runtime: embed base64 envelope
  Runtime-->>Agent: ~1.2 KB text + bracket envelope

The agent reads about 1,200 characters. Roughly 300 tokens. The remaining 724 tokens of its output budget go to the actual answer it’s about to produce. The section chunks and raw documents stay on disk, addressable by bundle ID, available via expand if the agent wants to drill in later.

Every stage measured against the 4K ceiling. Nothing overflows. Nothing gets truncated mid-sentence. The tool knows the budget before it starts.

That’s the discipline. I wouldn’t call it elegant. I’d call it respectful — of the model’s context, of the user’s battery, of the device’s disk. Everything else flows from treating those as real constraints instead of things to be “handled.”

Three things I’d do the same again, and one I’d change

Same again:

Name the context-aware output field. final4KAnswer is the single most-cited line in the codebase when teammates discuss the tool. The name is a contract and it communicates more than three paragraphs of doc comments would.

Base64-envelope structured data inside legacy text. I’ve used this pattern three more times since. It keeps working. Two constants and a JSONEncoder is a shockingly small amount of code for what it buys.

Index sections, not documents. Match the retrievable unit to the caller’s context budget. Every time I’m tempted to just-index-the-page to save implementation effort, I remind myself the 3B model can’t read the page.

What I’d change:

The four-place enableVectorSearch = false duplication. I defended it as defense-in-depth earlier. It is. But I should have wrapped Wax init in a single makeWebMemory() helper so there’s one place to keep honest. Today if a new feature adds a fifth Wax init and forgets the flag, nothing catches it. Six months from now that’s going to bite me and I’ll remember this paragraph.

The cloud builds its tools out of fetch() and summarize(). I built this one out of fetch(), chunk(), tier(), persist(), recall(), and a field named for a 4K window. It runs in a Swift actor on someone’s laptop or iPhone. No orchestrator. No second model for embeddings. No external vector DB. No user data leaving the device.

That’s what I was trying to build, and that’s what the code says now. The fact that it fits is what makes it work.

Source: Sources/Swarm/Tools/Web/ in the SwiftAgents repository. Line numbers pulled from the testing branch as of April 2026.