edge llm deployment — independent, unaffiliated field notes (2026-04-17)

1. what this is

I run a few hundred Cloudflare zones — a personal namespace of full-featured apps, sites, sketchpads, AI tools, scrape-shielded surfaces, the whole shape of what one operator can ship at edge. None of this is a hobby footprint. The reason it works is that the entire stack — DNS, registration, deploy, edge logic, monitoring — is driven from a script with an LLM in the loop composing the calls. Claude writes the wrangler invocations and the Namecheap API hits; wrangler ships the Workers; I read the result. One person runs the whole thing.

This is what I learned shipping it.

The snippets show the pairing I actually ship — Anthropic's Messages API on Cloudflare Workers — because code only runs as written if you pick one. The patterns generalize; OpenAI, Google, Groq, Cerebras, Vercel Edge, Deno Deploy, Fastly Compute swap in where I flag inline.

Assume you've shipped a Worker before and you have an API key. Where the official docs are authoritative I link them; where they're wrong or missing I say so. Everything's dated 2026-04-17. After that, the numbers may be stale. Check before shipping.

time-sensitive: older model retirements. Vendors retire older model identifiers on a rolling cadence. One concrete example as of publication: anthropic's claude-3-haiku-20240307 reaches its retirement date on 2026-04-19; code pinning that identifier should migrate to a currently-supported model from the same vendor before then. Comparable retirements have occurred or will occur across the other providers discussed here; always check the relevant vendor's model-lifecycle page before pinning.

2. about the `.cx` namespace

These notes live at claudeflare.cx, a personal domain used for community field notes and small utilities. The .cx side reads, in-house, as community extensions — third-party documentation and patterns that describe how the public surface of tools gets used by their communities. Content here is not limited to any one vendor; this edge-llm guide is one article in a broader set of independent notes.

3. the stack at a glance

Two axes: which provider you call, which runtime you call from. Most patterns survive the cross-product. The ones that don't, I flag inline.

llm providers referenced

provider	api surface	primary-source docs
OpenAI	Chat Completions, Responses, Realtime	platform.openai.com/docs
Anthropic	Messages, prompt caching, batch	platform.claude.com/docs/en
Google	Gemini API, context caching	ai.google.dev/gemini-api/docs
Groq	OpenAI-compatible chat API on LPU	console.groq.com/docs
Cerebras	OpenAI-compatible chat API on wafer-scale inference	inference-docs.cerebras.ai

edge runtimes referenced

runtime	substrate	primary-source docs
Cloudflare Workers	V8 isolates on the Cloudflare network	developers.cloudflare.com/workers
Vercel Edge Functions	V8 isolates on Vercel's edge	vercel.com/docs/functions/edge-functions
Deno Deploy	Isolates running the Deno runtime	docs.deno.com/deploy
Fastly Compute	WebAssembly on Fastly POPs	fastly.com/documentation/guides/compute

primitives an edge-llm proxy actually needs

primitive	role	typical implementation
HTTP handler	fronting the vendor's api; streaming proxy; auth gate	any edge runtime's default request handler
secrets	holds provider api keys off the client; rotated via the runtime's cli	Cloudflare Workers Secrets; Vercel encrypted env vars; Deno Deploy secrets; Fastly secret store
strong-consistency state	per-user rate limiter; chat-session stickiness	Cloudflare Durable Objects; Vercel KV + lock; Upstash Redis; a small DO-equivalent; or a SQL row
small KV store	static assets; small idempotent response caches	Cloudflare KV; Vercel KV; Deno KV; Upstash
object storage	multi-megabyte context blobs (attachments, pdfs, transcripts)	Cloudflare R2; AWS S3; GCS; Backblaze B2
metrics / time-series sink	per-call cost counters and ad-hoc metrics	Cloudflare Analytics Engine; Datadog; BigQuery; ClickHouse Cloud

Bolt on a vector index (Vectorize, Turbopuffer, pgvector, Pinecone) for retrieval, plus a managed SQL/PG proxy (Hyperdrive, Neon, Supabase) when state gets heavy. Most small-to-medium edge-llm apps don't need either on day one. Start small.

Code below is anthropic-on-workers because that's the pairing I've actually shipped against. The patterns aren't pairing-specific; other-provider and other-runtime swaps are noted inline.

4. pattern 1 — proxy worker for `/v1/messages`

Start with the dumbest proxy that survives a stream. Forward the bytes. Attach the key server-side. Don't touch the body. Auth, rate limiting, quotas, cost accounting all stack on top later, in their own functions, called from fetch.

// worker.js — thin streaming proxy in front of /v1/messages
export default {
  async fetch(req, env) {
    const url = new URL(req.url);
    if (url.pathname !== "/v1/messages" || req.method !== "POST") {
      return new Response("not found", { status: 404 });
    }

    // Forward the client's body and let the runtime stream it.
    // Do NOT JSON.parse + re-stringify: that can miss the prompt cache.
    const upstream = await fetch("https://api.anthropic.com/v1/messages", {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "anthropic-version": "2023-06-01",
        "x-api-key": env.ANTHROPIC_API_KEY,
      },
      body: req.body,
    });

    // Pass the upstream SSE stream through without buffering.
    return new Response(upstream.body, {
      status: upstream.status,
      headers: {
        "content-type": "text/event-stream",
        "cache-control": "no-cache",
        "connection": "keep-alive",
      },
    });
  },
};

Three things to notice. First, the upstream fetch passes req.body directly; the runtime handles the streaming request body. Second, the response body is passed through unchanged: do not JSON-parse it, do not buffer it, do not re-emit it — that path misses the prompt cache (§6). Third, connection: keep-alive plus cache-control: no-cache is the shape recommended by cloudflare's agents HTTP/SSE reference. x-accel-buffering is an nginx-specific header; cloudflare's edge does not document it as required, and it is harmless to omit in a workers-only path.

In front of this, a real Worker adds: an auth check on the incoming request, a rate-limit call to a Durable Object (§10), an Analytics Engine write (§11), and an error envelope for non-2xx upstream. Keep the handler small; push policy into explicit functions called from fetch.

Other-provider swap: replace the upstream url with https://api.openai.com/v1/chat/completions and the auth header with authorization: Bearer $OPENAI_API_KEY; for gemini, https://generativelanguage.googleapis.com/v1beta/models/$MODEL:streamGenerateContent?key=$KEY. Other-runtime swap: the same handler shape runs on vercel edge functions, deno deploy, and (with a small wasm shim) fastly compute.

5. pattern 2 — secret handling

The provider api key goes into Workers Secrets, not into [vars] in wrangler.toml, and not into a repo-checked .dev.vars for production. Cloudflare's secrets docs are explicit that [vars] is plaintext and meant for non-sensitive config; use wrangler secret put for anything that would not be pasted into a public repo.

# set a secret (prompts once, stored encrypted)
wrangler secret put ANTHROPIC_API_KEY

# list secret bindings on this Worker
wrangler secret list

# deploy with the new secret live
wrangler deploy

In code, read via env.ANTHROPIC_API_KEY. Never log it; never include it in an error message; never echo it back in a response header. If request bodies are logged for debugging, strip authorization and x-api-key before writing.

For multi-tenant deployments where each customer holds their own key, pass the key in from the client over TLS and forward it with the request — do not store customer keys in Workers Secrets, because Secrets are per-Worker-global, and a single leaked Worker bundle would expose every tenant. For BYO-key, validate the key shape server-side before forwarding, and forward with the header name the upstream expects.

Other-runtime swap: vercel encrypted env vars (vercel env add); deno deploy environment variables; fastly secret store. All expose the same "set once, read at runtime, never checked in" shape.

secrets store status (2026-04-17). Cloudflare's account-level Secrets Store is in open beta. To share one api key across many Workers, Secrets Store is the intended home — verify current GA status before relying on it for production.

6. pattern 3 — preserving prompt cache through a worker

Prompt caching is a substantial cost lever on anthropic's messages api. Cache reads are priced at 0.1× base input; 5-minute cache writes at 1.25× base; 1-hour cache writes at 2× base (prompt caching docs, retrieved 2026-04-17). Break-even: a 5-minute block pays off after one read; a 1-hour block after two. A Worker that breaks the cache pays full price. Openai's prompt caching and google's context caching follow the same engineering shape with different pricing; the worker-side rules below generalize.

The cache is keyed on the prefix of the request's tools → system → messages blocks, with cache_control: {"type": "ephemeral"} marking cacheable boundaries. Up to four explicit breakpoints per request. Minimum cacheable block size depends on the model: current opus and haiku tiers require 4,096 tokens; current sonnet requires 2,048; older sonnets and some haikus require less. Requests below the minimum are processed without caching and return no error.

To preserve the cache across a Worker:

Forward request bytes; do not re-serialize. JSON.stringify(JSON.parse(body)) can reorder keys, add whitespace, or strip a field. Any of those misses the cache.
Do not change block ordering. Cache matching is prefix-based; a shuffled block list is a different prefix.
Do not strip cache_control breakpoints. They are the client's statement of what to cache; removing them silently increases cost.
Place breakpoints on the last block that is identical across requests, not on changing content. A breakpoint on the user's timestamp defeats itself.
Keep policy pre-flight (auth, rate limit) separate from the body-forward path. A policy failure should return before the request body is touched.

workspace isolation as of 2026-02-05. Prompt caches on anthropic's messages api are now isolated per workspace (previously per organization). A multi-tenant Worker forwarding traffic with keys from multiple workspaces will not share cache across them — plan breakpoints accordingly. Amazon Bedrock and Google Vertex AI retain organization-level isolation. (Source: prompt caching docs, retrieved 2026-04-17.)

On the response side, the api returns cache telemetry in the usage block: cache_creation_input_tokens, cache_read_input_tokens, ephemeral_5m_input_tokens, ephemeral_1h_input_tokens. Total input tokens are cache_read + cache_creation + input_tokens; the input_tokens field alone counts only tokens after the last cache breakpoint. Write the full shape to Analytics Engine (§11) so hit rate is visible over time. A falling hit rate is usually a Worker that started rewriting request bytes it should have forwarded unchanged.

On current 4.x-tier anthropic models, cache_read_input_tokens are excluded from input-tokens-per-minute rate limits. Effective ITPM with hot caches is substantially higher than the raw limit suggests.

7. pattern 4 — retries, backoff, and `529`

Three status shapes to handle: 429 Too Many Requests (a limit), 529 Overloaded (the upstream is shedding load), and an in-stream overloaded_error event (the same overload expressed as an SSE event rather than an HTTP status). The messages api returns a retry-after header on 429 where appropriate. Other providers use compatible shapes; the worker-side loop is the same.

async function sendWithRetry(body, env, { maxRetries = 2 } = {}) {
  let attempt = 0;
  while (true) {
    const r = await fetch("https://api.anthropic.com/v1/messages", {
      method: "POST",
      headers: {
        "content-type": "application/json",
        "anthropic-version": "2023-06-01",
        "x-api-key": env.ANTHROPIC_API_KEY,
      },
      body,
    });

    // Happy path and non-retryable errors pass through.
    const retryable = r.status === 429 || r.status === 529 || r.status >= 500;
    if (!retryable || attempt >= maxRetries) return r;

    const retryAfter = Number(r.headers.get("retry-after")) || 0;
    const backoff = Math.max(
      retryAfter * 1000,
      Math.min(8000, 500 * 2 ** attempt + Math.random() * 250),
    );
    await new Promise(res => setTimeout(res, backoff));
    attempt++;
  }
}

Three rules of thumb. First, do not retry a streaming request that has already emitted tokens to the client; restart from the beginning on the client, or don't restart. Second, handle in-stream overloaded_error events by closing the stream and surfacing a clean error to the user; do not attempt to retry mid-stream. Third, keep maxRetries small (0 to 2) — Workers' wall-clock budget while a client is connected is generous, but a Worker that retries for minutes hits user-patience limits before platform limits.

Read the anthropic-ratelimit-* response headers into Analytics Engine: -requests-remaining, -input-tokens-remaining, -output-tokens-remaining. A running chart of "remaining budget at time of request" turns capacity planning into a look-at-the-graph problem.

8. pattern 5 — token counting at the edge

Anthropic exposes /v1/messages/count_tokens, which returns the input-token count for a given request without executing the model. Useful for pre-call budget checks, free-tier gating, or warning a user before they blow a daily allowance. Openai exposes a similar shape via tiktoken locally or via usage-preview endpoints; gemini exposes countTokens.

async function countTokens(reqBody, env) {
  const r = await fetch("https://api.anthropic.com/v1/messages/count_tokens", {
    method: "POST",
    headers: {
      "content-type": "application/json",
      "anthropic-version": "2023-06-01",
      "x-api-key": env.ANTHROPIC_API_KEY,
    },
    body: reqBody,
  });
  if (!r.ok) return null;
  const { input_tokens } = await r.json();
  return input_tokens;
}

Call this conditionally. In a proxy Worker, only count when the client is near a quota boundary; otherwise skip the round-trip. A reasonable heuristic: only count if the client's cumulative usage this window is within 20% of their cap.

tokenizer drift note. Newer model generations sometimes ship new tokenizers that can consume noticeably more tokens for the same text (recent example: opus 4.7 relative to prior opus tiers, ~35% at the margin). Budget caps, character counts, or context-window heuristics tuned on a previous model should be recalibrated before the model id is swapped. (Source: anthropic pricing / model docs, retrieved 2026-04-17.)

9. pattern 6 — r2 for long context

For attachments that are already big — pdfs, transcripts, codebase dumps — store them in R2 and inject references into the request rather than carrying the raw bytes through every call. R2 Standard storage is $0.015/GB-month; egress to the public internet is free; Class B (reads) are $0.36/M (R2 pricing, retrieved 2026-04-17). The same pattern holds with S3, GCS, or Backblaze B2 on other runtimes; differences are pricing and egress shape, not engineering.

Two patterns worth knowing:

Pre-signed uploads. Generate a pre-signed PUT URL in the Worker, return it to the client, let the client upload directly to R2. Saves Worker CPU-ms.
Cached context hydration. When building a request that references an R2 object, fetch the object inside the Worker and inline it into the messages block as a document. If the prefix is stable across calls, prompt cache covers most of the cost; R2 is then a cheap persistence layer rather than a hot path.

Do not proxy the R2 object through the Worker on every request. Serve directly from R2 (via a Worker-signed URL for private objects) when possible; that avoids pushing bytes through Workers' request/response-size envelopes.

Anthropic's Files API (anthropic-beta: files-api-2025-04-14, retrieved 2026-04-17) allows uploading a file once and referencing it by id across requests. For files reused many times within a workspace, that is often a better primitive than R2; use R2 when the file is already in the storage graph or needs a public URL.

10. pattern 7 — durable objects for rate limiting

Per-user rate limiting in a stateless Worker is impossible; every request may land on a different isolate, and KV is eventually consistent (propagation up to 60 seconds, per KV docs). Durable Objects solve this: each DO instance is a singleton addressed by a stable name, with strongly consistent in-memory state and an alarm API for time-based effects. On other runtimes the equivalent is Upstash Redis with per-key locks, Deno KV with atomics, or Vercel KV + a lock library.

// Durable Object: per-user token bucket
export class RateLimiter {
  constructor(state) { this.state = state; this.tokens = null; this.last = 0; }

  async fetch(req) {
    if (this.tokens === null) {
      const saved = await this.state.storage.get("bucket")
        || { tokens: 60, last: Date.now() };
      this.tokens = saved.tokens;
      this.last = saved.last;
    }

    const now = Date.now();
    const refill = (now - this.last) / 1000 * 1; // 1 token/sec
    this.tokens = Math.min(60, this.tokens + refill);
    this.last = now;

    if (this.tokens < 1) {
      await this.state.storage.put("bucket",
        { tokens: this.tokens, last: this.last });
      return new Response("rate limited",
        { status: 429, headers: { "retry-after": "1" } });
    }

    this.tokens -= 1;
    await this.state.storage.put("bucket",
      { tokens: this.tokens, last: this.last });
    return new Response("ok");
  }
}

// Worker side
export default {
  async fetch(req, env) {
    const userId = await authenticate(req);
    const id = env.RATE_LIMITER.idFromName(userId);
    const limiter = env.RATE_LIMITER.get(id);
    const check = await limiter.fetch("https://limiter/check");
    if (check.status === 429) return check;
    // ...proceed to forward to /v1/messages...
  },
};

Two refinements worth the lines. First, put a small in-memory cache in the Worker in front of the DO so the common case (plenty of tokens, recently checked) does not cross a network hop. Second, use the DO alarm API to lazily reset buckets instead of polling.

do soft cap: 1,000 req/s per individual object. (durable objects limits, retrieved 2026-04-17.) For a per-user bucket this is fine; for a global rate limiter, shard by a hash of the user id to spread load across multiple DO instances.

11. pattern 8 — analytics engine for per-call metrics

Cloudflare's Analytics Engine (AE) provides a per-Worker time-series sink writable at request rate and queryable with SQL. For an llm proxy, write one data point per call: model id, user bucket, route, input tokens, output tokens, cache read tokens, status code, latency. Each data point is a row aggregated later. Other-runtime equivalents: Datadog, BigQuery streaming inserts, ClickHouse Cloud, Honeycomb.

env.AE.writeDataPoint({
  blobs: [model, userBucket, route],               // strings for grouping
  doubles: [inputTokens, outputTokens, cacheRead, latencyMs],
  indexes: [userBucket],                            // sampling key
});

AE limits as of 2026-04-17: ≤250 data points per Worker invocation, ≤20 blobs and ≤20 doubles per data point, one index per call, all blobs ≤16 KB total, three-month retention (AE limits). Query through the SQL API for external dashboards (Grafana) or the GraphQL Analytics API for in-dashboard views.

Budget discipline: each AE write is billable. For high-QPS Workers, sample at the point of writing — e.g. 1/10 of successful calls plus 100% of errors. A good cost dashboard is one that does not dominate the cost it measures.

12. pattern 9 — agents sdk and remote mcp over workers

For anything with durable tool state — a long-running assistant, a multi-turn retrieval agent, a remote MCP server hosting tools — cloudflare's agents sdk on top of Durable Objects is the shape that works. Each Agent instance is a DO; tools are methods; state persists across requests and hibernation; WebSocket hibernation keeps long-lived clients cheap.

Key primitives from the agents sdk: the Agent base class (per-agent SQL DB plus KV state), AIChatAgent for chat use cases, the @callable() decorator for typed RPC over WebSocket, and React hooks useAgent / useAgentChat. All inherit DO scheduling and hibernation.

For remote MCP servers specifically, the remote MCP guide (retrieved 2026-04-17) documents three paths:

createMcpHandler() — stateless; no Durable Objects; fastest to deploy.
McpAgent — stateful; requires DOs; supports both SSE and Streamable HTTP transports; per-session state; tool elicitation. Streamable HTTP is the current MCP spec standard.
Raw WebStandardStreamableHTTPServerTransport — direct SDK, maximum control.

Inside a tool call, use the provider api as needed; return the tool result as the MCP response. Keep each tool's upstream call bounded. If a tool call runs long enough that platform limits become relevant, return a correlation id, kick off the work via waitUntil, and expose a /status/:id endpoint the client can poll.

13. gotchas

Observed in the field, in rough order of pain:

sse stalls without `connection: keep-alive`, not `x-accel-buffering`

A browser that does not receive SSE chunks in real time is the most common symptom of a misconfigured proxy. The canonical shape is content-type: text/event-stream + cache-control: no-cache + connection: keep-alive. x-accel-buffering is an nginx directive and is not documented as required on cloudflare's edge; it is harmless to include but should not be assumed to fix a Workers SSE bug.

cpu-ms vs wall-clock are different budgets

A streaming Worker burns wall-clock (waiting on the upstream) but very little CPU-ms. On the paid tier, default CPU limit per HTTP invocation is 30 s (configurable to 5 min); wall-clock is not enforced while the client is connected (workers limits, retrieved 2026-04-17). On the free tier, CPU is capped at 10 ms per invocation — any non-trivial llm proxy needs the paid plan.

json.parse of a streaming response buffers it

Calling response.json() on an upstream SSE response buffers the entire stream into memory before returning. Use response.body for streaming; response.json() only for non-streaming calls.

prompt-cache misses from subtle mutations

Re-serializing the request body, sorting JSON keys, adding a trailing newline, or injecting a per-request timestamp into the cached prefix will all miss the cache. Treat the request body as opaque bytes on the hot path; if inspection is required, clone first.

6 concurrent outbound connections waiting on headers

Workers allow 6 simultaneous open outbound HTTPS connections awaiting response headers; once headers arrive the connection stops counting. For a fan-out pattern (parallel calls from a single invocation) calls serialize past six in-flight. Streaming bodies count as one open connection for the duration of the stream. (workers limits.)

node.js compat requires the right `compatibility_date`

When importing a provider's Node-centric SDK, wrangler.toml needs compatibility_flags = ["nodejs_compat"] and compatibility_date >= "2024-09-23". A subset of Node APIs is only partially supported and will fail at call site with [unenv] <method> is not implemented yet!. This is the most common "worked locally, broke on deploy" failure for sdk-on-workers paths.

kv is eventually consistent; do not use it for rate limits

Writes are usually immediately visible at the originating colo but not guaranteed; propagation to other colos can take up to 60 seconds. KV has no read-your-writes guarantee. Use Durable Objects for anything that must be strongly consistent.

free tier exhausts fast for popular proxies

Workers Free is 100,000 requests/day with 10 ms CPU per invocation. A popular llm proxy can exhaust the daily budget in a few hours. Verify current allowances in the dashboard; migrate to Workers Paid ($5/month minimum) before a public launch.

api keys rotate atomically at wrangler, not at the provider

wrangler secret put updates the deployed Worker atomically, but if the old key is revoked at the provider before every deploy has flushed, in-flight requests using the old key will 401. Safe order: set the new secret, confirm the next deploy is live, then revoke the old key.

14. cost math (dated 2026-04-17)

All figures below reflect primary-source pricing retrieved 2026-04-17. Re-verify before quoting them to a customer.

anthropic messages api (per million tokens)

model	input	5m cache write	1h cache write	cache read	output
`claude-opus-4-7`	$5	$6.25	$10	$0.50	$25
`claude-sonnet-4-6`	$3	$3.75	$6	$0.30	$15
`claude-haiku-4-5`	$1	$1.25	$2	$0.10	$5

Source: platform.claude.com/docs/en/about-claude/pricing, retrieved 2026-04-17. Batch API pricing is 50% of synchronous. Fast mode (opus 4.6 only, beta) is 6× standard. Data residency (inference_geo: "us") multiplies all categories by 1.1. Context windows: opus 4.7 and sonnet 4.6 are 1M tokens; haiku 4.5 is 200k.

cloudflare workers and friends

workers free. 100,000 requests/day. 10 ms CPU per invocation. No cost.
workers paid (standard). $5/month minimum. 10M requests/month included, then $0.30 per additional million. 30M CPU-ms/month included, then $0.02 per additional million CPU-ms.
r2 standard. $0.015/GB-month storage. Class A (writes) $4.50/M. Class B (reads) $0.36/M. Egress to Workers and to the public internet: free.
durable objects. Per-request pricing plus wall-clock plus storage. SQLite-backed DOs are on Free and Paid.
kv. Free: 100k reads/day, 1k writes/day, 1 GB storage. Paid: unlimited reads/writes, 25 MiB/value ceiling.
analytics engine. Write-per-data-point pricing; check the current pricing page before heavy production use.

Source: developers.cloudflare.com/workers/platform/pricing/, r2 pricing, kv limits, retrieved 2026-04-17.

The shape that matters: for any llm-backed product, the model api is almost always the dominant cost line. Every optimization that increases cache hit rate, shortens the output, or switches a route from a larger to a smaller model saves real money. Every optimization that shaves a millisecond of Worker CPU saves cents. Spend engineering time accordingly.

15. failure-mode taxonomy

Named failures are fixable failures. Observed in deployments:

name	signature	fix
overload 529	upstream returns `529`; often in bursts	short exponential backoff; ≤2 retries; surface clean error if exhausted
in-stream overloaded	`event: error` with `overloaded_error` mid-stream	close stream; error to user; do not retry after partial emission
rate limit 429	`429` with `retry-after`	obey the header; add DO-side circuit breaker
sse stall	stream sends no events for >30s from client view	confirm SSE response headers; remove any intermediate `json()`
cache silent miss	billed in full on a request expected to hit cache	diff request bytes against the last known hit; usually a serialization drift
workspace cache miss	two keys in different workspaces no longer share cache	consolidate workspace, or accept the miss and re-tune breakpoints
do hot shard	one identity monopolizes a DO instance	shard the DO namespace by a salt; reject at Worker edge for outliers
r2 egress spike	unexpected R2 Class B op count	pre-sign client uploads; hydrate context once per session
subrequest ceiling	Worker hits the platform subrequest cap	collapse fan-out; move iterative calls into an Agent DO
6-connection head-of-line	parallel fan-out serializes past six in-flight	keep the outbound fan-out width ≤6 per invocation
kv read-after-write miss	write followed by read returns old value	KV is eventually consistent; use a DO for read-your-writes
key-rotation 401	silent `401` after rotating the provider key	atomic order: deploy new secret, verify, then revoke old key
tokenizer drift	budget-cap heuristics tuned on an older tokenizer blow	recalibrate; new-generation tokenizers can consume more tokens for the same text
older-model retirement	hard failure after the vendor's retirement date	migrate pinned identifier to a currently-supported model from the same vendor

16. field report — running a few hundred zones

These notes draw from operating a personal edge-CDN namespace of several hundred zones via api automation. The actual lived workload is multi-domain management (DNS posture, mail records, routing) and edge as a scrape-shield in front of high-traffic surfaces — not a high-volume llm-proxy fleet. Receipts below reflect that. Numbers are operationally aggregated; specific identifiers withheld for OPSEC.

receipt	observation	takeaway
scale	several-hundred-zone personal namespace; ~twelve-month observation window; one router worker on a kv content-map serves the static surfaces	one worker, one binding scales surprisingly far
automation stack	claude + wrangler + namecheap api as the management triad: an llm composes the wrangler/api calls; wrangler deploys; namecheap api drives registrations and contact/ns posture; bulk DNS/CAA/MX changes batched and verified against the namespace	llm-assisted ops collapse the multi-domain admin surface; api beats ui at scale by orders of magnitude
null-mx fleet rollout (2026-04-16)	RFC 7505 null-MX + SPF `-all` + DMARC `p=reject` + empty-DKIM wildcard applied across the namespace in a single afternoon; ~700 records written; zero failures; dig-verified spot-checks across a random subsample	idempotent posture changes are batchable; one afternoon vs one week of dashboard work
edge as scrape-shield	workers placed in front of high-traffic surfaces to absorb scraping loads; cf cache + worker-side pre-flight (auth shape, ip class, rate) pushes the cost off origin before bytes egress	the edge IS the right layer for absorbing adversarial traffic; bot-fight-mode covers residual-proxy scrapers, workers cover the rest
per-route, not per-namespace	several hundred zones + a small router worker stays inside workers-free most days; the first paid-tier upgrade was driven by a single high-traffic surface, not by aggregate growth	the unit cost is per-route load, not zone count; price the surface that's actually hot

Other observations:

A single router function scales surprisingly far. For static sites and small dynamic handlers, one Worker with a KV-backed content map handles hundreds of zones without breaking the free tier most days. The design is boring: request arrives, Worker looks up host + path in KV, response is the value. The hot path is a KV read and a Response.

Zone Analytics is enough for aggregate traffic. The GraphQL Analytics API returns per-zone 1-hour granularity traffic without writing a single log line from the Worker. For per-Worker, per-route metrics, Analytics Engine is the answer; for "is this zone alive, how many visitors today," Zone Analytics is the no-code answer.

DNS over API beats DNS via UI at scale. Bulk posture changes (null-MX, SPF, DMARC, CAA) applied across hundreds of zones via the provider API took minutes. The same work via the dashboard would have taken a week.

Null-MX with strict DMARC is the default for any zone that does not send mail. RFC 7505 null-MX plus SPF -all plus DMARC p=reject plus an empty-value DKIM wildcard makes spoofing unpleasant for attackers and costs nothing. One afternoon applied across the whole namespace.

Bot Fight Mode is not the right firewall for API traffic. It is a cheap bot filter primarily targeting residential-proxy scrapers. For API endpoints needing differentiated treatment, a Worker-side check (auth, rate limit, shape validation) is the right tool; Bot Fight Mode lives in front of static surfaces.

The crawler substrate is moving under everyone's feet, and Cloudflare is the one writing it down. Their December 2025 Radar year-in-review put AI bots at 4.2% of HTML traffic and 20% of verified-bot traffic, with "user-action" crawling up 15× year-over-year (radar 2025 year-in-review, retrieved 2026-04-18). On 2026-04-17 they shipped AI Training Redirects with hard numbers on their own docs: 4.8M training-crawler hits to developers.cloudflare.com over a rolling 30 days — 46k OpenAI, 3.6k Anthropic, 1.7k Meta against a legacy-docs cohort — and 100% of canonical-tagged legacy pages redirected within 7 days of enabling the feature. The dataset is dev-audience-skewed dogfood. It is also the closest thing to ground truth that exists outside the model labs.

Most of the web hasn't moved. Cloudflare's same-day Agent Readiness scan (2026-04-17) found 78% of sites with a robots.txt at all, only 4% declaring AI preferences via the Content Signals draft, 3.9% serving negotiated markdown, and fewer than 15 sites publishing an MCP Server Card. Being in the top single-digit percent on agent-readiness is currently an afternoon of work, and the gap widens each quarter. For a namespace already running through the wrangler-and-llm triad, picking up Content Signals plus a markdown variant per zone is a batch job, not a project.

The free tier covers more than expected, until it doesn't. The painful points have been (a) the Workers free-tier daily request budget when an llm-backed utility went viral, and (b) R2 Class A ops when write patterns were careless. Both were cheap to fix by paying the next tier.

Treat SSE streams as a first-class output type. A Worker that streams tokens to the client feels dramatically more responsive than the same Worker that buffers. Users abandon buffered calls at far lower latencies than they abandon streaming ones, and the engineering cost of streaming is one-time.

17. prior art

None of the patterns above are original. Primary sources that back them:

Anthropic's docs (canonical), the anthropic-cookbook, and the prompt caching guide. Where these notes contradict them, they are right.
OpenAI's platform docs and Google's gemini api docs — authoritative on their respective provider shapes.
Cloudflare's workers docs, durable objects docs, analytics engine docs, and agents docs. The examples in those docs are a more complete library than any third-party guide can be.
Vercel's edge functions docs, Deno's deploy docs, Fastly's compute docs — authoritative on their respective runtimes.
Community boilerplates for proxy-as-worker patterns, widely available in public repositories.

Corrections: claudeflare.cx/contact.

18. attestation

Dated, signed, reproducible from source. Regenerating attestation/attest.sh against the canonical source reproduces the block below.

-----BEGIN ATTESTATION-----
document    : edge llm deployment — field notes
version     : v3.0
published   : 2026-04-18T02:51:59Z
canonical   : claudeflare-cx/v3/canonical.md
git-sha     : 653a9a0525cb4b86dfc5ac002bceff92bcba3cfc-dirty
sha256      : 3909bf6d8e17add112ae6df3bdca24115e8bea5601f9c2ca3fa2c36919962e4d
archive.org : pending — snapshot will be embedded post-deploy
rekor-uuid  : pending — sigstore keyless cert will be embedded post-sign
license     : CC BY 4.0
-----END ATTESTATION-----

1. what this is

2. about the .cx namespace

3. the stack at a glance

llm providers referenced

edge runtimes referenced

primitives an edge-llm proxy actually needs

4. pattern 1 — proxy worker for /v1/messages

5. pattern 2 — secret handling

6. pattern 3 — preserving prompt cache through a worker

7. pattern 4 — retries, backoff, and 529

8. pattern 5 — token counting at the edge

9. pattern 6 — r2 for long context

10. pattern 7 — durable objects for rate limiting

11. pattern 8 — analytics engine for per-call metrics

12. pattern 9 — agents sdk and remote mcp over workers

13. gotchas

sse stalls without connection: keep-alive, not x-accel-buffering

cpu-ms vs wall-clock are different budgets

json.parse of a streaming response buffers it

prompt-cache misses from subtle mutations

6 concurrent outbound connections waiting on headers

node.js compat requires the right compatibility_date

kv is eventually consistent; do not use it for rate limits

free tier exhausts fast for popular proxies

api keys rotate atomically at wrangler, not at the provider

14. cost math (dated 2026-04-17)

anthropic messages api (per million tokens)

cloudflare workers and friends

15. failure-mode taxonomy

16. field report — running a few hundred zones

17. prior art

18. attestation

2. about the `.cx` namespace

4. pattern 1 — proxy worker for `/v1/messages`

7. pattern 4 — retries, backoff, and `529`

sse stalls without `connection: keep-alive`, not `x-accel-buffering`

node.js compat requires the right `compatibility_date`