edge llm deployment — field notes notes from running a few hundred Cloudflare zones.
1. what this is
I run a few hundred Cloudflare zones — a personal namespace of full-featured apps, sites, sketchpads, AI tools, scrape-shielded surfaces, the whole shape of what one operator can ship at edge. None of this is a hobby footprint. The reason it works is that the entire stack — DNS, registration, deploy, edge logic, monitoring — is driven from a script with an LLM in the loop composing the calls. Claude writes the wrangler invocations and the Namecheap API hits; wrangler ships the Workers; I read the result. One person runs the whole thing.
This is what I learned shipping it.
The snippets show the pairing I actually ship — Anthropic's Messages API on Cloudflare Workers — because code only runs as written if you pick one. The patterns generalize; OpenAI, Google, Groq, Cerebras, Vercel Edge, Deno Deploy, Fastly Compute swap in where I flag inline.
Assume you've shipped a Worker before and you have an API key. Where the official docs are authoritative I link them; where they're wrong or missing I say so. Everything's dated . After that, the numbers may be stale. Check before shipping.
claude-3-haiku-20240307 reaches its retirement date on ; code pinning that identifier should migrate to a currently-supported model from the same vendor before then. Comparable retirements have occurred or will occur across the other providers discussed here; always check the relevant vendor's model-lifecycle page before pinning.
2. about the .cx namespace
These notes live at claudeflare.cx, a personal domain used for community field notes and small utilities. The .cx side reads, in-house, as community extensions — third-party documentation and patterns that describe how the public surface of tools gets used by their communities. Content here is not limited to any one vendor; this edge-llm guide is one article in a broader set of independent notes.
3. the stack at a glance
Two axes: which provider you call, which runtime you call from. Most patterns survive the cross-product. The ones that don't, I flag inline.
llm providers referenced
| provider | api surface | primary-source docs |
|---|---|---|
| OpenAI | Chat Completions, Responses, Realtime | platform.openai.com/docs |
| Anthropic | Messages, prompt caching, batch | platform.claude.com/docs/en |
| Gemini API, context caching | ai.google.dev/gemini-api/docs | |
| Groq | OpenAI-compatible chat API on LPU | console.groq.com/docs |
| Cerebras | OpenAI-compatible chat API on wafer-scale inference | inference-docs.cerebras.ai |
edge runtimes referenced
| runtime | substrate | primary-source docs |
|---|---|---|
| Cloudflare Workers | V8 isolates on the Cloudflare network | developers.cloudflare.com/workers |
| Vercel Edge Functions | V8 isolates on Vercel's edge | vercel.com/docs/functions/edge-functions |
| Deno Deploy | Isolates running the Deno runtime | docs.deno.com/deploy |
| Fastly Compute | WebAssembly on Fastly POPs | fastly.com/documentation/guides/compute |
primitives an edge-llm proxy actually needs
| primitive | role | typical implementation |
|---|---|---|
| HTTP handler | fronting the vendor's api; streaming proxy; auth gate | any edge runtime's default request handler |
| secrets | holds provider api keys off the client; rotated via the runtime's cli | Cloudflare Workers Secrets; Vercel encrypted env vars; Deno Deploy secrets; Fastly secret store |
| strong-consistency state | per-user rate limiter; chat-session stickiness | Cloudflare Durable Objects; Vercel KV + lock; Upstash Redis; a small DO-equivalent; or a SQL row |
| small KV store | static assets; small idempotent response caches | Cloudflare KV; Vercel KV; Deno KV; Upstash |
| object storage | multi-megabyte context blobs (attachments, pdfs, transcripts) | Cloudflare R2; AWS S3; GCS; Backblaze B2 |
| metrics / time-series sink | per-call cost counters and ad-hoc metrics | Cloudflare Analytics Engine; Datadog; BigQuery; ClickHouse Cloud |
Bolt on a vector index (Vectorize, Turbopuffer, pgvector, Pinecone) for retrieval, plus a managed SQL/PG proxy (Hyperdrive, Neon, Supabase) when state gets heavy. Most small-to-medium edge-llm apps don't need either on day one. Start small.
4. pattern 1 — proxy worker for /v1/messages
Start with the dumbest proxy that survives a stream. Forward the bytes. Attach the key server-side. Don't touch the body. Auth, rate limiting, quotas, cost accounting all stack on top later, in their own functions, called from fetch.
// worker.js — thin streaming proxy in front of /v1/messages
export default {
async fetch(req, env) {
const url = new URL(req.url);
if (url.pathname !== "/v1/messages" || req.method !== "POST") {
return new Response("not found", { status: 404 });
}
// Forward the client's body and let the runtime stream it.
// Do NOT JSON.parse + re-stringify: that can miss the prompt cache.
const upstream = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"content-type": "application/json",
"anthropic-version": "2023-06-01",
"x-api-key": env.ANTHROPIC_API_KEY,
},
body: req.body,
});
// Pass the upstream SSE stream through without buffering.
return new Response(upstream.body, {
status: upstream.status,
headers: {
"content-type": "text/event-stream",
"cache-control": "no-cache",
"connection": "keep-alive",
},
});
},
};
Three things to notice. First, the upstream fetch passes req.body directly; the runtime handles the streaming request body. Second, the response body is passed through unchanged: do not JSON-parse it, do not buffer it, do not re-emit it — that path misses the prompt cache (§6). Third, connection: keep-alive plus cache-control: no-cache is the shape recommended by cloudflare's agents HTTP/SSE reference. x-accel-buffering is an nginx-specific header; cloudflare's edge does not document it as required, and it is harmless to omit in a workers-only path.
In front of this, a real Worker adds: an auth check on the incoming request, a rate-limit call to a Durable Object (§10), an Analytics Engine write (§11), and an error envelope for non-2xx upstream. Keep the handler small; push policy into explicit functions called from fetch.
Other-provider swap: replace the upstream url with https://api.openai.com/v1/chat/completions and the auth header with authorization: Bearer $OPENAI_API_KEY; for gemini, https://generativelanguage.googleapis.com/v1beta/models/$MODEL:streamGenerateContent?key=$KEY. Other-runtime swap: the same handler shape runs on vercel edge functions, deno deploy, and (with a small wasm shim) fastly compute.
5. pattern 2 — secret handling
The provider api key goes into Workers Secrets, not into [vars] in wrangler.toml, and not into a repo-checked .dev.vars for production. Cloudflare's secrets docs are explicit that [vars] is plaintext and meant for non-sensitive config; use wrangler secret put for anything that would not be pasted into a public repo.
# set a secret (prompts once, stored encrypted)
wrangler secret put ANTHROPIC_API_KEY
# list secret bindings on this Worker
wrangler secret list
# deploy with the new secret live
wrangler deploy
In code, read via env.ANTHROPIC_API_KEY. Never log it; never include it in an error message; never echo it back in a response header. If request bodies are logged for debugging, strip authorization and x-api-key before writing.
For multi-tenant deployments where each customer holds their own key, pass the key in from the client over TLS and forward it with the request — do not store customer keys in Workers Secrets, because Secrets are per-Worker-global, and a single leaked Worker bundle would expose every tenant. For BYO-key, validate the key shape server-side before forwarding, and forward with the header name the upstream expects.
Other-runtime swap: vercel encrypted env vars (vercel env add); deno deploy environment variables; fastly secret store. All expose the same "set once, read at runtime, never checked in" shape.
6. pattern 3 — preserving prompt cache through a worker
Prompt caching is a substantial cost lever on anthropic's messages api. Cache reads are priced at 0.1× base input; 5-minute cache writes at 1.25× base; 1-hour cache writes at 2× base (prompt caching docs, retrieved ). Break-even: a 5-minute block pays off after one read; a 1-hour block after two. A Worker that breaks the cache pays full price. Openai's prompt caching and google's context caching follow the same engineering shape with different pricing; the worker-side rules below generalize.
The cache is keyed on the prefix of the request's tools → system → messages blocks, with cache_control: {"type": "ephemeral"} marking cacheable boundaries. Up to four explicit breakpoints per request. Minimum cacheable block size depends on the model: current opus and haiku tiers require 4,096 tokens; current sonnet requires 2,048; older sonnets and some haikus require less. Requests below the minimum are processed without caching and return no error.
To preserve the cache across a Worker:
- Forward request bytes; do not re-serialize.
JSON.stringify(JSON.parse(body))can reorder keys, add whitespace, or strip a field. Any of those misses the cache. - Do not change block ordering. Cache matching is prefix-based; a shuffled block list is a different prefix.
- Do not strip
cache_controlbreakpoints. They are the client's statement of what to cache; removing them silently increases cost. - Place breakpoints on the last block that is identical across requests, not on changing content. A breakpoint on the user's timestamp defeats itself.
- Keep policy pre-flight (auth, rate limit) separate from the body-forward path. A policy failure should return before the request body is touched.
On the response side, the api returns cache telemetry in the usage block: cache_creation_input_tokens, cache_read_input_tokens, ephemeral_5m_input_tokens, ephemeral_1h_input_tokens. Total input tokens are cache_read + cache_creation + input_tokens; the input_tokens field alone counts only tokens after the last cache breakpoint. Write the full shape to Analytics Engine (§11) so hit rate is visible over time. A falling hit rate is usually a Worker that started rewriting request bytes it should have forwarded unchanged.
On current 4.x-tier anthropic models, cache_read_input_tokens are excluded from input-tokens-per-minute rate limits. Effective ITPM with hot caches is substantially higher than the raw limit suggests.
7. pattern 4 — retries, backoff, and 529
Three status shapes to handle: 429 Too Many Requests (a limit), 529 Overloaded (the upstream is shedding load), and an in-stream overloaded_error event (the same overload expressed as an SSE event rather than an HTTP status). The messages api returns a retry-after header on 429 where appropriate. Other providers use compatible shapes; the worker-side loop is the same.
async function sendWithRetry(body, env, { maxRetries = 2 } = {}) {
let attempt = 0;
while (true) {
const r = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"content-type": "application/json",
"anthropic-version": "2023-06-01",
"x-api-key": env.ANTHROPIC_API_KEY,
},
body,
});
// Happy path and non-retryable errors pass through.
const retryable = r.status === 429 || r.status === 529 || r.status >= 500;
if (!retryable || attempt >= maxRetries) return r;
const retryAfter = Number(r.headers.get("retry-after")) || 0;
const backoff = Math.max(
retryAfter * 1000,
Math.min(8000, 500 * 2 ** attempt + Math.random() * 250),
);
await new Promise(res => setTimeout(res, backoff));
attempt++;
}
}
Three rules of thumb. First, do not retry a streaming request that has already emitted tokens to the client; restart from the beginning on the client, or don't restart. Second, handle in-stream overloaded_error events by closing the stream and surfacing a clean error to the user; do not attempt to retry mid-stream. Third, keep maxRetries small (0 to 2) — Workers' wall-clock budget while a client is connected is generous, but a Worker that retries for minutes hits user-patience limits before platform limits.
Read the anthropic-ratelimit-* response headers into Analytics Engine: -requests-remaining, -input-tokens-remaining, -output-tokens-remaining. A running chart of "remaining budget at time of request" turns capacity planning into a look-at-the-graph problem.
8. pattern 5 — token counting at the edge
Anthropic exposes /v1/messages/count_tokens, which returns the input-token count for a given request without executing the model. Useful for pre-call budget checks, free-tier gating, or warning a user before they blow a daily allowance. Openai exposes a similar shape via tiktoken locally or via usage-preview endpoints; gemini exposes countTokens.
async function countTokens(reqBody, env) {
const r = await fetch("https://api.anthropic.com/v1/messages/count_tokens", {
method: "POST",
headers: {
"content-type": "application/json",
"anthropic-version": "2023-06-01",
"x-api-key": env.ANTHROPIC_API_KEY,
},
body: reqBody,
});
if (!r.ok) return null;
const { input_tokens } = await r.json();
return input_tokens;
}
Call this conditionally. In a proxy Worker, only count when the client is near a quota boundary; otherwise skip the round-trip. A reasonable heuristic: only count if the client's cumulative usage this window is within 20% of their cap.
9. pattern 6 — r2 for long context
For attachments that are already big — pdfs, transcripts, codebase dumps — store them in R2 and inject references into the request rather than carrying the raw bytes through every call. R2 Standard storage is $0.015/GB-month; egress to the public internet is free; Class B (reads) are $0.36/M (R2 pricing, retrieved ). The same pattern holds with S3, GCS, or Backblaze B2 on other runtimes; differences are pricing and egress shape, not engineering.
Two patterns worth knowing:
- Pre-signed uploads. Generate a pre-signed PUT URL in the Worker, return it to the client, let the client upload directly to R2. Saves Worker CPU-ms.
- Cached context hydration. When building a request that references an R2 object, fetch the object inside the Worker and inline it into the
messagesblock as a document. If the prefix is stable across calls, prompt cache covers most of the cost; R2 is then a cheap persistence layer rather than a hot path.
Do not proxy the R2 object through the Worker on every request. Serve directly from R2 (via a Worker-signed URL for private objects) when possible; that avoids pushing bytes through Workers' request/response-size envelopes.
Anthropic's Files API (anthropic-beta: files-api-2025-04-14, retrieved ) allows uploading a file once and referencing it by id across requests. For files reused many times within a workspace, that is often a better primitive than R2; use R2 when the file is already in the storage graph or needs a public URL.
10. pattern 7 — durable objects for rate limiting
Per-user rate limiting in a stateless Worker is impossible; every request may land on a different isolate, and KV is eventually consistent (propagation up to 60 seconds, per KV docs). Durable Objects solve this: each DO instance is a singleton addressed by a stable name, with strongly consistent in-memory state and an alarm API for time-based effects. On other runtimes the equivalent is Upstash Redis with per-key locks, Deno KV with atomics, or Vercel KV + a lock library.
// Durable Object: per-user token bucket
export class RateLimiter {
constructor(state) { this.state = state; this.tokens = null; this.last = 0; }
async fetch(req) {
if (this.tokens === null) {
const saved = await this.state.storage.get("bucket")
|| { tokens: 60, last: Date.now() };
this.tokens = saved.tokens;
this.last = saved.last;
}
const now = Date.now();
const refill = (now - this.last) / 1000 * 1; // 1 token/sec
this.tokens = Math.min(60, this.tokens + refill);
this.last = now;
if (this.tokens < 1) {
await this.state.storage.put("bucket",
{ tokens: this.tokens, last: this.last });
return new Response("rate limited",
{ status: 429, headers: { "retry-after": "1" } });
}
this.tokens -= 1;
await this.state.storage.put("bucket",
{ tokens: this.tokens, last: this.last });
return new Response("ok");
}
}
// Worker side
export default {
async fetch(req, env) {
const userId = await authenticate(req);
const id = env.RATE_LIMITER.idFromName(userId);
const limiter = env.RATE_LIMITER.get(id);
const check = await limiter.fetch("https://limiter/check");
if (check.status === 429) return check;
// ...proceed to forward to /v1/messages...
},
};
Two refinements worth the lines. First, put a small in-memory cache in the Worker in front of the DO so the common case (plenty of tokens, recently checked) does not cross a network hop. Second, use the DO alarm API to lazily reset buckets instead of polling.
11. pattern 8 — analytics engine for per-call metrics
Cloudflare's Analytics Engine (AE) provides a per-Worker time-series sink writable at request rate and queryable with SQL. For an llm proxy, write one data point per call: model id, user bucket, route, input tokens, output tokens, cache read tokens, status code, latency. Each data point is a row aggregated later. Other-runtime equivalents: Datadog, BigQuery streaming inserts, ClickHouse Cloud, Honeycomb.
env.AE.writeDataPoint({
blobs: [model, userBucket, route], // strings for grouping
doubles: [inputTokens, outputTokens, cacheRead, latencyMs],
indexes: [userBucket], // sampling key
});
AE limits as of : ≤250 data points per Worker invocation, ≤20 blobs and ≤20 doubles per data point, one index per call, all blobs ≤16 KB total, three-month retention (AE limits). Query through the SQL API for external dashboards (Grafana) or the GraphQL Analytics API for in-dashboard views.
Budget discipline: each AE write is billable. For high-QPS Workers, sample at the point of writing — e.g. 1/10 of successful calls plus 100% of errors. A good cost dashboard is one that does not dominate the cost it measures.
12. pattern 9 — agents sdk and remote mcp over workers
For anything with durable tool state — a long-running assistant, a multi-turn retrieval agent, a remote MCP server hosting tools — cloudflare's agents sdk on top of Durable Objects is the shape that works. Each Agent instance is a DO; tools are methods; state persists across requests and hibernation; WebSocket hibernation keeps long-lived clients cheap.
Key primitives from the agents sdk: the Agent base class (per-agent SQL DB plus KV state), AIChatAgent for chat use cases, the @callable() decorator for typed RPC over WebSocket, and React hooks useAgent / useAgentChat. All inherit DO scheduling and hibernation.
For remote MCP servers specifically, the remote MCP guide (retrieved ) documents three paths:
createMcpHandler()— stateless; no Durable Objects; fastest to deploy.McpAgent— stateful; requires DOs; supports both SSE and Streamable HTTP transports; per-session state; tool elicitation. Streamable HTTP is the current MCP spec standard.- Raw
WebStandardStreamableHTTPServerTransport— direct SDK, maximum control.
Inside a tool call, use the provider api as needed; return the tool result as the MCP response. Keep each tool's upstream call bounded. If a tool call runs long enough that platform limits become relevant, return a correlation id, kick off the work via waitUntil, and expose a /status/:id endpoint the client can poll.
13. gotchas
Observed in the field, in rough order of pain:
sse stalls without connection: keep-alive, not x-accel-buffering
A browser that does not receive SSE chunks in real time is the most common symptom of a misconfigured proxy. The canonical shape is content-type: text/event-stream + cache-control: no-cache + connection: keep-alive. x-accel-buffering is an nginx directive and is not documented as required on cloudflare's edge; it is harmless to include but should not be assumed to fix a Workers SSE bug.
cpu-ms vs wall-clock are different budgets
A streaming Worker burns wall-clock (waiting on the upstream) but very little CPU-ms. On the paid tier, default CPU limit per HTTP invocation is 30 s (configurable to 5 min); wall-clock is not enforced while the client is connected (workers limits, retrieved ). On the free tier, CPU is capped at 10 ms per invocation — any non-trivial llm proxy needs the paid plan.
json.parse of a streaming response buffers it
Calling response.json() on an upstream SSE response buffers the entire stream into memory before returning. Use response.body for streaming; response.json() only for non-streaming calls.
prompt-cache misses from subtle mutations
Re-serializing the request body, sorting JSON keys, adding a trailing newline, or injecting a per-request timestamp into the cached prefix will all miss the cache. Treat the request body as opaque bytes on the hot path; if inspection is required, clone first.
6 concurrent outbound connections waiting on headers
Workers allow 6 simultaneous open outbound HTTPS connections awaiting response headers; once headers arrive the connection stops counting. For a fan-out pattern (parallel calls from a single invocation) calls serialize past six in-flight. Streaming bodies count as one open connection for the duration of the stream. (workers limits.)
node.js compat requires the right compatibility_date
When importing a provider's Node-centric SDK, wrangler.toml needs compatibility_flags = ["nodejs_compat"] and compatibility_date >= "2024-09-23". A subset of Node APIs is only partially supported and will fail at call site with [unenv] <method> is not implemented yet!. This is the most common "worked locally, broke on deploy" failure for sdk-on-workers paths.
kv is eventually consistent; do not use it for rate limits
Writes are usually immediately visible at the originating colo but not guaranteed; propagation to other colos can take up to 60 seconds. KV has no read-your-writes guarantee. Use Durable Objects for anything that must be strongly consistent.
free tier exhausts fast for popular proxies
Workers Free is 100,000 requests/day with 10 ms CPU per invocation. A popular llm proxy can exhaust the daily budget in a few hours. Verify current allowances in the dashboard; migrate to Workers Paid ($5/month minimum) before a public launch.
api keys rotate atomically at wrangler, not at the provider
wrangler secret put updates the deployed Worker atomically, but if the old key is revoked at the provider before every deploy has flushed, in-flight requests using the old key will 401. Safe order: set the new secret, confirm the next deploy is live, then revoke the old key.
14. cost math (dated )
All figures below reflect primary-source pricing retrieved . Re-verify before quoting them to a customer.
anthropic messages api (per million tokens)
| model | input | 5m cache write | 1h cache write | cache read | output |
|---|---|---|---|---|---|
claude-opus-4-7 | $5 | $6.25 | $10 | $0.50 | $25 |
claude-sonnet-4-6 | $3 | $3.75 | $6 | $0.30 | $15 |
claude-haiku-4-5 | $1 | $1.25 | $2 | $0.10 | $5 |
cloudflare workers and friends
- workers free. 100,000 requests/day. 10 ms CPU per invocation. No cost.
- workers paid (standard). $5/month minimum. 10M requests/month included, then $0.30 per additional million. 30M CPU-ms/month included, then $0.02 per additional million CPU-ms.
- r2 standard. $0.015/GB-month storage. Class A (writes) $4.50/M. Class B (reads) $0.36/M. Egress to Workers and to the public internet: free.
- durable objects. Per-request pricing plus wall-clock plus storage. SQLite-backed DOs are on Free and Paid.
- kv. Free: 100k reads/day, 1k writes/day, 1 GB storage. Paid: unlimited reads/writes, 25 MiB/value ceiling.
- analytics engine. Write-per-data-point pricing; check the current pricing page before heavy production use.
The shape that matters: for any llm-backed product, the model api is almost always the dominant cost line. Every optimization that increases cache hit rate, shortens the output, or switches a route from a larger to a smaller model saves real money. Every optimization that shaves a millisecond of Worker CPU saves cents. Spend engineering time accordingly.
15. failure-mode taxonomy
Named failures are fixable failures. Observed in deployments:
| name | signature | fix |
|---|---|---|
| overload 529 | upstream returns 529; often in bursts | short exponential backoff; ≤2 retries; surface clean error if exhausted |
| in-stream overloaded | event: error with overloaded_error mid-stream | close stream; error to user; do not retry after partial emission |
| rate limit 429 | 429 with retry-after | obey the header; add DO-side circuit breaker |
| sse stall | stream sends no events for >30s from client view | confirm SSE response headers; remove any intermediate json() |
| cache silent miss | billed in full on a request expected to hit cache | diff request bytes against the last known hit; usually a serialization drift |
| workspace cache miss | two keys in different workspaces no longer share cache | consolidate workspace, or accept the miss and re-tune breakpoints |
| do hot shard | one identity monopolizes a DO instance | shard the DO namespace by a salt; reject at Worker edge for outliers |
| r2 egress spike | unexpected R2 Class B op count | pre-sign client uploads; hydrate context once per session |
| subrequest ceiling | Worker hits the platform subrequest cap | collapse fan-out; move iterative calls into an Agent DO |
| 6-connection head-of-line | parallel fan-out serializes past six in-flight | keep the outbound fan-out width ≤6 per invocation |
| kv read-after-write miss | write followed by read returns old value | KV is eventually consistent; use a DO for read-your-writes |
| key-rotation 401 | silent 401 after rotating the provider key | atomic order: deploy new secret, verify, then revoke old key |
| tokenizer drift | budget-cap heuristics tuned on an older tokenizer blow | recalibrate; new-generation tokenizers can consume more tokens for the same text |
| older-model retirement | hard failure after the vendor's retirement date | migrate pinned identifier to a currently-supported model from the same vendor |
16. field report — running a few hundred zones
These notes draw from operating a personal edge-CDN namespace of several hundred zones via api automation. The actual lived workload is multi-domain management (DNS posture, mail records, routing) and edge as a scrape-shield in front of high-traffic surfaces — not a high-volume llm-proxy fleet. Receipts below reflect that. Numbers are operationally aggregated; specific identifiers withheld for OPSEC.
| receipt | observation | takeaway |
|---|---|---|
| scale | several-hundred-zone personal namespace; ~twelve-month observation window; one router worker on a kv content-map serves the static surfaces | one worker, one binding scales surprisingly far |
| automation stack | claude + wrangler + namecheap api as the management triad: an llm composes the wrangler/api calls; wrangler deploys; namecheap api drives registrations and contact/ns posture; bulk DNS/CAA/MX changes batched and verified against the namespace | llm-assisted ops collapse the multi-domain admin surface; api beats ui at scale by orders of magnitude |
| null-mx fleet rollout () | RFC 7505 null-MX + SPF -all + DMARC p=reject + empty-DKIM wildcard applied across the namespace in a single afternoon; ~700 records written; zero failures; dig-verified spot-checks across a random subsample | idempotent posture changes are batchable; one afternoon vs one week of dashboard work |
| edge as scrape-shield | workers placed in front of high-traffic surfaces to absorb scraping loads; cf cache + worker-side pre-flight (auth shape, ip class, rate) pushes the cost off origin before bytes egress | the edge IS the right layer for absorbing adversarial traffic; bot-fight-mode covers residual-proxy scrapers, workers cover the rest |
| per-route, not per-namespace | several hundred zones + a small router worker stays inside workers-free most days; the first paid-tier upgrade was driven by a single high-traffic surface, not by aggregate growth | the unit cost is per-route load, not zone count; price the surface that's actually hot |
Other observations:
A single router function scales surprisingly far. For static sites and small dynamic handlers, one Worker with a KV-backed content map handles hundreds of zones without breaking the free tier most days. The design is boring: request arrives, Worker looks up host + path in KV, response is the value. The hot path is a KV read and a Response.
Zone Analytics is enough for aggregate traffic. The GraphQL Analytics API returns per-zone 1-hour granularity traffic without writing a single log line from the Worker. For per-Worker, per-route metrics, Analytics Engine is the answer; for "is this zone alive, how many visitors today," Zone Analytics is the no-code answer.
DNS over API beats DNS via UI at scale. Bulk posture changes (null-MX, SPF, DMARC, CAA) applied across hundreds of zones via the provider API took minutes. The same work via the dashboard would have taken a week.
Null-MX with strict DMARC is the default for any zone that does not send mail. RFC 7505 null-MX plus SPF -all plus DMARC p=reject plus an empty-value DKIM wildcard makes spoofing unpleasant for attackers and costs nothing. One afternoon applied across the whole namespace.
Bot Fight Mode is not the right firewall for API traffic. It is a cheap bot filter primarily targeting residential-proxy scrapers. For API endpoints needing differentiated treatment, a Worker-side check (auth, rate limit, shape validation) is the right tool; Bot Fight Mode lives in front of static surfaces.
The crawler substrate is moving under everyone's feet, and Cloudflare is the one writing it down. Their December 2025 Radar year-in-review put AI bots at 4.2% of HTML traffic and 20% of verified-bot traffic, with "user-action" crawling up 15× year-over-year (radar 2025 year-in-review, retrieved ). On they shipped AI Training Redirects with hard numbers on their own docs: 4.8M training-crawler hits to developers.cloudflare.com over a rolling 30 days — 46k OpenAI, 3.6k Anthropic, 1.7k Meta against a legacy-docs cohort — and 100% of canonical-tagged legacy pages redirected within 7 days of enabling the feature. The dataset is dev-audience-skewed dogfood. It is also the closest thing to ground truth that exists outside the model labs.
Most of the web hasn't moved. Cloudflare's same-day Agent Readiness scan () found 78% of sites with a robots.txt at all, only 4% declaring AI preferences via the Content Signals draft, 3.9% serving negotiated markdown, and fewer than 15 sites publishing an MCP Server Card. Being in the top single-digit percent on agent-readiness is currently an afternoon of work, and the gap widens each quarter. For a namespace already running through the wrangler-and-llm triad, picking up Content Signals plus a markdown variant per zone is a batch job, not a project.
The free tier covers more than expected, until it doesn't. The painful points have been (a) the Workers free-tier daily request budget when an llm-backed utility went viral, and (b) R2 Class A ops when write patterns were careless. Both were cheap to fix by paying the next tier.
Treat SSE streams as a first-class output type. A Worker that streams tokens to the client feels dramatically more responsive than the same Worker that buffers. Users abandon buffered calls at far lower latencies than they abandon streaming ones, and the engineering cost of streaming is one-time.
17. prior art
None of the patterns above are original. Primary sources that back them:
- Anthropic's docs (canonical), the anthropic-cookbook, and the prompt caching guide. Where these notes contradict them, they are right.
- OpenAI's platform docs and Google's gemini api docs — authoritative on their respective provider shapes.
- Cloudflare's workers docs, durable objects docs, analytics engine docs, and agents docs. The examples in those docs are a more complete library than any third-party guide can be.
- Vercel's edge functions docs, Deno's deploy docs, Fastly's compute docs — authoritative on their respective runtimes.
- Community boilerplates for proxy-as-worker patterns, widely available in public repositories.
Corrections: claudeflare.cx/contact.
18. attestation
Dated, signed, reproducible from source. Regenerating attestation/attest.sh against the canonical source reproduces the block below.
-----BEGIN ATTESTATION----- document : edge llm deployment — field notes version : v3.0 published : 2026-04-18T02:51:59Z canonical : claudeflare-cx/v3/canonical.md git-sha : 653a9a0525cb4b86dfc5ac002bceff92bcba3cfc-dirty sha256 : 3909bf6d8e17add112ae6df3bdca24115e8bea5601f9c2ca3fa2c36919962e4d archive.org : pending — snapshot will be embedded post-deploy rekor-uuid : pending — sigstore keyless cert will be embedded post-sign license : CC BY 4.0 -----END ATTESTATION-----