Workers AI as a triage filter, not a generator

When teams first reach for Workers AI (or any small-model AI service running close to the edge), the default mental model is “use it to generate content.” Generate the email reply. Generate the summary. Generate the description. This is the obvious move because it’s what the demos show.

In production, the move that’s been earning its keep is the opposite end of the pipeline: Workers AI as a TRIAGE filter at INGEST, not as a GENERATOR at the OUTPUT. The reasoning is consistent across the two systems I’ve shipped this pattern in — Agentic Inbox, an AI-assisted email client, and an internal IRCC immigration intel monitor — and is worth writing down because it inverts the default intuition.

The naive default

Picture an email assistant. The naive design:

inbound email
   ▼
store in inbox
   ▼
user opens email
   ▼
[user clicks "Draft reply"]
   ▼
Workers AI generates reply
   ▼
user edits and sends

The Workers AI call is at the end. It does the most visible thing — generates English text. But it’s also the most expensive call (long context = many tokens), the most latency-sensitive (user is waiting on the button click), and the most quality-sensitive (a bad reply embarrasses the user). Three failure modes stacked into one call.

The inverted pattern

The pattern that works:

inbound email
   ▼
[Workers AI scores: is this important? what category? draft worth it?]
   ▼
store with score + category + draft hint
   ▼
user opens inbox → sorted by importance score
   ▼
user opens email → auto-draft already there (computed in step 1)
   ▼
user edits and sends

The Workers AI call is at INGEST. It does cheaper things — score an importance number, classify a category, decide whether a draft is even worth pre-computing. The score and category drive the inbox UI; the draft (if it was generated) is already there when the user opens the email.

This pattern shows up in both Agentic Inbox and IRCC Monitor for the same three reasons:

Reason 1: cost asymmetry

A triage call is small. Score this 200-word email for “is it transactional, is it a question, is it a sales pitch?” → maybe 250 input tokens, 50 output tokens. A generation call is large. “Draft a reply to this 200-word email” → 250 input + system prompt (200) + 150 output = ~600 tokens.

If you triage at ingest and only generate for the 30% of emails the triage flagged as “draft-worthy,” you’ve spent 0.3 × 600 + 1.0 × 300 = 480 tokens per email instead of 1.0 × 600 = 600 tokens per email. On Workers AI free-tier-equivalent volume, that’s the difference between staying inside budget and getting throttled.

(The math is more dramatic on the IRCC monitor side, where ~85% of polled items are uninteresting and get filtered out by triage. The generator-at-the-end design would have spent 6x what the triage-at-the-start design did, for the same final output.)

Reason 2: latency asymmetry

A user clicking “Draft reply” expects a response in under one second. A user opening their inbox doesn’t see the triage score — it just sorts the list correctly. The triage call can be slow (multi-second) without anyone noticing; the generation call can’t.

If you move the generation to ingest time (auto-draft for important emails), you’ve turned an interactive latency problem into a background-job latency problem. The user opens an important email and the draft is already there. There’s no spinning indicator on “Generating…” — there’s just text.

This was specifically a design choice in Agentic Inbox. The EmailAgent Durable Object reads inbound mail, scores it, and for high-score messages, generates a draft reply BEFORE the user has opened the email. By the time the user opens the inbox at 9am, ~5-10 drafts are already waiting. The drafts ARE editable; the agent doesn’t auto-send.

Reason 3: quality asymmetry

A triage classifier (“is this email transactional?”) has a very forgiving error budget. If it gets 5% wrong, the inbox sorting is slightly imperfect — not a quality crisis. A generator that outputs misspoken text in 5% of replies is a much worse user experience because the failure is highly visible (a sent reply with a hallucinated detail).

This matters more than it sounds. The small Workers AI models (@cf/meta/llama-3.1-8b-instruct, @cf/moonshotai/kimi-k2.5, etc.) are GREAT at classification and so-so at generation. They are basically built for the triage role. They struggle at the generation role unless you spend significant context budget on prompts and examples.

So the routing rule writes itself: use small Workers AI models for the things they’re good at (classify, score, filter, route) and reserve the generation role for either (a) a larger model invoked rarely or (b) the user themselves with a pre-computed draft as a starting point.

What “triage” actually looks like

A few concrete triage tasks that pay rent in the systems I’ve shipped:

Importance score — float 0..1 estimating how much the user cares about this item right now. Drives sort order.
Category — discrete label from a small allowed set (“transactional / personal / sales / newsletter / receipt”). Drives folder routing.
Draft-worthy flag — boolean: should we pre-compute a draft reply? Avoids wasting generation budget on emails that don’t deserve a reply at all.
Sentiment — useful when the inbox UI wants to surface “this customer sounds upset” without the user having to read every reply.
PII presence — flag items containing personal data so they get extra logging precautions or are excluded from certain displays.
Topic embedding — encode each item into a 384-dim vector for downstream “find similar” features. This isn’t classification, but it lives at the same point in the pipeline.

What ties them together: they’re all filters or labels, not generators. They produce structured fields, not free-form text. They’re easy to evaluate (you can ground-truth them), easy to cache (one call per item, ever), and easy to fail gracefully (if scoring fails, default to neutral).

A pattern, not a rule

This is “where most of my Workers AI calls live now,” not “never use Workers AI for generation.” There are real generation use cases — when a user explicitly asks for a draft, when the system needs to summarize a thread, when the agent SDK exposes a tool that calls Workers AI directly. Those are real.

But the default placement — the FIRST thing you reach for when you put a Workers AI call into a pipeline — should be triage at ingest. The cost / latency / quality math points the same direction in basically every production setting I’ve seen.