— John Young

May 11, 2026 · 5 min read

1. Main thesis (~250 words)

The real constraint on AI agent task size isn’t lines of code. It’s context.

Most people size tasks by gut feel — “that seems about right” — and then wonder why the agent spirals into corrections halfway through. Lines changed and files touched are secondary proxies. The actual limiter is how much context the agent burns reading, exploring, and running commands before it can start writing the diff.

This matters because model performance degrades as context fills. Chroma’s context-rot study measured 18 LLMs and found attention to context grows progressively less reliable as input length grows. Anthropic’s own engineering team says the same thing: bigger windows don’t fix it — they just push the degradation further out.

So the sizing question is therefore: can this task finish before the agent’s reasoning quality degrades?

That reframes everything.

You stop asking “how many lines of code?” and start asking “how many files does the agent have to read to make a safe change?” You stop asking “is this small?” and start asking “is this one coherent thing with a single verifiable boundary?” Layer boundaries become natural seams. The one-sentence diff test becomes a real sizing tool, not a cute heuristic.

The well-sized task is the one where the agent can finish before the context window starts forgetting what you asked for.

That’s it. That’s the whole game.

Question: What’s the last task you handed to an agent where you knew, in hindsight, the real cost was context burn — not complexity?

2. Framework carousel (5–8 slides)

Slide 1 — Hook: Most people size agent tasks wrong. They count lines of code. The Sizing Decision Flowchart asks five questions instead.

Slide 2 — One-sentence diff test: Can you describe the expected diff in one sentence? If no, the task needs decomposition. “Add a nullable phone_number column with up/down migration” is one sentence. “Add phone number support across the stack” is not.

Slide 3 — Independent verifiability: Can the result be verified without the next task being done first? If no, you’ve either scoped too small (a fragment) or too entangled (restructure it). Each well-sized task is a checkpoint where you can say “this works” or “this doesn’t.”

Slide 4 — Layer boundaries: Does the task cross multiple architectural layers? Migration plus endpoint plus UI plus docs isn’t one task — it’s four. Each layer is a natural seam because each layer can be verified independently.

Slide 5 — Context burn: Does the agent need to read more than ~10 files to understand enough to make a safe change? If yes, either the task is too broad, or you haven’t pre-loaded enough context in the spec to shrink the exploration phase.

Slide 6 — Don’t over-decompose: Could you do it faster manually than writing the spec? If yes, bundle it with the next logical step. Max effort on a trivial task wastes context. Match the investment to the task.

Slide 7 — CTA: Read the full post — link in comments.

Question: Which of these five questions catches the most badly-sized tasks on your team?

3. Contrarian take (~100 words)

Count files read, not files changed.

Files-changed is the metric every engineering culture already tracks — PR size, diff lines, “small change.” For AI agents, it’s the wrong number. The bigger cost is how many files the agent has to read to understand enough context to make those changes safely. Twenty file reads before the first line of output is significant context burn before any real work begins.

This is also where a well-written task spec pays for itself. Relevant files, reference implementations, architectural notes — they’re not nice-to-have. They’re the difference between a fresh context and a half-spent one.

Question: When was the last time you counted files read on an agent task? Or do you only count files changed?

4. War story

The companion post on task anatomy ends with a worked example: add an optional phone number to user registration. Accepted on signup, persisted on the user record, returned by the API.

Tempting to hand to an agent as one task. Don’t.

It decomposes into four — one per architectural layer. Migration adds the nullable column with reversible up/down SQL. Model regenerates the sqlc queries against the updated struct. Service adds ValidatePhone using validate.PhoneE164, with table-driven tests. Handler wires the field through POST and GET, with integration tests.

Each milestone is independently verifiable. The migration runs in isolation. The service tests pass without the handler. The handler tests pass without touching the migration again. If step 3 fails, you don’t lose steps 1 and 2.

The diff fits in one sentence per task. The agent reads ~5 files per task, not 20. Each task lands well under the 200 LOC ceiling. Every gate of the sizing flowchart passes.

One feature, four well-sized tasks, four clean commits.

Question: When you’ve split a feature this way, did you find the layer boundary was the seam — or did you fight it?

5. Tactical tip

If you can’t describe the expected diff in one sentence, the task is either too big or needs decomposition.

That’s the test. Try it before you start the session. “Add a nullable phone_number column to the users table with an up and down migration” — one sentence, well-sized. “Add phone number support across the full stack” — multiple sentences, multiple tasks.

The Claude Code docs use this as a plan-or-execute litmus: skip the plan if the diff fits in one sentence. Flipped, it’s a sizing tool. If you need three clauses to describe what’s changing, you’re handing the agent a task that will spiral.

Question: What’s the last task you started where the one-sentence test would have caught the spiral before it began?