Research Index_

The evidence base behind everything I publish on running AI coding agents in production. Every row is a verbatim figure or quote from a primary source — a study, a benchmark, an engineering post — not a paraphrase and not a bare link. Each one was pulled while researching a post, then checked against the live page; sources that had drifted or gone dead were dropped or flagged. Grouped by the six themes I write on, and it grows with every new post.

For the argument this evidence adds up to, read The State of AI Coding Agent Engineering — the synthesis across all six themes, free by email.

101 verified sources · 6 themes · updated 2026-07-01 · download JSON

1 · Task Design & Decomposition

Scoping, decomposing, and speccing work so an agent finishes it on the first try.

80% of tool calls come from agents that appear to have at least one kind of safeguard (like restricted permissions or human approval requirements), 73% appear to have a human in the loop in some way, and only 0.8% of actions appear to be irreversible

Irreversible agent actions are rare in real traffic, so oversight should concentrate on the small slice where a single error is costly.

such as sending an email to a customer
And while these higher-risk actions are rare as a share of overall traffic, the consequences of a single error can still be significant.

Anthropic: Measuring AI agent autonomy in practice Feb 18, 2026 data

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024

The primary paper behind the autonomy trend confirms a ~7-month doubling of the 50%-task-completion time horizon since 2019, driven mainly by greater reliability and error-adaptation — the mechanism that inflates calls per task.

"Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes".
"within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month".
"The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes"
50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate

Kwa, West, Becker, et al. (METR): Measuring AI Ability to Complete Long Software Tasks submitted 2025-03-18 data

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost, What AI Coding Agents Are Actually Good For (And When to Skip)

The length of tasks (measured by how long they take human professionals) that generalist frontier model agents can complete autonomously with 50% reliability has been doubling approximately every 7 months for the last 6 years.

The autonomous task length frontier agents can complete has doubled roughly every 7 months for 6 years, so autonomous runs — and the per-task call count behind them — keep growing.

current models have almost 100% success rate on tasks taking humans less than 4 minutes, but succeed <10% of the time on tasks taking more than around 4 hours
"If the measured trend from the past 6 years continues for 2-4 more years, generalist autonomous agents will be capable of performing a wide range of week-long tasks."
the best current models—such as Claude 3.7 Sonnet—are capable of some tasks that take even expert humans hours, but can only reliably complete tasks of up to a few minutes long
AI agents often seem to struggle with stringing together longer sequences of actions

METR: Measuring AI Ability to Complete Long Tasks March 19, 2025 data

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost, What AI Coding Agents Are Actually Good For (And When to Skip)

When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts.

Experienced developers were measurably slower with AI in codebases they know well, contradicting their own forecasts of a speedup.

16 experienced developers from large open-source repositories (averaging 22k+ stars and 1M+ lines of code)
developers expected AI to speed them up by 24%
they still believed AI had sped them up by 20%
developers estimated that they were sped up by 20% on average when using AI—so they were mistaken

METR (Becker, Rush, Barnes, Rein): Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity July 10, 2025 data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision, What AI Coding Agents Are Actually Good For (And When to Skip)

only 48% of developers consistently check AI-assisted code before committing it, even though 38% find that reviewing AI-generated logic actually requires more effort than reviewing human-written code.

Most teams under-review AI code even though reviewing it costs more effort, so the last-mile verification tax is real and often unpaid.

AI gets you 80% to an MVP; the last 20% requires patience, learning deeply or hiring engineers.

Addy Osmani: The 80% Problem in Agentic Coding Jan 28, 2026 practitioner partial

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

AI writes faster. Humans still have to prove it works.

AI speeds up writing but shifts the constraint to verification; a human still owns proving the code works.

If your pull request doesn't contain evidence that it works, you're not shipping faster
45% of AI-generated code contains security flaws
Logic errors appear at 1.75× the rate of human-written code

Addy Osmani: Code Review in the Age of AI January 5, 2026 practitioner partial

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

A user asked to "clean up old branches." The agent listed remote branches, constructed a pattern match, and issued a delete. This would be blocked since the request was vague, the action irreversible and destructive, and the user may have only meant to delete local branches.

Vague-plus-irreversible-plus-destructive is the dangerous combination to gate; a concrete incident shows why you don't delegate blast-radius actions blind.

Claude Code users approve 93% of permission prompts.
If a session accumulates 3 consecutive denials or 20 total, we stop the model and escalate to the human.
Destroy or exfiltrate. Cause irreversible loss by force-pushing over history, mass-deleting cloud storage, or sending internal data externally.
Instead, a false positive costs a single retry where the agent gets a nudge, reconsiders, and usually finds an alternative path.

Anthropic: How we built Claude Code auto mode: a safer way to skip permissions Mar 25, 2026 practitioner

Cited in What AI Coding Agents Are Actually Good For (And When to Skip), Tier Your AI Agent's Production Authority by Task Risk

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.

The default should be the simplest solution; reaching for an agent is a decision to justify, not an assumption.

Code solutions are verifiable through automated tests; Agents can iterate on solutions using test results as feedback
The autonomous nature of agents means higher costs, and the potential for compounding errors.
Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.

Anthropic (Erik Schluntz and Barry Zhang): Building effective agents Dec 19, 2024 practitioner

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

For difficult tasks, I'll often reject five or six (or more!) agent attempts before accepting one as good enough to work with, or giving up and making the change by hand.

Getting value from agents on hard tasks means aggressively rejecting weak attempts and keeping judgment work human.

able to correctly diagnose 80% of issues on its own
The current core AI skill is shifting as much work onto AI agents as possible, without going too far.
I still don't use LLMs to write Slack messages, ADRs, issues and so forth.

Sean Goedecke: How I use LLMs as a staff engineer in 2026 May 17, 2026 practitioner

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

LLMs excel at writing code that works that doesn't have to be maintained.

Agents are best on throwaway and research code, not the maintained business logic and judgment writing you own long-term.

I would say that my use of LLMs here meant I got this done 2x-4x faster
It's rare that I let Copilot produce business logic for me
I **never** allow the LLM to write these for me

Sean Goedecke: How I use LLMs as a staff engineer February 4, 2025 practitioner

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

If your project has a robust, comprehensive and stable test suite agentic coding tools can _fly_ with it.

A strong automated test suite is the single biggest enabler of agent productivity on a codebase.

what should we call the other end of the spectrum, where seasoned professionals accelerate their work with LLMs while staying proudly and confidently accountable for the software they produce?
Automated testing / Planning in advance / Comprehensive documentation / Good version control habits / Effective automation / Culture of code review / Manual QA / Research skills / Ship to preview environment

Simon Willison: Vibe engineering 7th October 2025 practitioner

Cited in What AI Coding Agents Are Actually Good For (And When to Skip)

2 · Context Engineering

What goes in the window and when — CLAUDE.md, just-in-time retrieval, the instruction ceiling.

The most common failures are wrong tool selection and incorrect parameters, especially when tools have similar names like `notification-send-user` vs. `notification-send-channel`.

At scale, loading all tool definitions upfront is the failure driver; deferred tool loading cuts token cost and measurably raises tool-selection accuracy.

At Anthropic, we've seen tool definitions consume 134K tokens before optimization.
Opus 4.5 improved from 79.5% to 88.1%
This represents an 85% reduction in token usage while maintaining access to your full tool library.

Anthropic: Introducing advanced tool use on the Claude Developer Platform November 24, 2025 data

Cited in Where Just-in-Time Context Retrieval Silently Breaks

Even the best frontier models only achieve 68% accuracy at the max density of 500 instructions.

Instruction-following accuracy degrades sharply with density — the best frontier models hit only 68% at 500 instructions — so packing rules in measurably erodes compliance.

At 500 instructions, llama-4-scout exhibits an extreme O:M ratio of 34.88, indicating omission errors are over 30 times more frequent
Threshold decay: "Performance remains stable until a threshold, then transitions to a different (steeper) degradation slope" — exhibited by gemini-2.5-pro, o3
Primacy effects display an interesting pattern across all models: they start low at minimal instruction densities indicating almost no bias for earlier instructions, peak around 150–200 instructions

Daniel Jaroslawicz, Brendan Whiting, Parth Shah, Karime Maamari (Distyl AI): How Many Instructions Can LLMs Follow at Once? 2025 (arXiv 2507.11538v1) data

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

Procedural reliability, particularly tool initialization failures, constitutes the primary bottleneck for smaller models.

For smaller models, tool-invocation reliability (especially tool initialization) is the primary failure bottleneck, localizable via a 12-category taxonomy.

1,980 deterministic test instances
12-category error taxonomy capturing failure modes across tool initialization, parameter handling, execution, and result interpretation
Mid-sized models (qwen2.5:14b) offer practical accuracy-efficiency trade-offs on commodity hardware (96.6% success rate, 7.3 s latency)

Huang et al. (arXiv): When Agents Fail to Act: A Diagnostic Framework for Tool Invocation Reliability in Multi-Agent LLM Systems Submitted 22 January 2026 (accepted at ICAIBD 2026) data partial

Cited in Where Just-in-Time Context Retrieval Silently Breaks

models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows

Across 18 LLMs, models do not use context uniformly — reliability drops as input length grows — so a longer CLAUDE.md is not a neutral cost.

model performance degrades as input length increases, often in surprising and non-uniform ways
Even a single distractor reduces performance relative to the baseline (needle only).
Lower similarity needle-question pairs increases the rate of performance degradation
18 LLMs evaluated — Anthropic (Claude Opus 4, Sonnet 4, Sonnet 3.7, Sonnet 3.5, Haiku 3.5), OpenAI (o3, GPT-4.1 + mini/nano, GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo), Google (Gemini 2.5 Pro/Flash, 2.0 Flash), Alibaba (Qwen3-235B-A22B, Qwen3-32B, Qwen3-8B)

Kelly Hong, Anton Troynikov, Jeff Huber (Chroma): Context Rot: How Increasing Input Tokens Impacts LLM Performance July 14, 2025 data

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README, Where Just-in-Time Context Retrieval Silently Breaks, When One Agent Stops Being Enough: The Isolation Gate

While developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant

Empirically, teams pack context files with functional setup but almost no security or performance guardrails — the constraint side of CLAUDE.md is systematically under-specified.

2,303 agent context files across 1,925 repositories
Build and run commands: 62.3%, Implementation details: 69.9%, Architecture: 67.7%; Security: 14.5%, Performance: 14.5%
These files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code

Worawalan Chatlatanagulchai et al.: Agent READMEs: An Empirical Study of Context Files for Agentic Coding 17 Nov 2025 (submitted) data

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

We attribute this improvement to the legibility of failed logical search. Repeated failures under explicit lexical constraints provide a clearer signal that required evidence may be absent, whereas Agentic Hybrid may still return semantically related but unsupported passages.

Logical/lexical retrieval can signal 'nothing found' where embedding search cannot, which measurably reduces hallucination on answer-unavailable questions.

On average, its refusal rate increased from 0.767 to 0.828, while the hallucination rate decreased from 0.128 to 0.083.
anchoring the retrieval process in logical queries substantially reduces hallucinations in generated responses.
matches a strong agentic hybrid baseline, while substantially reducing construction and serving cost

Zeng et al. (arXiv): Rethinking Agentic RAG: Toward LLM-Driven Logical Retrieval Beyond Embeddings Submitted 26 May 2026 data

Cited in Where Just-in-Time Context Retrieval Silently Breaks

when the context window fills up and gets compacted, your CLAUDE.md values get summarized away with everything else

CLAUDE.md instructions decay mid-session — they get summarized away at compaction — so hook-based reinforcement is more reliable for must-follow standards.

hook output requires approximately 15 tokens per prompt reminder
Over 50-turn session, motto reminders total ~750 tokens against 200k context window
hook output arrives as clean system-reminder messages — no disclaimer, no 'may or may not be relevant' framing

Albert Nahas: Your CLAUDE.md Instructions Are Being Ignored - Here's Why (and How to Fix It) Feb 17 (year not stated on page; brief lists 2026) practitioner

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

Of course, there's a trade-off: runtime exploration is slower than retrieving pre-computed data. Not only that, but opinionated and thoughtful engineering is required to ensure that an LLM has the right tools and heuristics for effectively navigating its information landscape.

Just-in-time context retrieval is not free: it trades latency for freshness and demands deliberate tool and heuristic design to work.

agents built with the 'just in time' approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools.
In certain settings, the most effective agents might employ a hybrid strategy, retrieving some data up front for speed, and pursuing further autonomous exploration at its discretion.
Context, therefore, must be treated as a finite resource with diminishing marginal returns.

Anthropic: Effective context engineering for AI agents September 29, 2025 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

If Claude keeps doing something you don't want despite having a rule against it, the file is probably too long and the rule is getting lost. If Claude asks you questions that are answered in CLAUDE.md, the phrasing might be ambiguous. Treat CLAUDE.md like code: review it when things go wrong, prune it regularly, and test changes by observing whether Claude's behavior actually shifts.

Anthropic's own guidance says to maintain CLAUDE.md like code — prune it, and test rule changes by observing whether Claude's behavior actually shifts.

Claude stops when the work looks done. Without a check it can run, 'looks done' is the only signal available, and you become the verification loop: every mistake waits for you to notice it.
Keep it concise. For each line, ask: 'Would removing this cause Claude to make mistakes?' If not, cut it. Bloated CLAUDE.md files cause Claude to ignore your actual instructions!
Ruthlessly prune. If Claude already does something correctly without the instruction, delete it or convert it to a hook.
Unlike CLAUDE.md instructions which are advisory, hooks are deterministic and guarantee the action happens.

Anthropic (Claude Code Docs): Best practices for Claude Code 2026 (undated on page; brief dates it 2026) practitioner

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README, How to Verify AI Coding Agent Output: A Reviewer's Framework

CLAUDE.md content is delivered as a user message after the system prompt, not as part of the system prompt itself. Claude reads it and tries to follow it, but there's no guarantee of strict compliance, especially for vague or conflicting instructions.

CLAUDE.md is advisory context delivered as a user message, not enforced configuration — so strict compliance is not guaranteed, especially for vague or conflicting rules.

target under 200 lines per CLAUDE.md file. Longer files consume more context and reduce adherence.
Both are loaded at the start of every conversation. Claude treats them as context, not enforced configuration. To block an action regardless of what Claude decides, use a PreToolUse hook instead.
if two rules contradict each other, Claude may pick one arbitrarily.

Anthropic (Claude Code Docs): How Claude remembers your project 2026 (undated on page) practitioner

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

the most important memory work isn't 'store more,' it's 'curate better': Retrieve selectively, prune aggressively, summarize carefully

Reliability comes from curating context (selective retrieval, aggressive pruning), and tool-count bloat degrades even capable models.

once its context grew beyond a certain point (on the order of 100,000 tokens in an experiment), it began to fixate on repeating its past actions
failed a task when given 46 tools to consider but succeeded when given only 19 tools

Elasticsearch Labs (Someshwaran Mohankumar): Managing agentic memory with Elasticsearch January 16, 2026 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

Tool descriptions are not documentation. They are the LLM's primary decision surface.

Tool descriptions are the agent's decision surface and must be audited like production code; nearly all of them carry quality defects.

97.1% contain at least one quality issue
More than half (56%) have unclear purpose statements
augmented descriptions improved task success by 5.85 percentage points

Guy (AWS Heroes) / DEV Community: MCP Tool Design: Why Your AI Agent Is Failing (And How to Fix It) Posted Mar 18 (Edited Apr 8), 2026 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

yes, compaction and smaller models help on cost per step. But my issue wasn't just inefficiency, it was agents retrying when they shouldn't. I needed visibility + limits per agent/task, and the ability to cut it off, not just optimize it.

Practitioners want per-agent/per-task limits and a hard cut-off, not just cost optimization — the wedge is attribute-and-enforce, not optimize.

My AGENTS.md is 845 lines and it only started getting good once it got that long" (Sammi), directly contested by "sweet spot is between 60 and 120 lines. With psuedo xml tags between sections" (typpilol)
Budget alerts are not a kill switch. Credits are not protection.
Claude often ignores CLAUDE.md / The more information you have in the file the more it gets ignored
"cost control is a policy problem - we certainly don't need to use opus 4.6 for a simple test refactor... we need a way to measure cost / performance for agents on individual repos, with individual types of tasks..." (author bisonbear, id 47563774)

Hacker News (bhaviav100, OP): Ask HN: How are you keeping AI coding agents from burning money? 2026-03-29 (created_at 2026-03-29T00:22:20Z) practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost, CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

You cannot craft the perfect CLAUDE.md file immediately. Instead, treat it as a living document.

CLAUDE.md is a living document refined over time, not a one-shot artifact.

Claude Code agents have a context window, and the CLAUDE.md file gets added to the agent's context. Any unnecessary instructions and wordy sentences will consume more of that context.
Always review the CLAUDE.md file and correct any assumptions or missing details related to project architecture.

Ivan Kahl / Dometrain: Creating the Perfect CLAUDE.md for Claude Code January 15, 2026 practitioner partial

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

Frontier thinking LLMs can follow ~ 150-200 instructions with reasonable consistency.

There is a practical instruction ceiling — even frontier models only follow roughly 150-200 instructions consistently — so every line in CLAUDE.md competes for a finite budget.

Claude Code's system prompt contains ~50 individual instructions
Smaller models get MUCH worse, MUCH more quickly
LLMs bias towards instructions that are on the peripheries of the prompt

Kyle (HumanLayer): Writing a good CLAUDE.md November 25, 2025 practitioner

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

This isn't a hallucination. The retrieval worked perfectly. It just retrieved garbage.

Bad retrieval is a distinct silent failure mode from hallucination and has no built-in flag, so leaders must add one.

Silent retrieval failure. There's no mechanism to flag 'this retrieval returned low-confidence or low-credibility results.'
For 28 minutes, 55% of API requests to the platform failed
An agent with 85% accuracy per step only completes a 10-step workflow successfully 20% of the time

Medium (Paolo Perrone / Data Science Collective): Why AI Agents Keep Failing in Production April 12, 2026 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

context engineering is the delicate art and science of filling the context window with just the right information for the next step...task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history

Context engineering, not prompt engineering, is the real discipline: filling the window with the right information environment for the next step.

the art of providing all the context for the task to be plausibly solvable by the LLM

Simon Willison's Weblog: Context engineering 27th June 2025 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

The agent optimises locally. At each step, it asks, 'Do I have enough?' and when the answer is uncertain, it defaults to 'get more'. Without hard stopping rules, the default spirals.

Without a hard stop rule, an agent's local 'get more' default turns retrieval into an unbounded budget fire; capping cycles and abstaining is the control.

Three cap retrieval cycles. After three failed passes, return a best-effort answer with a confidence disclaimer.
agents making 200 LLM calls in 10 minutes, burning $50–$200 before anyone noticed
costs spike 1,700% during a provider outage as retry logic spiralled out of control

Towards Data Science (Mostafa Ibrahim): Agentic RAG Failure Modes: Retrieval Thrash, Tool Storms, and Context Bloat (and How to Spot Them Early) March 20, 2026 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

The honest column in the ledger: JIT buys these at the price of retrieval latency on the steps that load (usually trivial next to a model call, but nonzero), a new failure mode (an unresolvable reference must surface as an honest error, not a hallucinated payload), and a dependency on description quality — the agent loads from the catalog's one-liners, so a bad stub hides a good payload.

JIT context introduces two specific liabilities leaders must design for: unresolvable references must fail loud as honest errors, and retrieval quality is capped by the quality of catalog descriptions.

in a loop, the window is re-sent every step, so a preloaded handbook isn't one payment but thirty
a preloaded copy is a snapshot that ages as the run proceeds, while a reference resolves to the current state of the file, the ticket, the database at the moment of use
long-context research and practitioner experience agree that models degrade as windows fill with low-relevance text

TrueFoundry: JIT Context: Why the Best Agents Load Late and Load Little June 18, 2026 practitioner

Cited in Where Just-in-Time Context Retrieval Silently Breaks

negative instructions can be unreliable as user prompts

Negative 'don't do that' rules are unreliable in a user message like CLAUDE.md, so positive, runnable framing is preferable — reserving DO-NOT for hard safety boundaries.

Reddit user reported Claude Code created duplicate files despite explicit 'NEVER create duplicate files' rule
Gemini models have 'hit-or-miss' performance with negative commands
They are effective at preventing unethical or harmful behavior, especially when used in system prompts

Zhu Liang: The Pink Elephant Problem: Why 'Don't Do That' Fails with LLMs August 5, 2025 practitioner

Cited in CLAUDE.md Instruction Ceiling: Maintained Config, Not a README

3 · Evals & Verification

Knowing an agent’s output is actually correct, beyond a green build.

A 2-point lead on a leaderboard might reflect a genuine capability difference, or it might reflect that one eval ran on beefier hardware, or even at a luckier time of day, or both.

Small benchmark-score gaps between models are dominated by infrastructure noise, so self-reported numbers can't be trusted at fine margins.

gap between the most- and least-resourced setups on Terminal-Bench 2.0 was 6 percentage points (p < 0.01)
leaderboard differences below 3 percentage points deserve skepticism until the eval configuration is documented
Two agents with different resource budgets and time limits aren't taking the same test.

Anthropic Research: Quantifying infrastructure noise in agentic coding evals February 05, 2026 data

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

Their hyper-productivity is revealing a significant 'speed vs. trust' gap. Recent, deeper examinations of agent-generated code and agent-driven PRs reveal that a large percentage of agent efforts fail to meet the quality bar of being truly 'merge-ready,' often containing subtle regressions, superficial fixes, or a general lack of engineering hygiene.

Agent hyper-productivity creates a speed-vs-trust gap where most agent PRs aren't merge-ready, overwhelming review capacity.

29.6% of 'plausible' fixes introduced behavioral regressions or were incorrect upon rigorous retesting
True solve rates for GPT-4 patches dropped from 12.47% to 3.97% after detailed manual audits
Over 68% of agent-generated pull requests reportedly face long delays or remain unreviewed, creating an urgent need for scalable review automation.

arXiv (Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, Dong Qiu): Agentic Software Engineering: Foundational Pillars and a Research Roadmap 2025 data

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

descriptions claim unimplemented changes" was the most common issue (45.4%); high-MCI PRs had 51.7% lower acceptance rates (28.3% vs. 80.0%)

The most common defect in agent-authored PRs is a description claiming changes the code never made, and those PRs get accepted far less.

23,247 agentic PRs analyzed across five agents
High-MCI PRs took 3.5 times longer to merge (55.8 vs. 16.0 hours)
406 PRs (1.7%) exhibited high PR-MCI

arXiv (Jingzhi Gong, Giovanni Pinna, Yixin Bian, Jie M. Zhang): Analyzing Message-Code Inconsistency in AI Coding Agent-Authored Pull Requests Submitted January 8, 2026; revised January 26, 2026 data

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

Across agents, test-containing PRs are more common over time and tend to be larger and take longer to complete, while merge rates remain largely similar.

Whether an agent PR includes tests varies and doesn't correlate with merge outcomes, so test presence is a signal to read, not proof of quality.

We observe variation across agents in both test adoption and the balance between test and production code within test PRs
Testing is a critical practice for ensuring software correctness and long-term maintainability

arXiv (Sabrina Haque, Sarvesh Ingale, Christoph Csallner): Do Autonomous Agents Contribute Test Code? A Study of Tests in Agentic Pull Requests Submitted January 7-8, 2026 data partial

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

22.7% of tracked AI-introduced issues still survive at the latest version of the repository. These findings show that AI-generated code can introduce long-term maintenance costs into real software projects.

Over a fifth of AI-introduced issues survive at HEAD, so AI code accrues durable technical debt at scale unless verification catches it.

302.6k verified AI-authored commits from 6,299 GitHub repositories
more than 15% of commits from every AI coding assistant introduce at least one issue
code smells are by far the most common type" / "89.3% of all issues

arXiv (Yue Liu, Ratnadira Widyasari, Yanjie Zhao, Ivana Clairine Irsan, Junkai Chen, David Lo): Debt Behind the AI Boom: A Large-Scale Empirical Study of AI-Generated Code in the Wild Submitted 30 March 2026 (v2 revised 26 April 2026) data partial

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

We built an automated scanning agent that systematically audited eight among the most prominent AI agent benchmarks — SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, and CAR-bench — and discovered that every single one can be exploited to achieve near-perfect scores without solving a single task.

Every major agent benchmark can be gamed to near-perfect scores without solving anything, so self-reported benchmark performance is structurally untrustworthy.

A conftest.py file with 10 lines of Python 'resolves' every instance on SWE-bench Verified.
SWE-bench Verified (500 tasks) — 100% score via pytest hooks
Benchmark scores are actively being gamed, inflated, or rendered meaningless, not in theory, but in practice.

Berkeley RDI (Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song): How We Broke Top AI Agent Benchmarks: And What Comes Next April 2026 data

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

If your team adopts AI coding tools without restructuring how code review works, expect slower releases, not faster ones

AI moves the bottleneck from writing to reviewing, so teams that don't restructure review ship slower, not faster.

98 percent increase in PR volume" — attributed to Faros AI analysis of 10,000+ developers
PR review time went up 91 percent" — same Faros AI study
68 percent of senior engineers report quality improvements from AI, but only 26 percent would ship AI-generated code without review

LogRocket Blog (Ikeh Akinyemi): Why AI coding tools shift the real bottleneck to review January 20, 2026 data

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

The bots created more logic and correctness errors (1.75x), more code quality and maintainability errors (1.64x), more security findings (1.57x), and more performance issues (1.42x).

AI-authored PRs carry more defects than human ones in every category, concentrated in logic and security, so review depth should follow issue class.

On average, AI-generated pull requests (PRs) include about 10.83 issues each, compared with 6.45 issues in human-generated PRs.
AI-authored PRs contain 1.4x more critical issues and 1.7x more major issues on average than human-written PRs.
The report examined 470 open source pull requests.

Thomas Claburn, The Register: State of AI vs. Human Code Generation Report Report Dec 17, 2025; Register coverage Dec 17, 2025 data partial

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

So as not to unnecessarily punish creativity, it's often better to grade what the agent produced, not the path it took.

Grade the agent on the outcome it produced, not the procedure it followed, or you punish valid solutions.

As a rule, we do not take eval scores at face value until someone digs into the details of the eval and reads some transcripts.
CORE-Bench: "Opus 4.5 initially scored 42%... After fixing bugs... score jumped to 95%" — Anthropic's own eval-harness bug (Step 5)
Defining eval tasks is one of the best ways to stress-test whether the product requirements are concrete enough to start building

Anthropic Engineering Blog (Mikaela Grace, Jeremy Hadfield, Rodrigo Olivares, Jiri De Jonghe): Demystifying evals for AI agents January 9, 2026 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

The agent runs the build, sees green, and moves on. But 'build passes' and 'the output is production-ready' are different bars.

Agent self-verification confirms compilation and tests but not production-readiness, so quality attributes must be checked explicitly.

Developers consistently report agents declaring tasks complete while skipping accessibility attributes, test isolation, config externalization, dark mode, responsive layout, and meta tags.
The agent's own verification handles 'does it compile and do tests pass.' The orchestrator handles 'did it actually do what was asked, completely.'

DEV (Brad Kinnard): AI Coding Agents Can Verify Some of Their Work Now. Here's What They Still Miss. April 9, 2026 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

Don't ask the same agent to write code and verify it. That's like having students grade their own exams...The separation is what makes the gates trustworthy.

The agent that writes the code must not be the one that grades it; separated validation gates are what make verification trustworthy.

Eight quality gates required before production
Every commit is a known-good checkpoint. When something fails, the blast radius is one subtask, not an entire feature.
Agents are extremely literal. Give them vague instructions and they'll build something that technically matches what you said but misses what you meant.

DEV (Teemu Piirainen): How I Validate Quality When AI Agents Write My Code March 16, 2025 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

Generic evaluation metrics are everywhere...These metrics measure abstract qualities that may not matter for your use case. Good scores on them don't mean your system works.

Evals should be derived from error analysis of real traces, because good scores on generic metrics don't mean the system works.

Error analysis helps you decide what evals to write in the first place. It allows you to identify failure modes unique to your application and data.
Spend 60-80% of our development time on error analysis and evaluation
Binary evaluations force clearer thinking and more consistent labeling. Likert scales introduce significant challenges.

Hamel Husain and Shreya Shankar: LLM Evals: Everything You Need to Know January 15, 2026 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

on a real build, structured verification consistently found 30-40% of the specification unimplemented after the agent reported 'complete.' Not broken code. Missing code.

Agents routinely report 'complete' while 30-40% of the spec is unbuilt, a gap code review can't see because there is no diff.

Code review examines what was built...But if a feature wasn't built at all, there's no diff to review.
Verification works forward from the spec: 'given what was specified, was it built?'
5-6 passes to full completion is consistent enough to plan around

LoadSys (Lee Forkenbrock): How to Verify What Your AI Coding Agent Actually Built April 27, 2026 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

the biggest mistake engineers make in code review: only thinking about the code that was written, not the code that could have been written.

The core reviewer skill for agent output is architectural judgment about unwritten alternatives, not line-level nitpicking.

about once an hour I notice that the agent is doing something that looks suspicious, and when I dig deeper I'm able to set it on the right track and save hours of wasted effort.
If you're a nitpicky code reviewer, I think you will struggle to use AI tooling effectively.
Trying to make a badly-designed solution work costs time, tokens, and codebase complexity.

Sean Goedecke: If you are good at code review, you will be good at using AI agents September 20, 2025 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

Never assume that code generated by an LLM works until that code has been executed.

No agent-written code should be trusted until it has actually been run, because passing tests and plausibility are not proof.

Just because code passes tests doesn't mean it works as intended.
I've found that getting agents to manually test code is valuable as well, frequently revealing issues that weren't spotted by the automated tests.
Automated tests are no replacement for manual testing.

Simon Willison: Agentic manual testing 6th March 2026 practitioner

Cited in How to Verify AI Coding Agent Output: A Reviewer's Framework

4 · Production Operations

Running agents in production: cost, permissions, failure modes, guardrails.

More acute are the challenges of identifying the consumer of the model output, which is especially difficult when the consumers of the same model can be different interfaces/functional modules in the same user application (e.g., 'tech support chatbot' or 'new customer chatbot')

The hard, unsolved FinOps problem for AI is mapping model output back to the specific consumer; account-level billing is the wrong granularity and no accepted multi-agent allocation framework exists yet.

"Tokens! The meters, or elements of charge can be very different. For example, measuring the tokens at the user input vs. the compressed and semantic reduced or re-written actual prompt input token quantity that goes to the API endpoint that is charged."
"Lack of generally accepted frameworks for cost allocation across multi-agent workloads"

FinOps Foundation (finops.org): FinOps for AI Overview Last updated February 17, 2026 data

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Rather than supervising what the agent does, we supervise what it's able to do by enforcing access boundaries through, for example, sandboxes, virtual machines, and egress controls.

Safety comes from constraining what the agent can reach, not from watching what it does, because any model-layer check has a non-zero miss rate.

Any probabilistic defense has a non-zero miss rate.
Claude Code previously protected against agents taking unintended actions by asking users for permission at each turn... Our telemetry showed users approved roughly 93% of permission prompts.
The weakest layer is the one you built yourself

Anthropic: How we contain Claude across products 2026-05-25 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Attributing spend to specific skills, plugins, or subagent types via the `skill.name`, `plugin.name`, and `agent.name` attributes

Claude Code's OpenTelemetry export already emits per-skill, per-plugin, per-subagent, and per-prompt attribution attributes on every call, so an (agent, task, user) schema can be committed at the call level today.

OpenTelemetry export to your backend is opt-in and requires explicit configuration.
`token.usage` `type` attribute allowed values: `"input"`, `"output"`, `"cacheRead"`, `"cacheCreation"`
`query_source` attribute allowed values: `"main"`, `"subagent"`, `"auxiliary"` — distinguishes main-loop calls from subagent/auxiliary calls.
`prompt.id`: "UUID v4 correlating a user prompt with all subsequent events until next prompt".

Anthropic (code.claude.com): Monitoring undated (references minimum version "Claude Code v2.1.193 or later") practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Values for a given date can be revised for up to 30 days as late events arrive and reconciliation runs. For invoicing-grade totals, query dates at least 30 days in the past.

Provider analytics numbers are a post-hoc, reconciled reporting layer that keeps moving for up to 30 days and are attributed per-user, not per-request — useless as a real-time per-task control.

Enterprise Analytics cost granularity: "per-user and organization-level token usage and cost over time (usage-based Enterprise plans)" — NOT per-request.
Cost data freshness: "Data is typically available within four hours of the underlying usage but may take up to 24 hours."
"Daily Claude Code metrics per user: sessions, lines of code, commits, pull requests, tool acceptance, and estimated cost by model"

Anthropic (platform.claude.com): Analytics APIs undated (data available "for dates on or after January 1, 2026") practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Agents introduce a risk called *excessive agency*, where an agent determines the best solution to a problem is to take broader actions beyond its scope.

First-party cloud guidance names excessive agency as a High-risk gap and prescribes least-privilege boundaries plus user confirmation to contain it.

Level of risk exposed if this best practice is not established: High
Implement user confirmation for the agent, requiring users to confirm agent actions and mitigating the risk of excessive agency.
A permission boundary sets the maximum permissions which can be given to a role.

AWS: GENSEC05-BP01 Implement least privilege access and permissions boundaries for agentic workflows practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Autonomy is not a configuration decision that's decided once. Rather, it is more like a score that goes up or down, and that your system earns through demonstrated reliability in your specific environment and workflows.

Agent autonomy should be an earned, revocable score tied to measured reliability, not a one-time day-one setting.

Expansion of autonomy should happen as a consequence of earned trust, not as a deployment decision we make on day one.
Named trust-score inputs: percentage of agent actions completed without human override (30-day window); false escalation rate; override-correctness rate; time-to-revert
Conservative defaults with clear, earned expansion paths are the right architecture as the fastest route to durable autonomy at scale.

Barr Moses / Monte Carlo: Agentic Autonomy Is a Trust Score 2026-04-22 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

This brief event was the result of user error — specifically misconfigured access controls — not AI.

Even vendors' own defense of an agent-caused deletion frames it as an access-control misconfiguration, corroborating that these are authorization failures, not model failures.

The AI agent encountered a problem and determined that the optimal solution was to delete and recreate the entire environment.
Kiro requires two-person approval before pushing changes to production. But the deploying engineer had broader permissions than a typical employee, and Kiro inherited those elevated privileges.

Barrack AI: Amazon's AI deleted production. Then Amazon blamed the humans. 2026-02-22 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Enforcing least privilege requires control at the point of tool invocation, in real time, against a defined scope that reflects the agent's function, not its operator's credentials.

Least privilege for agents must be enforced at tool-invocation time and scoped to the agent's function, not inherited from its operator's broad credentials.

Authentication tells you who the agent is. It tells you nothing about what the agent should be allowed to do.
Gartner identifies approximately 40 tool definitions as the threshold beyond which agent latency and token cost increase measurably.

Cequence Security: Least Privilege Access for AI Agents: The Control You're Missing 2026-05-12 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Railway's CLI token created for managing custom domains had blanket permissions across the entire GraphQL API, including destructive operations on production volumes. There is no role-based access control (RBAC) for Railway API tokens.

The production database deletion happened because an over-broad, unscoped token authorized destructive operations, not because the model went rogue.

Tokens are not scoped by operation, by environment, or by resource. Every token is effectively root.
Soft guardrails are probabilistic controls that guess at intent instead of enforcing rules
The agent knew the rules, yet it violated every one of them

Chris Hughes / Zenity: System Prompts Are Not Security Controls: A Deleted Production Database Proves It 2026-04-28 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Tier 1 systems handling information retrieval need automated monitoring. Tier 2 workflows with reversible actions require real-time guardrails. Tier 3 systems involving financial transactions demand human-in-the-loop for all decisions.

Controls should be tiered in proportion to an action's risk, from monitoring for retrieval up to human-in-the-loop for high-stakes transactions.

15-20% of policy violations occur during tool execution before output generation
a single agent performing 1000+ actions per hour makes comprehensive human oversight untenable
Access control determines which resources your agents can touch, validation filters what they consume and produce, human oversight governs high-stakes decisions

Jackson Wells / Galileo: The Essential AI Agent Guardrails Framework for Autonomous Systems 2025-12-13 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Safety was retrofitted at the infrastructure layer. It should have been enforced at the identity and access layer from the start.

Bolting safety onto infrastructure after the fact fails; access limits must be enforced at the identity layer before the agent runs.

Cursor didn't hack the PocketOS environment, it was handed the keys that only a highly privileged user should have.
Many of the guardrails being marketed today are not guardrails at all. They are suggestions, enforced only insofar as the model chooses to comply.
The question isn't why Claude did this — it's why anyone gave an AI agent production credentials without a circuit breaker.

Jordyn Alger / Security Magazine: Company Database Deleted by AI Agent: What Security Leaders Need to Know 2026-05-01 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Least privilege does not mean making the agent weak. It means giving the agent exactly enough power to complete the approved task, for the approved time, in the approved context.

Least privilege scopes an agent to exactly the task, time, and context approved, which defines the axes of an authority-by-task-class table.

Static roles like 'claims analyst' or 'support ops' are often far wider than the exact permissions a single agent run should have.
Read access can still expose sensitive personal data, trade secrets, or protected records.
Shared service accounts destroy attribution: one API key used by multiple automations cannot prove who did what later

KLA: AI Agent Permissions and Entitlements: Enforcing Least-Privilege Access in Regulated Enterprises 2026-03-10 practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

After the key crosses it's `max_budget`, requests fail

A proxy can enforce multi-level budgets by validating spend before a request is admitted and hard-failing over the ceiling, i.e. terminate before the next call rather than alert after the invoice.

"validates spend against the authoritative database before being admitted (covering key, team, user, organization, end-user, tag, and per-window budgets)"
"`fail_closed_budget_enforcement`" enables a hard ceiling "even while Redis is degraded"
Exceeded-budget response body: `"ExceededTokenBudget: Current spend for token: 7.2e-05; Max Budget for Token: 2e-07"`.

LiteLLM (docs.litellm.ai): Budgets, Rate Limits practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

When agents run agentic loops, they can make unbounded LLM calls, causing unexpected costs.

Agentic loops make unbounded LLM calls by default, so the ceiling must be set per session — a hard iteration cap and a per-session dollar cap keyed to a trace/session id.

Control 1 — "Max Iterations": "Hard cap on the number of LLM calls per session".
Control 2 — "Max Budget Per Session": "Dollar cap per session (identified by `x-litellm-trace-id`)".
"When the counter exceeds `max_iterations`, the request receives a **429 Too Many Requests**".

LiteLLM (docs.litellm.ai): Agent Iteration Budgets practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Cost visibility tells you what your agents spent — through dashboards, cost traces, and budget alerts. Cost governance controls what they are permitted to spend, by enforcing per-session ceilings that terminate sessions before a threshold is exceeded.

Cost visibility (dashboards, alerts) is not cost control; governance means enforcing per-session ceilings that terminate the session before the threshold is crossed, and provider caps operate at the wrong (account/key) granularity.

"only 44% of organizations have adopted financial guardrails or AI FinOps practices" — attributed to Gartner, March 2026
"A 10-step agent with an average cost of $0.02 per step looks inexpensive in planning. That same agent entering a retry loop and executing 2,000 steps doesn't — that's $40 from a session that was supposed to cost $0.20."
"Provider-level controls operate at the API key or account level, not the individual session level. They cannot distinguish a single runaway session from many well-behaved sessions using the same key."

Logan Kelly, Waxell: The $400M AI FinOps Gap: Why Cost Visibility Isn't the Same as Cost Control April 9, 2026 practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Every request passes through it, which means budget enforcement happens in one place, consistently, regardless of which agent sent the request.

Infrastructure-level (proxy) budget enforcement is the only reliable guard against runaway costs because it enforces at one chokepoint, whereas application-level checks can be forgotten in a new agent.

"agent that takes 50 turns on a complex task hits 100,000 input tokens and 40,000 output tokens, costing roughly $0.90 per session. Run 100 of those sessions per hour, and you are looking at $90/hour, or over $2,100/day".
"developer on r/AI_Agents recently described watching their agent rack up $15 in API costs in under 10 minutes".
"If a developer forgets to add the check in a new agent, there is no safety net."

Matt Turley, RelayPlane: Agent Runaway Costs: How to Set LLM Budget Limits Before Costs Spiral March 24, 2026 practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Your employees ignore 96% of their permissions. Agents won't.

A broad permission grant is more dangerous for an agent than a human, because the agent will actually exercise every permission it holds.

Without mirroring these same permissions, an AI agent could expose protected data.
Developers should consider to use just-in-time access, human-in-the-loop verification
An agent that holds one of those tokens will keep answering requests even when the system has revoked

Oso: Setting Permissions for AI Agents 2025-10-28 (updated 2025-11-25) practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

Agent-level cost attribution starts with identity. When every agent has a unique, registered identity, every API call, token consumption event, and tool invocation can be tagged to that identity.

Agent-level cost attribution requires giving every agent a registered identity so every token and tool call can be tagged to it — but the field's default stops at alerts, not termination.

"Per-agent budgets define expected spend. Alerts fire when an agent approaches or exceeds its budget."
"Cloud cost management tools track compute and API spend at the account or service level — not at the agent level."

Prefactor: Implementing Agent-Level Cost Attribution Updated 9 April 2026 practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

OpenAI and Anthropic API calls show up as a single line item per provider. There's no native breakdown by your customer, your feature, or your workflow.

Cloud FinOps tooling structurally fails on LLM workloads because cloud tags don't propagate to the API call and provider billing arrives as one line item — attribution must be a schema on the call itself.

"the company spent $87,000/month on Anthropic API calls that arrived as a single line item".
"two enterprise customers were responsible for 78% of LLM costs while paying for 12% of revenue".
"Tagging doesn't propagate to OpenAI/Anthropic API calls. The tag lives on the EC2 instance making the API call, not on the API call itself."

Ravi Kanani, LeanOps Technologies: FinOps for AI Workloads in 2026: Why Traditional Cloud FinOps Practices Fail On LLMs May 19, 2026 practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Consumption dimensions tell you what was used, not who in your business used it. Allocation is the work of mapping that usage back to teams, budgets, and cost centers.

Aggregate token counts tell you what was used but not who used it; allocation to teams, budgets, and cost centers is the actual work, and centralized billing traded away the per-team visibility seats used to provide.

"Aggregate token counts don't tell you which teams are driving spend."
"Centralized billing simplified procurement and security, but it traded away the per-user and per-team visibility teams used to get from individual seats."
"AI cost also scales differently than cloud cost. It moves with prompt size, fanout, retries, and agentic loops."

Scott Castle, Chief Product Officer at CloudZero: Anthropic Shipped An Enterprise Analytics API. We Shipped the Claude Adapter Today. May 15, 2026 practitioner

Cited in You Can't Cap What You Can't Attribute: Per-Task Cost

Identity logic doesn't belong in prompts or agent code. It belongs in a control plane.

Access enforcement belongs in a runtime control plane, not in prompts or agent code, because a bigger prompt cannot enforce permissions.

Designing least privilege up front for an agent is an exercise in guesswork
Overpermissioning isn't a failure of discipline. It's a predictable outcome
If access is static, privilege is wrong

Strata Identity / Eric Olden: Why Agentic AI Forces a Rethink of Least Privilege 2026-05-11 (updated) practitioner

Cited in Tier Your AI Agent's Production Authority by Task Risk

5 · Team & Process

Reviewing AI diffs, reviewer capacity, and how teams absorb agent output.

CRA-only reviewed PRs achieve a 45.20% merge rate, 23.17 percentage points lower than human-only PRs (68.37%)

Code-review agents left to review alone merge PRs at a far lower rate than humans, so removing human review capacity degrades outcomes.

"34.88%" abandonment (CRA-only) vs "21.60%" (human-only) — outcome distribution across reviewed categories
"60.2% of closed CRA-only PRs fall into the 0–30% signal range" — signal-to-noise analysis of 98 closed CRA-only PRs
"12 of 13 CRAs exhibit average signal ratios below 60%" — quality assessment across 13 unique code review agents

Chowdhury, Banik, Ferdous, Shamim: From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests April 3, 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

Senior engineers become the verification layer for product ambiguity. They are no longer just checking implementation quality. They are reconstructing intent from generated code, thin specs, incomplete Jira tickets, and edge cases nobody wrote down.

The unbudgeted review burden concentrates on senior engineers as intent-reconstructors, creating retention risk that throughput dashboards never show.

"Replacement cost of a senior software engineer at $150,000 to $300,000 in 2026, including recruiting, ramp time, and lost institutional knowledge." — Industry benchmarks cited
"25% of PRs are now reviewed by AI agents, up from 0% in 2025. But review times have increased nearly 200%." — AI Engineering Report 2026 caption
the burden "does not get measured in PR throughput dashboards"

Faros AI (Naomi Lurie): The hidden cost of AI code quality: Why senior engineers are paying the price May 21, 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

Median time in review is up 441.5%

Telemetry across thousands of teams shows review time exploding under AI adoption while more PRs merge unreviewed and incidents rise.

"Code churn... has increased 861% under high AI adoption" — Takeaway 3 (Faros telemetry)
"Pull requests merged without any review... up 31.3%" — Takeaway 8 (Faros telemetry)
"Incidents-to-PR ratio is up 242.7%" — Takeaway 4 (Faros telemetry)

Faros Research: Ten takeaways from the AI Engineering Report 2026: The Acceleration Whiplash April 12, 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

the review queue becomes the binding constraint on their delivery pipeline

When agents raise output, the human review queue — not code generation — becomes the constraint that caps delivery.

"developers at large organisations spend between ten and fifteen percent of their working hours reading and commenting on others' code" — attributed to Sadowski et al., Google study
"review latency between submitting a pull request and receiving actionable feedback routinely stretches over twenty-four hours" — Introduction
"reviews of agent-generated code become rubber-stamps: the human approves because the code looks correct"

Martin Monperrus: The End of Code Review: Coding Agents Supersede Human Inspection 11 Jun 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

code review agents can exhibit a low signal-to-noise ratio when designed to identify all hidden issues, obscuring true progress and developer productivity

"Find everything" review agents drown the signal, so resolution/merge rate is the wrong yardstick and signal-to-noise proxies developer trust.

"CR-Bench...584 high-fidelity PR tasks" — Section 7.1
"average PR comments 41.03" per instance — Table 3
"high SNR serves as a primary proxy for developer trust by quantifying the ratio of actionable signal to distracting hallucinations"

Pereira, Sinha, Ghosh, Dutta (Nutanix, Inc.): CR-Bench: Evaluating the Real-World Utility of AI Code Review Agents 10 Mar 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

The development time has been shortened but the team now needs to spend more time to review. Doesn't look like any benefit.

Individual AI speedups externalize review burden onto the team, making review a shared, exhaustible resource rather than a free step.

"30 PRs per day across 6 reviewers" — [R07] reviewer-burden example
"reviewer-burden" ranked among top 3 most frequent codes (226 instances) — coding frequency
"Individual developers and organizations benefit from AI-generated content, but the cumulative effect degrades the shared resources that collaborative development depends on."

Sebastian Baltes, Marc Cheong, Christoph Treude: "An Endless Stream of AI Slop": How Developers Discuss the Burden of AI-Assisted Software Development 09 Jun 2026 data

Cited in Review Capacity Is the Real Ceiling on Your Agents

Tier by risk, not by author. A config change earns a linter and a glance. A payments path earns the full stack

Review depth should be triaged by the risk class of the change, not by who authored it or the diff size.

"AI-written code produces 1.7x more issues than human code" — CodeRabbit (470 OSS PRs, December 2025)
"93.4% of findings caught by exactly one tool" — dev.to engineer (4 parallel reviewers, 146 PRs, 679 findings)
"A model cannot be paged and cannot be held responsible for what it shipped, so whoever clicks merge owns it"

Addy Osmani: Agentic Code Review (addyosmani.com) June 15, 2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

We made writing cheap, and understanding stayed exactly as expensive as it has always been.

AI collapsed the cost of writing code but not the cost of understanding it, which is why review is now the ceiling.

"More than one in five reviews on the platform involves an agent" — GitHub
"4x the raw output of nonusers...only about 12% productivity gain" — GitClear
"The reasoning is usually thrown away rather than attached...reviewer has to reconstruct intent"

Addy Osmani / O'Reilly Radar: Agentic Code Review June 26, 2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

The reviewer role is being automated. The review, understood as judgment about whether the software is correct for its purpose, is relocating to where the agent cannot follow.

Agents can take over diff inspection, but human judgment doesn't disappear — it relocates to intent specification up front and accountability at merge.

"An agent-assisted developer produces more pull requests per day than human review capacity can absorb." — Monperrus paper discussion
"Automate the checkpoint and the judgment does not evaporate. It relocates to intent specification on the way in and accountability on the way out"
"The human does not leave the loop. The human moves from the end of it to the start."

Blake Crosley: Agents Supersede the Reviewer, Not the Review June 24, 2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

More code is entering the pipeline, but less of it is reaching production successfully. The bottleneck has moved from writing code to deciding whether code is safe to merge.

Third-party delivery data shows generation is not the wall — validation is, with feature throughput rising while main-branch throughput and success rates fall.

"feature branch throughput up 59% year over year, while main branch throughput for the median team actually fell" — CircleCI 2026 State of Software Delivery report
"main-branch throughput fell nearly 7%, and main-branch success rates dropped to 70.8%" — CircleCI 2026
"agentic AI PRs have a pickup time 5.3x longer than unassisted PRs. AI-assisted PRs wait 2.47x longer" — LinearB 2026 Software Engineering Benchmarks Report

Codacy: AI Is Breaking Code Review: How Engineering Teams Survive the PR Bottleneck 24/06/2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

Precision metrics degrade because even high‑quality comments may be ignored simply due to volume.

Comment volume stops mapping to value once reviewers skim or bulk-dismiss, so review agents must be measured by load removed, not comments posted.

"Human reviewers are overwhelmed with feedback and cognitive load spikes." — same section
"Review behavior changes—comments are skimmed, bulk‑dismissed, or ignored" — same section
"You are no longer measuring how a tool performs in practice, but how reviewers cope with noise."

David Loker / CodeRabbit: How to evaluate AI code review tools: A practical framework January 09, 2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

The bottleneck moves from generation to review queues, CI capacity, flaky environments, branch policy, cost ceilings, and the human attention needed to decide what should actually merge.

As agents get capable, the constraint shifts off code generation and onto the whole delivery surface — review bandwidth, CI, and human merge decisions.

"The model matters, but the delivery surface matters just as much."
"A team that cannot write crisp tasks will struggle to evaluate agents honestly."
"Reviewers do not need another wall of generated explanation. They need the shortest path to deciding whether the change should merge."

Developers Digest: AI Coding Agents Move the Bottleneck to Review Queues June 21, 2026 practitioner

Cited in Review Capacity Is the Real Ceiling on Your Agents

6 · Architecture Decisions

When agents help vs. hurt; single vs. multi-agent; build vs. buy.

Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent.

Subagents earn their place by isolating and compressing context — separate windows, not raw speed, are the reason to split.

agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats
some domains that require all agents to share the same context or involve many dependencies between agents are not a good fit for multi-agent systems today
token usage by itself explains 80% of the variance

Anthropic: How we built our multi-agent research system June 13, 2025 data

Cited in When One Agent Stops Being Enough: The Isolation Gate

This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification.

Multi-agent failure is predominantly a system-design and coordination problem, not a model-quality problem — a readiness test, not a model upgrade.

MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks
We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88).
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal.

Cemri, Pan, Yang, Agrawal, Chopra, Tiwari, Keutzer, Parameswaran, Klein, Ramchandran, Zaharia, Gonzalez, Stoica (UC Berkeley): Why Do Multi-Agent LLM Systems Fail? Submitted 17 Mar 2025; last revised 26 Oct 2025 (v3) data

Cited in When One Agent Stops Being Enough: The Isolation Gate

Mission-critical systems of record: Retain Buy as the primary option. Consider Make selectively for peripheral modules, extensions, or integration layers where the core system's integrity is not at risk.

Agentic AI shifts make-vs-buy by application type: commodity and differentiating apps move toward build, while regulated and mission-critical systems stay buy.

Commodity utilities: Default to Make. Evaluate Buy only where ecosystem integrations provide strong network value or where the firm's AI capability is below the viability threshold.
Where software development once required large teams working over months, small teams augmented by AI agents can now deliver functional applications in days or weeks.
AI-era Make demands skills in prompt engineering, agent orchestration, AI output validation, and governance of AI-generated artifacts.

David Klotz (IAAI, Media University Stuttgart): The Buy-or-Build Decision, Revisited: How Agentic AI Changes the Economics of Enterprise Software April 29, 2026 (arXiv:2604.26482v1) data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

Fully 60% of the life cycle costs of software systems come from maintenance, with a relatively measly 40% coming from development.

Maintenance dominates software lifecycle cost, and most of that maintenance is new enhancement work rather than bug-fixing.

During maintenance, 60% of the costs on average relate to user-generated enhancements (changing requirements), 23% to migration activities, and 17% to bug fixes.

David Wood (O'Reilly): The 60/60 Rule (ch. 34, 97 Things Every Project Manager Should Know) August 2009 (book publication) data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

AI adoption significantly increases individual productivity, flow, and job satisfaction. However, it also negatively impacts software delivery stability and throughput

AI helps the individual developer but hurts system-level delivery stability and throughput.

Unstable organizational priorities cause meaningful decreases in productivity and substantial increases in burnout.

DORA (Google Cloud): Accelerate State of DevOps Report 2024 2024 (page last updated April 13, 2026) data partial

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

the percentage of changed code lines (associated with refactoring) sunk from 25% of changed lines in 2021, to less than 10% in 2024, while lines classified as 'copy/pasted' (cloned) rose from 8.3% to 12.3%

AI-assisted development correlates with more code duplication and less refactoring, increasing long-term maintenance burden on code you own.

211 million changed lines from repos owned by Google, Microsoft, Meta, and enterprise C-Corps
4x more code cloning
'copy/paste' exceeds 'moved' code for first time in history

GitClear: AI Copilot Code Quality: 2025 Data Suggests 4x Growth in Code Clones January 2026 (research notation on page) data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

if AI adoption increases by 25%, estimated throughput delivery is expected to decrease by 1.5%

Individual AI productivity gains do not translate into system-level delivery throughput or stability, because code generation was never the bottleneck.

estimated delivery stability is expected to decrease by 7.2%
75.9% of respondents (of roughly 3,000 people surveyed) are relying on AI for at least part of their job responsibilities
if AI adoption increases by 25%, time spent doing valuable work is estimated to decrease 2.6%

Rachel Stephens (RedMonk): DORA Report 2024 – A Look at Throughput and Stability November 26, 2024 data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

Purchasing AI tools from specialized vendors and building partnerships succeed about 67% of the time, while internal builds succeed only one-third as often.

Most enterprise GenAI builds fail; buying and partnering succeeds roughly three times more often than building internally.

95% failure rate for enterprise AI solutions
about 5% of AI pilot programs achieve rapid revenue acceleration
150 interviews with leaders, a survey of 350 employees, and an analysis of 300 public AI deployments

Sheryl Estrada (Fortune): MIT report: 95% of generative AI pilots at companies are failing August 18, 2025, 6:54 AM ET data

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

Relative performance change compared to single-agent baseline ranges from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, demonstrating that architecture-task alignment determines collaborative success.

Whether a multi-agent split helps or hurts is decided by task decomposability — decomposable tasks gain sharply, sequential ones degrade sharply.

Across 260 configurations spanning six agentic benchmarks, five canonical architectures (Single-Agent and four Multi-Agent: Independent, Centralized, Decentralized, Hybrid), and three LLM families
The framework identifies the best-performing architecture for 87% of held-out configurations
architectures without centralized verification tend to propagate errors more than those with centralized coordination

Yubin Kim, Ken Gu, Chanwoo Park, et al. (MIT / Google): Towards a Science of Scaling Agent Systems Submitted 9 Dec 2025 data

Cited in When One Agent Stops Being Enough: The Isolation Gate

Three to five teammates is the sweet spot. Token costs scale linearly, and three focused teammates consistently outperform five scattered ones.

Fan-out has a practical ceiling around three to five focused agents; the real bottleneck becomes verification, not generation.

The bottleneck is no longer generation. It's verification.
Three focused agents consistently outperform one generalist agent working three times as long.
One agent can only hold so much information. Large codebases overwhelm a single context window.

Addy Osmani: The Code Agent Orchestra - what makes multi-agent coding work March 26, 2026 practitioner

Cited in When One Agent Stops Being Enough: The Isolation Gate

Use one when a side task would flood your main conversation with search results, logs, or file contents you won't reference again: the subagent does that work in its own context and returns only the summary.

Isolate a polluting side-task in a subagent's own context — the operational test for when to split before splitting the whole job.

Each subagent runs in its own context window with a custom system prompt, specific tool access, and independent permissions.
Preserve context by keeping exploration and implementation out of your main conversation
Enforce constraints by limiting which tools a subagent can use

Anthropic (Claude Code Docs): Create custom subagents practitioner

Cited in When One Agent Stops Being Enough: The Isolation Gate

For a team of roughly 200 developers, an internal build typically costs around $1.4M in year one, requires 2–3 dedicated FTEs to maintain, and takes 12–18 months to reach a first real use case.

Building an internal agentic AI platform in regulated industries is a multi-year, multi-FTE commitment with governance surface most organizations underestimate.

Every engineer building the platform is an engineer _not_ modernizing a legacy pipeline, remediating security debt, or accelerating a critical delivery program.
Building an internal agentic AI platform in banking or insurance is a multi-year platform engineering commitment with regulatory surface area most organizations underestimate

Bryan Ross (GitLab, Field CTO): The real cost of build vs. buy for agentic AI in regulated industries March 24, 2026 practitioner partial

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

The 2026 build-vs-buy question is less _can we afford to build it_ and more _what happens to us if the vendor moves_.

Cheaper agentic builds plus rising SaaS lock-in and repricing risk tilt the case toward owning differentiating workflows.

2,698 SaaS M&A transactions closed in 2025, up 28% year over year
68% of tech leaders plan vendor consolidation in 2026
organizations trapped in vendor lock-in face switching costs around 16 times higher

Digital Applied Team: Build vs Buy: The 2026 Case for Custom AI Tools July 1, 2026 practitioner partial

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

AI has dramatically reduced the cost of creating software, but it hasn't eliminated the cost of owning software.

AI lowers the cost to build software but not the ongoing cost of owning and operating it, which is where build-vs-buy now turns.

the last 20% (security, governance, observability, performance, reliability, data quality, change management) is still 80% of the effort
In 2026, most enterprises land on 'yes to both.' They buy the heavy core, build what differentiates, and use AI to accelerate the glue layer.
If the capability is your advantage, meaning revenue, margin, speed, or defensible differentiation (AI copilots, agentic workflows, decision support)

HatchWorks (Matt Paige): The Build vs Buy Framework in the Age of AI January 28, 2026 practitioner

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

If it's a core business function — do it yourself, no matter what.

Core, business-specific functions should be built in-house because that is where control and competitive advantage live.

There's no way it's going to be as flexible as what Amazon does with obidos, which they wrote themselves.
Pick your core business competencies and goals, and do those in house.

Joel Spolsky: In Defense of Not-Invented-Here Syndrome October 14, 2001 (update noted December 5, 2016) practitioner

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

for a strategic function you don't want the same software as your competitors because that would cripple your ability to differentiate.

Strategic, differentiating software should be built while commodity utility software should be bought, and the two demand different postures.

The 80/20 rule applies, except it may be more like 95/5
This is not a static dichotomy. Business activities that are strategic can become a utility as time passes.
For a utility function you buy the package and adjust your business process to match the software.

Martin Fowler: Utility Vs Strategic Dichotomy July 29, 2010 (updated April 7, 2016) practitioner

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

With such a layer in place, the build-versus-buy question fragments, and CIOs might buy a vendor's persona agent, build a specialized risk-management agent, purchase the foundation model, and orchestrate everything through a platform they control.

The industry consensus has shifted to hybrid: assemble build and buy across the AI stack under an orchestration layer you control.

Six months ago many were experimenting, but now they're scaling.
including cases where a senior executive's data surfaced in a junior employee's query.

Pat Brans (CIO.com): Your next big AI decision isn't build vs. buy — It's how to combine the two December 11, 2025 practitioner partial

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

The factor that stands out most to me is that these developers were all working in repositories they have a deep understanding of already, presumably on non-trivial issues since any trivial issues are likely to have been resolved in the past.

AI's edge is smallest exactly where you own and deeply understand a mature codebase long-term.

56% had never used Cursor before the study
Developers accepted less than 44% of AI generations
A quarter of the participants saw increased performance, 3/4 saw reduced performance

Simon Willison: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity 12th July 2025 practitioner

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

it's not about getting work done faster, it's about being able to ship projects that I wouldn't have been able to justify spending time on at all.

AI's clearest payoff is enabling marginal projects that were never worth building before, not accelerating core work.

I'm certain it would have taken me significantly longer without LLM assistance—to the point that I probably wouldn't have bothered to build it at all.

Simon Willison: Here's how I use LLMs to help me write code 11th March 2025 practitioner

Cited in Build vs. Buy Agentic AI: Ownership Is the New Decision

I can only focus on reviewing and landing one significant change at a time, but I'm finding an increasing number of tasks that can still be fired off in parallel without adding too much cognitive overhead to my primary work.

Human review-and-land throughput — one significant change at a time — is the real ceiling on how far parallel agents scale.

Code that started from your own specification is a lot less effort to review.

Simon Willison: Embracing the parallel coding agent lifestyle October 6, 2025 practitioner

Cited in When One Agent Stops Being Enough: The Isolation Gate

Actions carry implicit decisions, and conflicting decisions carry bad results.

When a parallel split backfires it is conflicting implicit decisions between agents, not weak model quality, that causes it.

Share context, and share full agent traces, not just individual messages
At the core of reliability is Context Engineering
The simplest way to follow the principles is to just use a single-threaded linear agent

Walden Yan (Cognition): Don't Build Multi-Agents 06.12.25 practitioner

Cited in When One Agent Stops Being Enough: The Isolation Gate

multi-agent systems work best today when writes stay single-threaded and the additional agents contribute intelligence rather than actions

Only split once writes can stay single-threaded and added agents are read-only intelligence — parallel-writer swarms still fail.

most multi-agent setups in the world are limited to 'readonly' subagents
The practical shape is map-reduce-and-manage: a manager splits work, children execute, the manager synthesizes
an average of 2 bugs per PR, of which roughly 58% are severe

Walden Yan (Cognition): Multi-Agents: What's Actually Working 04.22.26 practitioner

Cited in When One Agent Stops Being Enough: The Isolation Gate