The State of AI Coding Agent Engineering

Frontier agents are getting more capable on a schedule you can graph: the length of task they finish autonomously has been doubling roughly every seven months for six years (Kwa, West, Becker et al., METR ). And yet, in the one controlled trial that measured it, experienced developers were 19% slower with AI on code they knew well — while believing they’d been sped up by 20%. Capability went up. Delivered throughput went down. Both numbers are real, and the gap between them is the whole story.

That gap is where this report lives. The cost of building software didn’t disappear when generation got cheap — it moved downstream, into the work that was always expensive and is now the bottleneck: verifying output you didn’t write, budgeting the context an agent reads, containing what it’s allowed to touch, absorbing its diffs through a review pool that didn’t grow, and owning the code for the five years after it merges. Every serious problem in agentic engineering right now is a facet of that one shift.

The SME move follows directly: stop optimizing generation, and start engineering the system around it. This report is the argument for that, built from the strongest primary evidence across six pillars — the synthesis behind eleven essays and a verified index of 101 sources .

The six pillars, in one argument

Task Design & Decomposition — Sizing an agent’s work is a context problem, not a line-count problem; oversight belongs on the rare irreversible action, not the average one.
Context Engineering — More context is not free reliability. Instruction-following degrades to 68% at 500 instructions , and models don’t read a long window uniformly .
Evals & Verification — Generation outran verification. A green build isn’t trust when 45% of agent PRs describe changes the code never made and every major agent benchmark can be gamed to a near-perfect score .
Production Operations — Safety comes from constraining what an agent can reach, not watching what it does — and you can’t cap a cost you can’t attribute to a task.
Team & Process — Reviewer-hours are the real ceiling on how many agents you can run, and the burden lands on the seniors you can least afford to lose. (Read this one in full below.)
Architecture Decisions — Split to many agents for context isolation, not speed; and build-vs-buy is now a five-year ownership call, because 60% of a system’s lifecycle cost is maintenance .

Pillar 5 — Team & Process: reviewer-hours are the real ceiling

Here’s one pillar in full, so you can judge the rest.

You didn’t hit a correctness wall when you gave your team five agents — you hit a reviewer-throughput wall. The thing that decides how many agents your team can actually run is not agent output quality; it’s how much your reviewers can absorb before the queue backs up and people start rubber-stamping. That number is finite, it’s measurable, and most teams have never once looked at it.

The number is already in your telemetry. Across two years of data on 22,000 developers, Faros AI found median time in review up 441.5%, with unreviewed merges up 31.3% . Review time doesn’t rise 441% because the code got worse — it rises because more of it arrives at a pool that didn’t grow. When agent output climbs and the reviewer denominator stays fixed, the queue is the constraint, and the review queue becomes the binding constraint on the whole delivery pipeline (Monperrus ).

The reflexive fix — bolt on a review agent — makes it worse if you measure it wrong. Studied across thousands of real PRs, code-review-agent-only PRs merged at 45.20%, 23 points below human-reviewed PRs (68.37%) (Chowdhury et al. ), because most of their comments weren’t the signal a reviewer needed. And the cost concentrates on your best people: Faros describes senior engineers becoming the verification layer , spending “their most valuable hours unraveling plausible-looking code” — a burden that “does not get measured in PR throughput dashboards, which is precisely why the cost is invisible until it isn’t.”

The judgment: budget review as a finite, protected resource, and split the delegable half from the half that can’t move. Line-by-line inspection is mechanical — delegate it to agents. Deciding what “correct” means for the product, and owning what shipped, is judgment that relocates rather than automates — it stays human, and named. Do the reviewer-hours math before you approve agent number seven, because past the crossover every added agent produces queue depth, not throughput. (The long version is Review Capacity Is the Real Ceiling on Your Agents .)

That’s the shape of every pillar: a measurable constraint, the strongest evidence for where it bites, and what to actually do about it.

The research index stays free and public — 101 verified primary sources, every row a verbatim figure or quote, grouped by these same six pillars. That’s the evidence appendix. The report above is the synthesis: the argument the evidence adds up to, and the judgment I’d want if I were putting agents into production on a real team.