The Efficient Frontier of Agentic Pipelines: a scientific approach
A field report applying a scientific lens to a practical question: where is the efficient frontier of AI in software engineering today? This is not a copy-paste method. It is a set of principles, drawn from information theory and control theory and tested against 600 million tokens of real runs, meant to help engineering leaders judge where agentic AI pays and where it does not. The theory does not prove the conclusions; the runs do. The theory explains why they hold.
The biggest challenge with AI today is the frontier line, the sweet spot between fanatics and detractors, between marketing and a rational fear of something genuinely new.
The industry is selling agentic pipelines as the ultimate tool for every company that wants to survive past 2026. The promise is old: for decades the IT industry has tried to automate itself out of existence, chasing the tool that would make time to market collapse. Today's candidate is a chain of autonomous, specialized LLMs able to replace an entire engineering team, starting from a good set of specifications.
There are plenty of demos, frameworks and companies built on this premise. They are genuinely impressive. The question nobody answers is the practical one: not whether agentic pipelines can work in a demo, but how to make them work in real life, and where it is worth even trying.
I burned 600 million tokens in five days chasing that answer, precisely to avoid burning a budget chasing the marketing instead.
This is not a detractor's piece. I went in wanting these pipelines to work, ran them hard, and came back with a map of where they pay and where they do not: the efficient frontier as it stands today, what works, what does not, and the scientific principles that help explain why. Engineering leaders moving AI into their teams deserve a map, not a pitch deck.
If the story is too long for you to read: jump directly to "What actually works" close to the end of the article.
What I was trying to build
The idea was straightforward on paper: feed a natural-language Product Requirement Document into a chain of specialized agents and get back production-ready, tested, documented code, with no human in the loop for the intermediate steps. The pipeline:
- Architect Agent reads the PRD, defines interfaces and project structure, breaks the work into stories and tasks for BDD and TDD, and produces a roadmap synchronizing backend and frontend around a progressively scoped MVP. The closest thing to a well-run human team, on paper.
- Coding Agent writes the code under a strict ruleset, following TDD. I constrained it to a functional style, on the hypothesis that pure functions and explicit boundaries would reduce context bleed between tasks. In practice the results were mixed and I have no clean proof it helped.
- Self-reflection inside the Coding Agent: it reviewed its own output before committing each task. This is one of the best-supported techniques in the literature, an 11-point gain on HumanEval, from 80% to 91% pass@1 ("Reflexion", Shinn et al., NeurIPS 2023), and roughly 20% across diverse tasks with iterative self-refinement ("Self-Refine", Madaan et al., 2023).
- Quality Agents handled security review, code quality, and visual testing.
- Human evaluation at the end.
It sounded reasonable. It did not work as expected.
The concrete setup was Opus 4.6 as the Architect, Claude Design and V0 for frontend specs and React mockups, Composer 2 for the coding and quality agents. Over five days of intensive runs, launching tasks, reviewing outputs, adjusting prompts, and repeating, I burned roughly 600 million tokens of inference, not through the API, which would have been ruinous, but through subscription tooling that gave me the volume without the bill.
Where it broke, and why
Six failure modes follow. For each one: what I observed, the mechanism behind it, and what mitigates it. The mechanisms matter more than the anecdotes, because they are what make these failures reproducible rather than bad luck.
1. The collapse under fragmentation
The single clearest pattern across the whole experiment: the more I broke a project into separate tasks, the worse the result got. Not worse in one dimension, worse in all of them at once. Bugs, drift from the spec, components that contradicted each other, and correction loops where the pipeline kept trying to patch its own hallucinated implementation and dug deeper instead. The finer the decomposition, the faster it fell apart.
Part of this is the context degradation that the next sections cover, but the deeper cause is a leverage problem. A specification is short and the code it implies is long, on the order of 1 to 1000, not as a measurement but as a magnitude. A small task hands the model maybe one part of intent and asks it to produce a thousand. The other 999 come from its statistical prior, the average of every codebase it was trained on, so a small task is mostly the model inventing. Make the task bigger and the ratio shifts, say 5 to 1000: now your five parts carry more weight in fixing what the output becomes, and less is left to the prior. Bigger tasks are not just more efficient, they are more determined by you.
The second force runs the other way. The smaller you cut a task, the more you strip its surroundings. Ask the model to build a login form and it knows what that is, it brings the whole pattern with it. Ask it to build this one text field, then this one button, each as an isolated task, and it has lost the picture they belong to, so it fills the missing context with a fresh guess every time, and the guesses do not agree with each other. Isolation does not just remove information, it manufactures wrong information.
These two forces compound as the project grows. Early on, a small task can still be read against the whole codebase, so the model can reconstruct some of its context. Past a certain size the codebase no longer fits in the window, and a small task can never see enough of the system to be placed correctly. So this is not only a specification problem: even a perfect spec cannot rescue a task cut too small to carry its own context. Too little intent going in, too little environment around it.
There is a formal intuition underneath this, though not a proof. Kolmogorov complexity is the length of the shortest program that produces a given output ("Kolmogorov complexity", Kolmogorov, 1965). It is uncomputable, so you cannot turn it into a measurement, but it gives the right mental picture: a system has some irreducible amount of information that has to be specified somewhere, and whatever the spec does not pin, the model fills in by sampling rather than deriving. Sampling is where the bugs and the drift come from.
Mitigation: make tasks as large as the context window efficiently allows, and no smaller. Work at the level of a story, a whole feature described in one place, rather than a checklist of fragments. There is no clean formula for the ceiling; it is an empirical limit you find by pushing tasks wider until coherence starts to drop. The instinct to decompose, the one good engineers have trained for years, is exactly the instinct that breaks here.
2. Chain depth destroys signal
The last section was about task size. This one is about how many agents you chain to do a task, which turns out to follow opposite rules in two directions.
Chaining agents in sequence made things worse. Each agent hands its output to the next, and each handoff is a chance to lose or bend the original intent. The arithmetic is unforgiving: if one handoff keeps intent intact 90% of the time, two keep it 0.9 × 0.9 ≈ 81%, five about 59%. Errors compose, they do not cancel. This is the one place the theory is not just an analogy: when each agent sees only the previous agent's output and not the original source, the chain is Markovian, and Shannon's Data Processing Inequality applies directly ("Data processing inequality", Shannon, 1948). No step can hold more information about the original than the step before it, so processing can only lose signal, never restore it. A chain of N agents makes N−1 handoffs, and every one of them leaks.
It leaks twice, in fact, because the receiving agent also reads imperfectly. Liu et al. found that LLMs attend to long contexts unevenly: move the key information into the middle of the prompt and accuracy drops by double digits ("Lost in the Middle", Liu et al., 2023). Intent is lost once when an agent compresses it into a handoff, and again when the next agent fails to fully read it.
Running agents in parallel does the reverse, but with a caveat that turns out to matter a lot. Several independent reviewers, each more often right than wrong, get more reliable together as you add them: that is Condorcet's Jury Theorem from 1785 ("Condorcet's jury theorem"). The theorem rests on two assumptions, that each voter beats chance and that the voters are independent. LLM agents fail the second one. Instances of the same model, or models trained on overlapping data, have correlated errors: when they are wrong, they tend to be wrong in the same way, so a second and third opinion add less than Condorcet's idealized count would promise. This is exactly why the parallel gain decays so fast. The empirical work bears this out without leaning on the theorem: Du et al. found multi-agent debate helps, with returns that flatten after a few agents ("Improving Factuality and Reasoning", Du et al., 2023), and self-consistency shows the same shape, more sampled reasoning paths helping less each time ("Self-Consistency", Wang et al., 2022). Condorcet explains the direction of the effect; the correlated errors explain its ceiling.
So the two directions are not symmetric. Depth in series compounds loss without limit. Width in parallel compounds confidence toward a ceiling. The design follows directly: keep the chain as short as possible, and add agents in parallel only up to the point where they still pay. Because the parallel gain decays toward zero while cost and latency keep rising, that point arrives early. In practice the magic number is two: one agent that writes, with self-reflection on its own output, and one that reviews. Past two, you are mostly paying for confidence you already have.
3. Validation only converges against a measurable signal
I expected the self-reflection and validator agents to clean up the output. They did, but only under one specific condition, and finding that condition was the useful part.
The rule is simple: iteration improves the work only when it has a measurable signal to push against. A failing test count is such a signal, every pass either lowers it or visibly does not. Reflexion gets its 11-point gain precisely because it reflects on test results, a ground truth the model cannot argue with (Shinn et al., 2023). Remove the external signal and the opposite happens: Huang et al. showed that a model correcting itself on its own judgment alone often makes the answer worse, not better ("Large Language Models Cannot Self-Correct Yet", Huang et al., 2023). There is a clean analogy in control theory. Lyapunov's 1892 work on dynamical systems ("Lyapunov stability") gives the condition for a feedback system to settle toward a target: some measure of the error has to shrink at every step. He was describing physical systems, not LLM validation, but the framing transfers cleanly. "Does this match what the product manager meant" is not a measurable error, so loops anchored to intent have nothing that provably shrinks, and they wander instead of closing. Mine did exactly that: anchored to tests they converged, anchored to judgment they circled.
This splits validation cleanly in two. Deterministic checks have an external, complete reference: style guide, dependency rules, known vulnerability patterns, the test suite itself. Those should be industrialized, and they are the ideal job for a separate validator agent, because independence is pure upside there, it parallelizes cheaply and avoids the self-preference bias that makes models rate their own style of output too highly ("LLM Evaluators Recognize and Favor Their Own Generations", Panickssery et al., 2024). Conformance to intent has no such reference. It cannot be delegated to another agent, because that agent only receives the task description, already a compressed copy of the goal, and would validate against a degraded reference. It stays with the human, the one judge who holds the full intent.
4. Non-determinism and the missing referee
The same input produced structurally different code across runs. Different module boundaries, different state handling, different error strategies. Two runs of the same pipeline on the same PRD were two different codebases that happened to pass the same tests.
This is by design, not a bug. An LLM is a probability distribution over what comes next, and generation samples from it. Every decision the spec leaves open, the same gap from the first section, is a fork in the road, and the pipeline takes thousands of forks per feature. Two runs are two different paths through a tree the spec never pruned. The cost lands on maintainability: the next run does not inherit the previous run's implicit choices, so the consistency that refactoring depends on erodes run by run.
The harder problem is the referee. A human engineer judges code against a huge unwritten context: team conventions, an architecture decision someone made in 2021 and never documented, what "done" actually means here. Polanyi called this tacit knowledge, we know more than we can tell ("The Tacit Dimension", Polanyi, 1966). It is exactly the information the spec did not encode, it does not fit in a context window, and most of it lives in no document a model could read. The model is aiming at a target it cannot see, and the miss only becomes visible at review, the most expensive moment to find it.
Mitigation: stop pinning the internals, pin the behavior. You do not need two runs to produce the same code, only a component that behaves the same way at its boundary. Freeze the contract and the observable behavior, then treat the inside as a black box: if it has four legs and barks, it is a dog, and that is enough. An executable contract plus a test suite that asserts behavior turns non-determinism from a correctness problem into an implementation detail. Then budget review for the behavior the tests do not yet capture, because some of that gap is irreducible.
5. Layering beats vertical slicing
Vertical slicing, one feature end to end across frontend and backend, is how well-run human teams divide work. It produced my worst results. Working one layer at a time did clearly better.
The reason connects back to the first section. An end-to-end slice makes frontend, backend, data model and tests share one context budget, which forces each piece to be cut small to fit, and small pieces are exactly what collapse under fragmentation. Staying within one layer spends the whole budget on a single consistent vocabulary, so the model holds the full picture of what it is building instead of a quarter of four pictures at once.
What makes this safe is that layers talk through contracts, and a contract is a small, stable, human-reviewable artifact. Design the API surface first, by hand, with GraphQL schemas or OpenAPI specs that are executable and checkable. A frontend constrains the API it needs, which is exactly why contract design belongs at the start and has to be a human decision; GraphQL eases the residual coupling by letting the frontend ask for the shape it wants at query time.
The same logic runs between backend services, and it points at the real unit: the bounded context ("Bounded Context", Evans / Fowler, 2003), a perimeter with its own vocabulary and clear edges, small enough that the model can hold all of it at once. That is a constraint on what must fit in the window, not on how big a task is. Inside it, the task should be as large as section one argued, a whole story in one place. Sometimes the perimeter is a microservice, often just a module inside one.
One caveat, and a large one: all of the above is greenfield. Brownfield is a different problem. There the model must first reconstruct behavior that already exists, across code nobody fully remembers, and these rules bend in ways that deserve their own treatment.
6. The cost curve crosses early
Serial chains multiply inference: every stage re-reads what the stages before it produced, so total tokens grow faster than the number of stages, and every correction loop that does not converge multiplies the bill again. On the other side of the ledger, the quality gains follow the diminishing-returns curve from section two, each added agent helping less than the last. Rising cost against falling benefit means the two curves cross. In my runs they crossed early, usually after one generator and a small set of parallel critics.
Time behaves the same way. A five-stage serial chain means five inference passes before any human sees output, with loop iterations stacked on top. A senior engineer waiting on a pipeline is a cost too.
Then there is pricing. I ran this on subscription tooling precisely because the API math does not work: at consumption pricing the same volume costs orders of magnitude more, for results comparable to or only marginally better than disciplined single-agent use. The business case for autonomous agentic pipelines at API pricing does not close in most scenarios, and nothing published since I first wrote that sentence has contradicted it.
What actually works
The failures above draw a boundary line, and the territory inside it is large. Five practices survived the 600 million tokens. Each is the mirror image of one failure mode.
Specs that carry decisions. The method now has a name, spec-driven development: a Specify, Plan, Tasks, Implement workflow with a human checkpoint at the end of each phase ("Spec-driven development"). GitHub's open-source Spec Kit passed 90,000 stars by May 2026 ("Spec Kit"), and its "constitution", a persistent ruleset fed to the agent at the start of every session, is the artifact I had arrived at on my own before I knew its name. One caution from the people who watched waterfall die the first time: over-formalized specs slow change and feedback the same way 1990s process did ("Spec-driven development", Thoughtworks, 2025). Specify the 20% where a wrong decision is expensive. Leave the rest to the loop.
Story-sized tasks inside bounded contexts. One layer at a time, the task as large as a whole story, the bounded context small enough to fit the window. Contracts designed by humans first and frozen, GraphQL or OpenAPI as the executable boundary. Features get assembled from layers; they are not the unit of generation. Greenfield only; brownfield is its own problem.
Short chains, parallel critics. One generating agent holding the full context. Review fans out in parallel, where Condorcet works for you, instead of stacking in series, where Shannon works against you.
Validation routed by determinism. Independent agents for style, security patterns, dependency rules and test execution: the reference is external and complete, the parallelism is cheap, and independence removes the self-preference bias. Self-reflection inside the generator, anchored to executable acceptance criteria. Intent conformance stays with the human, the only judge holding the tacit context.
IDE-first as the default. The autonomous factory earns its complexity only on high-volume, pattern-shaped work: migrations, codemods, test backfills. For everything else, a developer in the loop with one strong agent, plus on-demand validation, wins on both cost and outcome. The frontier confirms the shape. Anthropic reports that more than 80% of the code merged into its own codebase is now written by Claude, with engineers shipping roughly eight times more code than in 2024, and a human still gating every merge ("When AI builds itself", Anthropic, 2026). And the strongest available model resolves about 64% of tasks on SWE-bench Pro, the contamination-resistant benchmark built from real multi-file problems ("SWE-bench Pro leaderboard", 2026).
And the smallest rule of all: for four changes, two well-written prompts beat building a factory. Knowing when to industrialize and when not to is itself the judgment this whole article is about.
The map
The efficient frontier today is narrower than the marketing and wider than the skepticism. Autonomous chains that expand a thin spec into a full system sit outside it. The old results from Shannon, Condorcet, Kolmogorov and Lyapunov do not prove that, my runs did; but they give it a language and explain why the failures were structural rather than bad luck. Inside the frontier: humans deciding, contracts frozen early, story-sized tasks in bounded contexts, deterministic validation industrialized, and judgment kept where the information actually lives.
I burned 600 million tokens learning what the theorems already hinted at. This article is the cheaper path.