1 billion tokens after: what agentic pipelines actually do

I spent 1B tokens to find the state of the art in the actual Agentic AI SDLC landscape. I went over the hype or the skepticism. This article explores what I learned, what it works, what it doesn't work: so you don't have to do it yourself

I am not an AI skeptic. I use AI every day. I have rewritten workflows around it, and I have seen it genuinely change how fast a team can move. But I have also spent months and roughly a billion tokens trying to build something the industry keeps promising is just around the corner: a pipeline of AI agents that autonomously handles software development end to end. What I found is that the gap between the demo and the reality is not a bug. It is an architectural property of how these models work.

This is not a takedown. It is a field report from someone who tried to find the sweet spot between the fanatics and the detractors, and ended up with a more nuanced position than either camp would like.

What I was trying to build

The idea was straightforward on paper. Take a Product Requirement Document written in natural language, feed it into a chain of specialized agents, and get back production-ready, tested, documented code. No human in the loop for the intermediate steps. An Architect agent would read the PRD and define the file structure and interfaces. A Developer agent would generate the code. An automated test suite would run. A feedback loop would send error logs back to the Developer for correction.

I also built context-awareness into the system, allowing agents to read parts of the existing codebase to maintain consistency with current dependencies and standards.

It sounded reasonable. It did not work as expected.

Where it broke

The failures were not random. They were consistent, reproducible, and each one pointed to the same underlying problem.

Context saturation on complex tasks. As the scope grew beyond a single isolated function, agents lost the coherent view of the system. They produced code that worked in isolation but broke dependencies elsewhere. This is the known context window problem, but the real insight is more subtle: it is not just about having too much context. It is also about having too little.

When you break a project into very small tasks and pass them sequentially through a pipeline, each agent receives a fragment of the original specification. The model works by completing what it is given, filling the gap between the 10% you specify and the 100% it needs to produce. When that 10% is fragmented into a hundred micro-tasks, each agent receives 0.1% of the original intent. That is not enough semantic signal to reason about what you are actually building. If you ask an AI to build a login form, it understands the problem. If you ask it to first create two input fields, then a button, then an API call, it has lost the context that makes those pieces coherent. The individual tasks are valid. The system they produce is not.

Validation overhead that erases the productivity gain. The time a senior engineer needed to verify that the generated code was safe, logically sound, and free of subtle bugs was equal to or greater than the time that same engineer would have spent writing it from scratch. This is the metric that matters most for any build-versus-buy calculation, and it is the one the demos never show.

Non-determinism at the architectural level. Identical inputs produced different architectures across different runs. This is not surprising if you understand how probabilistic models work, but it has a consequence that is rarely discussed: the output of an agentic pipeline is not maintainable in any meaningful sense. If the next run produces a different structure, you cannot reason about consistency over time. You do not have a codebase. You have a snapshot.

Systematic failure on complex business logic. Agents handled standard patterns well. CRUD operations, boilerplate, well-documented integration patterns. But they failed consistently on the kind of logic that actually defines a product: edge cases, multi-party business rules, the kind of "dirty" integration logic that exists in any real system that has been running in production for years. This is exactly where the leverage would matter most, and it is exactly where it collapses.

Infinite correction loops. Agents would enter cycles where they attempted to fix a failing test by generating more incorrect code or repeating the same error. The feedback loop that was supposed to create convergence created divergence instead.

The architectural reason this happens

Here is what I think is actually going on, and why it matters beyond my specific experiment.

The multi-agent team model is based on a human analogy: you divide the work, specialists handle their area, they pass results to the next person. This is how human teams work because human beings have persistent context, shared understanding built over time, and the ability to ask for clarification when something does not make sense.

Language models do not work this way. Each agent in a pipeline starts from scratch. When you pass the output of one agent to the next, you are compressing the context, not transferring it. Information is lost at every handoff, exactly like in the management hierarchies I write about in other contexts. The more agents you chain, the more signal you lose, and the more the final output drifts from the original intent.

The horizontal versus vertical framing helped me understand this more clearly. When you cut a project vertically, by feature, you are forcing each agent to manage the full communication stack: API contracts, frontend flows, backend logic, data layer. Most of that surface is already highly standardized and does not require creative reasoning. You are burning context budget on the parts that matter least.

When you cut horizontally, by layer, each agent operates on a surface that has semantic coherence. Business logic is business logic. The model can reason about it as a unified problem. The interfaces between layers are stable contracts, and those can be defined and enforced independently.

This is Conway's Law applied to AI-assisted development: the way you decompose the work shapes the quality of the output, and the decomposition that works for human teams does not work for language models.

What actually works

The approach that produced the best results was also the simplest. Medium-sized tasks, not micro-tasks. Explicit contracts between components defined before generation begins. And what I started calling a code of conduct for the model: a small set of non-negotiable rules about how the output must be structured, verified explicitly at the end of every interaction, with the interaction invalidated and repeated if the rules were not followed.

This is not far from what good vibe coding looks like with proper structure. The difference is the explicitness of the rules and the verification step.

Agents in a CI/CD pipeline, handling atomic and verifiable tasks, also work well. Analyzing test logs, generating performance reports, flagging anomalies. These work because the task has clear boundaries and the output can be checked against an objective criterion. The agent is not being asked to reason about what to build. It is being asked to reason about what happened.

The pattern that emerges is consistent: AI works when the task is semantically complete, the output is verifiable, and the context budget is used on the part of the problem that requires actual reasoning. It does not work when the task is fragmented, the output is evaluated subjectively, or the context is diluted across a handoff chain.

The economic argument that nobody talks about

If you use a subscription-based coding assistant, you have effectively unlimited tokens for a fixed monthly cost. If you run the same workload through APIs at consumption pricing, you spend orders of magnitude more for a result that is comparable or marginally better.

The business case for agentic pipelines on API pricing does not close, in most scenarios. This is not a knock on the technology. It is a capital allocation problem, and it is the kind of calculation that gets skipped when people are excited about a new paradigm.

Where this leaves us

The state of the art today is closer to AI as a very capable assistant than AI as an autonomous developer. A senior engineer who knows how to use these tools well can compress days of work into hours. That is real and significant leverage. But the automation of the full development cycle, with agents operating independently across complex business logic, is not reliably achievable with current architectures on real production tasks.

The models are improving fast. Context windows have expanded dramatically. Reasoning-first architectures are showing different behavior on multi-step tasks. I am not making a claim about what will be possible in two years. I am making a claim about what is true today, based on a lot of failed experiments and a substantial compute bill.

The honest position is this: the technology is genuinely powerful, the hype around autonomous agent teams is mostly ahead of the reality, and the sweet spot is using AI as high-leverage assistance on well-defined tasks rather than as a replacement for the engineering judgment that makes those tasks coherent in the first place.

That judgment is still ours to provide.