BUSINESS IMPACT

The architecture was not broken. The execution was.

A few years ago, I inherited a system that was managing dozens of third-party supplier integrations, handling millions of transactions a year, with hard peaks concentrated in summer and the Christmas period. The market volume was in the dozens of millions. Margins were thin, but the business line was strategically critical: it supported a tier-1 distribution partnership that represented a significant share of the company's revenue.

The system was not working well. High error rates, frequent P1 incidents, overload spikes during peak periods. Every failure had a direct cost: lost sales, operational overhead, reputation damage with partners. The business line was bleeding.

The pressure

The organization had reached a conclusion: the architecture was bad, the technology was legacy, the problems were structural. The proposed solution was a full rewrite from scratch.

The problem with that plan was the timing. The company was in a cost-cutting phase, not an investment phase. A full rewrite would have taken months, required significant budget, and delivered nothing in the meantime. The business line was already being discussed as a candidate for disinvestment. Committing to a long rewrite in that context was essentially a slow way to kill the project.

There was also a second problem, more subtle. Nobody had actually proven that the architecture was the root cause. It was an assumption. A convenient one, because it gave everyone a clean narrative: the tech is broken, we need to start over. But assumptions are not diagnoses.

The diagnosis

I sat down with the team manager. We both knew this system well. We knew there was real value underneath the noise, and we were not convinced the problems were architectural.

When we looked closely, the picture was different from the narrative.

The flows were not mapped end-to-end in any observable way. Nobody had a clear view of where exactly transactions were failing and why. The team was reacting to incidents as they came, running behind bugs without a systematic view of the problem space. The internal conversation was focused on the technology being legacy, not on specific, identifiable failure points.

The integrations were being treated as a uniform group, with the goal of standardizing behavior across all of them. But they were not uniform. Different suppliers had very different reliability profiles, volume patterns, and failure modes. Treating them the same meant optimizing for nothing in particular.

The problems looked enormous because nobody was looking at them with the right frame. Once you have a map, big problems tend to decompose into smaller ones. We did not have a map.

The decision was straightforward: do not rewrite. Instrument, diagnose, fix iteratively, with the small budget we had available.

The approach

The first move was observability. We instrumented everything in Datadog at an obsessive level of detail. Every flow, every integration, every failure point. For the first time, we had a complete end-to-end map of what was actually happening.

With the map, patterns emerged. We grouped failures into clusters based on root cause and supplier. Some clusters were large but easy to fix. Others were complex but rare. We prioritized by impact, starting with the low-hanging fruit, which turned out to be higher than expected.

The iteration was tight. Fix a cluster, measure the improvement, move to the next. As the biggest problems disappeared, the metrics started to improve visibly, and something else happened: smaller problems that had been invisible in the noise became detectable. The observability was paying compound interest.

One specific outcome surprised us. Once we had clean data on failure patterns, we could distinguish between our own issues and supplier-side issues. In the previous chaos, everything looked like our problem. With clear instrumentation, we could identify cases where the fault was on the supplier side and escalate with evidence. Some integrations improved significantly just from that.

The investment was small. No rewrite, no new architecture, no long migration project. A focused engineering effort over a few months, with a clear method and tight feedback loops.

The result

P1 incidents stopped. Error rates went from bad to good to excellent. The partnership was no longer at risk.

But the more interesting outcome came later.

As our system became reliable, something shifted in how distribution partners treated us. Reliable partners get prioritized. Unreliable ones get deprioritized. We had been in the second category without fully realizing it. As our performance improved, our position in the distribution stack improved with it.

After a full fiscal year, the numbers were in. Our volume grew 30%, against a 15% benchmark for comparable business lines. Margins improved. The business line that was being considered for disinvestment became one of the better-performing areas.

The architecture had not been the problem. The execution had been.

What I took from this

The instinct to rewrite is almost always wrong when the real problem has not been diagnosed. Rewrites are expensive, slow, and they inherit the same organizational and operational problems that caused the original failures, just on newer code.

The more common failure mode is not bad architecture. It is the absence of observability, which makes every problem look bigger and more structural than it is. You cannot fix what you cannot see. And once you can see it, most problems turn out to be fixable without starting over.

This is not a story about a technical solution. It is a story about diagnosis before action, and about the cost of skipping that step.