Skip to main content

Nobody had asked why the requests existed in the first place.

At some point, a team was created to manually process refund requests. Four people, full time, working through a queue of transactions that had gone wrong. It seemed like a reasonable operational solution to a known problem.

Nobody asked why the problem existed.

The context

The flow worked like this. A customer would purchase a product, the payment would go through, and then the system would attempt to confirm the actual booking with the inventory. In a portion of cases, that confirmation would fail: the product was no longer available, or the inventory system returned an error. The purchase had been charged, but the booking could not be completed.

The obvious fix was an automatic refund. But automatic refunds were not the right answer here, for a few reasons.

First, the error rate was 3% on average, reaching 5% on specific product types during peak periods. On a high-volume transactional system, that is a large absolute number. Refunding automatically at that scale meant losing a significant amount of revenue.

Second, the booking flow involved deferred confirmation. For integration and architectural reasons, the full cycle from payment to confirmed booking could take several hours. An immediate refund in that window would cause the customer to lose their slot and walk away, even in cases where the booking could have been recovered with a small adjustment, a time shift, a variant of the same product. The majority of failed transactions did not actually require a refund. They required a human to look at them and find an alternative.

So the team was built. A dedicated customer care unit, trained to triage these transactions manually, contact customers, and resolve each case individually. It worked, for a while.

The pressure

As volume grew, the queue grew with it. Four people managing an ever-increasing backlog of failed transactions. Customers waiting up to 30 days to see their money released or their booking confirmed. Money held in a suspended state, neither refunded nor converted into a completed sale.

The costs were stacking up on multiple fronts. The operational cost of the team itself. Double banking transaction fees, charged both on the original payment and on any refund or re-processing. Lost revenue from transactions that were never recovered. And a customer satisfaction problem that was becoming impossible to ignore: customers seeing their money frozen, getting nervous, leaving bad reviews.

Multiple solutions had been proposed. Some of them were complex architectural overhauls. None of them shipped. The problem kept growing.

The diagnosis

The core issue was a synchronization problem between systems, compounded by incorrect caching behavior.

What made it difficult was that the failure was not consistent. It happened in some cases and not others. The pattern was not obvious from the outside, and the problem touched multiple systems. That variability was exactly why the organization had built the manual workaround instead of fixing the root cause: it looked too complex, too distributed, too hard to pin down.

The first step was instrumentation. We added tracking to follow the exact behavior of transactions that were entering the error state, to understand precisely where the synchronization was breaking and under what conditions. We also went directly to the people doing the manual resolution, the customer care team, and mapped the real-world case types they were handling every day. Their knowledge of the failure patterns turned out to be invaluable. They had been living with this problem longer than anyone.

Once the patterns were mapped, the problem became tractable.

The approach

The fix was not an architectural overhaul. It was a series of targeted interventions, each aimed at a specific failure pattern.

Caching was the biggest lever. The system was using a generic caching strategy that did not account for the differences between inventory types. Different suppliers had different characteristics, different consistency guarantees, different edge cases. We rebuilt the caching logic to be specific to each product type, which eliminated a large portion of the failures immediately.

Concurrent purchase handling was another gap. Some inventory systems did not manage concurrent bookings on the same slot correctly. We introduced our own concurrency control layer for those specific cases, operating on a defined pool of slots we controlled, which gave us the certainty we needed without depending on the supplier to fix their side.

We also introduced a payment reservation system, holding funds rather than immediately charging, which gave us a time window to confirm availability before the charge became final. This removed the most damaging failure mode: the customer being charged for something that could not be delivered.

Throughout the process, the approach was the same as in the first case: map the problem space, cluster the failure types, fix in order of impact, measure continuously.

The result

Error rates dropped from 3 to 5% down to under 1%, with only a residual fraction of edge cases remaining. Those last cases involved occasional, low-frequency failure modes that would have required disproportionate investment to resolve fully. We left them.

The dedicated customer care team was dissolved. The four people who had been spending their days managing a queue of technical errors went back to doing actual customer support work, handling issues that required human judgment rather than compensating for a system bug.

The double banking transaction costs disappeared. The revenue that had been lost on unrecovered transactions was recovered. The customer experience of having money frozen for weeks was eliminated.

The total financial impact across recovered revenue, eliminated operational costs, and removed transaction fees ran into the millions annually.

What I took from this

Organizations are good at building workarounds. They are much worse at asking why the workaround exists.

A team of four people processing manual refunds every day is not a customer care solution. It is a very expensive symptom of an undiagnosed technical problem. The cost of that symptom, the salaries, the transaction fees, the lost revenue, the customer dissatisfaction, was far higher than the cost of fixing the root cause.

The reason it took so long to fix was not technical complexity. The failure patterns were real but findable, once someone decided to look at them systematically. The reason was that the workaround had become normalized. It was budgeted, staffed, and managed as if it were a permanent feature of the operation.

The most useful thing I did in this case was not the technical fix. It was refusing to accept the workaround as a given, and going to find out why the queue existed in the first place.