I once joined a team with a product on fire. The “payment error rate” was high (20%+), partners complained the product “never worked”, and churn was at boiling point. The team was exasperated, but their frustration was nothing compared to that of marketing who felt partners were falling through a sieve, and partner support who could hardly raise issues fast enough.

Rough landing

On my first day, the team set up a meeting about “payment errors”: issues that prevented the customer from completing their purchase. The frustration on the topic was clear. The team cared deeply about these issues, but they were stuck and demotivated.

The problem: Low morale and lack of structured support prevent teams from performing.

The solution: Introduce a structure the team can get behind — building visibility, taking things step-by-step — and target low-hanging fruit to build morale.

Bird’s eye view

How much of a problem were these payment errors, really? It was hard to tell. Various dashboards and logging already existed, but they were dispersed and rarely agreed. When troubleshooting a particular issue, the strand of spaghetti could hardly be followed.

The problem: Diverging dashboards made it hard to know when an outage was occurring.

The solution: Create a headline red-amber-green dashboard that clearly showed whether business operations were within acceptable parameters.

Don’t panic

With the issues now clearly visible on overhead monitors, senior leadership snapped to attention, asking questions like, ‘isn’t this an everyone-stays-up-all-night situation?’ These senior stakeholders had to be managed by explaining that there were multiple systemic issues at play, and this required long-term dedicated focus.

The problem: Leadership panicked and looked for short-term solutions

The solution: Explain the long-term approach while delivering short-term wins and show progress.

Unify reporting

Fine-grained logging and metrics still disagreed, making troubleshooting nigh impossible. I defined a single place in the code for all errors to flow through. The single “catch” was responsible for logging and emitting metrics to all relevant systems.

The problem: Inconsistent instrumentation hid issues.

The solution: Implement a unified error handling mechanism.

Stop the bleeding

The state of play could now be seen. However, being a live system, with many dependencies, further outages continued to occur. I introduced a DevOps-style on-call rotation to tackle new outages and maintain a steady-state of business performance. Two important techniques that I used here were:

Using projection graphs based on exponentially-weighted averages to compare against expected volumes and rates
Discriminate between user and system errors i.e., actionable and unactionable errors (4xxs versus 5xxs)

The problem: Outages continued to occur, driving performance even lower.

The solution: Introduce an on-call rotation with actionable alerting to maintain performance.

Prioritise the work

Prioritising the list of payment errors by occurrence gave the team a structured backlog to burn down to zero. Lumping together related issues (same topic, same subsystem) helped to expedite them. Over time, all other kinds of issues were added to this backlog, and explicitly prioritised.

The problem: The team was overwhelmed with many errors to fix.

The solution: Help the team to focus with a visible board of prioritised issues.

Validate fixes

As the major issues and low-hanging fruit cleared, themes became evident, one amongst them being time zones. I found an old bug fix which actually made the situation worse. I fixed this and proved the fix by emitting metrics, showing traffic recovering as it shifted from the old flow to the new. This became a general practice.

The problem: Some “fixes” made the problem worse.

The solution: Instrument fixes with metrics and logging.

Single source of truth

Another theme of issues was data inconsistencies through the customer funnel, causing purchases to fail at the last step. This was due to different pages being powered by different back-end systems. The solution here was to migrate all the pages to use a common back-end, resolving a wide range of data consistency issues.

The problem: Inconsistent data sources through the product resulted in invalid data.

The solution: Migrate the product to use a single source of truth.

Identify hot spots

The next major theme was apparent ‘user’ errors during credit card form submission — errors which “ought” not be possible given the software implementation. However, this area of the codebase had excessive tech debt and was due for replatforming. So a colleague and I rewrote it, completing most of the work in a 2-day hackathon.

The problem: Tech debt caused critical functionality to malfunction in unexpected ways which were difficult to detect or debug.

The solution: Address tech debt in malfunctioning critical functionality.

Jettison intractable problems

With the majority of issues now resolved, we noticed some combinations of purchase options never succeeded. We met with our dependencies, and discovered together they were not actually supported, and adding that support was utterly infeasible. After exploring all available alternatives, we managed the situation by contacting partners, informing them with regret that this functionality was not available after all, and helping them find alternative solutions.

The problem: Some issues may be infeasible to solve.

The solution: Inform and help the partner, after exhausting other options.

The long tail

The payment error rate was now below 1%, and the fire was out. The tech strategy I authored with help from my colleagues to “stop starting, and start finishing” had grown to encompass the entire area of 3 teams and 20 staff to achieve this result. Tech debt still remained — but was no longer burning partners and customers.

Conclusion

A methodical approach helped us see through the blaze to the root causes. The time zone bug, the data inconsistencies, and the broken credit card form, clear in retrospect, were not visible amidst the raging fire. Much effort was needed to tame the inferno, and withstanding the heat wasn’t easy! But remember: The hottest fire forges the strongest steel.

Ronen Agranat Consulting

Fire jumping: How I reduced a 20%+ payment error rate to <1%

Rough landing

Bird’s eye view

Don’t panic

Unify reporting

Stop the bleeding

Prioritise the work

Validate fixes

Single source of truth

Identify hot spots

Jettison intractable problems

The long tail

Conclusion

Related Posts

OfferZen Leadership Ask-Me-Anything

What it takes to be a great manager

Liked this post?

Sign up for more.

Subscribers get early access to all articles and exclusive content.

OfferZen Leadership Ask-Me-Anything

What it takes to be a great manager

Discover more from Ronen Agranat Consulting