The incident review exemplar
When I re-joined Atlassian in December 2025 I was given the opportunity to become a Post Incident Review Champion for the team. Basically that meant that I would be available to review documentation around how teams within our department were working to address problems in their software, and provide oversight of my team's approach to working through incident follow-up work.
To get up to speed with the championing expectations I made a point of attending the department's regularly scheduled Zoom call for examining some recent post incident review documents, adding additional comments and flagging up any opportunities to improve.
The calendar invite for the meetings included a link to a review document for an incident as an exemplar of what a good review should look like. When I clicked the link and read into the details I recognised that it was from an incident that some of my previous team's colleagues had been involved in back in 2024. When I finished reading the full content I realised that it didn't even touch on the analysis that I had done shortly after the incident, so it did not include the root cause of how a small change to one feature managed to balloon out to have broad impact on other functionality.
Ironically, my new manager had read through my analysis of this incident and used that as a basis for determining that I would be a suitable person to get involved as a PIR champion.
It was a minor change, with safeguards in place
I won't go into too much detail here because:
- I wasn't directly involved and it's long enough ago that I can't remember
- This is a blog post, not a book
- The technical details are more important than the business processes
Three microservices and an endpoint migration
Before:
Microservice A exposes an HTTP endpoint then in turn calls on Microservice B to look up some data
After:
Microservice A still exposes the same HTTP endpoint, but calls on Microservice C to look up the data.
A minor detail is that Microservice C happens to sit behind a proxy server that has some responsibilities for access control and rate limiting between client services and the target endpoints.
The Safeguards
The change is controlled by a feature flag in Microservice A, meaning that we can control what percentage or requests transition across to hitting the new endpoint and can switch back to using the original endpoint on Microservice B if anything goes wrong during the rollout.
The rate limit configuration in the proxy server has been updated to specify the expected peak load between Microservice A and the new endpoint on Microservice C.
The HTTP client in Microservice A already has several endpoints that it accesses via the proxy server, so there isn't any new complexity involved in having these systems connected up.
Logs and metrics are in place, with nothing showing up as problematic after a significant percentage of requests got directed to the new endpoint.
Feature rollout
Everything looked fine for switching over, so the feature flag got turned up to 100% of requests to be routed to go via the new endpoint.
Things go bump in the night
Something went wrong.
Microservice A started to return error responses for multiple endpoints - not just the one involved in this migration. Automated monitoring picked up that there was an issue and sent out an alert to the person on call.
The person on call wasn't the person who was rolling this new feature out, so would take some time to get up to speed. It was the middle of the night in the dev team's timezone, so there wasn't a bunch of colleagues online to help the investigation.
It took a while to track down some error responses in the logs and metrics. The calls to the new endpoint were being rate limited, but why were other endpoints being impacted?
The load on Microservice A was going up and up as clients of its various endpoints hit it with more and more requests.
Additional instances of Microservice A were provisioned as auto-scaling policies kicked in to cope with the additional traffic.
The feature flag got turned off but the system was struggling to keep up with the load. Unhealthy instances terminated and new instances came online to replace them.
On call personnel from multiple other teams impacted by the unavailability of the endpoints joined the online incident war room call to try to establish what was going on.
It takes some time - maybe half an hour or more - but eventually the service seems to have recovered. It's about four in the morning for the dev team's timezone, but a senior manager is based in another part of the world and has joined the discussion in the online war room and wants to get into some root cause analysis.
I joined the war room call at around the time that the system was starting to stabilise. It was almost dawn in New Zealand, so would be about 4am in some part of Australia where the members of the development team were based. My main contribution to the discussion was to point out the signals that indicated that the system was back to its normal steady state, and that there may not be much benefit in people continuing to try to look at it in the middle of the night. The root cause analysis could wait. Our US based colleagues were now in the middle of their working day, but my teammates in Australia could benefit from some sleep.
What went wrong?
The first problematic aspect of this particular situation was that the rate limiting configuration in the proxy server had a type of edge case situation that resulted in the newly specified limit not being applied the way that it was expected to. This meant that when Microservice A started to receive peak load it would be rate limited by the proxy server when it attempted to call the endpoint on Microservice C.
The second problem involved the configuration of the HTTP client in Microservice A. Although it had circuit-breaking in place to handle problematic responses, that circuit-breaker was not configured to propagate a suitable HTTP response to the callers of Microservice A. So, when the proxy server presented out a 429 response for some calls it would be bubbled up as a 500 response to the callers of Microservice A. When callers of Microservice A got back the 500 response they would automatically retry the request.
The third problem - which I find to be the most interesting, and something that should have been fed into the PIR document - was that the circuit breaking configuration on the HTTP client within Microservice A was being applied across all of the endpoints that were fronted by the proxy service. This is what dramatically expanded the blast radius of the issue.
What should have been a neatly isolated issue involving a single endpoint, blew out to cover multiple endpoints that were much more business critical.
One endpoint's 429 responses resulted in multiple endpoints presenting out an error response code that callers treated as suitable for immediate retries, creating additional load.
Summary
A combination of rate limiting and circuit-breaking made a minor incident into a major one.
This was one case where a more gradual roll out of the feature flag may or may not have prevented the incident due to the thundering herd - which is ironically something that circuit-breaking is intended to prevent.
No comments:
Post a Comment