Friday, 20 March 2026

When combining service resilience measures elevated incident severity

The incident review exemplar

When I re-joined Atlassian in December 2025 I was given the opportunity to become a Post Incident Review Champion for the team. Basically that meant that I would be available to review documentation around how teams within our department were working to address problems in their software, and provide oversight of my team's approach to working through incident follow-up work.

To get up to speed with the championing expectations I made a point of attending the department's regularly scheduled Zoom call for examining some recent post incident review documents, adding additional comments and flagging up any opportunities to improve.

The calendar invite for the meetings included a link to a review document for an incident as an exemplar of what a good review should look like. When I clicked the link and read into the details I recognised that it was from an incident that some of my previous team's colleagues had been involved in back in 2024. When I finished reading the full content I realised that it didn't even touch on the analysis that I had done shortly after the incident, so it did not include the root cause of how a small change to one feature managed to balloon out to have broad impact on other functionality.

Ironically, my new manager had read through my analysis of this incident and used that as a basis for determining that I would be a suitable person to get involved as a PIR champion.

It was a minor change, with safeguards in place

I won't go into too much detail here because:

  • I wasn't directly involved and it's long enough ago that I can't remember
  • This is a blog post, not a book
  • The technical details are more important than the business processes

Three microservices and an endpoint migration

Before:

Microservice A exposes an HTTP endpoint then in turn calls on Microservice B to look up some data

After:

Microservice A still exposes the same HTTP endpoint, but calls on Microservice C to look up the data.

A minor detail is that Microservice C happens to sit behind a proxy server that has some responsibilities for access control and rate limiting between client services and the target endpoints.

The Safeguards

The change is controlled by a feature flag in Microservice A, meaning that we can control what percentage or requests transition across to hitting the new endpoint and can switch back to using the original endpoint on Microservice B if anything goes wrong during the rollout.

The rate limit configuration in the proxy server has been updated to specify the expected peak load between Microservice A and the new endpoint on Microservice C.

The HTTP client in Microservice A already has several endpoints that it accesses via the proxy server, so there isn't any new complexity involved in having these systems connected up. 

Logs and metrics are in place, with nothing showing up as problematic after a significant percentage of requests got directed to the new endpoint.

Feature rollout

Everything looked fine for switching over, so the feature flag got turned up to 100% of requests to be routed to go via the new endpoint.

Things go bump in the night

Something went wrong.

Microservice A started to return error responses for multiple endpoints - not just the one involved in this migration. Automated monitoring picked up that there was an issue and sent out an alert to the person on call.

The person on call wasn't the person who was rolling this new feature out, so would take some time to get up to speed. It was the middle of the night in the dev team's timezone, so there wasn't a bunch of colleagues online to help the investigation.

It took a while to track down some error responses in the logs and metrics. The calls to the new endpoint were being rate limited, but why were other endpoints being impacted?

The load on Microservice A was going up and up as clients of its various endpoints hit it with more and more requests.

Additional instances of Microservice A were provisioned as auto-scaling policies kicked in to cope with the additional traffic.

The feature flag got turned off but the system was struggling to keep up with the load. Unhealthy instances terminated and new instances came online to replace them.

On call personnel from multiple other teams impacted by the unavailability of the endpoints joined the online incident war room call to try to establish what was going on.

It takes some time - maybe half an hour or more - but eventually the service seems to have recovered. It's about four in the morning for the dev team's timezone, but a senior manager is based in another part of the world and has joined the discussion in the online war room and wants to get into some root cause analysis.

I joined the war room call at around the time that the system was starting to stabilise. It was almost dawn in New Zealand, so would be about 4am in some part of Australia where the members of the development team were based. My main contribution to the discussion was to point out the signals that indicated that the system was back to its normal steady state, and that there may not be much benefit in people continuing to try to look at it in the middle of the night. The root cause analysis could wait. Our US based colleagues were now in the middle of their working day, but my teammates in Australia could benefit from some sleep.

What went wrong?

There were multiple layers to this situation that I was able to unpick by looking into the metrics and ultimately the code.

The first problematic aspect of this particular situation was that the rate limiting configuration in the proxy server had a type of edge case situation that resulted in the newly specified limit not being applied the way that it was expected to. This meant that when Microservice A started to receive peak load it would be rate limited by the proxy server when it attempted to call the endpoint on Microservice C.

The second problem involved the configuration of the HTTP client in Microservice A. Although it had circuit-breaking in place to handle problematic responses, that circuit-breaker was not configured to propagate a suitable HTTP response to the callers of Microservice A. So, when the proxy server presented out a 429 response for some calls it would be bubbled up as a 500 response to the callers of Microservice A. When callers of Microservice A got back the 500 response they would automatically retry the request.

The third problem - which I find to be the most interesting, and something that should have been fed into the PIR document - was that the circuit breaking configuration on the HTTP client within Microservice A was being applied across all of the endpoints that were fronted by the proxy service. This is what dramatically expanded the blast radius of the issue.

What should have been a neatly isolated issue involving a single endpoint, blew out to cover multiple endpoints that were much more business critical.

One endpoint's 429 responses resulted in multiple endpoints presenting out an error response code that callers treated as suitable for immediate retries, creating additional load.

Summary

A combination of rate limiting and circuit-breaking made a minor incident into a major one.

This was one case where a more gradual roll out of the feature flag may or may not have prevented the incident due to the thundering herd - which is ironically something that circuit-breaking is intended to prevent.

Monday, 16 March 2026

Some initial sidenotes about the redundancy

Context

It's easy to get distracted and side-tracked by other things that you happen to notice when applying a change to a non-trivial software system. I've gotten into the discipline of making a note to address that later, then moving on to maintain focus on the work that is the current priority. 

A couple of examples include:

- One of our older services includes an alerting library that references OpsGenie, that's going to be phased out soon (by Atlassian)

- The dependencies of this service includes a DataDog library, we haven't been using DataDog for metrics for as long as I have been at the company

Last Thursday I was made redundant along with about 1600 colleagues at Atlassian, so here I am keeping a few notes about what I have been learning from this experience.  

Things I've Been Contemplating Since Starting Garden Leave

Twitter / X Accounts Spouting Nonsense, That Spreads To LinkedIn...

Some Twitter accounts post made up nonsense claiming to be "inside information" about what had been going on in the lead up to redundancies and layoffs at companies. While I can't claim to know about every detail of what has been going on across the entire company, I can smell bullshit in at least one post that claims to be about Atlassian.

A few days after seeing the conspicuous post I came across a reputable account that had called out similar false information from the same account about a different company, so that made it clear to me that I hadn't missed something that was happening at Atlassian.

While holding off on pushing "publish" on this post, I came across someone on LinkedIn repeating the same nonsense. Sure enough at the bottom of their post they cited the dodgy Twitter / X account as their source.

Conspicuous Survey Questions

With the benefit of hindsight, I am wondering whether the answers provided in a recent internal survey may have fed in as data points counting for or against individuals for criteria for being less suited for the increased application of AI.

If I recall correctly, the questions included topics such as:

  • "Do you see yourself working at Atlassian in 12 months time?" 
  • "How likely would you be to recommend Atlassian as a place to work?"
  • "In the last 30 days, how much time has AI saved in your day to day work?"

I think the answers to those first 2 questions would be quite different today, as I was in quite a positive frame of mind up until last Thursday.

Based on the shocked reactions from my team mates, their responses may also shift - if the surveys continue. 

The range of people impacted is significant

It has only been a few days since the announcement, but so far I have heard about quite a few people that I knew who have been caught up in this round of redundancies: 

  • Four of my former team mates
  • A senior recruiter who was involved in hiring me my first time at Atlassian
  • The head of department of my former team
  • The main incident response trainer

Saturday, 14 March 2026

Two truths, and an AI

When I joined a company a few years ago there was an established part of the introduction to the wider team, where each new person had their turn to describe three interesting facts about themselves - where one of the "facts" would actually be made up. Then there would be a bit of voting to see which statement was least believable.

Some of my recent experiences with AI reminded me of that "Two truths, and a lie" experience.

Earlier this week I spent a couple of hours delving into what performance characteristics we should expect to get out of a particular configuration of an AWS service. The AI agent surfaced up the top handful of performance optimisation recommendations, followed up with some tables of numbers for estimated performance differences involved.

Given that our use case was mainly going to involve finding matches between two data sources, I figured that there would almost certainly be further performance benefits available if we worked with sorted data. So, I gave the agent a concise prompt to determine whether that would be worthwhile. What it came back with looked promising, but then looked too good to be true.

The first few paragraphs of the explanation of the benefits available from working with sorted data appeared to make sense, but later in the response there was a single sentence statement with bold formatting which gave the impression that it was an established fact - along the lines of "The report data is guaranteed to already be sorted". That turned out to be complete nonsense, directly contradicting a statement that was highlighted in the documentation that the AI had been using as a significant reference.

Unlike the icebreaker activity, when an AI agent presents information we don't know whether to have confidence that it will be accurate.

Friday, 13 March 2026

It wasn't a surprise, but it was a surprise

It wasn't totally unexpected

It's Friday 13th March 2026, the news is still sinking in that I have been made redundant due to Atlassian's decision to reduce costs by cutting 1600 from the workforce.

I've been telling friends and family that I'm okay and that it wasn't entirely unexpected. My teammates had been speculating and half-joking about the prospect of layoffs in the last couple of weeks, particularly as one of the developers had been unlucky enough to go through a couple of rounds of redundancies at his previous companies.

Getting mixed signals

Earlier in the week I had received an invitation from the talent acquisition team to participate in interviewing a candidate for a software engineering position next Tuesday, so I quietly thought, "hmm, maybe we're not in a total hiring freeze after all", and also, "oh crap, that goal related to participating in interviews is still going to be relevant this quarter so I'm gonna need to complete that refresher about the interviewing process".

It's just business, numbers matter

The other number being used to describe this round of redundancies is 10% of the Atlassian staff, so being on a team with about ten developers I can speculate that we would be expected to have one person be cut. I'd like to believe in this instance it may have come down to the "last in, first out" approach, as I was the most recent person on the team to join the company.

I've even seen at least one person mention, "On Monday I was promoted, on Wednesday I was made redundant", so performance hasn't been the driver for these cuts.

So far I haven't heard of anyone else in my little corner of the world being caught up in this round of redundancies. As New Zealand doesn't have a particularly significant number of Atlassian employees, I suppose our headcount reduction may fall under "Australia".


The morning of the redundancy announcement email

A morning in the life of a developer

Code reviews

Blocking for required changes

Checked Bitbucket for fresh pull requests on my team's repositories that require approvals before they could be merged in and included in a deploy.

One of the changes involved a private function for calculating some dates that included logic based on the current date. The documentation comment appeared to be incomplete so I couldn't quite tell what it was intended to do.

I added a couple of comments, mainly proposing that the date calculation logic should be extracted out to its own component with a Clock being provided as a dependency so that we could cover it with tests and have control over what the value of "now" would be for the Clock component.

On this particular day I made the decision to be a little bit stricter in my feedback, so I clicked the "Changes required" option meaning that I would have to come back to re-review the pull request before it could be progressed.

Checking out a branch to see the code in context

For another code review I decided to check out the branch to give it a proper inspection on my work laptop.

This particular codebase was a bit older, so had the potential to include some legacy stuff in there. Specifically, I noticed the configuration of the dependencies mentioned a client for metrics that we haven't been using during either of my stints at Atlassian. 

There were no references to the dependency in the code, but seeing it in the gradle file means it's probably being bundled into the service and deployed.

I didn't even get around to making a little "note to self" for that, so maybe someone on the team will pick up on that some time later.

Post incident reviews

My department has a regularly scheduled session for developers to build up experience and awareness of how to go about addressing incidents that arise across our services. This is a chance to learn from others' mistakes and to give input to ensure that the appropriate mitigations are applied to ensure that the specific system(s) involved in the recent incident don't get impacted again.

As I had a meeting clash for this particular day, so I decided to have an early read-through of the incident summaries and provide my comments in advance of the normal meeting schedule.

My main contribution on this occasion was to query whether a feature flag had involved an id value, as this should have ensured that when multiple feature flag lookups were involved in the processing that the same result would have been returned.

Preparing to upload synthetic data to S3

The story that I was currently working on involved assessing the performance of Amazon Athena for different data formats when a large volume of data is involved.

The previous day I had guided an AI agent to create some Python scripts that would generate files to resemble an S3 inventory report as CSV or Parquet.

The functionality was quite impressive, even if I do say so myself
- Random generation of the object key (file path)
- Control of the seed for random generator, so CSV and Parquet would have like-for-like data
- Parallelism to support generation of multiple files at the same time
- Generation of manifest file with representative path
- Script for uploading to S3, also supporting parallelism

This particular experiment was deemed to be necessary because out AI agents were not able to give us solid data for their estimation of performance differences between the data formats. Moving from CSV to Parquet would be a blocker on an existing implementation being usable.

Then things got interesting

A team mate posted a link to Mike's blog post in one of our team Slack channels.
I started reading the blog post, then checked my email...

Ruh roh...

The initial sentences mentioned "may be impacted" in bold, then the content mentioned that access to various systems would be cut off, so it dawned on me that this time my employment with Atlassian would be over.

I didn't get a chance to check in those Python scripts because the company had made the understandable decision to cut off access to Bitbucket.

I had access to the Atlassian Slack from my personal device for most of the rest of the day, so I could let my team know about my situation.

I got several nice mentions in the team channel and in direct messages from my teammates and managers. It came as much of a shock to them as it did to me.

When I went on LinkedIn I started to see a few familiar names showing up mentioning how they too had been caught up in this round of redundancies.

It was kind of reassuring to see the names of people who have been significant contributors - so it's not as though I have been identified as dead wood and cut out based on performance. Apparently there are about 1600 of us in this situation.

Wednesday, 3 December 2025

Having a go at learning some Kotlin

What's this about? 

The year 2025 is almost over, so that means that it has been a bit over a decade since my old colleague Filippo gave a presentation to the development team of ScienceDirect covering the merits of the Kotlin programming language. So, it's about time that I had a proper go at using it.

This blog post is intended to trace what the experience has been like, covering surprises that I encounter along the way.

Getting started

The programming language that I am most experienced with is Java, so I have chosen to try out implementing some functionality in Kotlin from a recent hobby project that I developed in Java involving spinning up a database in a Docker container and running some queries.

JVM version support

IntelliJ IDEA includes some automation for creating a new project, so I selected the relevant options to use the latest LTS version of the Java virtual machine with Spring Boot, Kotlin, Postgresql and Test containers.

After a few seconds I had a new project in place ready for me to start development, but it didn't quite match up with what I had specified. In the auto-generated HELP.md file there was a mention of:

* The JVM level was changed from '25' to '24' as the Kotlin version does not support Java 25 yet.

Hmm, that's not a limitation that I was expecting to encounter. Java 25 has been out for a couple of months now, and as it is a long term support (LTS) release I would expect that providers of languages and frameworks would prioritise compatibility.

What does it actually mean?

After some brief checking it turns out that Kotlin won't currently compile to JVM 25 bytecode, but that doesn't mean that we can't deploy and run on JVM 25.

Some JUnit annotations are not a clean fit

Static methods?

This may be a bit subjective, but having to set up a companion object and an additional annotation leaves me feeling that Kotlin isn't a great fit for JUnit.

I wanted to specify some particular set up logic to run prior to the tests in my first test class. In Java it would just involve declaring a static method and annotating it, but static methods aren't a directly applicable concept in Kotlin, so it ends up looking like the following:

companion object {
@BeforeAll
@JvmStatic
internal fun setUp()
// Here's where the magic happens
}
}

So it seems like I should also look into what test support libraries are best for idiomatic Kotlin. 

In conclusion...

I'm going to hit "publish" on this post before I have any code to share, as I want it to still be relevant.

Kotlin 2.3.0-RC2 has just been released, that includes support for Java 25:

https://kotlinlang.org/docs/whatsnew-eap.html#kotlin-jvm-support-for-java-25 

"Starting with Kotlin 2.3.0-RC2, the compiler can generate classes containing Java 25 bytecode."

I may come back later with some follow up post(s) about how I get on with learning some more about Kotlin - preferably the language features rather than adoption rough edges.

 

Wednesday, 12 November 2025

Redis website had some out of date references

I picked the wrong place to refresh my Redis know-how 

I started this post with the intention of making some notes about my experiences with learning more about how to apply Redis to solve a few common problems, but I stumbled across some out of date content and odd behaviour of the redis.io website so I'm making some notes about that instead.

Rate limiting

Java sample code missing 

This was of interest as it sometimes comes up in interviews, so I wanted to look into an elegant solution.

https://redis.io/learn/howtos/ratelimiting

As I am mainly a Java developer, I followed the instructions for Java but didn't get very far as the repository that is mentioned seems to have been cleared out.

The commit history seems to indicate that the repository has not been updated for a couple of years, so this wasn't a great start.

Python. NodeJS and Ruby all show as still having code in their Github repositories.

Website connection timeout

The references section on the rate limiting howto page linked to pages that result in timeouts.
 

I don't know the details, so I'm speculating that somewhere in the configuration between Cloudflare and redis.io the "not found" situation goes into a timeout black hole. To test but not necessarily prove that theory I have attempted to access a non-existent page under the redis.io site and observed that it results in the same timed out error.

What have we learnt?

Not much about Redis, but a little bit about some grey areas of the redis.io website.
 
If you're applying a content delivery network (CDN) to front for your website then be sure to pay attention to how errors will be handled.
 
Handling of "not found" URLs isn't a difficult optional extra feature to have in place.

Reported to the AI Chatbot

Unsurprisingly the AI Chatbot on the redis.io site wasn't able to reason about the Cloudflare configuration or the empty Github repository when I mentioned the issues.

Update

I've gone back to check on how the site is behaving now. The rate limiting tutorial page still points to an empty Github repository, but the handling of 404s has improved.