Tuesday, 24 March 2026

Dipping into Claude Code - false start

I've seen a few recommendations to get set up with Claude Code, so I'm checking the website to find out what it has to offer and what would be most suitable for my current needs.

I'm not sure if I have just been unlucky, or if this is normal. The documentation is unavailable due to a configuration error of the site configuration.


 Okay, so I suppose that tells us what platform they are hosted on. What does the error code represent?

 

That's saying that there is a "Hostname/SSL certificate mismatch".

So, one of the main platforms for AI agents has had it's documentation website misconfigured.

With an expectation that the maintainers will obviously be using their own technology to assist their setup, this does not inspire confidence. 

My AI assisted development journey

 Level 1 - Code completion

If you have written code with a modern editor then you have almost certain used some form of AI.

Within a given context, the editor will have some awareness of what functions, methods, classes or objects are available in the current scope. As you start to type a name the range of potential options narrows down and the editor may present you with a drop down of options to choose from to finish the current token.

Some examples of editors that will default to offering code completion are:

Visual Studio, IntelliJ IDEA, Eclipse. 

With plugins even Emacs and vim can be configured to provide code completion.

This isn't really AI, as there is nothing providing contextual guidance towards the most appropriate next token.

Level 2 - Generative AI

My first exposure to this was the GitHub Copilot plugin for IntelliJ IDEA.

Around 2024 I joined a pilot program within the company that I was working for at the time, to have access via SSO and GitHub credentials.

The codebase that I was working in was Java with the popular Spring Boot framework, but as it involved a bespoke workflow configuration and an unconventional data store, the agent was of limited use when it came time to try to introduce any new functionality.

I tried out prompting it with a couple of lines of description about the desired method and it would attempt to generate some code.

A couple of years have passed now, but I seem to recall that it would also struggle with the version of the API that was available for the data store.

There wasn't much of buzz around the company about how productive the tooling was.

Level 3 - Agentic AI

The good

In December 2025 I rejoined Atlassian and tried out the Rovo tooling.

Rovodev had a command line interface that I would start up within the base directory of my project codebase and provide a sentence or two of instructions about what I wanted to achieve.

One of my first development tasks on this new team was to update the handling of a particular exception case so that an appropriate alert would be emitted.

There was an unusual amount of complexity involved, as this particular codebase was several years old but involved a combination of:

AWS Lambdas, Step function flows, Spring configuration, alerting library and tests.

I made an initial attempt at applying the required change, then asked the AI agent to verify the functionality.

I was surprised and delighted that this new tooling was able to pick up on the context involved and flagged up that my implementation would have resulted in the step function configuration retrying - which would have had the undesirable side-effect of also triggering multiple alerts.

After a couple of further iterations of prompting and examination of the configuration, and some reading up of the documentation about step functions I was able to set up the desired functionality in a clean and elegant way.

The bad and/or the ugly

A very different experience of interacting with an AI agent involved prompting it to accurately describe the implications of applying a different approach for storing data at scale.

Looking back on it now, this reminds me of the classic "compare and contrast" style of question that was a particular favourite of one of my Computer Science lecturers.

After about a dozen prompts attempting to clarify the specifics of the performance differences of the data stores, I introduced an additional consideration about whether it would make a difference if the data was stored in a pre-sorted configuration. The AI agent went through its typical cycle of reviewing the context and considering the information available and proceeded to present back a sensible looking evaluation of the implications of using sorted data - but then it confidently asserted that the components involved would already have the data arranged in sorted order. This was a complete contradiction of the documentation.

Important Disclaimer: Rovodev wasn't some new custom Agentic AI, it just delegated to other mainstream AI LLMs.

With that disclaimer out of the way, if it had been a human developer confidently asserting that the best case scenario was the default behaviour of the system, then they would not be popular among their colleagues.

Summary

As of early 2026, there are several types of artificial intelligence systems that can aid software development.

In this post I have given some dumbed-down examples of how far things have progressed in the last year or so.

This is a point in time snapshot of some use cases, it is not intended to indicate what is state of the art - though aspects can be taken as statements of fact.

Friday, 20 March 2026

When combining service resilience measures elevated incident severity

The incident review exemplar

When I re-joined Atlassian in December 2025 I was given the opportunity to become a Post Incident Review Champion for the team. Basically that meant that I would be available to review documentation around how teams within our department were working to address problems in their software, and provide oversight of my team's approach to working through incident follow-up work.

To get up to speed with the championing expectations I made a point of attending the department's regularly scheduled Zoom call for examining some recent post incident review documents, adding additional comments and flagging up any opportunities to improve.

The calendar invite for the meetings included a link to a review document for an incident as an exemplar of what a good review should look like. When I clicked the link and read into the details I recognised that it was from an incident that some of my previous team's colleagues had been involved in back in 2024. When I finished reading the full content I realised that it didn't even touch on the analysis that I had done shortly after the incident, so it did not include the root cause of how a small change to one feature managed to balloon out to have broad impact on other functionality.

Ironically, my new manager had read through my analysis of this incident and used that as a basis for determining that I would be a suitable person to get involved as a PIR champion.

It was a minor change, with safeguards in place

I won't go into too much detail here because:

  • I wasn't directly involved and it's long enough ago that I can't remember
  • This is a blog post, not a book
  • The technical details are more important than the business processes

Three microservices and an endpoint migration

Before:

Microservice A exposes an HTTP endpoint that in turn calls on Microservice B to look up some data from a data source owned by Microservice B.

After:

Microservice A still exposes the same HTTP endpoint, but calls on Microservice C to look up the data.

A minor detail is that Microservice C happens to sit behind a proxy server that has some responsibilities for access control and rate limiting between client services and the target endpoints.

The Safeguards

The change is controlled by a feature flag in Microservice A, meaning that we can control what percentage or requests transition across to hitting the new endpoint and can switch back to using the original endpoint on Microservice B if anything goes wrong during the rollout.

The rate limit configuration in the proxy server has been updated to specify the expected peak load between Microservice A and the new endpoint on Microservice C.

The HTTP client in Microservice A already has several endpoints that it accesses via the proxy server, so there isn't any new complexity involved in having these systems connected up. 

Logs and metrics are in place, with nothing showing up as problematic after a significant percentage of requests got directed to the new endpoint.

Feature rollout

Everything looked fine for switching over, so the feature flag got turned up to 100% of requests to be routed to go via the new endpoint.

Things go bump in the night

Something went wrong.

Microservice A started to return error responses for multiple endpoints - not just the one involved in this migration. Automated monitoring picked up that there was an issue and sent out an alert to the person on call.

The person on call wasn't the person who was rolling this new feature out, so would take some time to get up to speed. It was the middle of the night in the dev team's timezone, so there wasn't a bunch of colleagues online to help the investigation.

It took a while to track down some error responses in the logs and metrics. The calls to the new endpoint were being rate limited, but why were other endpoints being impacted?

The load on Microservice A was going up and up as clients of its various endpoints hit it with more and more requests.

Additional instances of Microservice A were provisioned as auto-scaling policies kicked in to cope with the additional traffic.

The feature flag got turned off but the system was struggling to keep up with the load. Unhealthy instances terminated and new instances came online to replace them.

On call personnel from multiple other teams impacted by the unavailability of the endpoints joined the online incident war room call to try to establish what was going on.

It takes some time - maybe half an hour or more - but eventually the service seems to have recovered. It's about four in the morning for the dev team's timezone, but a senior manager is based in another part of the world and has joined the discussion in the online war room and wants to get into some root cause analysis.

I joined the war room call at around the time that the system was starting to stabilise. It was almost dawn in New Zealand, so would be about 4am in some part of Australia where the members of the development team were based. My main contribution to the discussion was to point out the signals that indicated that the system was back to its normal steady state, and that there may not be much benefit in people continuing to try to look at it in the middle of the night. The root cause analysis could wait. Our US based colleagues were now in the middle of their working day, but my teammates in Australia could benefit from some sleep.

What went wrong?

There were multiple layers to this situation that I was able to unpick by looking into the metrics and ultimately the code.

The first problematic aspect of this particular situation was that the rate limiting configuration in the proxy server had a type of edge case situation that resulted in the newly specified limit not being applied the way that it was expected to. This meant that when Microservice A started to receive peak load it would be rate limited by the proxy server when it attempted to call the endpoint on Microservice C.

The second problem involved the configuration of the HTTP client in Microservice A. Although it had circuit-breaking in place to handle problematic responses, that circuit-breaker was not configured to propagate a suitable HTTP response to the callers of Microservice A. So, when the proxy server presented out a 429 response for some calls it would be bubbled up as a 500 response to the callers of Microservice A. When callers of Microservice A got back the 500 response they would automatically retry the request.

The third problem - which I find to be the most interesting, and something that should have been fed into the PIR document - was that the circuit breaking configuration on the HTTP client within Microservice A was being applied across all of the endpoints that were fronted by the proxy service. This is what dramatically expanded the blast radius of the issue.

What should have been a neatly isolated issue involving a single endpoint, blew out to cover multiple endpoints that were much more business critical.

One endpoint's 429 (Too Many Requests) responses resulted in multiple endpoints presenting out an error response code that callers treated as suitable for immediate retries, creating additional load.

Summary

A combination of rate limiting and circuit-breaking made a minor incident into a major one.

This was one case where a more gradual roll out of the feature flag may or may not have prevented the incident due to the thundering herd - which is ironically something that circuit-breaking is intended to prevent.

Monday, 16 March 2026

Some initial sidenotes about the redundancy

Context

It's easy to get distracted and side-tracked by other things that you happen to notice when applying a change to a non-trivial software system. I've gotten into the discipline of making a note to address that later, then moving on to maintain focus on the work that is the current priority. 

A couple of examples include:

- One of our older services includes an alerting library that references OpsGenie, that's going to be phased out soon (by Atlassian)

- The dependencies of this service includes a DataDog library, we haven't been using DataDog for metrics for as long as I have been at the company

Last Thursday I was made redundant along with about 1600 colleagues at Atlassian, so here I am keeping a few notes about what I have been learning from this experience.  

Things I've Been Contemplating Since Starting Garden Leave

Twitter / X Accounts Spouting Nonsense, That Spreads To LinkedIn...

Some Twitter accounts post made up nonsense claiming to be "inside information" about what had been going on in the lead up to redundancies and layoffs at companies. While I can't claim to know about every detail of what has been going on across the entire company, I can smell bullshit in at least one post that claims to be about Atlassian.

A few days after seeing the conspicuous post I came across a reputable account that had called out similar false information from the same account about a different company, so that made it clear to me that I hadn't missed something that was happening at Atlassian.

While holding off on pushing "publish" on this post, I came across someone on LinkedIn repeating the same nonsense. Sure enough at the bottom of their post they cited the dodgy Twitter / X account as their source.

Conspicuous Survey Questions

With the benefit of hindsight, I am wondering whether the answers provided in a recent internal survey may have fed in as data points counting for or against individuals for criteria for being less suited for the increased application of AI.

If I recall correctly, the questions included topics such as:

  • "Do you see yourself working at Atlassian in 12 months time?" 
  • "How likely would you be to recommend Atlassian as a place to work?"
  • "In the last 30 days, how much time has AI saved in your day to day work?"

I think the answers to those first 2 questions would be quite different today, as I was in quite a positive frame of mind up until last Thursday.

Based on the shocked reactions from my team mates, their responses may also shift - if the surveys continue.

An internal blog post with more em dashes than I have seen in my life

Perhaps a slight exaggeration, but there was a blog post that went around pushing for major changes and heavy investment in setting up environments for AI to run in. The post did have a lot of em dash characters, which is something that has been described as a symptom of content that was generated using AI.

The range of people impacted is significant

It has only been a few days since the announcement, but so far I have heard about quite a few people that I knew who have been caught up in this round of redundancies: 

  • Four of my former team mates
  • A senior recruiter who was involved in hiring me my first time at Atlassian
  • The head of department of my former team
  • The main incident response trainer

What's not in the headlines

It's kind of odd that a major technology company can part ways with the CTO and that doesn't make any headlines. In most articles it doesn't even get a mention (maybe I'm just basing this on the commentary on LinkedIn, as opposed to news articles).


Saturday, 14 March 2026

Two truths, and an AI

When I joined a company a few years ago there was an established part of the introduction to the wider team, where each new person had their turn to describe three interesting facts about themselves - where one of the "facts" would actually be made up. Then there would be a bit of voting to see which statement was least believable.

Some of my recent experiences with AI reminded me of that "Two truths, and a lie" experience.

Earlier this week I spent a couple of hours delving into what performance characteristics we should expect to get out of a particular configuration of an AWS service. The AI agent surfaced up the top handful of performance optimisation recommendations, followed up with some tables of numbers for estimated performance differences involved.

Given that our use case was mainly going to involve finding matches between two data sources, I figured that there would almost certainly be further performance benefits available if we worked with sorted data. So, I gave the agent a concise prompt to determine whether that would be worthwhile. What it came back with looked promising, but then looked too good to be true.

The first few paragraphs of the explanation of the benefits available from working with sorted data appeared to make sense, but later in the response there was a single sentence statement with bold formatting which gave the impression that it was an established fact - along the lines of "The report data is guaranteed to already be sorted". That turned out to be complete nonsense, directly contradicting a statement that was highlighted in the documentation that the AI had been using as a significant reference.

Unlike the icebreaker activity, when an AI agent presents information we don't know whether to have confidence that it will be accurate.

Friday, 13 March 2026

It wasn't a surprise, but it was a surprise

It wasn't totally unexpected

It's Friday 13th March 2026, the news is still sinking in that I have been made redundant due to Atlassian's decision to reduce costs by cutting 1600 from the workforce.

I've been telling friends and family that I'm okay and that it wasn't entirely unexpected. My teammates had been speculating and half-joking about the prospect of layoffs in the last couple of weeks, particularly as one of the developers had been unlucky enough to go through a couple of rounds of redundancies at his previous companies.

Getting mixed signals

Earlier in the week I had received an invitation from the talent acquisition team to participate in interviewing a candidate for a software engineering position next Tuesday, so I quietly thought, "hmm, maybe we're not in a total hiring freeze after all", and also, "oh crap, that goal related to participating in interviews is still going to be relevant this quarter so I'm gonna need to complete that refresher about the interviewing process".

It's just business, numbers matter

The other number being used to describe this round of redundancies is 10% of the Atlassian staff, so being on a team with about ten developers I can speculate that we would be expected to have one person be cut. I'd like to believe in this instance it may have come down to the "last in, first out" approach, as I was the most recent person on the team to join the company.

I've even seen at least one person mention, "On Monday I was promoted, on Wednesday I was made redundant", so performance hasn't been the driver for these cuts.

So far I haven't heard of anyone else in my little corner of the world being caught up in this round of redundancies. As New Zealand doesn't have a particularly significant number of Atlassian employees, I suppose our headcount reduction may fall under "Australia".


The morning of the redundancy announcement email

A morning in the life of a developer

Code reviews

Blocking for required changes

Checked Bitbucket for fresh pull requests on my team's repositories that require approvals before they could be merged in and included in a deploy.

One of the changes involved a private function for calculating some dates that included logic based on the current date. The documentation comment appeared to be incomplete so I couldn't quite tell what it was intended to do.

I added a couple of comments, mainly proposing that the date calculation logic should be extracted out to its own component with a Clock being provided as a dependency so that we could cover it with tests and have control over what the value of "now" would be for the Clock component.

On this particular day I made the decision to be a little bit stricter in my feedback, so I clicked the "Changes required" option meaning that I would have to come back to re-review the pull request before it could be progressed.

Checking out a branch to see the code in context

For another code review I decided to check out the branch to give it a proper inspection on my work laptop.

This particular codebase was a bit older, so had the potential to include some legacy stuff in there. Specifically, I noticed the configuration of the dependencies mentioned a client for metrics that we haven't been using during either of my stints at Atlassian. 

There were no references to the dependency in the code, but seeing it in the gradle file means it's probably being bundled into the service and deployed.

I didn't even get around to making a little "note to self" for that, so maybe someone on the team will pick up on that some time later.

Post incident reviews

My department has a regularly scheduled session for developers to build up experience and awareness of how to go about addressing incidents that arise across our services. This is a chance to learn from others' mistakes and to give input to ensure that the appropriate mitigations are applied to ensure that the specific system(s) involved in the recent incident don't get impacted again.

As I had a meeting clash for this particular day, so I decided to have an early read-through of the incident summaries and provide my comments in advance of the normal meeting schedule.

My main contribution on this occasion was to query whether a feature flag had involved an id value, as this should have ensured that when multiple feature flag lookups were involved in the processing that the same result would have been returned.

Preparing to upload synthetic data to S3

The story that I was currently working on involved assessing the performance of Amazon Athena for different data formats when a large volume of data is involved.

The previous day I had guided an AI agent to create some Python scripts that would generate files to resemble an S3 inventory report as CSV or Parquet.

The functionality was quite impressive, even if I do say so myself
- Random generation of the object key (file path)
- Control of the seed for random generator, so CSV and Parquet would have like-for-like data
- Parallelism to support generation of multiple files at the same time
- Generation of manifest file with representative path
- Script for uploading to S3, also supporting parallelism

This particular experiment was deemed to be necessary because out AI agents were not able to give us solid data for their estimation of performance differences between the data formats. Moving from CSV to Parquet would be a blocker on an existing implementation being usable.

Then things got interesting

A team mate posted a link to Mike's blog post in one of our team Slack channels.
I started reading the blog post, then checked my email...

Ruh roh...

The initial sentences mentioned "may be impacted" in bold, then the content mentioned that access to various systems would be cut off, so it dawned on me that this time my employment with Atlassian would be over.

I didn't get a chance to check in those Python scripts because the company had made the understandable decision to cut off access to Bitbucket.

I had access to the Atlassian Slack from my personal device for most of the rest of the day, so I could let my team know about my situation.

I got several nice mentions in the team channel and in direct messages from my teammates and managers. It came as much of a shock to them as it did to me.

When I went on LinkedIn I started to see a few familiar names showing up mentioning how they too had been caught up in this round of redundancies.

It was kind of reassuring to see the names of people who have been significant contributors - so it's not as though I have been identified as dead wood and cut out based on performance. Apparently there are about 1600 of us in this situation.

Dipping into Claude Code - false start