Wednesday, 25 March 2026

Preventing credential leakage through compromised build pipeline

Context

Recently a sophisticated attack injected malicious behaviour to some trivy image scanning software, that resulted in pipelines running malicious code that extracted sensitive content and injected malicious beahaviour into artifacts.

I'd like to believe that the organisations that I have worked for have had robust safeguards in place that would have prevented them from being vulnerable to this type of attack, but to keep myself busy and "never let a good crisis go to waste" I am jotting down some thoughts around how such a compromise could be detected and prevented from inflicting damage.

I will base my evaluation on the assumption that the artifacts involved here are a scanning tool running within a build pipeline, and a Docker image that has been generated as part of an earlier stage of the pipeline process. 

Clean separation of concerns

Do not include any sensitive data in the container image

If there is no sensitive data such as API keys, tokens, credentials, secrets as part of the artifact being scanned then a compromised tool cannot find them to expose them.

The build environment should be completely isolated from the environment(s) that the container will ultimately be running in, so even if the mechanism for obtaining credentials was discovered it should not be possible to leak secrets.

Don't allow arbitrary network egress

By locking down the build environment to only have access to the resources that are relevant to the processes involved we can reduce the scope of what a compromised system could do.

If the malicious software cannot phone home then it cannot expose what it has found, or update itself with further actions, such as triggering as part of a botnet.

Alongside prevention, we would ideally also want detection for picking up on if and when an unexpected attempt was made to call out to the network.

Read-only access

Given that the responsibility of the scanner should be limited to reading of the content of the container image, it is reasonable to not permit write access to the files involved.

As an additional safeguard, the image generation stage should include writing of a checksum for additional verification that the integrity of the container image is in tact between build and deploy.

Observability

The sophistication and frequency of software supply chain attacks is only going to continue to increase, so it is more vital than ever that we treat our build and deploy environments with the same level of scrutiny and attention to detail as our top tier production services.

Logs, metrics and alerts must be in place in the build pipeline to enable teams to stay ahead of attackers and continuously improve the resilience and integrity of the release processes. 

Does size matter for AI context files?

Context 

The projects that worked best with AI agents had been set up with files to provide context for the AI to apply when responding to prompts.

My team wasn't paying for the use of the underlying service, so we didn't need to concern ourselves with the cost / benefit analysis.

Now that I have a bit of independence and time to myself, I'm going to look into how those guiding references impact the cost of using AI agents.

Basically, I want to understand a bit more about how files like AGENTS.md work - are they somehow applied locally, or do they get fed into the interactions with the model?

Findings 

I went to Google agents.md, but landed on a website that matches that as the domain name instead. Luckily enough that was specifically set up to document what the agents.md file is all about.

(Sidenote: I no longer rely on Google for searching the Internet, as they are so dysfunctional that they cannot index this Blog, even though Google owns the Blogger platform).

Digging around a little further, I came across the following that does a great job of summarising what I suspected would be some pros and cons of specifying context in the AGENTS.md file:

https://www.aihero.dev/a-complete-guide-to-agents-md

The basic gist is:

  • The full content of the file is pulled in every time that the AI agent comes to respond to a prompt, so if the file grows then tokens get consumed.
  • By establishing separate reference documents for the agent to link through to for specific purposes we can narrow down the scope of what is relevant, and reduce token consumption.
  • In larger code repositories, you can specify an appropriate hierarchy of AGENTS.md files, so the agent can pick up the context that is most relevant to the current location.

Summary

As of March 2026, interactions with AI models involve tokens as the unit of consumption.

When we provide additional context for agents to respond to our prompts we need to be budget conscious about the token cost that is involved.

By carefully structuring AGENTS.md and supporting documentation we can gain the benefits of agents having sufficient context, without the costs of queries applying too much token cost from superfluous context.

Friday, 20 March 2026

When combining service resilience measures elevated incident severity

The incident review exemplar

When I re-joined Atlassian in December 2025 I was given the opportunity to become a Post Incident Review Champion for the team. Basically that meant that I would be available to review documentation around how teams within our department were working to address problems in their software, and provide oversight of my team's approach to working through incident follow-up work.

To get up to speed with the championing expectations I made a point of attending the department's regularly scheduled Zoom call for examining some recent post incident review documents, adding additional comments and flagging up any opportunities to improve.

The calendar invite for the meetings included a link to a review document for an incident as an exemplar of what a good review should look like. When I clicked the link and read into the details I recognised that it was from an incident that some of my previous team's colleagues had been involved in back in 2024. When I finished reading the full content I realised that it didn't even touch on the analysis that I had done shortly after the incident, so it did not include the root cause of how a small change to one feature managed to balloon out to have broad impact on other functionality.

Ironically, my new manager had read through my analysis of this incident and used that as a basis for determining that I would be a suitable person to get involved as a PIR champion.

It was a minor change, with safeguards in place

I won't go into too much detail here because:

  • I wasn't directly involved and it's long enough ago that I can't remember
  • This is a blog post, not a book
  • The technical details are more important than the business processes

Three microservices and an endpoint migration

Before:

Microservice A exposes an HTTP endpoint that in turn calls on Microservice B to look up some data from a data source owned by Microservice B.

After:

Microservice A still exposes the same HTTP endpoint, but calls on Microservice C to look up the data.

A minor detail is that Microservice C happens to sit behind a proxy server that has some responsibilities for access control and rate limiting between client services and the target endpoints.

The Safeguards

The change is controlled by a feature flag in Microservice A, meaning that we can control what percentage or requests transition across to hitting the new endpoint and can switch back to using the original endpoint on Microservice B if anything goes wrong during the rollout.

The rate limit configuration in the proxy server has been updated to specify the expected peak load between Microservice A and the new endpoint on Microservice C.

The HTTP client in Microservice A already has several endpoints that it accesses via the proxy server, so there isn't any new complexity involved in having these systems connected up. 

Logs and metrics are in place, with nothing showing up as problematic after a significant percentage of requests got directed to the new endpoint.

Feature rollout

Everything looked fine for switching over, so the feature flag got turned up to 100% of requests to be routed to go via the new endpoint.

Things go bump in the night

Something went wrong.

Microservice A started to return error responses for multiple endpoints - not just the one involved in this migration. Automated monitoring picked up that there was an issue and sent out an alert to the person on call.

The person on call wasn't the person who was rolling this new feature out, so would take some time to get up to speed. It was the middle of the night in the dev team's timezone, so there wasn't a bunch of colleagues online to help the investigation.

It took a while to track down some error responses in the logs and metrics. The calls to the new endpoint were being rate limited, but why were other endpoints being impacted?

The load on Microservice A was going up and up as clients of its various endpoints hit it with more and more requests.

Additional instances of Microservice A were provisioned as auto-scaling policies kicked in to cope with the additional traffic.

The feature flag got turned off but the system was struggling to keep up with the load. Unhealthy instances terminated and new instances came online to replace them.

On call personnel from multiple other teams impacted by the unavailability of the endpoints joined the online incident war room call to try to establish what was going on.

It takes some time - maybe half an hour or more - but eventually the service seems to have recovered. It's about four in the morning for the dev team's timezone, but a senior manager is based in another part of the world and has joined the discussion in the online war room and wants to get into some root cause analysis.

I joined the war room call at around the time that the system was starting to stabilise. It was almost dawn in New Zealand, so would be about 4am in some part of Australia where the members of the development team were based. My main contribution to the discussion was to point out the signals that indicated that the system was back to its normal steady state, and that there may not be much benefit in people continuing to try to look at it in the middle of the night. The root cause analysis could wait. Our US based colleagues were now in the middle of their working day, but my teammates in Australia could benefit from some sleep.

What went wrong?

There were multiple layers to this situation that I was able to unpick by looking into the metrics and ultimately the code.

The first problematic aspect of this particular situation was that the rate limiting configuration in the proxy server had a type of edge case situation that resulted in the newly specified limit not being applied the way that it was expected to. This meant that when Microservice A started to receive peak load it would be rate limited by the proxy server when it attempted to call the endpoint on Microservice C.

The second problem involved the configuration of the HTTP client in Microservice A. Although it had circuit-breaking in place to handle problematic responses, that circuit-breaker was not configured to propagate a suitable HTTP response to the callers of Microservice A. So, when the proxy server presented out a 429 response for some calls it would be bubbled up as a 500 response to the callers of Microservice A. When callers of Microservice A got back the 500 response they would automatically retry the request.

The third problem - which I find to be the most interesting, and something that should have been fed into the PIR document - was that the circuit breaking configuration on the HTTP client within Microservice A was being applied across all of the endpoints that were fronted by the proxy service. This is what dramatically expanded the blast radius of the issue.

What should have been a neatly isolated issue involving a single endpoint, blew out to cover multiple endpoints that were much more business critical.

One endpoint's 429 (Too Many Requests) responses resulted in multiple endpoints presenting out an error response code that callers treated as suitable for immediate retries, creating additional load.

Summary

A combination of rate limiting and circuit-breaking made a minor incident into a major one.

This was one case where a more gradual roll out of the feature flag may or may not have prevented the incident due to the thundering herd - which is ironically something that circuit-breaking is intended to prevent.

Monday, 16 March 2026

Some initial sidenotes about the redundancy

Context

It's easy to get distracted and side-tracked by other things that you happen to notice when applying a change to a non-trivial software system. I've gotten into the discipline of making a note to address that later, then moving on to maintain focus on the work that is the current priority. 

A couple of examples include:

- One of our older services includes an alerting library that references OpsGenie, that's going to be phased out soon (by Atlassian)

- The dependencies of this service includes a DataDog library, we haven't been using DataDog for metrics for as long as I have been at the company

Last Thursday I was made redundant along with about 1600 colleagues at Atlassian, so here I am keeping a few notes about what I have been learning from this experience.  

Things I've Been Contemplating Since Starting Garden Leave

Twitter / X Accounts Spouting Nonsense, That Spreads To LinkedIn...

Some Twitter accounts post made up nonsense claiming to be "inside information" about what had been going on in the lead up to redundancies and layoffs at companies. While I can't claim to know about every detail of what has been going on across the entire company, I can smell bullshit in at least one post that claims to be about Atlassian.

A few days after seeing the conspicuous post I came across a reputable account that had called out similar false information from the same account about a different company, so that made it clear to me that I hadn't missed something that was happening at Atlassian.

While holding off on pushing "publish" on this post, I came across someone on LinkedIn repeating the same nonsense. Sure enough at the bottom of their post they cited the dodgy Twitter / X account as their source.

Conspicuous Survey Questions

With the benefit of hindsight, I am wondering whether the answers provided in a recent internal survey may have fed in as data points counting for or against individuals for criteria for being less suited for the increased application of AI.

If I recall correctly, the questions included topics such as:

  • "Do you see yourself working at Atlassian in 12 months time?" 
  • "How likely would you be to recommend Atlassian as a place to work?"
  • "In the last 30 days, how much time has AI saved in your day to day work?"

I think the answers to those first 2 questions would be quite different today, as I was in quite a positive frame of mind up until last Thursday.

Based on the shocked reactions from my team mates, their responses may also shift - if the surveys continue.

An internal blog post with more em dashes than I have seen in my life

Perhaps a slight exaggeration, but there was a blog post that went around pushing for major changes and heavy investment in setting up environments for AI to run in. The post did have a lot of em dash characters, which is something that has been described as a symptom of content that was generated using AI.

The range of people impacted is significant

It has only been a few days since the announcement, but so far I have heard about quite a few people that I knew who have been caught up in this round of redundancies: 

  • Four of my former team mates
  • A senior recruiter who was involved in hiring me my first time at Atlassian
  • The head of department of my former team
  • The main incident response trainer

What's not in the headlines

It's kind of odd that a major technology company can part ways with the CTO and that doesn't make any headlines. In most articles it doesn't even get a mention (maybe I'm just basing this on the commentary on LinkedIn, as opposed to news articles).


Saturday, 14 March 2026

Two truths, and an AI

When I joined a company a few years ago there was an established part of the introduction to the wider team, where each new person had their turn to describe three interesting facts about themselves - where one of the "facts" would actually be made up. Then there would be a bit of voting to see which statement was least believable.

Some of my recent experiences with AI reminded me of that "Two truths, and a lie" experience.

Earlier this week I spent a couple of hours delving into what performance characteristics we should expect to get out of a particular configuration of an AWS service. The AI agent surfaced up the top handful of performance optimisation recommendations, followed up with some tables of numbers for estimated performance differences involved.

Given that our use case was mainly going to involve finding matches between two data sources, I figured that there would almost certainly be further performance benefits available if we worked with sorted data. So, I gave the agent a concise prompt to determine whether that would be worthwhile. What it came back with looked promising, but then looked too good to be true.

The first few paragraphs of the explanation of the benefits available from working with sorted data appeared to make sense, but later in the response there was a single sentence statement with bold formatting which gave the impression that it was an established fact - along the lines of "The report data is guaranteed to already be sorted". That turned out to be complete nonsense, directly contradicting a statement that was highlighted in the documentation that the AI had been using as a significant reference.

Unlike the icebreaker activity, when an AI agent presents information we don't know whether to have confidence that it will be accurate.

Friday, 13 March 2026

It wasn't a surprise, but it was a surprise

It wasn't totally unexpected

It's Friday 13th March 2026, the news is still sinking in that I have been made redundant due to Atlassian's decision to reduce costs by cutting 1600 from the workforce.

I've been telling friends and family that I'm okay and that it wasn't entirely unexpected. My teammates had been speculating and half-joking about the prospect of layoffs in the last couple of weeks, particularly as one of the developers had been unlucky enough to go through a couple of rounds of redundancies at his previous companies.

Getting mixed signals

Earlier in the week I had received an invitation from the talent acquisition team to participate in interviewing a candidate for a software engineering position next Tuesday, so I quietly thought, "hmm, maybe we're not in a total hiring freeze after all", and also, "oh crap, that goal related to participating in interviews is still going to be relevant this quarter so I'm gonna need to complete that refresher about the interviewing process".

It's just business, numbers matter

The other number being used to describe this round of redundancies is 10% of the Atlassian staff, so being on a team with about ten developers I can speculate that we would be expected to have one person be cut. I'd like to believe in this instance it may have come down to the "last in, first out" approach, as I was the most recent person on the team to join the company.

I've even seen at least one person mention, "On Monday I was promoted, on Wednesday I was made redundant", so performance hasn't been the driver for these cuts.

So far I haven't heard of anyone else in my little corner of the world being caught up in this round of redundancies. As New Zealand doesn't have a particularly significant number of Atlassian employees, I suppose our headcount reduction may fall under "Australia".


The morning of the redundancy announcement email

A morning in the life of a developer

Code reviews

Blocking for required changes

I checked Bitbucket for fresh pull requests on my team's repositories that required approvals before they could be merged in and included in a deploy.

One of the changes involved a private function for calculating some dates that included logic based on the current date. The documentation comment appeared to be incomplete so I couldn't quite tell what it was intended to do.

I added a couple of comments, mainly proposing that the date calculation logic should be extracted out to its own component with a Clock being provided as a dependency so that we could cover it with tests and have control over what the value of "now" would be for the Clock component.

On this particular day I made the decision to be a little bit stricter in my feedback, so I clicked the "Changes required" option meaning that I would have to come back to re-review the pull request before it could be progressed.

Checking out a branch to see the code in context

For another code review I decided to check out the branch to give it a proper inspection on my work laptop.

This particular codebase was a bit older, so had the potential to include some legacy stuff in there. Specifically, I noticed the configuration of the dependencies mentioned a client for metrics that we haven't been using during either of my stints at Atlassian. 

There were no references to the dependency in the code, but seeing it in the gradle file means it's probably being bundled into the service and deployed.

I didn't even get around to making a little "note to self" for that, so maybe someone on the team will pick up on that some time later.

Post incident reviews

My department has a regularly scheduled session for developers to build up experience and awareness of how to go about addressing incidents that arise across our services. This is a chance to learn from others' mistakes and to give input to ensure that the appropriate mitigations are applied to ensure that the specific system(s) involved in the recent incident don't get impacted again.

As I had a meeting clash for this particular day, so I decided to have an early read-through of the incident summaries and provide my comments in advance of the normal meeting schedule.

My main contribution on this occasion was to query whether a feature flag had involved an id value, as this should have ensured that when multiple feature flag lookups were involved in the processing that the same result would have been returned.

Preparing to upload synthetic data to S3

The story that I was currently working on involved assessing the performance of Amazon Athena for different data formats when a large volume of data is involved.

The previous day I had guided an AI agent to create some Python scripts that would generate files to resemble an S3 inventory report as CSV or Parquet.

The functionality was quite impressive, even if I do say so myself
- Random generation of the object key (file path)
- Control of the seed for random generator, so CSV and Parquet would have like-for-like data
- Parallelism to support generation of multiple files at the same time
- Generation of manifest file with representative path
- Script for uploading to S3, also supporting parallelism

This particular experiment was deemed to be necessary because out AI agents were not able to give us solid data for their estimation of performance differences between the data formats. Moving from CSV to Parquet would be a blocker on an existing implementation being usable.

Then things got interesting

A team mate posted a link to Mike's blog post in one of our team Slack channels.

I started reading the blog post, then checked my email...

(Scooby-Doo voice) Ruh roh...

The initial sentences mentioned "may be impacted" in bold, then the content mentioned that access to various systems would be cut off, so it dawned on me that this time my employment with Atlassian would be over.

I didn't get a chance to check in those Python scripts because the company had made the understandable decision to cut off access to Bitbucket.

I had access to the Atlassian Slack from my personal device for most of the rest of the day, so I could let my team know about my situation.

I got several nice mentions in the team channel and in direct messages from my teammates and managers. It came as much of a shock to them as it did to me.

When I went on LinkedIn I started to see a few familiar names showing up mentioning how they too had been caught up in this round of redundancies.

It was kind of reassuring to see the names of people who have been significant contributors - so it's not as though I have been identified as dead wood and cut out based on performance. Apparently there are about 1600 of us in this situation.