Tech Blog of Stephen Souness

Monday, 9 June 2025

Automation won't pick up some version upgrades

Introduction

This post is about situations where software components that are commonly imported in as part of assembling production systems can slip outside of the normal expected path for detecting the availability and applying version upgrades.

A couple of examples of systems that can be set up to detect when new versions of dependencies are available are:

Renovate

Dependabot

Examples of dependency changes

When a base Docker image went distroless

When new versions stopped being released for the alpine distribution of the envoyproxy Docker image automation had nothing in place to detect that and raise it as a potential issue.

I came across this when a production issue came up in another team's core infrastructure service. Since my team was going to be blocked until the incident wsa resolved, I followed the online chat discussion, checked some logs, did some Googling and established that the error that was being seen should have been resolved by a version of envoy that had been available for several months.

It took an hour or so to join the dots and establish that the "latest" version that was being deployed was several versions behind envoy's releases because it had not been updated to align with a decision to stop support of a particular linux distribution in favor of going with a distroless approach.

Change of maven artifact name

For Java applications maven packaging has become a de facto standard for the management of libraries that need to be brought in to support functionalities within a service or application.

An example of an artifact that changed its name as part of a major version upgrade is apache commons-lang that moved over to commons-lang3.

I can't recall any particular problem arising from running with commons-lang, but I wouldn't like to see commons-lang as a dependency in my codebase - given that it's most recent release was back in 2011, more than 14 years ago.

So how can we stay up to date?

In my view, the best way to reduce dependendency management overhead is to minimise dependencies in the first place. Carefully weigh up the value that is being added when you bring in any dependency

Does it bring along a bunch of transitivie dependencies? Is it worth it?
Could the same be achieved with a couple of extra classes directly in our codebase?

As software bills of material and greater attention is focussed on the software supply chain, I believe it will become more common for organisations to have centralised tooling in place to surface up the use of out of date artifacts.

Thursday, 5 June 2025

Mechanical sympathy mindset

During my time in London I socialised a bit with some of the members of the team behind the LMAX Disruptor, which is where I first recall hearing of the concept of Mechanical Sympathy and its applicattion to design of software.

Something that was intially counterintuitive to me about the Disruptor was the approach to including padding in data structures to exploit caching at the hardware level.

I'm on a break from working in tech at the moment as I keep myself available to help my family adjust to some health issues that come with aging, but my mechanical sympathy mindset is still active.

Today as I was vacuuming the floor of my living room I wondered about whether height makes a difference to the suction power of the vacuum cleaner - as taller people will typically extend the length of the pipe between the suction mechanism and the floor. Based on some Google search results, the short answer is "yes".

I was going to posit that as an explanation for why my mother always seemed to do a better job of vacuuming than me when I was younger, but I wasn't taller then, so that doesn't really apply.

Friday, 30 May 2025

Severance and FedRAMP - keep them separate

Not long before starting my career break to help out with family, I had a social catch up with some work colleagues via Zoom and the conversation looped into what TV shows people have been watching lately... (do we still refer to things as TV?). Severance got a mention, so a few days later I got drawn into starting an Apple TV subscription to find out what it was all about.

This post is going to be short and sweet, that may limit it to an inside joke or "Were you thinking what I was thinking?" as I don't want to offer up any spoilers about the Severance, or FedRAMP.

I just hope nobody takes inspiration for life to immitate art, by having the Severance approach to work/life balance be applied to meet a level of FedRAMP isolation.

Thursday, 15 May 2025

Applying AI to software development can be like following SatNav

Trying out a different navigation system

A month or so ago I upgraded to a car that has a SatNav system included, so I have been trying to use that instead of the Maps app on my phone. My experiences with it so far have generally been good, but it is far from flawless - a bit like Artificial Intelligence (AI) in software development.

As context, my previous vehicle was not too old to include SatNav, it just hadn't been set up with English language or New Zealand maps - one of the down sides of having a second hand vehicle that originated in Japan.

Flawed or incomplete information

Driving around central Christchurch can be a bit challenging at times as various roadworks are underway, leaving streets closed off or narrowed down to a single lane. It could be reasonable to expect that a basic navigation system might not have up to the minute awareness of those closures and restrictions. However, something that I did not expect to encounter was the navigation system advising me to expect a roundabout where no roundabout exists. This hasn't just been a one off anomaly, so I am having to glance down at the map from time to time to understand whether I should be preparing to turn at these traffic lights or further along.

As I become more accustomed to the prompts coming from the SatNav, I expect that I will be more aligned with the time and distance ahead that the guidance is expected to be related to.

Routing with unclear preferences

I'm not sure why, but on a recent long distance drive the guidance system wanted me to turn off the main highway to go inland to some unfamiliar roads. Normally I would be curious enough to take a little detour and see a different part of the countryside, but on this occasion I had a kitten as a passenger in the back seat so I wanted to stay on the familiar and direct route.

There may have been some marginal difference of total road distance, but I don't know if the inland roads would have been of the same sealed quality as State Highway 1. The only time that I have needed to detour inland around that part of the country before involved gravel roads, where my comfortable driving speed is much lower.

Lower involvement results in lower recall

If I am not actively engaged in the navigation process then I am less likely to remember details of the route for future journeys. If I didn't make an active decision to turn off at a particular place, I will be less likely to absorb that information and have it available for use in the future.

Generating code with AI is like using SatNav

When it comes to software development, I believe that we should be treating artificial intelligence systems with a suitable amount of caution and awareness of the potential limitations and flaws.

I have used AI to generate small one-off utility apps, as well as to produce snippets of code for production systems that process millions of data records every day. Just like we have checks and balances in place to test and review code produced by human developers, I would not allow fully AI generated code to drive my production systems... ...yet.

Between the chair and the keyboard

The person driving the AI still needs to be around for awareness so that what is being produced will be fit for purpose - meeting the "ity" considerations, such as:

functionality
scalability
security
stability
durability
flexibility
...

Ethics, and compliance with laws and standards are also aspect that will continue to require people being involved to be held accountable for.

Tuesday, 29 April 2025

Restricting concurrent updates

Introduction

Just jotting down some thoughts about what might be involved for addressing an issue being faced in my last project.

The problematic situation

Multiple sources of updates arriving in concurrently, being picked up for inclusion in an aggregated representation of the data.

There are multiple worker processes, each with multiple worker threads that pick up changes and apply them without any awareness or consideration of what other work is underway.

Potential solution approaches

Debouncing of updates

Using Redis as a store for keeping track of the identifier of records that are currently being processed has been successfully applied for reducing race conditions for updates of some attributes, so the pattern could be applied more broadly.

The debouncing approach can be thought of as a type of transaction lock restricting access to the record that is required to contain the full representation of state.

Partitioning workload processing

This would most likely involve needing to switch to a different message processing technology to enable isolation of workers and having a single update per identifier at a time.

If Kafka was applied here, we would need to have more awareness and consideration for balancing of keys across partitions to ensure that we preserve scalability of throughput.

To benefit from the change the processing would probably also need to switch to being a single thread per partition to achieve the goal of eliminating concurrent updating of records that have the same identifier.

In my opinion, this would be more trouble than it is worth.

Monday, 7 April 2025

A self-imposed career break to spend more time with family

It's 2025 and I am back to blogging on my personal site, so what has been happening?

For the past few years I have been posting some of my ponderings on another site - a blog that is internal to Atlassian, where I have been working as a senior developer.

That brings me around to why I am going to resume posting here. I have decided to dedicate my time to helping out with a senior family member, stepping completely away from work commitments for a while.

Personal progress

On my first fresh non-work day I have found myself unblocked from making progress on something that I have been comtemplating for a couple of months - I have purchased a more modern car. A 2020 hybrid seems like a nice step up from a 2007 regular petrol car.

Family health progress

Without going into any detail, the family member who had been facing some challenging medical needs has made a remarkable come-back. We are taking things one day at a time.

Things to do

Don't expect much technical output from me for a while, as there is still a need for me to help out with gardening and household maintenance.

Not having any slack notifications for a month or three seems quite appealing.

Tuesday, 27 December 2022

2022 - A year in review

Just a look back over the last 12 months.

January

I moved back to Christchurch to live, after having spent a few months further south since moving back from London.

Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka.

February

I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open.

Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand.

March

Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics. This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation.

April

Had my first direct experience with Covid-19. I only went for a test because a friend had mentioned that a runny nose was his first symptom.

May

Managed to roll my ankle as I was leaving the house for the evening. I thought about going back inside and resting it up with ice etc. - but then decided to try walking off the pain and carried on to a pub quiz. My team won the pub quiz, and I got an Uber home.

A couple of days later my ankle swelled up so much that it was too painful to walk on. This lasted a few weeks.

June

Heard from recruiter that was working as a local sourcer for Atlassian, now that they are fully open to remote working.

Had a family member come to visit for a few days. On the second day they seemed a bit ill so I ordered some more Covid tests - I was okay, but they tested positive and needed to isolate for a week or so.

July

A few stages of interviews with Atlassian.

Went down south for a weekend, including watching The All Blacks versus Ireland at Forsyth Barr Stadium in Dunedin.

Attended a comedy show by Rhys Darby - he's the comedian / actor who palyed the character Murray from the Flight of The Conchords television series.

August

Final stage of interviews with Atlassian.

Received and accepted offer to join Atlassian as a Senior Software Engineer.

New laptop arrived - glad it was a Mac, as I wasn't sure whether that had been something that I had asked about during the interview process.

September

Properly started the new job.

Purchased a decent office chair, and standing desk - all within budget for being expensed back. A welcome improvement over sitting at the kitchen table.

October

More learning about Atlassian systems, and familiarising myself with the services that my team is responsible for.

November

Learning about another existing service that will be moving across to my team for further development and maintenance.

December

Went to Sydney to meet up with the rest of my work teammates - several of whom also had to travel across from other parts of Australia.

Enjoyed my first experience of an escape room, which was the team building exercise that we chose.

Went to Otautahi Smoke - an afternoon and evening of live music, BBQ food and beers in Hagley Park.

Wednesday, 13 July 2022

Designing systems - The "ity"s That Limit or Enable Profitability

Introduction

This started off as a little aide-mémoire to get my head into the right space for preparing for an interview. It's not an exhaustive list, and twists terminology that has been used to represent other things (see: to Velocity), so don't treat it as a text book reference to work from.

Most of the listed points can be associated back to so called "non-functional requirements" - NFRs. I don't like that particular terminology, so alternatively we might consider them as dimensions of the quality of the sytem.

Usability

"If you build it, they will come" should come with a provisor, "... but if it's awkward to use they'll soon go away, and might not come back."

Security

All of the aspects that combine to protect data from being seen or manipulated by anyone other than the intended recipient or sender, and also assuring users that the data has originated from the intended source.

Velocity

Here I'm cheating a bit by trying to come up with a term to represent the speed at which a system can respond to user input - not development velocity.

Accessibility

This has multiple dimensions to it, ranging from the devices that can present the interface, to the affordances offered for people with disabilities.

Reliability

The system does what it is intended and expected to do, in a consistent manner over a significant period of time.

Elasticity / Scalability / Capacity

How well the system can cope when it becomes popular enough to attract a lot of users.

Likewise, how well it can scale back down to a level that is suitable when there is less demand - and less need for potentially expensive resources to be available.

Adaptability / Flexibility

It's not necessarily always the case, but given a range of possible technologies to choose from, they will often have an associated time or money cost for applying changes.

Not all roads lead to Profitability

In a commercial product, these are intended to combine to lead to profitability.

Not all products will consider these as being equally high priority, so you may find it a valuable exercise to get your team together and agree on a relative ranking so that you can focus on what is important for your business to succeed with the challenged and opportunities in the current environment.

Visualise Priority Ranking

I'd even go so far as to suggest having a visual representation of the value rankings so that there can be little doubt what to prioritise when making changes - in the days of office working this might be something like a poster on the wall or an A4 printout in the top corner of the whiteboard where the team has their stand-up meetings.

Sunday, 3 July 2022

Running Java with Preview Features in the Cloud - Part One

Introduction

I've been catching up on some features that have been added in recent versions of Java. The 6 month release cadence of new versions of Java is great, but can lead to a build up of new things to learn about.

The support for pattern matching in switch statements - JEP 406 - is particularly appealing, but for now it is still only available as a preview feature, meaning that at compile time and at run time we need to explicitly specify to enable preview.

A shallow view of the main cloud providers

A lot of online applications these days will run in some sort of cloud runtime environment. Some examples from the main cloud providers are:

According to what the documentation currently specifies, AWS Lambda's pre-packaged Java environments only support versions 8 and 11 unless you bring your own Docker container. Similarly, Azure Functions only offer versions 8 and 11. This leaves us to consider Google Cloud Functions which supports and recommends Java 17.

What can we try out?

As far as I can tell, the Google Cloud Function way of running Java doesn't allow us to control command line arguments to the Java runtime, so we cannot simply specify --enable-preview that way.

This leaves us to try out customizing AWS Lambda to:

set up a Docker container including the Java 17 runtime
set up a wrapper script to pass --enable-preview as a command line paramater to make the lambda initialize with the functionality that we want.

Tuesday, 21 June 2022

Speeding up Software Builds for Continuous Integration

Downloading the Internet

Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expression like, "Just waiting for Maven to download the Internet".

Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced onces will now be cached and found locally on your computer's hard drive.

What happens on the Continuous Integration environment?

Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow "download the Internet" process every single time that it comes to check out your latest changes and run a build.

What are the main costs involved here?

Developer time waiting on the build to complete before moving on to the next change
Data transfer charges for sourcing from external repositories

Cutting down costs - saving time

What options do we have available for reducing these costs?

Localise the artifact repository, acting as a pass-through cache
Or Pre-download the most common artifacts in a build container image

Option 1 would involve the selection and setup of an appropriate artifact repository manager such as Nexus or Artifactory. There's a reasonable chance that if your organisation happens to write your own reusable libraries then this will be already be in place for supporting the distribution of those artifacts anyway, so it may just be a matter of re-configuring the setup to support mirroring of external third party libraries sources from external repositories.

Option 2 may seem a bit counter-intuitive as it would go against the current trend of trying to minimise container sizes and to be generally useful it would need to contain a broader range of artifacts than any one project's build would require.

Keep it local

For both options the performance improvement comes down to locality of reference. The builds should be able to obtain most, if not all, dependencies without having to go beyond the organisation's private build environment's network - whether that be a Virtual Private Cloud or a data centre.

With this type of setup in place builds should be able to spend less time on initial setup, and be more focussed on compilation, running tests, and ultimately making the new known good version of the code available for use.

If you want to understand the potential time savings on offer here, just try temporarily moving the content of your local development environment's build cache away and see how long a build takes. For a typical Java microservice I would not be at all surprised if the build time doubles or even triples for having to obtain the build plugin libraries, the application's direct dependencies, and all of the transitive dependencies.

Monday, 20 June 2022

Docker SBOM - Software Bill Of Materials

In an earlier post on this blog I was curious about comparing Docker images to try to track down the differences that might be causing performance problems. Since then I have had a play with the sbom Docker command for listing out what is included in the image.

Following the documentation at: https://docs.docker.com/engine/sbom/

Below is an example of the output of a run of a locally built app:

> docker sbom hello-world-alpine-jlink:latest

Syft v0.43.0
✔ Loaded image
✔ Parsed image
✔ Cataloged packages      [16 packages]

NAME                    VERSION       TYPE
alpine-baselayout       3.2.0-r20     apk
alpine-baselayout-data 3.2.0-r20     apk
alpine-keys             2.4-r1        apk
apk-tools               2.12.9-r3     apk
busybox                 1.35.0-r13    apk
ca-certificates-bundle 20211220-r0   apk
docker-comparison       1.0-SNAPSHOT java-archive
jrt-fs                  11.0.15       java-archive
libc-utils              0.7.2-r3      apk
libcrypto1.1            1.1.1o-r0     apk
libssl1.1               1.1.1o-r0     apk
musl                    1.2.3-r0      apk
musl-utils              1.2.3-r0      apk
scanelf                 1.3.4-r0      apk
ssl_client              1.35.0-r13    apk
zlib                    1.2.12-r1     apk

This is a much more detailed listing of the components that are included in the docker image than we would get from looking at the Dockerfile or image history, so I would recommend it as a way of checking what you are including in an image. The main feature request that I have is to separate the artifacts by type, though in this trivial example that is simple enough to do by just looking at the listing.

Tuesday, 14 June 2022

The Importance of Segmenting Infrastructure

Kafka for Logging

I was recently poking around in the source code of a few technologies that I have been using for a few years when I came across KafkaLog4jAppender. It enables you to use Kafka as a place to capture application logs. The thing that caught my eye was the latest commit associated with that particular class, "KafkaLog4jAppender deadlocks when idempotence is enabled".

In the context of Kafka, idempotence is intended to enable the system to avoid producing duplicate records when a producer may need to retry sending events due to some - hopefully - intermittent connectivity problem between the producer and the receiving broker.

The unfortunate situation that arises here is that the Kafka client code itself uses Log4j, so it can result in the application being blocked from sending its logs via a Kafka topic because the Kafka client Producer gets deadlocked waiting on transaction state.

Kafka For Metrics - But Not For Kafka Metrics

This reminded me of a similar scenario where an organisation might choose to use Kafka as their mechanism for sending out notifications of metrics for their microservices and associated infrastructure. If Kafka happens to be part of the infrastructure that you are interested in being able to monitor, then you need to keep those resources isolated from the metrics Kafka - otherwise you run the risk of an incident impacting Kafka which prevents the metrics from being transmitted.

Keeping Things Separated

A real world example of keeping infrastructure isolated from itself can be seen in the way Confluent Cloud handles audit logs. I found it a little confusing at first, as the organisation that I was working for at the time only had Kafka clusters in a single region, but the audit logs were on completely separate infrastructure in another region and even another cloud provider.

Sometimes You're Using A Service Indirectly

A slightly different - but no less significant - example of the need for isolating resources can arise when a particular type of infrastructure is being used for different types of workload. Rather than having a "big bang" release of changes to all of the systems, a phased rollout approach can be taken. One of my earliest involvements with using AWS came shortly after their 2015 DynamoDB outage, which had a ripple out impact for a range of other AWS services because behind the scenes those other services were themselves utilising DynamoDB.

It's my understanding that AWS subsequently moved to isolating their internal services' DynamoDB resource from general consumers' DynamoDB infrastructure - but don't quote me on that.

Friday, 10 June 2022

Docker Images - Size matters, But So Does Performance

Introduction

I recently went through the exercise of re-building a Docker image based on what was supposed to be a stable, well-known application codebase. Along the way I observed an unexpected performance issue.

The application contained within the Docker image was just a Java command line utility for parsing some yaml files to provision kafka resources on our hosted development clusters. The code had not been changed for several months, so this was supposed to just be a matter of setting up a local copy of the Docker image instead of pulling down a trusted third party's image from Dockerhub.

The application was bundled within a Docker contrainer whose Dockerfile was alongside the code, so it should have been a simple matter of using that to produce the image and pushing it to our own repo, and then pulling that down for our runtime use.

It's the same, so why's it different?

We had been running with the existing third party Docker image for several months, so there was a well established history of how long each stage of the deployment pipeline should typically take to run.

When the new Docker image ran it took noticeably longer to complete each stage. I don't have the exact figures in front of me, but can recall that it was in the order of double digit percentage of time slower - so a six minute build might now be taking longer than seven minutes.

Examining the Docker images

The third party's build process for the original Docker image wasn't available for examination, so to compare the Docker images we need to use something like

> docker history --no-trunc <full image name>

From this I was quickly able to establish that there were a couple of significant differences between the application's specified Dockerfile and the Dockerfile that would have been used for building the faster running established version:

The base image

CentOS Linux versus Alpine Linux

The Java runtime

Full Java SDK versus jlink with specific modules

Getting back up to speed

Since the purpose of this setup was to be a lift and shift of the existing setup, I adjusted the Dockerfile to involve CentOS Linux as its base image and adjusted it to use a full JDK instead of the clever jlink minimised Java runtime environment.

At this point we were where we wanted to be as our baseline for migrating off the third party Docker image. Our image has the same base OS and Java runtime and performs close enough to the same - without taking the double digit percentage of time longer than our starting point.

What was the issue?

While I was working on this particular setup there was a pressing deadline that I was not free to play around with tuning this setup and isolating whether the issue was due to the OS or the jlink runtime (or something else).

Based on what I have seen mentioned online, I suspect that there may have been some aspect of the application that involved heavy use of system calls that were not set up to run Java efficiently with Alpine's musl library. For now that it just a theory, and not something that I have managed to reproduce on a simplified locally built application.

If the runtime environment had involved inputs from external systems I would have been more motivated to try to keep us on Alpine to minimise the potential vulnerabilities as it tends to have fewer services and libriaries that tend to have CVEs representing potential security vulnerabilities.