Sunday, 28 September 2025

Looking outside the service boundary

Help others to help yourself

This post is about how it sometimes pays to take a look beyond the services your team owns, so that you have a deeper understanding of the operating context and can have confidence in the performance and robustness of the implementation.

I wouldn't claim to be an expert in anything, but sometimes my extra pair of eyes picks up on an opportunity to make a small change and get a significant benefit.

Database queries 

Back when I was operating in an environment where teams had access to logs and metrics of the services  of other teams, I could dip into what was going on when my login service was hitting timeouts form a dependency.

Based on the details in the logs, the culprit seemed to be delays from a database query. Surprisingly enough, the database was missing an index for most common query pattern, so as we scaled up from a few hundred users to a few thousand 

Default configuration options don't always match what has been in place in previous setups. Migration from a hand-configured setup to one using infrastructure as code...

Logging of ElasticSearch slow queries

In AWS at least there is an option to have ElasticSearch log the slowest queries that it encounters. This can feed into an evaluation of whether the data or query needs to be adjusted to reach acceptable and / or optimal performance.

I was involved in a project where an existing ElasticSearch setup needed to be migrated to a new cluster with a less click-ops approach to the configuration. When I took a look into the new setup I noticed that slow query logging was not enabled, so I alerted the other team and they were able to adjust the config before we needed it.

Finding a root cause during a production incident 

On one occasion I followed the incident call and chat when a system that impacted the workflow of all developers in the company was underway and no root cause had been established.

There were a few pages of logs to look at so it took a while to isolate down what was relevant to the situation.

Without going into too much detail, it turned out that a Docker container that was part of the deployment process was not up to date. This was not a simple case of a team not keeping up with the latest available updates, but actually a situation where the third party developers had switched away from the particular distribution that was involved.

Automation of picking up updated versions and applying them would not have helped in that situation, as the distribution switch could be regarded as a type of forking so it would not be easy to automatically detect it.

Thursday, 25 September 2025

Improve performance, without introducing failure seams

Learning from the mistakes of others

A few years ago a team that worked across the office from my team came up with a neat way of introducing a cache to speed up the performance of the most business critical pages on the company's main money earning website.

Page load time had been identified as a particularly significant aspect of how search engines would rank the value of websites, so getting a few milliseconds off that metric was a great achievement.

Fast forward several months, the unthinkable happened as the infrastructure component that was at the heart of the caching implementation had a temporary outage, taking away the possibility of loading pages at all.

It was quite a while ago, and I wasn't directly involved in the recovery process but I expect that it would have taken a stressful hour or three to recover.

Avoid introducing points of failure

As inevitably happens, the project that I was working on required some performance improvements in order for it to be considered worthy of going to production.

From being aware of the potential failure modes of the caching infrastructure, we were able to design our implementation to gracefully degrade if and when the cache was unavailable. Instead of becoming completely unavailable our pages would fall back to obtain the data directly from the source of truth.

The cost involved here would be additional latency for the end users, and some additional cost from needing to auto-scale up the read capacity provisioning of the data store. 

Wednesday, 24 September 2025

Java 25 Hello World, almost unrecognisable as Java

Hello world in Java 25 can be very different to any earlier version of Java.

void main() {
IO.println("Hello world");
}

That's it, no package, no import, not even having to specify a class.

Even that IO represents a class that was only introduced in Java 25.

If you saw that code in a multi-choice lineup of code that should and should not work in Java 10 years ago you wouldn't dream of regarding it as valid Java.

The main method is also so different to traditional Java. There's no "public", no "static", and no parameter list.

It's still early days, but I'm not a fan. 

Tuesday, 23 September 2025

When to avoid, or allow upserts

Introduction 

A few recent posts on this blog have outlined how we could achieve version aware upserting of data into various databases. In this post let's consider situations where that might be an unsuitable approach. 

An assumption about Id uniqueness

When we attempt to take an entity and write it into a database, we have an expectation that the attribute or attributes that are used to uniquely identify that entity are safe to handle as trustworthy within the business domain. Let's consider a situation where that assumption has been known to fall down in real production systems.

Generation of a value to be used as the primary key in a relational database can seem to be a solved problem, given that we now have UUIDs that can be generated and passed around for use in our applications and services.

Some earlier implementations for generation of UUID involved combining the MAC address associated with the network device of the machine and the current time as a way of combining to produce a value that would not clash with other machines.

There turned out to be limitations to the uniqueness guarantee when the processing speed and concurrency of the machine resulted in multiple "unique" values being generated at effectively the same time.

Another unfortunate way of producing colliding UUIDs values could occur when a virtual machine happened to have been set up specifying the same MAC address, further increasing the risk of collisions.

For situations where the identifier generator cannot be trusted, we should focus our efforts on recognizing inserts and updates as clearly differentiated operations. 

Trusting a source of truth

It is common for alternative representations of data to exist downstream from an originating system. These may asynchronously apply some transformations or aggregations to produce an entity that is intended for reporting or any manner of follow-on business processing.

In this situation the system is far enough removed from the original data creation that there is little point in expecting inserts and updates to arrive in order.

One size does not fit all

I have made some broad generalizations, but the universal consideration applies, "It depends"....  You may find yourself needing a pipeline that can and must differentiate between creating and updating, in which case you can also expect to need to have firm control over the ordering of those events - perhaps involving Kafka with suitable partitioning and concurrency controls. That's a topic for another post.

Designing APIs for use by AI, and others

Introduction 

I recently listened to a podcast episode that offered an introduction to some of the core concepts of retrieval-augmented generation (RAG) for artificial intelligence systems. One of the many points covered was the opportunity to prompt the system to inform the user if it could not determine a crorect answer. In this post I will share some real world experience of how attention to detail in API design can help or hinder this capability.

Example of an unintended limitation 

"A chain is only as strong as its weakest link".

Databases and search engines have different use cases and so have different associated expectations around what type of behavior is expected when the system is not fully operational.

Most databases are based on a concept of indexes and records, where it is only appropriate to present back a query result if the authoritative data store has successfully been contacted and produced a result.

Search engines can be a little bit looser, on the basis that producing some results will be more useful than presenting back no results at all.

Partial results, unknown

When using a search system such as ElasticSearch, the API can provide context around whether the search was able to run over a completely representative range of nodes on the cluster. In a situation where one or more nodes was unable to produce a timely response, the search response can include an indication that the result is only partially complete.

The problem that we faced was when an API layer sat between my service and the ElasticSearch implementation, abstracting away the possibility of partial results. This effectively hid the possibility of data being incomplete.

To compound the issue, the structure of some of the data involved included a nested list, that ElasticSearch can index separately to the core document. That meant that in addition to not finding a match for a given ID, we could also face a situation where a document was found but could be missing a subset of its data.

An unusual CAP theorem trade-off

As a consumer of an API, I want to have some awareness of how trustworthy the data will be.

If I get back a representation of the state of an entity, I'd like to know if it is potentially incomplete so that I have the option of retrying, or presenting the consumer with that context so that they can make informed decisions around how to utilise the information.

In the ElasticSearch situation. the possibilty of a nested list having some items missing could result in seeing the entity in a state that it actually never been in - i.e. it's not a case of eventual consistency where we're seeing a slightly out of date representation of the data.

Sidenote - When can it happen?

In my limited experience, the partial results situation was only seen when the ElasticSearch cluster was under unusually high load, such as when an additional nested structure was introduced without appropriate corresponding indexing configuration.

Summary

In a world of microservices and everything as a service we have a responsibility to detect when edge cases are encountered, and to minimise the possibility of unintentionally disrupting systems that rely on the data that we are making available.

As I see it, in the scenario described in this post there were two main alternative options to choose from:

1. Propagate the possibilty of partial results in the API response, along with suitable caveat information in the documentation

2. Treat partial results as the system being temporarily unavailable, avoiding the possibility for consumers of the data to miss the more nuanced implementation detail related to the nested structure.

   

Always learning - Consistent hashing to reduce impact of database shard rebalancing

Introduction

This post will describe an approach to balancing the distribution and redistribution of data in horizontally scaled databases.

It is something that I came across as part of preparing for system design interviews, which often involve discussions weighing up the pros and cons and risks of applying particular approaches to storing data.

Storing structured data at scale

What is structured data?

Structured data is at the core of most applications on the Internet.

Without going into too much detail, let's just expect that it is something that is stored in a way that enables us to consistently be able to retrieve it for later use and be in a particular known shape.

How does retrieval work? 

Retrieval of structured data basically involves taking an identifier, such as a primary key, and being able to quickly look up an index to the one location that is expected to give an authoritative perspective of the current state of the corresponding record.

In the simplest case of a single physical server acting as the database, it has full responsibility for all records, but that is a severe limitation on scalability.

Simple hash for sharding to scale horizontally

If we scale up to store data across mutiple physical servers we can avoid redundant duplicate processing of lookups by applying an approach to distribute the storage based on the primary key. When we only have two servers it may be find to apply a very simple hash and modulo arithmetic model to consistently balance the distribution of data.

Store or retrieve data with key 235723

Hashing the key, might give a value such as 5821725

Modulo 2 of the hashed value gets us to 1, so that record is stored on node 1, and is retrievable from there.

That works fine, until we come to needing to scale up further. Once we add a third node our modulo of the key drastically changes where the lookups will be mapped to.

An extra layer of indirection

Knowing about a limitation in advance allows us an opportunity to prepare for the future and mitigate against limitations.

Rather than the coarse grained approach of hashing directly down to the number of nodes that are available, we can choose a wide range of potential hash values and define our own distribution of nodes that sub-ranges should be mapped onto. This is commonly represented as a ring of hashed key ranges which map out to a specific node. Different segments of the range of potential values can be allocated to the same destination node.

As nodes are added we can have more control over the redistribution of sub-sets of the data by adjusting the distribution to include new nodes. So instead of having to take the entire database offline to step through every item for potential redistribution we can restrict the blast radius down to a small sub-set at a time as the existing data gets migrated.

Likewise, if we have a situation where we want to scale down for some reason, we can apply a mechanism to update the mapping to different nodes in a gradual and balanced way.

A tip for simplifying rebalancing

When we store our data we can include some metadata to represent the hashed value that determines what node it is allocated to. This enables us to more quickly identify the objects that will need to be migrated when a hash range needs to be moved to a different node.

Without the metadata we would be stuck with iterating through all keys from the preceding node in the ring to re-calculate the hash and determine they happened to fall within the range of values that we are evaluating for, to decide whether the data should stay or be moved.

Monday, 22 September 2025

Upserts with version awareness in DynamoDB

Introduction

ElasticSearch has a concurrency control capability that enables writes to use a version value to determine whether to apply an update or discard it as being stale in comparison to some existing data.

As part of considering a migration away from ElasticSearch as a data store, I was interested in how other databases could be made to achieve the same type of version aware upsert capability.

In some earlier posts on this blog I have shared how the version aware upserting can be done with PostgreSQL, MariaDB and TitanDB.

This post is to share how the same capability can be achieved with AWS's DynamoDB document database, as an example of a non-relational database.

What does the DynamoDB API offer?

Insert, or update 

DynamoDB has putItem for creating an item in a DynamoDB table, and updateItem for updating an existing item.

On first look, we might expect some combination of putItem and updateItem to need to be applied, as that would resemble how the relational databases had to detect conflict and fall back to attempt the second type of operation.

It turns out that we can just use putItem, as the API documentation states:

"If an item with the same key already exists in the table, it is replaced with the new item."

So, that takes care of the insert otherwise update aspect of the implementation, but what about version awareness?

Conditional updating writing 

The putItem API offers us the option of specifying some conditional logic that includes the ability to compare existing data against the data being sent.

As this is not involving the updateItem functionality, I decided that it was inappropriate to have "updating" in the heading. Here we are either creating a new item or overwriting an existing one, whereas the updateItem call would be for updating an existing item. 

If we have a table called event, containing an id and a version then we can have a call like the following:

String idAsString = "event-123";
String version = "456";

Map<String, AttributeValue> eventDataUpdating = Map.of(
"id", AttributeValue.builder().s(idAsString).build(),
"version", AttributeValue.builder().n(version).build());

dbClient.putItem(
PutItemRequest.builder().tableName("event")
.item(eventDataUpdating)
.conditionExpression("attribute_not_exists(id) OR (version < :version)")
.expressionAttributeValues(Map.of(":version", AttributeValue.builder().n(version).build()))
.returnValuesOnConditionCheckFailure(ReturnValuesOnConditionCheckFailure.ALL_OLD)
.build());

In that code, conditionExpression and expressionAttributeValues combine to express the two situations that determine whether the content should be written into the event table:

  • attribute_not_exists covers the insert case, as there is no existing record with the specified id;
  • version < :version covers the situation where an existing item exists but has a lower version value than what is being provided now.

Try it out

I've been experimenting in Java, using the Localstack Docker container as a standalone environment for interacting with DynamoDB, so you can grab the code and try running it for yourself.

At the time of this post, it is just a single class that:

  • creates the table
  • writes an initial low version
  • sets up 100 randomly ordered version values
  • spins up virtual threads that each pick up one of the version values and concurrently attempt to apply the update using the condition check
  • prints out when a conflict has prevented an attempted update (as expected) 
  • verifies that when the dust has settled we ultimately end up with the highest version being written

Code in GitHub  

(You'll need Java 21 or later, Maven, and Docker). 

Disclaimer

So far I have only scratched the surface of how to achieve the desired functionality.

I would not recommend applying this approach without also diving deep into the documentation for further layers of potential limitations and situations where eventual consistency may make this less appropriate than it appears.

Follow-on Curiosity

Is this how global tables address conflict?

I wonder if DynamoDB global tables apply similar logic when writes to the same item occur concurrently in different regions.

From the documentation about multi-region strong consistency, "Conditional write operations always evaluate the condition expression against the latest version of an item. Updates always operate against the latest version of an item."

I suppose the ReplicatedWriteConflictException could simply be a mapping from when ConditionalCheckFailedException is encountered.

Thursday, 18 September 2025

How to include AI in the developer interview process

What problem are we trying to solve?

In an earlier post I put forward an observation that as of September 2025 (month of posting), companies that are looking to hire professional software developers include a setup in the interviewing process that takes away common code assistance, which is in stark contrast to the way that developers joining the company would be expected to operate in their day to day work.

I've had a little bit of time to contemplate this, and while I do not have a solution that involves keeping the AI features enabled, I do have the an idea of an alternative approach to include some consideration of AI.

Look what the AI made us do

Instead of getting developers to cram revision of computer science fundamentals by re-reading Cracking The Coding Interview to fit a cookie-cutter template of what a good developer should be expected to produce for a given coding challenge, we could instead present examples of code that has been generated by AI, and have the candidate evaluate the quality and / or suggest refinements to the prompt.

This could get us closer to aligning the interview process with the new reality of how software development is increasingly being assisted by the application of artificial intelligence systems.

For a limited time only 

I fully expect that cleaning up after loosely applied AI generated code will become less of a requirement as systems evolve, but at this point in time it should be considered as a valuable skill to have.

The amount of buzz on LinkedIn mentioning Vibe coding is either a sign that a lot of people are getting into it, or that AI companies are doing well at loading up the hype train (or both).

Not the gotcha I was expecting

So there I was, starting a tech interview... 

I recently attended a couple of online technical coding challenge interviews (apologies to anyone reading my posts in chronological order, this is a fast-follow to the previous one).

To give myself a better chance of success in the second one I thought to ask for the initial requirements as text after I had made my initial attempt at interpreting what the interviewer had described.

To my astonishment, the interviewer said that he wasn't sure whether that was allowed, and that he would be able to clarify any points if I wanted to raise them.

Why was I so surprised?

I've attended a few online interviews over the course of my career, and it has never been about listening comprehension and recall.

Not so long ago I was on the other side of this particular type of interviewing process at the same company, so I thought that I had a solid grasp on how the interview works.

Does it matter?

I was bit caught off guard, as I hadn't expected to need to capture much information as notes.

As this was towards the start of the interview, it left me a little bit flustered, wondering whether I may have missed any detail from the initial requirements description.

Whether I make it through to the next round of the interview process or not, I may provide feedback to the company to ensure that future candidates - and interviewers - are aware of whether the requirements can be shared as text - for consistency and fairness to candidates who may parse information better from a textual representation.

Developer interviews out of step with reality

Disclaimer 

I am in the process of trying to get back into the workforce, after taking a few months out to be more available for family while dealing with a medical situation.

After some time away from coding my muscle memory for solving little software problems is a bit slower than normal, so I fear that I may not be coming across as being a relatively intelligent and capable professional software developer with over two decades of industry experience.

I'm feeling a bit glum, as shortly after each interview I established some aspect that I had not covered off with my implementation - which can be a showstopper for progressing to the next round of the interview process.

The coding interview process is out-dated

Disable default editor functionality

I have attended a couple of online coding interviews in the last couple of weeks, and found myself having to disable some of the default features in my integrated development environment to enable myself to show my own capabilities separated from the auto-completion suggestions.

On the positive side, at least by coding in my own editor I was able to choose some familiar and relatively up to date test libraries as dependencies - unlike an in browser development environment for another interview, where I found myself stuck with JUnit 4 that has been "in maintenance" for about half a decade.

Meanwhile, in the real world

Day to day software development within the types of company that I am interested in joining now generally involves making the most of the suggestions offered up by editing tools. Additionally, there is an increasingly common expectation that developers will lean on AI systems where their employer has paid for a subscription to have access to those coding assistants.

Turn on AI and let her rip?

No. I don't have a solution to offer, as I still believe that the assessment is meant to be of the developer's capability.

My underlying gripe is quite tangential to this particular aspect of the state of the approach to coding interviews. From having been on both sides of the current interview process, I am of the opinion that there is too much time pressure on candidates and interviewers to squeeze in a coding. implementation and try to gain a realistic impression of the candidate's ability to do the job.

Wednesday, 17 September 2025

Sequenced Collections, Look before you leap

I have been giving myself a refresher on Java Collections APIs as part of preparation for coding interviews. It's not all reading dry documentation, as we now have podcasts, youtube videos and all manner of consumable media available to consume.

On one particular podcast episode I heard about Sequenced Collections, which was something that was introduced with Java 21 a couple of years ago.

An example of some functionality introduced by Sequenced Collections is the option to obtain a reversed view of a collection. The key word to pay attention to there is, "view".

So, if we take an ArrayList and call reversed() what we get back will be a SequencedCollection of the original ArrayList. As part of the SequencedCollection we can then call addLast(e) to add the specified object, e, onto the end of the collection.

The gotcha

If the ArrayList contains many objects then we will be faced with the performance overhead of adjusting the location offset of each individual existing entry, as behind the scenes the implementation is still the original ArrayList, and that does not offer good performance for inserting items at the start.

So, just like any other API, be careful how you approach methods that seem to be offering a more convenient way of working - there can be a hidden cost to the untrained eye.

 

Another consideration, for really optimal performance

I won't claim credit for this, as it was mentioned on the podcast... ( https://youtu.be/gTBb7LxTBbE?si=AvuwcUSl5XGTKczB )

When we iterate through a data structure we can benefit from CPU cache lines as the processor can pre-fetch data around the location in memory - that's not necessarily going to be the case if our iteration is in a reversed order.

A time for cool heads