Monday, 27 October 2025

Caching successful DNS in modern Java

Things have moved on a bit since I set up a custom DNS resolver for microservices to enable them to continue to use known successfully resolved addresses when DNS resolution subsequently fails.

Since Java 22 we can specify a value in a file in the JDK to have the option of continuing to use successful lookup results even after the TTL expires. 

The property name is: 

networkaddress.cache.stale.ttl
https://bugs.openjdk.org/browse/JDK-8306653

The compromise 

A trade-off to consider when applying that approach instead of a customised DNS Resolver is that we would not have the option of detecting when normal DNS resolution has failed.

Hypothetical situation

Let's suppose that a minor adjustment was applied to the deployment approach for rolling out a new version of microservices within the organisation's VPC. We might start to notice intermittent errors showing in our logs, where some unintended temporary DNS clear out results in continuing to send requests to the old address(es). Without visibility of applying the fallback to last known good it would take longer to track down the underlying issue. 

Mitigating DNS issues with custom DNS resolver

Context 

The recent AWS outage that involved DNS lookups for DynamoDB failing to resolve to any address reminded me of a situation that sometimes showed up in the error logs of some microservices that I used to maintain.

I'll spare you the details of the issue - partly because I can't remember them almost a decade later.

How I went about reducing the impact of a temporary blip in DNS availability is the focus of this post, as I believe that the same approach could have reduced the impact of the AWS incident (with caveats, because "it depends" is universal).

A bit about microservices

Most microservices only communicate with a limited range of other services to perform their functionality. For this post we could consider DynamoDB as an example of a system that our microservice needs to be able to interact with, but others might typically include:

  • metrics service
  • logging service
  • some service developed for data lookup
  • feature flagging service
  • configuration management service

To de-couple services from each other we may rely on DNS to specify where we expect to be able to find each system that needs to be called upon.

In most horizontally scalable implementations we can expect the services to be behind some API gateway and / or load balancer system, which means that the services could be deploying new versions or scaling up or scaling down without our miroservice having any need to adjust where it addresses its outgoing requests.

DNS resolution

A DNS lookup in its simplest form is a way of specifying a hostname and getting back an IP address. Once we have that address we can send requests to it and interpret responses as being associated with the API of the service that we are expecting to be interacting with.

DNS records can specify a "time to live" (TTL) as a way to signal how long systems can expect to be able to cache the lookup result. For a stable website on the open Internet that is not due to undergo any hosting changes a sensible default might range from 1 hour to 24 hours, but applying DNS for service discovery within a virtual private cloud (VPC) can expect to involve much lower time for caching of DNS.

Horizontal scaling of load balancers

I deliberately used the ambiguous terminology of "system" when referring to load balancing, as there can be - and should be - more than one physical server running the load balancing service for a given service.

Just like most other services the load balancers can be expected to go in and out of service, that may sometimes involve introducing a different IP address as a candidate destination for clients to receive from the service discovery DNS lookup.

When there are multiple destinations available for a given host record, DNS resolution can expose multiple IP addresses as the answer to a lookup.

In order to avoid broken connectivity between clients and destination load balancer instances it would be reasonable to expect that allowing appropriate coordination between timing of updates to DNS records and shutting down of old instances should be a safe approach. This assumes that clients are honouring the TTL details being exposed.

Client lookup -> prod.somehost.somenetwork

DNS server response: prod.somehost.somenetwork is now available at IP addresses: 10.9.8.7, 10.9.7.8, 10.9.6.5 with TTL of 60 seconds

Shortly afterwards our infrastructure monitoring may detect that the ELB instance at IP address 10.9.7.8 is due for maintenance that can involve unavailability, so it spins up a fresh instance at IP address 10.9.5.6 and adds the IP address for that to the DNS record:

DNS server record for prod.somehost.somenetwork now has IP addresses: 10.9.8.7, 10.9.6.5, 10.9.5.6 with TTL of 60 seconds

Allowing for some reasonable window of time to complete serving traffic and draining connections, the load balancer instance at 10.9.7.8 can be safely shut down.

Additional layer of DNS caching

Covering for specific temporary glitches

Finally, this is the interesting bit...

In a situation where our microservice application has been running for a while it will have successfully established connections to each of its dependencies and be steadily processing data and emitting logs and metrics. Now let's consider what would happen if the next request destined to DynamoDB happens to occur just after the resolved DNS data has passed it's time to live. Now we fire off a fresh lookup to obtain the latest fresh representation of the address(es) to direct requests to, but the AWS race condition for updating records has wiped out the record containing the correct address data...

Client lookup -> dynamodb.us-east-1.amazonaws.com

DNS server response -> something empty, no addresses found for the specified hostname

An obedient, naïve application would take this as meaning that now there is no address available for DynamoDB in us-east-1, meaning that we have to halt all interactions involving DynamoDB. That's a sad situation to be in. On the bright side, at least it's only one dependency involved, so we would still be able to log that an issue has occurred, and emit a metric counting that we've encountered a DNS problem.

The alternative appoach that I have applied in production systems involves configuring a customised DNS Resolver within the microservice that simply keeps track of the last known good DNS response, so that we can continue to attempt to direct our requests to somewhere if DNS is flakey.

Limitations

The ability to carry on using cached DNS details is only of limited use:

  • The microservice could continue to attempt to send requests to an instance that has gone out of service, as we are no longer obeying the TTL
  • New instances of our microservice won't have a value to use
    • so our microservice can't scale up and be fully productive for the time being
    • if our microservice is re-deployed the new instances won't be useful
      • This is a situation where checks for usefulness need to be in place before we rotate out old instances and expect new instances to take on the load.

Why does it matter?

There are some errors that you won't see in your logs nor in your metrics due to the fact that the issue prevents logs and metrics from even going out. Just because you haven't seen any errors or exceptions being logged involving DNS, doesn't mean that your services are not encountering them.

Resilience isn't just about retries of HTTP requests due to timeouts and connection failures, DNS has just as much potential to bring your systems down.

Useful? 

Let me know if you've found this useful, or have applied a similar or superior approach in your code. 

Friday, 24 October 2025

AWS outage - It wasn't just DNS

The AWS outage that temporarily took down so much of the Internet this week has been reported as being due to a DNS issue, but that wasn't the root cause.

AWS have posted details of the incident that explains how a race condition in the mechanism in place for updating the DNS records ultimately caused the problem.

I suspect that the possibilty of the race condition occurring may have been increased by changes required for the use of IPv6 for DynamoDB, as that would have involved expanding the amount of data involved.

It will be interesting to see if AWS offers up more details of their approach to eliminating the race condition. As much as anything, I'm curious to find out what is similar to the race condition that I worked through earlier this year.

Tuesday, 21 October 2025

How UFO protected from DynamoDB outages

Context - The October 21 2025 AWS Outage 

I've just read a news article about Amazon Web Services (AWS) having a problem with their DynamoDB service which impacted a lot of their customers in the popular / vital US-EAST-1 region. This reminded me of another time that relying on DynamoDB caused an outage for one of the companies that I was working for when I lived in London, and how we went about preventing that from happening again.

"It's always DNS"

"Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1." 

It's a cliché that outages on the Internet are often caused by some issue involving DNS, that happened to match up with the root cause this time around.

What's the UFO?

My team's core service was for capturing metadata about customers' usage of various components of the website. Because the service relied on DynamoDB for storing that information, we came up with a fallback service for the rare situations of DynamoDB being unavailable for any reason. At the time we were using short names for our services, typically based on the acronym of the service name - so I retro-fitted a service name to match with UFO as acronym, Usage Fallback Option.

How did it work?

The UFO service would write to an alternative data store that was independent of DynamoDB.

Clients of the normal usage service applied a circuit-breaker approach to interactions with the regular usage service, so the fallback service would automatically start to receive traffic if an outage occurred. 

Recovery scripts were available for either re-feeding the data into DynamoDB once it came back into service, or for progressing to the next phase of the data processing pipeline.

Should this approach be broadly applicable?

The short answer is "No."

In this particular situation we had a piece of functionality that was not visible to customers and was just an implementation detail of an asynchronous data procsesing pipeline.

On other projects I have worked on services that relied on DynamoDB as the primary data source involved in serving content on a website, so there was no similar fallback mechanism available there. Caches could have hidden some of the outage, but they were not intended to keep the site up.

It made sense to invest in the fallback service for usage as usage data acted as the transaction measurement of value being delivered when it came to contract negotiations with customers.

As the most recent outage only impacted a single region, I expect that the team would have been able to switch traffic to 0% in the impacted region. This would have been possible due to the regions being structures to be fully independent.

Addendum

Blast radius

From a brief skim-read of the AWS status page it is apparent that EC2 launching had a dependency on DynamoDB, I take that to imply that the UFO service would not have been able to scale up to pick up the full load of usage events.

Sunday, 19 October 2025

Records in Java

What are they?

On the surface, records seem to just be syntactic sugar for producing a class that can hold some specific combination of data. They can be used to make code less verbose, but there's more to them than that. 

What do we get for free?

equals, hashCode, and toString methods are automatically implemented. If we need to deviate from the default behaviour of these methods then we can still specify the implementation in the same way that we would override those methods in a normal class.

What else is special?

Records are implicitly final, so you cannot extend from them with a sub-class. If you want to combine them with some other data then declare another Record that contains the original as a component.

That doesn't prevent us from including generics or annotations in the same way that we do with regular classes. 

Why are they?

Records are a special type of class in Java that enables developers to represent their data with less of an extensible mindset, that aligns with "Favor composition over inheritence" from Joshua Bloch's Effective Java book.

JDK Enhancement Protocol (JEP) 395 goes into more detail about the motivation and history of introducing records as a feature of the JDK.

Considerations for use

Records have been a part of Java since version 16, with Java 17 as the first long term support (LTS) version top include them.

Java 17 was release in September of 2021, so at the time of writing this post we are at a point where we can expect any stable and mature libraries to be well and truly up to speed with operating with records.

One situation where records cannot be applied is as an @Entity in the Java persistence API (JPA).

Caveat

This post has mainly been created as a refresher / reminder to myself about where records fit in Java development. I applied them in an interview about three years ago, but was a little rusty on the pros and cons when it came to a similar interview more recently.

If you want to get deeper, go and check out the JEP or other documentation. 

 

Saturday, 11 October 2025

Java best practices, for features introduced from Java 9 to Java 25

I've just watched a recording of a presentation by Stephen Colbourne from Devoxx Belgium "The New Java Best Practices by Stephen Colebourne".

Some of them I agreed with, some I am yet to warm to.

  • I haven't made much use of "var" in place of specifying a variable type, so I still find code more readable when the type is specified.
  • I am fully onboard with the best practice about the use of modules, "arguably the biggest change since Java 8".
  • unnamed variables came in Java 22, which might explain why I haven't encountered it so far - as companies tend to stick to long term support (LTS) versions of the language so I only got to Java 21.
  • markdown docs. Check out the Inside Java podcast about this ( https://inside.java/2025/01/21/podcast-034/ )
  • Optional and null - the code style example shows how Optional can flow as more readable

This is not an exhaustive breakdown of what Colbourne covered, so I recommend that you take a look for yourself.


Thursday, 9 October 2025

AI Wars - attack of the clones

I've been checking in some of my experimental code with GitHub repositories over the last while, and noticed that the "Traffic" section under "Insights" shows a dozen or so clones on each repo.

I don't have any real followers, so I'm starting going with the assumption that the cloning will be by some automated systems picking up my code for use by artificial intelligence code assistants.

Just a pity that GitHub does not offer any way of monitoring the origin of these clones.

There's also a possibility that some of the traffic may relate to Dependabot, but in the interests of keeping this post with it's amusing title I'm going to check for that after posting.

 

(Disclaimer: This post has no connection to Star Wars, or Disney's trademark on "Attack of the Clones", and is not promoting any product or service)

Tuesday, 7 October 2025

Constructors - another way Java 25 can be subtly different

Constructors are still super - eventually 

Java 25 has introduced a change to the way that constructors work in the inheritance hierarchy.

In previous versions of Java, if we have a class that extends another class then all constructors of the subclass have to either specify a constructor from the base class as the first operation or automatically have the default zero arguments constructor applied by default - or call on alternate constructor and get the subclass's constructor invoked via that.

In Java 25 it is now valid to call super() later in the constructor.

So, we could have the following as valid code...

public class SomeBaseClass {
SomeBaseClass() {
System.out.println("Hello from SomeBaseClass constructor");
}
}

public class SomeSubClass extends SomeBaseClass {
SomeSubClass() {
System.out.println("Hello from SomeSubClass constructor");

super();
}
}


public class ConstructorDemo {
static void main() {
SomeSubClass someSubClass = new SomeSubClass();
}
}
 
The output from running ConstructorDemo would be:
Hello from SomeBaseClass constructor
Hello from SomeSubClass constructor

In older versions of Java we would have gotten a compilation error for the SomeSubClass constructor.

See https://openjdk.org/jeps/513 for the details.

It makes sense for validation and similar use cases, but could trigger a few sideways glances from developers reviewing pull requests.

A time for cool heads