Skip to main content

How UFO protected from DynamoDB outages

Context - The October 21 2025 AWS Outage 

I've just read a news article about Amazon Web Services (AWS) having a problem with their DynamoDB service which impacted a lot of their customers in the popular / vital US-EAST-1 region. This reminded me of another time that relying on DynamoDB caused an outage for one of the companies that I was working for when I lived in London, and how we went about preventing that from happening again.

"It's always DNS"

"Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1." 

It's a cliché that outages on the Internet are often caused by some issue involving DNS, that happened to match up with the root cause this time around.

What's the UFO?

My team's core service was for capturing metadata about customers' usage of various components of the website. Because the service relied on DynamoDB for storing that information, we came up with a fallback service for the rare situations of DynamoDB being unavailable for any reason. At the time we were using short names for our services, typically based on the acronym of the service name - so I retro-fitted a service name to match with UFO as acronym, Usage Fallback Option.

How did it work?

The UFO service would write to an alternative data store that was independent of DynamoDB.

Clients of the normal usage service applied a circuit-breaker approach to interactions with the regular usage service, so the fallback service would automatically start to receive traffic if an outage occurred. 

Recovery scripts were available for either re-feeding the data into DynamoDB once it came back into service, or for progressing to the next phase of the data processing pipeline.

Should this approach be broadly applicable?

The short answer is "No."

In this particular situation we had a piece of functionality that was not visible to customers and was just an implementation detail of an asynchronous data procsesing pipeline.

On other projects I have worked on services that relied on DynamoDB as the primary data source involved in serving content on a website, so there was no similar fallback mechanism available there. Caches could have hidden some of the outage, but they were not intended to keep the site up.

It made sense to invest in the fallback service for usage as usage data acted as the transaction measurement of value being delivered when it came to contract negotiations with customers.

As the most recent outage only impacted a single region, I expect that the team would have been able to switch traffic to 0% in the impacted region. This would have been possible due to the regions being structures to be fully independent.

Addendum

Blast radius

From a brief skim-read of the AWS status page it is apparent that EC2 launching had a dependency on DynamoDB, I take that to imply that the UFO service would not have been able to scale up to pick up the full load of usage events.

Comments

Popular posts from this blog

Having a go at learning some Kotlin

What's this about?  The year 2025 is almost over, so that means that it has been a bit over a decade since my old colleague Filippo gave a presentation to the development team of ScienceDirect covering the merits of the Kotlin programming language. So, it's about time that I had a proper go at using it. This blog post is intended to trace what the experience has been like, covering surprises that I encounter along the way. Getting started The programming language that I am most experienced with is Java, so I have chosen to try out implementing some functionality in Kotlin from a recent hobby project that I developed in Java involving spinning up a database in a Docker container and running some queries. JVM version support IntelliJ IDEA includes some automation for creating a new project, so I selected the relevant options to use the latest LTS version of the Java virtual machine with Spring Boot, Kotlin, Postgresql and Test containers. After a few seconds I had a new project i...

The Importance of Segmenting Infrastructure

Kafka for Logging I was recently poking around in the source code of a few technologies that I have been using for a few years when I came across KafkaLog4jAppender. It enables you to use Kafka as a place to capture application logs. The thing that caught my eye was the latest commit associated with that particular class, "KafkaLog4jAppender deadlocks when idempotence is enabled" . In the context of Kafka, idempotence is intended to enable the system to avoid producing duplicate records when a producer may need to retry sending events due to some - hopefully - intermittent connectivity problem between the producer and the receiving broker. The unfortunate situation that arises here is that the Kafka client code itself uses Log4j, so it can result in the application being blocked from sending its logs via a Kafka topic because the Kafka client Producer gets deadlocked waiting on transaction state. Kafka For Metrics - But Not For Kafka Metrics This reminded me of a similar scen...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...