Skip to main content

The Importance of Segmenting Infrastructure

Kafka for Logging

I was recently poking around in the source code of a few technologies that I have been using for a few years when I came across KafkaLog4jAppender. It enables you to use Kafka as a place to capture application logs. The thing that caught my eye was the latest commit associated with that particular class, "KafkaLog4jAppender deadlocks when idempotence is enabled".

In the context of Kafka, idempotence is intended to enable the system to avoid producing duplicate records when a producer may need to retry sending events due to some - hopefully - intermittent connectivity problem between the producer and the receiving broker.

The unfortunate situation that arises here is that the Kafka client code itself uses Log4j, so it can result in the application being blocked from sending its logs via a Kafka topic because the Kafka client Producer gets deadlocked waiting on transaction state.

Kafka For Metrics - But Not For Kafka Metrics

This reminded me of a similar scenario where an organisation might choose to use Kafka as their mechanism for sending out notifications of metrics for their microservices and associated infrastructure. If Kafka happens to be part of the infrastructure that you are interested in being able to monitor, then you need to keep those resources isolated from the metrics Kafka - otherwise you run the risk of an incident impacting Kafka which prevents the metrics from being transmitted.

Keeping Things Separated

A real world example of keeping infrastructure isolated from itself can be seen in the way Confluent Cloud handles audit logs. I found it a little confusing at first, as the organisation that I was working for at the time only had Kafka clusters in a single region, but the audit logs were on completely separate infrastructure in another region and even another cloud provider.

Sometimes You're Using A Service Indirectly

A slightly different - but no less significant - example of the need for isolating resources can arise when a particular type of infrastructure is being used for different types of workload. Rather than having a "big bang" release of changes to all of the systems, a phased rollout approach can be taken. One of my earliest involvements with using AWS came shortly after their 2015 DynamoDB outage, which had a ripple out impact for a range of other AWS services because behind the scenes those other services were themselves utilising DynamoDB.

It's my understanding that AWS subsequently moved to isolating their internal services' DynamoDB resource from general consumers' DynamoDB infrastructure - but don't quote me on that.

Comments

Popular posts from this blog

Speeding up Software Builds for Continuous Integration

Downloading the Internet Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet". Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive. What happens on the Continuous Integration environment? Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow " download the Internet" p...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...

Applying AI to software development can be like following SatNav

Trying out a different navigation system A month or so ago I upgraded to a car that has a SatNav system included, so I have been trying to use that instead of the Maps app on my phone. My experiences with it so far have generally been good, but it is far from flawless - a bit like Artificial Intelligence (AI) in software development. As context, my previous vehicle was not too old to include SatNav, it just hadn't been set up with English language or New Zealand maps - one of the down sides of having a second hand vehicle that originated in Japan. Flawed or incomplete information Driving around central Christchurch can be a bit challenging at times as various roadworks are underway, leaving streets closed off or narrowed down to a single lane. It could be reasonable to expect that a basic navigation system might not have up to the minute awareness of those closures and restrictions. However, something that I did not expect to encounter was the navigation system advising me to expec...