Skip to main content

Mitigating DNS issues with custom DNS resolver

Context 

The recent AWS outage that involved DNS lookups for DynamoDB failing to resolve to any address reminded me of a situation that sometimes showed up in the error logs of some microservices that I used to maintain.

I'll spare you the details of the issue - partly because I can't remember them almost a decade later.

How I went about reducing the impact of a temporary blip in DNS availability is the focus of this post, as I believe that the same approach could have reduced the impact of the AWS incident (with caveats, because "it depends" is universal).

A bit about microservices

Most microservices only communicate with a limited range of other services to perform their functionality. For this post we could consider DynamoDB as an example of a system that our microservice needs to be able to interact with, but others might typically include:

  • metrics service
  • logging service
  • some service developed for data lookup
  • feature flagging service
  • configuration management service

To de-couple services from each other we may rely on DNS to specify where we expect to be able to find each system that needs to be called upon.

In most horizontally scalable implementations we can expect the services to be behind some API gateway and / or load balancer system, which means that the services could be deploying new versions or scaling up or scaling down without our miroservice having any need to adjust where it addresses its outgoing requests.

DNS resolution

A DNS lookup in its simplest form is a way of specifying a hostname and getting back an IP address. Once we have that address we can send requests to it and interpret responses as being associated with the API of the service that we are expecting to be interacting with.

DNS records can specify a "time to live" (TTL) as a way to signal how long systems can expect to be able to cache the lookup result. For a stable website on the open Internet that is not due to undergo any hosting changes a sensible default might range from 1 hour to 24 hours, but applying DNS for service discovery within a virtual private cloud (VPC) can expect to involve much lower time for caching of DNS.

Horizontal scaling of load balancers

I deliberately used the ambiguous terminology of "system" when referring to load balancing, as there can be - and should be - more than one physical server running the load balancing service for a given service.

Just like most other services the load balancers can be expected to go in and out of service, that may sometimes involve introducing a different IP address as a candidate destination for clients to receive from the service discovery DNS lookup.

When there are multiple destinations available for a given host record, DNS resolution can expose multiple IP addresses as the answer to a lookup.

In order to avoid broken connectivity between clients and destination load balancer instances it would be reasonable to expect that allowing appropriate coordination between timing of updates to DNS records and shutting down of old instances should be a safe approach. This assumes that clients are honouring the TTL details being exposed.

Client lookup -> prod.somehost.somenetwork

DNS server response: prod.somehost.somenetwork is now available at IP addresses: 10.9.8.7, 10.9.7.8, 10.9.6.5 with TTL of 60 seconds

Shortly afterwards our infrastructure monitoring may detect that the ELB instance at IP address 10.9.7.8 is due for maintenance that can involve unavailability, so it spins up a fresh instance at IP address 10.9.5.6 and adds the IP address for that to the DNS record:

DNS server record for prod.somehost.somenetwork now has IP addresses: 10.9.8.7, 10.9.6.5, 10.9.5.6 with TTL of 60 seconds

Allowing for some reasonable window of time to complete serving traffic and draining connections, the load balancer instance at 10.9.7.8 can be safely shut down.

Additional layer of DNS caching

Covering for specific temporary glitches

Finally, this is the interesting bit...

In a situation where our microservice application has been running for a while it will have successfully established connections to each of its dependencies and be steadily processing data and emitting logs and metrics. Now let's consider what would happen if the next request destined to DynamoDB happens to occur just after the resolved DNS data has passed it's time to live. Now we fire off a fresh lookup to obtain the latest fresh representation of the address(es) to direct requests to, but the AWS race condition for updating records has wiped out the record containing the correct address data...

Client lookup -> dynamodb.us-east-1.amazonaws.com

DNS server response -> something empty, no addresses found for the specified hostname

An obedient, naïve application would take this as meaning that now there is no address available for DynamoDB in us-east-1, meaning that we have to halt all interactions involving DynamoDB. That's a sad situation to be in. On the bright side, at least it's only one dependency involved, so we would still be able to log that an issue has occurred, and emit a metric counting that we've encountered a DNS problem.

The alternative appoach that I have applied in production systems involves configuring a customised DNS Resolver within the microservice that simply keeps track of the last known good DNS response, so that we can continue to attempt to direct our requests to somewhere if DNS is flakey.

Limitations

The ability to carry on using cached DNS details is only of limited use:

  • The microservice could continue to attempt to send requests to an instance that has gone out of service, as we are no longer obeying the TTL
  • New instances of our microservice won't have a value to use
    • so our microservice can't scale up and be fully productive for the time being
    • if our microservice is re-deployed the new instances won't be useful
      • This is a situation where checks for usefulness need to be in place before we rotate out old instances and expect new instances to take on the load.

Why does it matter?

There are some errors that you won't see in your logs nor in your metrics due to the fact that the issue prevents logs and metrics from even going out. Just because you haven't seen any errors or exceptions being logged involving DNS, doesn't mean that your services are not encountering them.

Resilience isn't just about retries of HTTP requests due to timeouts and connection failures, DNS has just as much potential to bring your systems down.

Useful? 

Let me know if you've found this useful, or have applied a similar or superior approach in your code. 

Comments

Popular posts from this blog

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...

Speeding up Software Builds for Continuous Integration

Downloading the Internet Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet". Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive. What happens on the Continuous Integration environment? Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow " download the Internet" p...

Designing systems - The "ity"s That Limit or Enable Profitability

Introduction This started off as a little aide-mémoire to get my head into the right space for preparing for an interview. It's not an exhaustive list, and twists terminology that has been used to represent other things (see:  to Velocity), so don't treat it as a text book reference to work from. Most of the listed points can be associated back to so called "non-functional requirements" - NFRs. I don't like that particular terminology, so alternatively we might consider them as dimensions of the quality of the sytem. Usability "If you build it, they will come" should come with a provisor, "... but if it's awkward to use they'll soon go away, and might not come back." Security All of the aspects that combine to protect data from being seen or manipulated by anyone other than the intended recipient or sender, and also assuring users that the data has originated from the intended source. Velocity Here I'm cheating a bit by trying t...