Help others to help yourself
This post is about how it sometimes pays to take a look beyond the services your team owns, so that you have a deeper understanding of the operating context and can have confidence in the performance and robustness of the implementation.
I wouldn't claim to be an expert in anything, but sometimes my extra pair of eyes picks up on an opportunity to make a small change and get a significant benefit.
Database queries
Back when I was operating in an environment where teams had access to logs and metrics of the services of other teams, I could dip into what was going on when my login service was hitting timeouts form a dependency.
Based on the details in the logs, the culprit seemed to be delays from a database query. Surprisingly enough, the database was missing an index for most common query pattern, so as we scaled up from a few hundred users to a few thousand
Default configuration options don't always match what has been in place in previous setups. Migration from a hand-configured setup to one using infrastructure as code...
Logging of ElasticSearch slow queries
In AWS at least there is an option to have ElasticSearch log the slowest queries that it encounters. This can feed into an evaluation of whether the data or query needs to be adjusted to reach acceptable and / or optimal performance.
I was involved in a project where an existing ElasticSearch setup needed to be migrated to a new cluster with a less click-ops approach to the configuration. When I took a look into the new setup I noticed that slow query logging was not enabled, so I alerted the other team and they were able to adjust the config before we needed it.
Finding a root cause during a production incident
On one occasion I followed the incident call and chat when a system that impacted the workflow of all developers in the company was underway and no root cause had been established.
There were a few pages of logs to look at so it took a while to isolate down what was relevant to the situation.
Without going into too much detail, it turned out that a Docker container that was part of the deployment process was not up to date. This was not a simple case of a team not keeping up with the latest available updates, but actually a situation where the third party developers had switched away from the particular distribution that was involved.
Automation of picking up updated versions and applying them would not have helped in that situation, as the distribution switch could be regarded as a type of forking so it would not be easy to automatically detect it.
Comments
Post a Comment