Skip to main content

Looking outside the service boundary

Help others to help yourself

This post is about how it sometimes pays to take a look beyond the services your team owns, so that you have a deeper understanding of the operating context and can have confidence in the performance and robustness of the implementation.

I wouldn't claim to be an expert in anything, but sometimes my extra pair of eyes picks up on an opportunity to make a small change and get a significant benefit.

Database queries 

Back when I was operating in an environment where teams had access to logs and metrics of the services  of other teams, I could dip into what was going on when my login service was hitting timeouts form a dependency.

Based on the details in the logs, the culprit seemed to be delays from a database query. Surprisingly enough, the database was missing an index for most common query pattern, so as we scaled up from a few hundred users to a few thousand 

Default configuration options don't always match what has been in place in previous setups. Migration from a hand-configured setup to one using infrastructure as code...

Logging of ElasticSearch slow queries

In AWS at least there is an option to have ElasticSearch log the slowest queries that it encounters. This can feed into an evaluation of whether the data or query needs to be adjusted to reach acceptable and / or optimal performance.

I was involved in a project where an existing ElasticSearch setup needed to be migrated to a new cluster with a less click-ops approach to the configuration. When I took a look into the new setup I noticed that slow query logging was not enabled, so I alerted the other team and they were able to adjust the config before we needed it.

Finding a root cause during a production incident 

On one occasion I followed the incident call and chat when a system that impacted the workflow of all developers in the company was underway and no root cause had been established.

There were a few pages of logs to look at so it took a while to isolate down what was relevant to the situation.

Without going into too much detail, it turned out that a Docker container that was part of the deployment process was not up to date. This was not a simple case of a team not keeping up with the latest available updates, but actually a situation where the third party developers had switched away from the particular distribution that was involved.

Automation of picking up updated versions and applying them would not have helped in that situation, as the distribution switch could be regarded as a type of forking so it would not be easy to automatically detect it.

Comments

Popular posts from this blog

Speeding up Software Builds for Continuous Integration

Downloading the Internet Can you remember the last time you started out on a clean development environment and ran the build of some software using Maven or Gradle for dependency management? It takes ages to download all of the necessary third party libraries from one or more remote repositories, leading to expressions like, "Just waiting for Maven to download the Internet". Once your development environment has been used for building a few projects the range of dependencies that will need to be downloaded for other builds reduces down as the previously referenced ones will now be cached and found locally on your computer's hard drive. What happens on the Continuous Integration environment? Now consider what goes on when Jenkins or your other preferred Continuous Integration server comes to build your software. If it doesn't have a local copy of the libraries that have been referenced then it is going to pay the cost of that slow " download the Internet" p...

2022 - A year in review

Just a look back over the last 12 months. January I moved back to Christchurch to live, after having spent a few months further south since moving back from London. Work was mainly around balancing other peoples' understanding and expectations around our use of Kafka. February I decided that it would be worthwhile to have a year's subscription for streaming Sky Sports, as some rugby matches that I would want to watch would be on at time when venues wouldn't be open. Having moved to Christchurch to be close to an office, now found myself working from home as Covid restrictions came back into effect across New Zealand. March Got back into some actual coding at work - as opposed to mainly reviewing pull requests for configuration changes for Kafka topics.  This became urgent, as the command line interface tool that our provisioning system was dependent on had been marked for deprecation. April   Had my first direct experience with Covid-19.  I only went for a test because ...

Applying AI to software development can be like following SatNav

Trying out a different navigation system A month or so ago I upgraded to a car that has a SatNav system included, so I have been trying to use that instead of the Maps app on my phone. My experiences with it so far have generally been good, but it is far from flawless - a bit like Artificial Intelligence (AI) in software development. As context, my previous vehicle was not too old to include SatNav, it just hadn't been set up with English language or New Zealand maps - one of the down sides of having a second hand vehicle that originated in Japan. Flawed or incomplete information Driving around central Christchurch can be a bit challenging at times as various roadworks are underway, leaving streets closed off or narrowed down to a single lane. It could be reasonable to expect that a basic navigation system might not have up to the minute awareness of those closures and restrictions. However, something that I did not expect to encounter was the navigation system advising me to expec...