Learning from the mistakes of others
A few years ago a team that worked across the office from my team came up with a neat way of introducing a cache to speed up the performance of the most business critical pages on the company's main money earning website.
Page load time had been identified as a particularly significant aspect of how search engines would rank the value of websites, so getting a few milliseconds off that metric was a great achievement.
Fast forward several months, the unthinkable happened as the infrastructure component that was at the heart of the caching implementation had a temporary outage, taking away the possibility of loading pages at all.
It was quite a while ago, and I wasn't directly involved in the recovery process but I expect that it would have taken a stressful hour or three to recover.
Avoid introducing points of failure
As inevitably happens, the project that I was working on required some performance improvements in order for it to be considered worthy of going to production.
From being aware of the potential failure modes of the caching infrastructure, we were able to design our implementation to gracefully degrade if and when the cache was unavailable. Instead of becoming completely unavailable our pages would fall back to obtain the data directly from the source of truth.
The cost involved here would be additional latency for the end users, and some additional cost from needing to auto-scale up the read capacity provisioning of the data store.
Comments
Post a Comment