Posted On: 2022-01-10
As promised in the update to the previous post, today's post is a mini postmortem for the recent outage/technical issues for this site. The bulk of the issues were avoidable, and thus I hope by sharing what I've learned from the experience, I may help others avoid similar issues of their own.
The primary issue at the heart of the outage was a misconfigured Certificate Authority Authorization (CAA) record. When set correctly, the CAA record tells Certificate Authorities (CAs) which CAs have permission to issue new certificates for that particular domain. If an unauthorized CA receives a request for a new certificate they are obligated to reject/ignore the request. This is useful, as it makes it more difficult (though not impossible) for an attacker to acquire a valid certificate, and thereby impersonate a website. When misconfigured, however, it can make it quite difficult to get any certificate at all.
My mistake pertaining to the misconfigured CAA record was four-fold:
In hindsight, it's easy to see those mistakes, and how to avoid them. Reading the (correct) documentation, testing my changes, and minimizing the time between implementing a change and actually using it are all fundamental practices for software development - and no doubt server management as well. If I'd "just done it right" in the first place, there simply wouldn't have been an outage*.
The last mistake - taking too long to ask support for help - provides what is perhaps the most valuable lesson personally. Sure, I may be able to try out several dozen potential solutions in the amount of time it takes me to write a message clearly articulating my problem - but there are a lot of problems that I simply know less about than others do, and taking the time to ask may well save me a lot of time and headache (especially if I'm stuck chasing the wrong problem.)
Beyond the CAA issues, there was also the problem that my previous post, despite being authored and uploaded in time for its release on December 27th, wasn't available until after I'd sorted out the certificate troubles. There were two important reasons why this happened:
The lesson here is a simple: when it comes to automated processes, one should avoid a "no news is good news" approach. To accomodate this, I've since adjusted my deployment notifications, so that I get success messages - and all that remains is to develop the habit of watching for the confirmation messages (so that, should the automated messages fail unexpectedly again, it'll make me curious enough to manually check things out.)
One thing that's worth mentioning is that the first "problem" is actually not an issue at all. The smoke tests exist to catch system failures early, and prevent them from being silently deployed to the site. In the event of certificate issues, I want the smoke tests to fail: the certificate's state is just as important as the code's stability. I even have a process in place that allows me to manually override such failures (one which I could have used for that deployment - if only I had been aware that it had failed.) So, while the smoke tests are technically the reason why the post didn't reach you on time, I have no intention of changing them.
This wraps up my retrospective for this site's December 2021 outage. At the heart of everything was a configuration error, but a combination of failing up-front to properly test things, unfortunate timing for the outage, and delays in asking for assistance meant that a problem that could have been wrapped up within a day or less stretched on for more almost two weeks.