I had already spent roughly a week and a half working on an impromptu milestone in the project to increase the reliability and stability of the site, as well as beinggreenlit to apply hours to better build, test, and deployment processes. This is a good thing and it still stands as such. Now, the site wasn't fragile before, but a couple incidences understandably gave concern about long term quality. We had a few instances of corrupt MySQL logs, ran out of space on ourEBS volume, and embarrassingly I've had occasion to deploy code and find bugs, even a broken page, even testing locally and trying to be careful. The choice to spend time specifically on a better foundation was a good one.
This isn't about that time I spent, but another post may be.
Thursday we flipped the switch to the new system, running all new instances on EC2, migrated to Postresql, and with a whole new deployment process that includes spawning a new "staging" instance that clones our production web server and lets us test new versions before rolling it out to the public. Everything looked good, I spent some time correcting a couple hiccups, and at the end of the day when things had been running and seemed stable and golden, I declared the milestone complete (and in this arrangement, that means invoicing for a payment, so its not just an ego issue).
I woke up the next morning to find the site had been down for a few hours. It was unavailable about a dozen times throughout the rest of the day, and I clocked about 7.5 hours today getting everything in line. It has been running for longer than that now, without problem, and we seem to be in the clear.
Situations like this require us to look inward and ask what we could have done differently to avoid the escalation of a problem into a crisis, and I've spent much of today, while working on the issues and afterwards, trying to understand this. Much of what I can do now is speculation. While there are many things I could have or should have done, there are few of them that I can know for a certainty would have been "the" things to make a difference.
Priorities are one area I can be confident in believing able to avoid what happened today. A service should not run without thorough watchdogs. Websites should be given realistic traffic test exposures. I can test my code and comment it well, but the upfront work needs to be in place to ensure that my new code is actually servicing requests.
Can you always make these claims?
- Our site's resources are tested automatically and report broken pages and other issues to us
- We can test our production environment before it is actually production for new code
- If something goes wrong, our server processes are restarted and we are informed, before the users know and even if they never know