Skip to main content

How To Own Your Mistakes

Today was a very troubling and frustrating day for both myself and one of my best clients. This is my declaration of ownership for the my own failure to make today not happen. The short story is right after declaring the "make the site more stable" milestone complete and shipping out an invoice, the site spent its most unstable day ever being frantically put on stilts and duct taped to the wall by myself. For the long version, read on.

I had already spent roughly a week and a half working on an impromptu milestone in the project to increase the reliability and stability of the site, as well as beinggreenlit to apply hours to better build, test, and deployment processes. This is a good thing and it still stands as such. Now, the site wasn't fragile before, but a couple incidences understandably gave concern about long term quality. We had a few instances of corrupt MySQL logs, ran out of space on ourEBS volume, and embarrassingly I've had occasion to deploy code and find bugs, even a broken page, even testing locally and trying to be careful. The choice to spend time specifically on a better foundation was a good one.

This isn't about that time I spent, but another post may be.

Thursday we flipped the switch to the new system, running all new instances on EC2, migrated to Postresql, and with a whole new deployment process that includes spawning a new "staging" instance that clones our production web server and lets us test new versions before rolling it out to the public. Everything looked good, I spent some time correcting a couple hiccups, and at the end of the day when things had been running and seemed stable and golden, I declared the milestone complete (and in this arrangement, that means invoicing for a payment, so its not just an ego issue).

I woke up the next morning to find the site had been down for a few hours. It was unavailable about a dozen times throughout the rest of the day, and I clocked about 7.5 hours today getting everything in line. It has been running for longer than that now, without problem, and we seem to be in the clear.

Situations like this require us to look inward and ask what we could have done differently to avoid the escalation of a problem into a crisis, and I've spent much of today, while working on the issues and afterwards, trying to understand this. Much of what I can do now is speculation. While there are many things I could have or should have done, there are few of them that I can know for a certainty would have been "the" things to make a difference.

Priorities are one area I can be confident in believing able to avoid what happened today. A service should not run without thorough watchdogs. Websites should be given realistic traffic test exposures. I can test my code and comment it well, but the upfront work needs to be in place to ensure that my new code is actually servicing requests.

Can you always make these claims?
  • Our site's resources are tested automatically and report broken pages and other issues to us
  • We can test our production environment before it is actually production for new code
  • If something goes wrong, our server processes are restarted and we are informed, before the users know and even if they never know
I know, from now on, I will.

Comments

Popular posts from this blog

CARDIAC: The Cardboard Computer

I am just so excited about this. CARDIAC. The Cardboard Computer. How cool is that? This piece of history is amazing and better than that: it is extremely accessible. This fantastic design was built in 1969 by David Hagelbarger at Bell Labs to explain what computers were to those who would otherwise have no exposure to them. Miraculously, the CARDIAC (CARDboard Interactive Aid to Computation) was able to actually function as a slow and rudimentary computer.  One of the most fascinating aspects of this gem is that at the time of its publication the scope it was able to demonstrate was actually useful in explaining what a computer was. Could you imagine trying to explain computers today with anything close to the CARDIAC? It had 100 memory locations and only ten instructions. The memory held signed 3-digit numbers (-999 through 999) and instructions could be encoded such that the first digit was the instruction and the second two digits were the address of memory to operat...

Statement Functions

At a small suggestion in #python, I wrote up a simple module that allows the use of many python statements in places requiring statements. This post serves as the announcement and documentation. You can find the release here . The pattern is the statement's keyword appended with a single underscore, so the first, of course, is print_. The example writes 'some+text' to an IOString for a URL query string. This mostly follows what it seems the print function will be in py3k. print_("some", "text", outfile=query_iostring, sep="+", end="") An obvious second choice was to wrap if statements. They take a condition value, and expect a truth value or callback an an optional else value or callback. Values and callbacks are named if_true, cb_true, if_false, and cb_false. if_(raw_input("Continue?")=="Y", cb_true=play_game, cb_false=quit) Of course, often your else might be an error case, so raising an exception could be useful...

How To Teach Software Development

How To Teach Software Development Introduction Developers Quality Control Motivation Execution Businesses Students Schools Education is broken. Education about software development is even more broken. It is a sad observation of the industry from my eyes. I come to see good developers from what should be great educations as survivors, more than anything. Do they get a headstart from their education or do they overcome it? This is the first part in a series on software education. I want to open a discussion here. Please comment if you have thoughts. Blog about it, yourself. Write about how you disagree with me. Write more if you don't. We have a troubled industry. We care enough to do something about it. We hark on the bad developers the way people used to point at freak shows, but we only hurt ourselves but not improving the situation. We have to deal with their bad code. We are the twenty percent and we can't talk to the eighty percent, by definition, so we need to impro...