Skip to main content

How To Own Your Mistakes

Today was a very troubling and frustrating day for both myself and one of my best clients. This is my declaration of ownership for the my own failure to make today not happen. The short story is right after declaring the "make the site more stable" milestone complete and shipping out an invoice, the site spent its most unstable day ever being frantically put on stilts and duct taped to the wall by myself. For the long version, read on.

I had already spent roughly a week and a half working on an impromptu milestone in the project to increase the reliability and stability of the site, as well as beinggreenlit to apply hours to better build, test, and deployment processes. This is a good thing and it still stands as such. Now, the site wasn't fragile before, but a couple incidences understandably gave concern about long term quality. We had a few instances of corrupt MySQL logs, ran out of space on ourEBS volume, and embarrassingly I've had occasion to deploy code and find bugs, even a broken page, even testing locally and trying to be careful. The choice to spend time specifically on a better foundation was a good one.

This isn't about that time I spent, but another post may be.

Thursday we flipped the switch to the new system, running all new instances on EC2, migrated to Postresql, and with a whole new deployment process that includes spawning a new "staging" instance that clones our production web server and lets us test new versions before rolling it out to the public. Everything looked good, I spent some time correcting a couple hiccups, and at the end of the day when things had been running and seemed stable and golden, I declared the milestone complete (and in this arrangement, that means invoicing for a payment, so its not just an ego issue).

I woke up the next morning to find the site had been down for a few hours. It was unavailable about a dozen times throughout the rest of the day, and I clocked about 7.5 hours today getting everything in line. It has been running for longer than that now, without problem, and we seem to be in the clear.

Situations like this require us to look inward and ask what we could have done differently to avoid the escalation of a problem into a crisis, and I've spent much of today, while working on the issues and afterwards, trying to understand this. Much of what I can do now is speculation. While there are many things I could have or should have done, there are few of them that I can know for a certainty would have been "the" things to make a difference.

Priorities are one area I can be confident in believing able to avoid what happened today. A service should not run without thorough watchdogs. Websites should be given realistic traffic test exposures. I can test my code and comment it well, but the upfront work needs to be in place to ensure that my new code is actually servicing requests.

Can you always make these claims?
  • Our site's resources are tested automatically and report broken pages and other issues to us
  • We can test our production environment before it is actually production for new code
  • If something goes wrong, our server processes are restarted and we are informed, before the users know and even if they never know
I know, from now on, I will.

Comments

Popular posts from this blog

CARDIAC: The Cardboard Computer

I am just so excited about this. CARDIAC. The Cardboard Computer. How cool is that? This piece of history is amazing and better than that: it is extremely accessible. This fantastic design was built in 1969 by David Hagelbarger at Bell Labs to explain what computers were to those who would otherwise have no exposure to them. Miraculously, the CARDIAC (CARDboard Interactive Aid to Computation) was able to actually function as a slow and rudimentary computer.  One of the most fascinating aspects of this gem is that at the time of its publication the scope it was able to demonstrate was actually useful in explaining what a computer was. Could you imagine trying to explain computers today with anything close to the CARDIAC? It had 100 memory locations and only ten instructions. The memory held signed 3-digit numbers (-999 through 999) and instructions could be encoded such that the first digit was the instruction and the second two digits were the address of memory to operat...

Statement Functions

At a small suggestion in #python, I wrote up a simple module that allows the use of many python statements in places requiring statements. This post serves as the announcement and documentation. You can find the release here . The pattern is the statement's keyword appended with a single underscore, so the first, of course, is print_. The example writes 'some+text' to an IOString for a URL query string. This mostly follows what it seems the print function will be in py3k. print_("some", "text", outfile=query_iostring, sep="+", end="") An obvious second choice was to wrap if statements. They take a condition value, and expect a truth value or callback an an optional else value or callback. Values and callbacks are named if_true, cb_true, if_false, and cb_false. if_(raw_input("Continue?")=="Y", cb_true=play_game, cb_false=quit) Of course, often your else might be an error case, so raising an exception could be useful...

Announcing Feet, a Python Runner

I've been working on a problem that's bugged me for about as long as I've used Python and I want to announce my stab at a solution, finally! I've been working on the problem of "How do i get this little thing I made to my friend so they can try it out?" Python is great. Python is especially a great language to get started in, when you don't know a lot about software development, and probably don't even know a lot about computers in general. Yes, Python has a lot of options for tackling some of these distribution problems for games and apps. Py2EXE was an early option, PyInstaller is very popular now, and PyOxide is an interesting recent entry. These can be great options, but they didn't fit the kind of use case and experience that made sense to me. I'd never really been about to put my finger on it, until earlier this year: Python needs LÖVE . LÖVE, also known as "Love 2D", is a game engine that makes it super easy to build ...