How NOT to Inform Your Customers of an Outage

Monday, December 8th, 2008

There are a number of different ways to inform your customers of an outage. I’ve previously discussed how 365main and Amazon Web Services did this fairly well in the past. Unfortunately, Limelight Networks customers are hearing about issues with their CDN via GigaOM.

(more…)

Complexity and the 4 a.m. test

Sunday, September 14th, 2008

 

With most technology, it’s a given that there’s almost always More Than One Way To Do It (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize “cost,” either up-front development cost or long-term engineering cost. The smarter ones have recognized the concept of “Technology Debt” as addressed by several observers. As a leader in Operations, however, I tend to subscribe to my own rule: the 4 a.m. rule.

(more…)

The Art of the Post-Mortem

Saturday, July 26th, 2008

I’ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work — and almost every other large installation I’ve seen or been part of — the learnings from these inquisitions are shared for educational reasons. The name for this differs from company to company: some call it a RFO (reason for outage) or an After-Action Report, but for whatever reasons the name for this at AOL is a Post-Mortem.

(more…)