Things Fall Apart, Datacenter Edition

The relentless pursuit by Operations staff of 100% uptime has always struck me as something more than just a job, but a battle against the relentless forces of nature. Everything ultimately breaks down — systems, buildings, even people — and attempting to maintain 100% availability is the Ops equivalent of trying to cheat death. Sooner or later, despite our best efforts, our number will ultimately be up. Most recently in the news, self-proclaimed World’s Finest Data Center operator 365 Main suffered an approximately 45 minute power outage at their San Francisco facility. Much to their credit., and unlike most of their competitors, 365 Main has been extremely open about their investigation. I’ll examine this a bit today, as it’s a rare public glimpse into what goes on inside a large data center facility.

One afternoon, a transformer owned by the supplying power utility failed and caused a power surge. Normally, a power problem like this should trigger an automated transition from utility power to data center generated power. There’s a pretty cool animation from the company that makes the generator that 365 Main uses that shows how the transition happens. Except, unfortunately, this time when the utility power was interrupted, three of the generators failed due to a software bug, and 365 Main’s design could only survive two failures.

So, 365 Main screwed up big time? Not really. Their design was horribly flawed? Not so much. They have eight rooms full of servers, and ten generators — enough for every room to have one, plus two extra generators for contingencies. This type of UPS is a Diesel Rotary UPS: utility power makes a flywheel spin, the flywheel runs a generator, the generator supplies power to the computers. When utility power goes away, there’s a brief (to humans) pause while the diesel engine starts up make power for the computers. As long as the diesel spins up before the flywheel spins down, power keeps flowing. Proponents of this design like to emphasize how it’s simple, and thus pretty reliable, and in its defense, no part of the Rotary system seemed to fail in this case. For completeness, the other kind of UPS uses lots and lots of a batteries.

What did fail was the diesel engine’s controller. What makes the electricity is an enormous diesel engine, so naturally there’s quite a bit of support equipment to keep it running that had to be checked. During the investigation, issues with other systems (exhaust) were uncovered. Ultimately, though, they found a flaw that could be reproduced in a critical part of a system that should never fail. Yet, it did fail in 30% of the cases, and that was enough to bring down all sorts of different products, not to mention causing a highly public issue for 365 Main.

I’m not affiliated in any way with 365 Main, but I do have to say I’m impressed with how they have handled the aftermath of the incident. They were open and honest about the incident, provided lots of public information about their investigation, even to non-customers like me. Heck, they even ignored the idiotic rumors that Valleywag was tricked into posting as “news.” Nicely done, guys.

One Response to “Things Fall Apart, Datacenter Edition”

  1. Michaela Says:

    Nice explanation sir. Thanks. And yes, we do believe that operations guys can even cheat death :)

Leave a Reply