I’ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work — and almost every other large installation I’ve seen or been part of — the learnings from these inquisitions are shared for educational reasons. The name for this differs from company to company: some call it a RFO (reason for outage) or an After-Action Report, but for whatever reasons the name for this at AOL is a Post-Mortem.
In general, these sorts of documents contain all of the super-secret (or just embarrassing) details that make up daily life in Operations. They’re almost never distributed very far — even large service providers (say, Verizon) tend to have a sanitized version they give their customers. Interestingly, however, a sanitized but pretty juicy example emerged from Amazon in response to their recent S3 outage.
Here’s a break-down by phase. This is the “detection” phase — someone, likely someone in a Network Operations Center (since this is Sunday morning) — starts seeing big red lights. Detection is all about finding out something is wrong, and defining how serious it is and who needs to fix it.
At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.
At this point, it’s pretty clear that they had a major system event going on. I’d imagine cell phones or pagers (depending on how retro they are out in Seattle) were ruining Sunday morning all over Washington state. The next phase is “investigation” — basically, determining the proximate cause of the problem.
At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer’s request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn’t able to successfully process many customer requests.
I notice that the times moved from 5-minute rounding to 1-minute rounding. You get that level of detail from log analysis, and from the sort of really clever network and system monitoring technology that used to be the domain of really big players with lots of money. So, we’re an hour into a major outage and it’s likely that this has been escalated both technically (to the most senior engineers who know the system) and to the management of the business that owns the system (Amazon Web Services).
At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system’s state, and then reactivate the request processing components.
Okay, almost another hour is gone and I’d imagine all the “easy” options are exhausted. Now, they’re trying more high-impact solutions. This one sounds suspiciously like “bounce the <insert process here> and see if it comes up clean,” which is one of those embarrassing-but-effective solutions you end up using when you just don’t know what else to do sometimes.
By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system’s state cleared. By 2:20pm PDT, we’d restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.
At 2:57pm PDT, Amazon S3’s EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3’s US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.
So, a full bounce of that subsystem took almost 4 hours to show results. You can imagine those were 4 pretty tense hours. Some companies use a conference bridge to manage big incidents, others use web chats or VoIP systems. I’m sure a bunch of people were all working very hard to move this along quickly, and it still took quite a while, but you can almost imagine the relief that flooded the whole team at 2:57pm when EU came back up. By around 5pm, the whole system was back up and normal, and there wasn’t much left to do except the paperwork.
Which brings us to the last part of their message:
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
You can be sure that as soon as they were on a path to get the system stable, they were investigating how it got unstable in the first place. This is a pretty in-depth problem statement, and it’s impressive how transparent Amazon is about it. I doubt I’d ever be allowed to report to the public on any of my outages like this alone, and I suspect Amazon is no different: at some point during that investigation, Amazon’s PR people started working on wording for public announcements. At the same time, technical teams were working to figure out how to make sure it never happens again.
During our post-mortem analysis we’ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we’re taking: (a) we’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we’ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we’ve added additional monitoring and alarming of gossip rates and failures; and, (d) we’re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.
This is fascinating. Certainly, (a) is an obvious concern — I know I’d be pulling out my hair if a key system I managed took 4 hours to restart. I’d speculate that this was simply a contingency nobody had thought about yet, and thus probably was a pretty painfully manual process. Amazon (and all the large operators) rely on quite a bit of automation to manage large fleets of servers, but when the unexpected happens, it doesn’t always work as planned. Item (b) looks like it’s addressing the root cause, item (d) may also play into this. Item (c) is that perennial favorite of ops: monitoring and alarming. These are all just the sort of things I’d expect to see if I were in that situation.
In summary, very interesting data about an increasingly important service on the internet. At the same time, a rare view into what goes on in the dark where SysAdmins and Network Engineers roam. Incidentally, there are technical details about the system that failed, which is a technology Amazon calls Dynamo.
Tags: amazon s3, operations, outages