<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>JacobRosenberg.net</title>
	<atom:link href="http://www.jacobrosenberg.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.jacobrosenberg.net</link>
	<description>Technology making the world better. Except when it doesn&#039;t.</description>
	<lastBuildDate>Tue, 27 Mar 2012 22:14:54 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>The Google IO Games</title>
		<link>http://www.jacobrosenberg.net/2012/03/27/the-google-io-games/</link>
		<comments>http://www.jacobrosenberg.net/2012/03/27/the-google-io-games/#comments</comments>
		<pubDate>Tue, 27 Mar 2012 22:13:57 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=64</guid>
		<description><![CDATA[Last year, Google&#8217;s IO conference sold out in under an hour of painful and awkward page refreshes against a clunky Cold Fusion-based system. This year, they increased the price and brought the system in-house. The event sold out in minutes, &#8230; <a href="http://www.jacobrosenberg.net/2012/03/27/the-google-io-games/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Last year, Google&#8217;s IO conference sold out in under an hour of painful and awkward page refreshes against a clunky Cold Fusion-based system. This year, they increased the price and brought the system in-house. The event sold out in minutes, with many insisting they never had a chance.</p>
<p>In the spirit of the recently released movie adaptation of the popular book <span style="text-decoration: underline;">The Hunger Games</span>, I have a suggestion for how Google can structure registration next year.</p>
<p>There will be an open registration period (perhaps a week) where developers can put their names into a drawing. If Google wishes to provide multiple entries for those who have previously attended, have contributed to projects, or are otherwise important to them, they can do so.</p>
<p>At a given date and time, Google will conduct the Google IO Reaping. A certain number of names will be drawn and receive email notification that they have an opportunity to purchase a ticket&#8211;I&#8217;d suggest they have 24-48 hours, so there&#8217;s no rush. Everyone who receive an invitation from the reaping gets a ticket if they pay for one.</p>
<p>If seats don&#8217;t get sold as tickets, they go back into the pool and re-reaped. The goal is that only the people who signed up for the reaping can be named as Google Tributes. Of course, people sometimes will need to cancel, and we&#8217;ll put those back into the pool to be re-reaped, and if someone else takes the seat the original purchaser gets a refund.</p>
<p>This process avoids a huge sign-up rush. It also gives everyone in every time zone an equal chance to make it in. And, finally, it restores an opportunity for Google to give previous participants an edge while keeping things fair.</p>
<p>What do you think, Google? May the odds be EVER in IO&#8217;s favor?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2012/03/27/the-google-io-games/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Content Delivery Summit &#8211; CDN and Cloud Convergence</title>
		<link>http://www.jacobrosenberg.net/2011/06/05/content-delivery-summit-cdn-and-cloud-convergence/</link>
		<comments>http://www.jacobrosenberg.net/2011/06/05/content-delivery-summit-cdn-and-cloud-convergence/#comments</comments>
		<pubDate>Sun, 05 Jun 2011 23:44:59 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[CDN]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=60</guid>
		<description><![CDATA[I moderated a panel at the Content Delivery Summit (affiliated with the Streaming Media East event in New York last month) and they were nice enough to record the video. Very cool stuff! The subject was about the convergence between &#8230; <a href="http://www.jacobrosenberg.net/2011/06/05/content-delivery-summit-cdn-and-cloud-convergence/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I moderated a panel at the Content Delivery Summit (affiliated with the Streaming Media East event in New York last month) and they were nice enough to record the <a href="http://www.streamingmedia.com/Articles/Editorial/Featured-Articles/CDNs-and-Cloud-Computing-Converging-But-Still-Distinct-75389.aspx">video</a>. Very cool stuff!</p>
<p>The subject was about the convergence between CDN services and Cloud Computing services. Enjoy!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2011/06/05/content-delivery-summit-cdn-and-cloud-convergence/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>AOL Cloud wins Uptime Institute Green IT Innovation Award</title>
		<link>http://www.jacobrosenberg.net/2011/06/01/aol-cloud-wins-uptime-institute-green-it-innovation-award/</link>
		<comments>http://www.jacobrosenberg.net/2011/06/01/aol-cloud-wins-uptime-institute-green-it-innovation-award/#comments</comments>
		<pubDate>Wed, 01 Jun 2011 23:22:51 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=58</guid>
		<description><![CDATA[It isn&#8217;t every day that a project of your conception takes off and gains enough traction to change things for everyone at your company. It&#8217;s even less common that such a project gets recognized by an outside body when it&#8217;s &#8230; <a href="http://www.jacobrosenberg.net/2011/06/01/aol-cloud-wins-uptime-institute-green-it-innovation-award/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>It isn&#8217;t every day that a project of your conception takes off and gains enough traction to change things for everyone at your company. It&#8217;s even less common that such a project gets recognized by an outside body when it&#8217;s an infrastructure effort, which companies like AOL don&#8217;t discuss that openly. The combination of these two are why I&#8217;m especially pleased to share that the AOL Cloud project, which I&#8217;d been working to make a reality since 2007, was recognized by the<a href="http://symposium.uptimeinstitute.com/geit-awards/1213-2011-green-enterprise-it-award-winners"> Uptime Institute at their 2011 Green Enterprise IT awards</a>. We were recognized in the &#8220;IT Innovation&#8221; category on May 11th at the Uptime Symposium. Aaron Lake and I provided a <a href="http://symposium.uptimeinstitute.com/images/stories/symposium_2011_files/2011presentations_public/GEIT_AOL_p.pdf">presentation of our effor</a>t &#8211; Will Stevens couldn&#8217;t join us, as he had left AOL. </p>
<p>No effort of this nature happens as an individual effort, and this was no exception. Given the challenging circumstances at AOL in 2010, it was especially exciting to see people band together to work on such an exciting project. I&#8217;m glad we were able to embrace some of the best of current technology at AOL, and make it ours. I&#8217;m looking forward to seeing our participation in <a href="http://corp.aol.com/2011/05/19/aol-announces-participation-in-world-ipv6-day/">World IPv6 Day</a> as well. </p>
<p>Go AOL!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2011/06/01/aol-cloud-wins-uptime-institute-green-it-innovation-award/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>CDN World Summit</title>
		<link>http://www.jacobrosenberg.net/2010/09/26/cdn-world-summit/</link>
		<comments>http://www.jacobrosenberg.net/2010/09/26/cdn-world-summit/#comments</comments>
		<pubDate>Sun, 26 Sep 2010 21:58:53 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[CDN]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Internet]]></category>
		<category><![CDATA[Technology]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=53</guid>
		<description><![CDATA[I&#8217;ll be speaking on Tuesday at the CDN World Summit in London. If you&#8217;re attending, feel free to look me up while I&#8217;m there &#8211; the event is somewhat more vendor-focused than content-provider focused.]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ll be speaking on Tuesday at the <a href="http://www.cdnworldsummit.com">CDN World Summit</a> in London. If you&#8217;re attending, feel free to look me up while I&#8217;m there &#8211; the event is somewhat more vendor-focused than content-provider focused. <a href="http://www.jacobrosenberg.net/wp-content/uploads/2010/09/1660_CDN_2010_168x125.gif"><img src="http://www.jacobrosenberg.net/wp-content/uploads/2010/09/1660_CDN_2010_168x125.gif" alt="" title="1660_CDN_2010_168x125" width="168" height="125" class="alignnone size-full wp-image-54" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2010/09/26/cdn-world-summit/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MySQL DBA (Database Administrator) Opening at AOL</title>
		<link>http://www.jacobrosenberg.net/2009/08/31/mysql-dba-database-administrator-opening-at-aol/</link>
		<comments>http://www.jacobrosenberg.net/2009/08/31/mysql-dba-database-administrator-opening-at-aol/#comments</comments>
		<pubDate>Mon, 31 Aug 2009 12:20:38 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[aol]]></category>
		<category><![CDATA[Job]]></category>
		<category><![CDATA[administrator]]></category>
		<category><![CDATA[aim]]></category>
		<category><![CDATA[database]]></category>
		<category><![CDATA[dba]]></category>
		<category><![CDATA[dulles]]></category>
		<category><![CDATA[mysql]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=50</guid>
		<description><![CDATA[Full details: http://bit.ly/19Ea7I Contact: Carl.Coppadge at corp.aol.com or AIM: carlcoppadge11 Location: Dulles, VA MySQL Database Administrator AOL&#8217;s People Networks Operations, which operates AIM and ICQ, has an opening for a Sr. MySQL Database Administrator. This position would be responsible for managing &#8230; <a href="http://www.jacobrosenberg.net/2009/08/31/mysql-dba-database-administrator-opening-at-aol/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Full details: <a href="http://bit.ly/19Ea7I">http://bit.ly/19Ea7I</a></p>
<p>Contact: Carl.Coppadge at corp.aol.com or AIM: carlcoppadge11</p>
<p>Location: Dulles, VA</p>
<p>MySQL Database Administrator</p>
<p>AOL&#8217;s People Networks Operations, which operates AIM and ICQ, has an opening for a Sr. MySQL Database Administrator. This position would be responsible for managing large, highly scaled production MySQL databases as well as pre-production (QA/Dev/Staging) environments.</p>
<p>Key Responsibilities:</p>
<ul>
<li>All aspects of MySQL deployment, operation,and design to ensure high reliability and performance</li>
<li>Collaborate with developers and architects to create high-performance, cost-effective designs</li>
<li>Participate in an on-call rotation as part of a 24&#215;7 operations team to resolve urgent production issues</li>
<li>Ensuring data integrity with good process and a keen eye for detecting errors and misuses of data
<p>Desired Skills for this Position:</li>
<li>Knowledge of database architecture concepts as well as MySQL-specific implementations</li>
<li>Significant experience with MySQL in a high-volume production environment</li>
<li>Experience with open source ETL packages and methods for large scale data migration</li>
<li>Diverse technical background with awareness of concepts in networking, Linux, and storage</li>
<li>Bachelor-level degree in Engineering, Computing, or Sciences orequivalent experience</li>
<li>Replication: managing replication delay, scaling replication throughput, and designing resilient systems</li>
<li>Configuration: tuning InnoDB, managing memory usage, and tuning file systems for maximum throughput</li>
<li>Reliability: backups under load, failover strategies, and recovery from replication issues</li>
<li>MySQL: familiarity with Percona and Google patches
<p>Our perfect candidate can manage many rapidly changing projects while maintaining professionalism and poise. Our environment is both highly exciting and highly demanding &#8212; those who thrive here adopt a &#8220;work smarter, not harder&#8221; attitude.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2009/08/31/mysql-dba-database-administrator-opening-at-aol/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How NOT to Inform Your Customers of an Outage</title>
		<link>http://www.jacobrosenberg.net/2008/12/08/how-not-to-inform-your-customers-of-an-outage/</link>
		<comments>http://www.jacobrosenberg.net/2008/12/08/how-not-to-inform-your-customers-of-an-outage/#comments</comments>
		<pubDate>Mon, 08 Dec 2008 05:02:47 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cdn]]></category>
		<category><![CDATA[operations]]></category>
		<category><![CDATA[outages]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=47</guid>
		<description><![CDATA[There are a number of different ways to inform your customers of an outage. I&#8217;ve previously discussed how 365main and Amazon Web Services did this fairly well in the past. Unfortunately, Limelight Networks customers are hearing about issues with their &#8230; <a href="http://www.jacobrosenberg.net/2008/12/08/how-not-to-inform-your-customers-of-an-outage/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>There are a number of different ways to inform your customers of an outage. I&#8217;ve previously discussed how <a href="http://www.jacobrosenberg.net/2007/08/02/things-fall-apart-datacenter-edition/">365main</a> and <a href="http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem">Amazon Web Services</a> did this fairly well in the past. Unfortunately, Limelight Networks customers are <a href="http://gigaom.com/2008/12/05/is-limelight-networks-down-in-asia-europe/">hearing about issues with their CDN via GigaOM</a>.</p>
<p><span id="more-47"></span></p>
<p>There&#8217;s much that can be said about how to do this right, but a few tips I use myself in these situations:</p>
<p> </p>
<ul>
<li>Timely: tell your customers what you&#8217;re certain of as early as reasonable</li>
<li>Unbiased: don&#8217;t speculate or editorialize, just the facts</li>
<li>Clear: avoid using unnecessary jargon or technical terms</li>
<li>Regular: keep communicating until the issue is over</li>
</ul>
<div>Remember, this sort of issue happens from time to time with any service &#8212; concentrate on supporting your customers and your people. </div>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/12/08/how-not-to-inform-your-customers-of-an-outage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Complexity and the 4 a.m. test</title>
		<link>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/</link>
		<comments>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/#comments</comments>
		<pubDate>Mon, 15 Sep 2008 03:04:51 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[4am test]]></category>
		<category><![CDATA[complexity]]></category>
		<category><![CDATA[outages]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=42</guid>
		<description><![CDATA[  With most technology, it&#8217;s a given that there&#8217;s almost always More Than One Way To Do It (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some &#8230; <a href="http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p> </p>
<p>With most technology, it&#8217;s a given that there&#8217;s almost always <a href="http://www.perl.com/pub/a/1999/03/pm.html">More Than One Way To Do It</a> (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize &#8220;cost,&#8221; either up-front development cost or long-term engineering cost. The smarter ones have recognized the concept of &#8220;Technology Debt&#8221; as addressed by <a href="http://onstartups.com/home/tabid/3339/bid/165/Development-Short-Cuts-Are-Not-Free-Understanding-Technology-Debt.aspx">several</a> <a href="http://www.dharmesh.com/Blog/bid/524/Understanding-Technology-Debt">observers</a>. As a leader in Operations, however, I tend to subscribe to my own rule: the 4 a.m. rule.</p>
<p><span id="more-42"></span>Simply put, the 4 a.m. rule is this: </p>
<blockquote>
<p style="text-align: left;"><strong>Never adopt any solution which you couldn&#8217;t understand immediately upon being awoken to fix it at 4 a.m.</strong> </p>
</blockquote>
<p>There&#8217;s a very simple reason to adhere to this rule whenever possible &#8212; as I&#8217;ve previously mentioned, things fall apart. Systems all break: complex ones and simple ones alike. Sooner or later, people need to fix them and the more byzantine the operation of the system, the harder it will be. </p>
<p>The simplest way possible to survive the 4 a.m. test is to only build very simple systems. A totally simple system is sometimes just the ticket to solve the problem, and where it is adequate, go with it. Interesting problems occasionally have extremely elegant solutions, and making them more complex is just bad mojo. </p>
<p>Still, you&#8217;ll much more often find a place where more complexity is necessary to achieve your desired goal. In these circumstances, it can be tricky to pass the 4 a.m. test. This is where two strategies are necessary: documentation and transparency.</p>
<p>Documentation deserves a whole separate discussion, but the part that&#8217;s important at 4 a.m. is a complete lack of subtlety: </p>
<ul>
<li>Recovery instructions: you&#8217;ll have bleary eyes, so these must be as simple as &#8220;if this, do that&#8221;</li>
<li>Architecture diagrams: simple pictures with bright colors and clearly labeled lines detailing what talks to what and why. And don&#8217;t make me load Visio at 4 a.m. ever. </li>
<li>If it&#8217;s needed, and can fail, it should be mentioned, but in as few words as necessary. This is not the time for flowery prose.</li>
</ul>
<p>Transparency is quite a bit harder. This is about exposing as much as possible to someone observing the system. A few places are crucial:</p>
<ul>
<li>Error messages: For the love of god people, make sure every message requires absolutely no prior knowledge and is clear and unambiguous even out of context. </li>
<li>Simple dependencies: Nothing is harder to discover than extremely complex webs of services. If you ever see an design with recursive dependency, run like heck.</li>
<li>Change logging: the first question you should ask when something is broken is &#8220;what changed.&#8221; Keep a record of even the boring stuff &#8211; you never know when it&#8217;ll save your bacon.</li>
</ul>
<p>Remember as a cardinal rule: </p>
<blockquote>
<p style="text-align: left;"><span> </span><strong>complexity is a vice: use it sparingly and explain it simply enough for 4 a.m.</strong></p>
</blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>The Art of the Post-Mortem</title>
		<link>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/</link>
		<comments>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/#comments</comments>
		<pubDate>Sun, 27 Jul 2008 02:13:15 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Internet]]></category>
		<category><![CDATA[amazon s3]]></category>
		<category><![CDATA[operations]]></category>
		<category><![CDATA[outages]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=33</guid>
		<description><![CDATA[I&#8217;ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work &#8212; and almost every other &#8230; <a href="http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work &#8212; and almost every other large installation I&#8217;ve seen or been part of &#8212; the learnings from these inquisitions are shared for educational reasons. The name for this differs from company to company: some call it a RFO (reason for outage) or an After-Action Report, but for whatever reasons the name for this at AOL is a Post-Mortem.</p>
<p><span id="more-33"></span>In general, these sorts of documents contain all of the super-secret (or just embarrassing) details that make up daily life in Operations. They&#8217;re almost never distributed very far &#8212; even large service providers (say, Verizon) tend to have a sanitized version they give their customers. Interestingly, however, a <a href="http://status.aws.amazon.com/s3-20080720.html">sanitized but pretty juicy example</a> emerged from Amazon in response to their recent S3 outage. </p>
<p>Here&#8217;s a break-down by phase. This is the &#8220;detection&#8221; phase &#8212; someone, likely someone in a Network Operations Center (since this is Sunday morning) &#8212; starts seeing big red lights. Detection is all about finding out something is wrong, and defining how serious it is and who needs to fix it.</p>
<blockquote><p>At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.</p></blockquote>
<p>At this point, it&#8217;s pretty clear that they had a major system event going on. I&#8217;d imagine cell phones or pagers (depending on how retro they are out in Seattle) were ruining Sunday morning all over Washington state. The next phase is &#8220;investigation&#8221; &#8212; basically, determining the <strong>proximate cause</strong> of the problem.</p>
<blockquote><p>At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer&#8217;s request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn&#8217;t able to successfully process many customer requests.</p></blockquote>
<p>I notice that the times moved from 5-minute rounding to 1-minute rounding. You get that level of detail from log analysis, and from the sort of really clever network and system monitoring technology that used to be the domain of really big players with lots of money. So, we&#8217;re an hour into a major outage and it&#8217;s likely that this has been escalated both technically (to the most senior engineers who know the system) and to the management of the business that owns the system (Amazon Web Services). </p>
<blockquote><p>At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system&#8217;s state, and then reactivate the request processing components.</p></blockquote>
<p>Okay, almost another hour is gone and I&#8217;d imagine all the &#8220;easy&#8221; options are exhausted. Now, they&#8217;re trying more high-impact solutions. This one sounds suspiciously like &#8220;bounce the &lt;insert process here&gt; and see if it comes up clean,&#8221; which is one of those embarrassing-but-effective solutions you end up using when you just don&#8217;t know what else to do sometimes.</p>
<blockquote><p>By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system&#8217;s state cleared. By 2:20pm PDT, we&#8217;d restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.</p>
<p>At 2:57pm PDT, Amazon S3&#8242;s EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3&#8242;s US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.</p></blockquote>
<p>So, a full bounce of that subsystem took almost 4 hours to show results. You can imagine those were 4 pretty tense hours. Some companies use a conference bridge to manage big incidents, others use web chats or VoIP systems. I&#8217;m sure a bunch of people were all working very hard to move this along quickly, and it still took quite a while, but you can almost imagine the relief that flooded the whole team at 2:57pm when EU came back up. By around 5pm, the whole system was back up and normal, and there wasn&#8217;t much left to do except the paperwork.</p>
<p>Which brings us to the last part of their message:</p>
<blockquote><p>We&#8217;ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers&#8217; objects. However, we didn&#8217;t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn&#8217;t detect it and it spread throughout the system causing the symptoms described above. We hadn&#8217;t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.</p></blockquote>
<p>You can be sure that as soon as they were on a path to get the system stable, they were investigating how it got unstable in the first place. This is a pretty in-depth problem statement, and it&#8217;s impressive how transparent Amazon is about it. I doubt I&#8217;d ever be allowed to report to the public on any of my outages like this alone, and I suspect Amazon is no different: at some point during that investigation, Amazon&#8217;s PR people started working on wording for public announcements. At the same time, technical teams were working to figure out how to make sure it never happens again.</p>
<blockquote><p>During our post-mortem analysis we&#8217;ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we&#8217;re taking: (a) we&#8217;ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we&#8217;ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we&#8217;ve added additional monitoring and alarming of gossip rates and failures; and, (d) we&#8217;re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.</p></blockquote>
<p>This is fascinating. Certainly, (a) is an obvious concern &#8212; I know I&#8217;d be pulling out my hair if a key system I managed took 4 hours to restart. I&#8217;d speculate that this was simply a contingency nobody had thought about yet, and thus probably was a pretty painfully manual process. Amazon (and all the large operators) rely on quite a bit of automation to manage large fleets of servers, but when the unexpected happens, it doesn&#8217;t always work as planned. Item (b) looks like it&#8217;s addressing the root cause, item (d) may also play into this. Item (c) is that perennial favorite of ops: monitoring and alarming. These are all just the sort of things I&#8217;d expect to see if I were in that situation.</p>
<p>In summary, very interesting data about an increasingly important service on the internet. At the same time, a rare view into what goes on in the dark where SysAdmins and Network Engineers roam. Incidentally, there are technical details about the system that failed, which is a technology Amazon calls <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Dynamo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Velocity and Structure08</title>
		<link>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/</link>
		<comments>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/#comments</comments>
		<pubDate>Sat, 21 Jun 2008 23:05:17 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Web Technology]]></category>
		<category><![CDATA[cdn]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[structure08]]></category>
		<category><![CDATA[velocity]]></category>
		<category><![CDATA[web performance]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=32</guid>
		<description><![CDATA[A whole lot of conferences are happening this week, and I&#8217;ll be attending two of them. On Monday and Tuesday of this week I&#8217;ll be attending O&#8217;Reilly&#8217;s Velocity conference, where I&#8217;ll be moderating a panel entitled &#8220;Everything You Ever Wanted &#8230; <a href="http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>A whole lot of conferences are happening this week, and I&#8217;ll be attending two of them. On Monday and Tuesday of this week I&#8217;ll be attending O&#8217;Reilly&#8217;s Velocity conference, where I&#8217;ll be moderating a panel entitled &#8220;<a href="http://en.oreilly.com/velocity2008/public/schedule/detail/2213">Everything You Ever Wanted to Know about CDNs (but were afraid to ask).</a>&#8221; I&#8217;m hoping that seems to be fun, but there ought to be a lot of <strong>other</strong> interesting people I&#8217;d like to see while there as well, including two other very smart folks from AOL (<a href="http://en.oreilly.com/velocity2008/public/schedule/speaker/24541">Mandi Walls</a> and <a href="http://en.oreilly.com/velocity2008/public/schedule/speaker/3308">Eric Goldsmith</a>). I&#8217;ve been thinking about this as &#8220;Web 2.0 Expo without all that boring UI and Business Stuff&#8221;. </p>
<p><a href="http://conferences.oreilly.com/velocity/"><br />
<img title="Velocity, the Web Performance and Operations Conference 2008" src="http://conferences.oreillynet.com/banners/velocity/speaker/468x60.gif" border="0" alt="Velocity, the Web Performance and Operations Conference 2008" width="468" height="60" /></a></p>
<p>The second event I&#8217;ll be at will be <a href="http://events.gigaom.com/structure/08/">GigaOM&#8217;s Structure 08</a>. Cloud computing is really leveling the playing field, giving small start-ups access to world-class operational assets&#8230; which to me only underscore the importance of having brilliant Ops folks to run those systems. I&#8217;m eager to see what sort of discussions emerge.</p>
<p>If you happen to be at either, give me a buzz in the comments, and I&#8217;ll try and catch up with you. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Really Big Data Centers for Lease</title>
		<link>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/</link>
		<comments>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/#comments</comments>
		<pubDate>Mon, 22 Oct 2007 04:26:29 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
				<category><![CDATA[Data Centers]]></category>
		<category><![CDATA[Infrastructure]]></category>
		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/</guid>
		<description><![CDATA[This past Friday, DuPont Fabros Technology (DFT) raised $640 million in an IPO. DFT is a Real Estate Investment Trust (REIT) which specializes in large-scale commercial data centers. More to the point, they specialize in the sort of facilities which &#8230; <a href="http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>This past Friday, DuPont Fabros Technology (DFT) <a href="http://www.reuters.com/article/newIssuesNews/idUSN1837411420071018">raised $640 million in an IPO</a>. DFT is a Real Estate Investment Trust (REIT) which specializes in large-scale commercial data centers. More to the point, they specialize in the sort of facilities which are desired by the largest technology companies. I&#8217;ve mentioned before that building and operating facilities is often desirable for larger players, but when it isn&#8217;t, they increasingly turn to DFT.</p>
<p><span id="more-29"></span></p>
<p>DFT operates quite a few facilities in Northern Virginia. Most familiar to me would be the former <a href="http://www.dft.com/data_centers/va4.shtml">AOL Gainesville Technology Center</a>, <a href="http://www.washingtonpost.com/wp-dyn/content/article/2005/07/27/AR2005072702539.html">sold to them in 2005</a>. While this facility was big, the real capacity of a data center is less space (&#8220;raised floor square feet&#8221;) but power (&#8220;megawatts of critical load&#8221;). DFT has subsequently acquired several other sites, and <a href="http://www.dft.com/data_centers/acc4.shtml">will open a new facility</a> roughly twice the size and with four times the power of the former AOL site, with <a href="http://www.dft.com/data_centers/development_pipeline.shtml">several more in their pipeline</a>.</p>
<p>That&#8217;s quite a bit of  new hosting capacity, and a clear sign that the large facility shortage may be over. To put it in perspective, the Lenior, NC or Quincy, WA sites mentioned in <a href="http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/">my earlier post</a> as &#8220;really big&#8221; are in the same ballpark, size-wise, as one of their new centers. And, while we don&#8217;t really know that Google or Microsoft has filled out their space, DFT brags that all their earlier sites are leased with terms averaging 8 years. Their still-under-construction ACC4 site was 43.8% pre-leased in August.</p>
<p>So, who&#8217;s using all that space? Their IPO tells a bit of these normally secretive details. Facebook is making a big new expansion, <a href="http://www.datacenterknowledge.com/archives/2007/Oct/18/facebook_expands_data_center_space.html">leasing 10,000 square feet of space</a>. Most of the rest of the space goes to a few major users, according to their <a href="http://www.sec.gov/Archives/edgar/data/1407739/000119312507177663/ds11.htm">registration statement</a>. The same statement indicates that their undeveloped property represents 187MW of additional capacity. That said, they probably want to acquire some new customers, because &#8220;As of August 1, 2007, our two largest tentants, Microsoft and Yahoo!, accounted for 86.0% of our annualized rent&#8221;.</p>
<p>This is a distinctly different business model than that of either the major network providers (Level3 or Verizon), or the major Carrier Neutral providers (Equinix or Switch &amp; Data). They consider themselves wholesalers to major providers, rather than a retail provider, because they allow the sort of large capacity blocks to be purchased that otherwise can&#8217;t be gotten without building it yourself.</p>
<p>So, will they ultimately suffer the fate of Exodus or Genuity?  They do seem to be in the right geographic locations (NoVA, Chicago, Santa Clara). They seem to have the right sizes and scale (new facilities over 30MW), with individual units of capacity in the 2-4MW range, for their target market. I&#8217;m not sure there are 10 more Microsoft, Yahoo, or Googles out there looking for space, but if they&#8217;re opening up their doors to Facebook at reasonably pricing, perhaps there&#8217;s enough large scale demand left. With Amazon chasing the small end, and DFT chasing the high end, this may make a very interesting business outlook for the ever-turbulent retail data center business.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

