<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.jacobrosenberg.net/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.jacobrosenberg.net/~d/styles/itemcontent.css" type="text/css" media="screen"?><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>JacobRosenberg.net</title>
	
	<link>http://www.jacobrosenberg.net</link>
	<description>The view from AOL's Operations</description>
	<pubDate>Mon, 15 Sep 2008 03:06:04 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.2</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.jacobrosenberg.net/JacobRosenbergsBlog" type="application/rss+xml" /><item>
		<title>Complexity and the 4 a.m. test</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/392819550/</link>
		<comments>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/#comments</comments>
		<pubDate>Mon, 15 Sep 2008 03:04:51 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[4am test]]></category>

		<category><![CDATA[complexity]]></category>

		<category><![CDATA[outages]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=42</guid>
		<description><![CDATA[ 
With most technology, it&#8217;s a given that there&#8217;s almost always More Than One Way To Do It (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize &#8220;cost,&#8221; either up-front development cost or long-term engineering cost. The smarter ones have [...]]]></description>
			<content:encoded><![CDATA[<p> </p>
<p>With most technology, it&#8217;s a given that there&#8217;s almost always <a href="http://www.perl.com/pub/a/1999/03/pm.html">More Than One Way To Do It</a> (unless you worship Python). There are always those situations where choices must be made, and different people use different yardsticks to decide. Some try to minimize &#8220;cost,&#8221; either up-front development cost or long-term engineering cost. The smarter ones have recognized the concept of &#8220;Technology Debt&#8221; as addressed by <a href="http://onstartups.com/home/tabid/3339/bid/165/Development-Short-Cuts-Are-Not-Free-Understanding-Technology-Debt.aspx">several</a> <a href="http://www.dharmesh.com/Blog/bid/524/Understanding-Technology-Debt">observers</a>. As a leader in Operations, however, I tend to subscribe to my own rule: the 4 a.m. rule.</p>
<p><span id="more-42"></span>Simply put, the 4 a.m. rule is this: </p>
<blockquote>
<p style="text-align: left;"><strong>Never adopt any solution which you couldn&#8217;t understand immediately upon being awoken to fix it at 4 a.m.</strong> </p>
</blockquote>
<p>There&#8217;s a very simple reason to adhere to this rule whenever possible &#8212; as I&#8217;ve previously mentioned, things fall apart. Systems all break: complex ones and simple ones alike. Sooner or later, people need to fix them and the more byzantine the operation of the system, the harder it will be. </p>
<p>The simplest way possible to survive the 4 a.m. test is to only build very simple systems. A totally simple system is sometimes just the ticket to solve the problem, and where it is adequate, go with it. Interesting problems occasionally have extremely elegant solutions, and making them more complex is just bad mojo. </p>
<p>Still, you&#8217;ll much more often find a place where more complexity is necessary to achieve your desired goal. In these circumstances, it can be tricky to pass the 4 a.m. test. This is where two strategies are necessary: documentation and transparency.</p>
<p>Documentation deserves a whole separate discussion, but the part that&#8217;s important at 4 a.m. is a complete lack of subtlety: </p>
<ul>
<li>Recovery instructions: you&#8217;ll have bleary eyes, so these must be as simple as &#8220;if this, do that&#8221;</li>
<li>Architecture diagrams: simple pictures with bright colors and clearly labeled lines detailing what talks to what and why. And don&#8217;t make me load Visio at 4 a.m. ever. </li>
<li>If it&#8217;s needed, and can fail, it should be mentioned, but in as few words as necessary. This is not the time for flowery prose.</li>
</ul>
<p>Transparency is quite a bit harder. This is about exposing as much as possible to someone observing the system. A few places are crucial:</p>
<ul>
<li>Error messages: For the love of god people, make sure every message requires absolutely no prior knowledge and is clear and unambiguous even out of context. </li>
<li>Simple dependencies: Nothing is harder to discover than extremely complex webs of services. If you ever see an design with recursive dependency, run like heck.</li>
<li>Change logging: the first question you should ask when something is broken is &#8220;what changed.&#8221; Keep a record of even the boring stuff - you never know when it&#8217;ll save your bacon.</li>
</ul>
<p>Remember as a cardinal rule: </p>
<blockquote>
<p style="text-align: left;"><span> </span><strong>complexity is a vice: use it sparingly and explain it simply enough for 4 a.m.</strong></p>
</blockquote>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/392819550" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2008/09/14/complexity-and-the-4-am-test/</feedburner:origLink></item>
		<item>
		<title>SimpleCDN: Revolutionary Business Model or Just a Pyramid Scheme?</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/360817098/</link>
		<comments>http://www.jacobrosenberg.net/2008/08/09/simplecdn-revolutionary-business-model-or-just-a-pyramid-scheme/#comments</comments>
		<pubDate>Sun, 10 Aug 2008 04:18:57 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Internet]]></category>

		<category><![CDATA[akamai]]></category>

		<category><![CDATA[cdn]]></category>

		<category><![CDATA[content distribution networks]]></category>

		<category><![CDATA[internet business]]></category>

		<category><![CDATA[limelight]]></category>

		<category><![CDATA[s3]]></category>

		<category><![CDATA[simplecdn]]></category>

		<category><![CDATA[web 2.0]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=38</guid>
		<description><![CDATA[ 
Just the other day, I stumbled on Yet Another Startup CDN Provider, SimpleCDN. The CDN business has become very crowded as facilities-based carriers (AT&#38;T and Level), start-ups (Limelight, Peer1, Panther Express), and the established leader (Akamai) engaged in a free-for-all over price. SimpleCDN&#8217;s value proposition is straightforward: pay a one-time up-front fee based on object [...]]]></description>
			<content:encoded><![CDATA[<p> </p>
<p>Just the other day, I stumbled on Yet Another Startup CDN Provider, SimpleCDN. The CDN business has become very crowded as facilities-based carriers (AT&amp;T and Level), start-ups (Limelight, Peer1, Panther Express), and the established leader (Akamai) engaged in a free-for-all over price. SimpleCDN&#8217;s value proposition is straightforward: pay a one-time up-front fee based on object size, instead of monthly based on usage. The big question I&#8217;m inclined to want answered is: how could this possibly work as a business?</p>
<p><span id="more-38"></span></p>
<p>This is a pretty unusual concept, so I decided to check it out myself. They let anyone sign up on their portal and start delivering without so much as a sales call. I wanted to ask some questions, so I went looking for real-world contact information. </p>
<p>Here&#8217;s where the first problems became obvious: </p>
<p>- The only address they provide (the Highland Park, IL one) <a href="http://www.mbe.com/hpgen/CenterPage.asp?strCenterNum=MBE1714">appears to be a Mailboxes Etc location</a>. </p>
<p>- They also list some &#8220;NOC&#8221; mailing addresses on <a href="http://www.simplecdn.com/contact">their contact pages</a>. If they seem familiar, it&#8217;s because they&#8217;re major carrier neutral data centers. Perhaps they host something at these locations, but they sure don&#8217;t have people there. </p>
<p>- There was, however, a 24/7 support address. So, I sent them some email.</p>
<p>I received an almost-immediate response (on a Sunday night) from Frank Wilson, who described himself as &#8220;Chief Engineer of SimpleCDN, and also a co-founder.&#8221; He offered to answer any questions, so I asked:</p>
<blockquote><p><span> </span>I&#8217;m assuming your data center bills you for power and space every month, and I&#8217;m sure you&#8217;re paying someone for transit monthly, so how exactly does this work over the long run? On the surface, it seems like you&#8217;re banking on the demand for content dropping dramatically over time, and the marginal cost to store it approaching zero. </p>
<p><span> </span>Assuming your cost for delivery is the 7 cents per GB transferred (which implies Cogent-level transit costs). At the 1MB = 1 credit = ~14GB xfer&#8217;d price point you&#8217;ve set, you&#8217;re collecting that one-time fee sufficiently to cover 14000 times while at the 1GB = 35 credits = 500GB which would only cover around 500 downloads. </p></blockquote>
<p>Days later, with no answer to my question, I decided to do a bit of poking around. Here are a few things I discovered:</p>
<p>1) The entire SimpleCDN operation seems to be hosted in a single Chicago location. Looking at looking glass servers from Europe, Asia, South America, and Australia reveals the exact same set of servers in Chicago serving everything. They own <a href="http://www.cidr-report.org/cgi-bin/as-report?as=AS46177&amp;v=4&amp;view=2.0">AS46177</a>, but don&#8217;t seem to be using it.</p>
<p>2) All of their routes come via a self-declared ISP called <a href="http://www.resisoft.com">Resisoft</a> (<a href="http://www.cidr-report.org/cgi-bin/as-report?as=AS32205&amp;v=4&amp;view=2.0">AS32205</a>). Don&#8217;t bother trying to click that link &#8212; they don&#8217;t seem to be a working web site. It isn&#8217;t clear if they&#8217;re troubled, or just host <a href="http://www.ripoffreport.com/reports/0/266/RipOff0266401.htm">troublesome companies</a>. It seems like most of their transit is Hurricane Electric, which isn&#8217;t exactly premium bandwidth. </p>
<p>3) Frank Wilson seems to be the only visible employee of SimpleCDN&#8211;he answers their support email, manages their network, and markets their services. Yes, you heard that right: from sites from the <a href="ttp://www.webhostingtalk.com/showthread.php?t=693479">technical web-hosting</a> to <a href="http://messages.finance.yahoo.com/Stocks_(A_to_Z)/Stocks_A/threadview?m=tm&amp;bn=700&amp;tid=184169&amp;mid=184169&amp;tof=40&amp;so=R&amp;frt=2">stock boards</a> to <a href="http://gigaom.com/2007/08/06/cdn-price-wars/">commenting on industry sites</a> he pops up, promoting SimpleCDN. Sometimes he identifies as an employee, other times he talks about SimpleCDN in the third person, as if he weren&#8217;t affiliated. I find this very shifty.</p>
<p>Here&#8217;s what I think is happening: SimpleCDN is using current revenue from those initial &#8220;one-time&#8221; payments to cover their current bandwidth and hosting bills. As long as usage remains reasonable, and they keep growing their customer base, they can probably keep up with their costs for a little while. </p>
<p>Except, this just isn&#8217;t a very good business model. Most of these customers will pay a few dollars a few times a year. The transaction charges eat alive a small company &#8212; even Amazon&#8217;s EC2 products aren&#8217;t really designed around someone using 5 instance-hours a month. And, all it takes is one really big download (Firefox 4, maybe?) to call their bluff.</p>
<p>So, worst case, they&#8217;re some sort of scam that will collect a bunch of pre-paid hosting charges and disappear. Best case, they have a naive business model which will cause their collapse if they get even moderately popular. I suppose I can understand why someone currently using S3 as a CDN (seriously, guys, why?) might consider this an appealing alternative, but it&#8217;s pretty hard to imagine that Frank&#8217;s business will be here in 2 years. </p>
<p>Oh, and to Frank: I&#8217;d love to see a business plan that proves me wrong. I&#8217;m always interested in seeing legitimate competitors in the space, so show the world some evidence that SimpleCDN is more than just a great big promise that can&#8217;t go the distance!</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/360817098" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/08/09/simplecdn-revolutionary-business-model-or-just-a-pyramid-scheme/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2008/08/09/simplecdn-revolutionary-business-model-or-just-a-pyramid-scheme/</feedburner:origLink></item>
		<item>
		<title>The Art of the Post-Mortem</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/347068732/</link>
		<comments>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/#comments</comments>
		<pubDate>Sun, 27 Jul 2008 02:13:15 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Internet]]></category>

		<category><![CDATA[amazon s3]]></category>

		<category><![CDATA[operations]]></category>

		<category><![CDATA[outages]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=33</guid>
		<description><![CDATA[I&#8217;ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work &#8212; and almost every other large installation I&#8217;ve seen or been part of &#8212; the learnings from these inquisitions are [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve mentioned in the past that the failure of complex systems is an inevitable fact of nature. The corresponding act of human inquisition into the reasons for that failure are equally inevitable. Where I work &#8212; and almost every other large installation I&#8217;ve seen or been part of &#8212; the learnings from these inquisitions are shared for educational reasons. The name for this differs from company to company: some call it a RFO (reason for outage) or an After-Action Report, but for whatever reasons the name for this at AOL is a Post-Mortem.</p>
<p><span id="more-33"></span>In general, these sorts of documents contain all of the super-secret (or just embarrassing) details that make up daily life in Operations. They&#8217;re almost never distributed very far &#8212; even large service providers (say, Verizon) tend to have a sanitized version they give their customers. Interestingly, however, a <a href="http://status.aws.amazon.com/s3-20080720.html">sanitized but pretty juicy example</a> emerged from Amazon in response to their recent S3 outage. </p>
<p>Here&#8217;s a break-down by phase. This is the &#8220;detection&#8221; phase &#8212; someone, likely someone in a Network Operations Center (since this is Sunday morning) &#8212; starts seeing big red lights. Detection is all about finding out something is wrong, and defining how serious it is and who needs to fix it.</p>
<blockquote><p>At 8:40am PDT, error rates in all Amazon S3 datacenters began to quickly climb and our alarms went off. By 8:50am PDT, error rates were significantly elevated and very few requests were completing successfully. By 8:55am PDT, we had multiple engineers engaged and investigating the issue. Our alarms pointed at problems processing customer requests in multiple places within the system and across multiple data centers. While we began investigating several possible causes, we tried to restore system health by taking several actions to reduce system load. We reduced system load in several stages, but it had no impact on restoring system health.</p></blockquote>
<p>At this point, it&#8217;s pretty clear that they had a major system event going on. I&#8217;d imagine cell phones or pagers (depending on how retro they are out in Seattle) were ruining Sunday morning all over Washington state. The next phase is &#8220;investigation&#8221; &#8212; basically, determining the <strong>proximate cause</strong> of the problem.</p>
<blockquote><p>At 9:41am PDT, we determined that servers within Amazon S3 were having problems communicating with each other. As background information, Amazon S3 uses a gossip protocol to quickly spread server state information throughout the system. This allows Amazon S3 to quickly route around failed or unreachable servers, among other things. When one server connects to another as part of processing a customer&#8217;s request, it starts by gossiping about the system state. Only after gossip is completed will the server send along the information related to the customer request. On Sunday, we saw a large number of servers that were spending almost all of their time gossiping and a disproportionate amount of servers that had failed while gossiping. With a large number of servers gossiping and failing while gossiping, Amazon S3 wasn&#8217;t able to successfully process many customer requests.</p></blockquote>
<p>I notice that the times moved from 5-minute rounding to 1-minute rounding. You get that level of detail from log analysis, and from the sort of really clever network and system monitoring technology that used to be the domain of really big players with lots of money. So, we&#8217;re an hour into a major outage and it&#8217;s likely that this has been escalated both technically (to the most senior engineers who know the system) and to the management of the business that owns the system (Amazon Web Services). </p>
<blockquote><p>At 10:32am PDT, after exploring several options, we determined that we needed to shut down all communication between Amazon S3 servers, shut down all components used for request processing, clear the system&#8217;s state, and then reactivate the request processing components.</p></blockquote>
<p>Okay, almost another hour is gone and I&#8217;d imagine all the &#8220;easy&#8221; options are exhausted. Now, they&#8217;re trying more high-impact solutions. This one sounds suspiciously like &#8220;bounce the &lt;insert process here&gt; and see if it comes up clean,&#8221; which is one of those embarrassing-but-effective solutions you end up using when you just don&#8217;t know what else to do sometimes.</p>
<blockquote><p>By 11:05am PDT, all server-to-server communication was stopped, request processing components shut down, and the system&#8217;s state cleared. By 2:20pm PDT, we&#8217;d restored internal communication between all Amazon S3 servers and began reactivating request processing components concurrently in both the US and EU.</p>
<p>At 2:57pm PDT, Amazon S3&#8217;s EU location began successfully completing customer requests. The EU location came back online before the US because there are fewer servers in the EU. By 3:10pm PDT, request rates and error rates in the EU had returned to normal. At 4:02pm PDT, Amazon S3&#8217;s US location began successfully completing customer requests, and request rates and error rates had returned to normal by 4:58pm PDT.</p></blockquote>
<p>So, a full bounce of that subsystem took almost 4 hours to show results. You can imagine those were 4 pretty tense hours. Some companies use a conference bridge to manage big incidents, others use web chats or VoIP systems. I&#8217;m sure a bunch of people were all working very hard to move this along quickly, and it still took quite a while, but you can almost imagine the relief that flooded the whole team at 2:57pm when EU came back up. By around 5pm, the whole system was back up and normal, and there wasn&#8217;t much left to do except the paperwork.</p>
<p>Which brings us to the last part of their message:</p>
<blockquote><p>We&#8217;ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers&#8217; objects. However, we didn&#8217;t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn&#8217;t detect it and it spread throughout the system causing the symptoms described above. We hadn&#8217;t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.</p></blockquote>
<p>You can be sure that as soon as they were on a path to get the system stable, they were investigating how it got unstable in the first place. This is a pretty in-depth problem statement, and it&#8217;s impressive how transparent Amazon is about it. I doubt I&#8217;d ever be allowed to report to the public on any of my outages like this alone, and I suspect Amazon is no different: at some point during that investigation, Amazon&#8217;s PR people started working on wording for public announcements. At the same time, technical teams were working to figure out how to make sure it never happens again.</p>
<blockquote><p>During our post-mortem analysis we&#8217;ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we&#8217;re taking: (a) we&#8217;ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we&#8217;ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we&#8217;ve added additional monitoring and alarming of gossip rates and failures; and, (d) we&#8217;re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.</p></blockquote>
<p>This is fascinating. Certainly, (a) is an obvious concern &#8212; I know I&#8217;d be pulling out my hair if a key system I managed took 4 hours to restart. I&#8217;d speculate that this was simply a contingency nobody had thought about yet, and thus probably was a pretty painfully manual process. Amazon (and all the large operators) rely on quite a bit of automation to manage large fleets of servers, but when the unexpected happens, it doesn&#8217;t always work as planned. Item (b) looks like it&#8217;s addressing the root cause, item (d) may also play into this. Item (c) is that perennial favorite of ops: monitoring and alarming. These are all just the sort of things I&#8217;d expect to see if I were in that situation.</p>
<p>In summary, very interesting data about an increasingly important service on the internet. At the same time, a rare view into what goes on in the dark where SysAdmins and Network Engineers roam. Incidentally, there are technical details about the system that failed, which is a technology Amazon calls <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Dynamo</a>.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/347068732" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2008/07/26/the-art-of-the-post-mortem/</feedburner:origLink></item>
		<item>
		<title>Velocity and Structure08</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/317120094/</link>
		<comments>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/#comments</comments>
		<pubDate>Sat, 21 Jun 2008 23:05:17 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Web Technology]]></category>

		<category><![CDATA[cdn]]></category>

		<category><![CDATA[cloud computing]]></category>

		<category><![CDATA[conferences]]></category>

		<category><![CDATA[structure08]]></category>

		<category><![CDATA[velocity]]></category>

		<category><![CDATA[web performance]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/?p=32</guid>
		<description><![CDATA[A whole lot of conferences are happening this week, and I&#8217;ll be attending two of them. On Monday and Tuesday of this week I&#8217;ll be attending O&#8217;Reilly&#8217;s Velocity conference, where I&#8217;ll be moderating a panel entitled &#8220;Everything You Ever Wanted to Know about CDNs (but were afraid to ask).&#8221; I&#8217;m hoping that seems to be [...]]]></description>
			<content:encoded><![CDATA[<p>A whole lot of conferences are happening this week, and I&#8217;ll be attending two of them. On Monday and Tuesday of this week I&#8217;ll be attending O&#8217;Reilly&#8217;s Velocity conference, where I&#8217;ll be moderating a panel entitled &#8220;<a href="http://en.oreilly.com/velocity2008/public/schedule/detail/2213">Everything You Ever Wanted to Know about CDNs (but were afraid to ask).</a>&#8221; I&#8217;m hoping that seems to be fun, but there ought to be a lot of <strong>other</strong> interesting people I&#8217;d like to see while there as well, including two other very smart folks from AOL (<a href="http://en.oreilly.com/velocity2008/public/schedule/speaker/24541">Mandi Walls</a> and <a href="http://en.oreilly.com/velocity2008/public/schedule/speaker/3308">Eric Goldsmith</a>). I&#8217;ve been thinking about this as &#8220;Web 2.0 Expo without all that boring UI and Business Stuff&#8221;. </p>
<p><a href="http://conferences.oreilly.com/velocity/"><br />
<img title="Velocity, the Web Performance and Operations Conference 2008" src="http://conferences.oreillynet.com/banners/velocity/speaker/468x60.gif" border="0" alt="Velocity, the Web Performance and Operations Conference 2008" width="468" height="60" /></a></p>
<p>The second event I&#8217;ll be at will be <a href="http://events.gigaom.com/structure/08/">GigaOM&#8217;s Structure 08</a>. Cloud computing is really leveling the playing field, giving small start-ups access to world-class operational assets&#8230; which to me only underscore the importance of having brilliant Ops folks to run those systems. I&#8217;m eager to see what sort of discussions emerge.</p>
<p>If you happen to be at either, give me a buzz in the comments, and I&#8217;ll try and catch up with you. </p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/317120094" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2008/06/21/velocity-and-structure08/</feedburner:origLink></item>
		<item>
		<title>Really Big Data Centers for Lease</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332222/</link>
		<comments>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/#comments</comments>
		<pubDate>Mon, 22 Oct 2007 04:26:29 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Data Centers]]></category>

		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/</guid>
		<description><![CDATA[This past Friday, DuPont Fabros Technology (DFT) raised $640 million in an IPO. DFT is a Real Estate Investment Trust (REIT) which specializes in large-scale commercial data centers. More to the point, they specialize in the sort of facilities which are desired by the largest technology companies. I&#8217;ve mentioned before that building and operating facilities [...]]]></description>
			<content:encoded><![CDATA[<p>This past Friday, DuPont Fabros Technology (DFT) <a href="http://www.reuters.com/article/newIssuesNews/idUSN1837411420071018">raised $640 million in an IPO</a>. DFT is a Real Estate Investment Trust (REIT) which specializes in large-scale commercial data centers. More to the point, they specialize in the sort of facilities which are desired by the largest technology companies. I&#8217;ve mentioned before that building and operating facilities is often desirable for larger players, but when it isn&#8217;t, they increasingly turn to DFT.</p>
<p><span id="more-29"></span></p>
<p>DFT operates quite a few facilities in Northern Virginia. Most familiar to me would be the former <a href="http://www.dft.com/data_centers/va4.shtml">AOL Gainesville Technology Center</a>, <a href="http://www.washingtonpost.com/wp-dyn/content/article/2005/07/27/AR2005072702539.html">sold to them in 2005</a>. While this facility was big, the real capacity of a data center is less space (&#8221;raised floor square feet&#8221;) but power (&#8221;megawatts of critical load&#8221;). DFT has subsequently acquired several other sites, and <a href="http://www.dft.com/data_centers/acc4.shtml">will open a new facility</a> roughly twice the size and with four times the power of the former AOL site, with <a href="http://www.dft.com/data_centers/development_pipeline.shtml">several more in their pipeline</a>.</p>
<p>That&#8217;s quite a bit of  new hosting capacity, and a clear sign that the large facility shortage may be over. To put it in perspective, the Lenior, NC or Quincy, WA sites mentioned in <a href="http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/">my earlier post</a> as &#8220;really big&#8221; are in the same ballpark, size-wise, as one of their new centers. And, while we don&#8217;t really know that Google or Microsoft has filled out their space, DFT brags that all their earlier sites are leased with terms averaging 8 years. Their still-under-construction ACC4 site was 43.8% pre-leased in August.</p>
<p>So, who&#8217;s using all that space? Their IPO tells a bit of these normally secretive details. Facebook is making a big new expansion, <a href="http://www.datacenterknowledge.com/archives/2007/Oct/18/facebook_expands_data_center_space.html">leasing 10,000 square feet of space</a>. Most of the rest of the space goes to a few major users, according to their <a href="http://www.sec.gov/Archives/edgar/data/1407739/000119312507177663/ds11.htm">registration statement</a>. The same statement indicates that their undeveloped property represents 187MW of additional capacity. That said, they probably want to acquire some new customers, because &#8220;As of August 1, 2007, our two largest tentants, Microsoft and Yahoo!, accounted for 86.0% of our annualized rent&#8221;.</p>
<p>This is a distinctly different business model than that of either the major network providers (Level3 or Verizon), or the major Carrier Neutral providers (Equinix or Switch &amp; Data). They consider themselves wholesalers to major providers, rather than a retail provider, because they allow the sort of large capacity blocks to be purchased that otherwise can&#8217;t be gotten without building it yourself.</p>
<p>So, will they ultimately suffer the fate of Exodus or Genuity?  They do seem to be in the right geographic locations (NoVA, Chicago, Santa Clara). They seem to have the right sizes and scale (new facilities over 30MW), with individual units of capacity in the 2-4MW range, for their target market. I&#8217;m not sure there are 10 more Microsoft, Yahoo, or Googles out there looking for space, but if they&#8217;re opening up their doors to Facebook at reasonably pricing, perhaps there&#8217;s enough large scale demand left. With Amazon chasing the small end, and DFT chasing the high end, this may make a very interesting business outlook for the ever-turbulent retail data center business.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332222" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/10/21/really-big-data-centers-for-lease/</feedburner:origLink></item>
		<item>
		<title>Be nice</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332223/</link>
		<comments>http://www.jacobrosenberg.net/2007/10/16/be-nice/#comments</comments>
		<pubDate>Tue, 16 Oct 2007 22:16:26 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Navel-Gazing]]></category>

		<category><![CDATA[aol]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/10/16/be-nice/</guid>
		<description><![CDATA[It&#8217;s been well-reported that AOL made cuts today. While I wasn&#8217;t among those affected, naturally with any event this large, quite a few people I knew and worked with were amongst those impacted.
If there&#8217;s any one thing that&#8217;s been disappointing about this time around, it&#8217;s been the continual stream of nastiness by people who claim [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been well-reported that <a href="http://www.nytimes.com/2007/10/16/business/media/16aol.html">AOL made cuts today</a>. While I wasn&#8217;t among those affected, naturally with any event this large, quite a few people I knew and worked with were amongst those impacted.<span id="more-28"></span></p>
<p>If there&#8217;s any one thing that&#8217;s been disappointing about this time around, it&#8217;s been the continual stream of nastiness by people who claim to be current or former AOL employees in public forums and online rumors sites. Isn&#8217;t it bad enough that we&#8217;re going through this &#8212; do we need to rip on each other as well? Difficult situations produce different responses from people, and I understand the desire to vent, I just wish it were less self-destructive.</p>
<p>And that&#8217;s all I have to say about the topic.  I&#8217;ll be back to my regularly-scheduled &#8230; uhm &#8230; &#8220;posting schedule&#8221; when the grumpy wears off.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332223" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/10/16/be-nice/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/10/16/be-nice/</feedburner:origLink></item>
		<item>
		<title>Things Fall Apart, Datacenter Edition</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332224/</link>
		<comments>http://www.jacobrosenberg.net/2007/08/02/things-fall-apart-datacenter-edition/#comments</comments>
		<pubDate>Fri, 03 Aug 2007 03:53:12 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[365 main]]></category>

		<category><![CDATA[Data Centers]]></category>

		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Internet]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/08/02/things-fall-apart-datacenter-edition/</guid>
		<description><![CDATA[The relentless pursuit by Operations staff of 100% uptime has always struck me as something more than just a job, but a battle against the relentless forces of nature. Everything ultimately breaks down &#8212; systems, buildings, even people &#8212; and attempting to maintain 100% availability is the Ops equivalent of trying to cheat death. Sooner [...]]]></description>
			<content:encoded><![CDATA[<p>The relentless pursuit by Operations staff of 100% uptime has always struck me as something more than just a job, but a battle against the relentless forces of nature. Everything ultimately breaks down &#8212; systems, buildings, even people &#8212; and attempting to maintain 100% availability is the Ops equivalent of trying to cheat death. Sooner or later, despite our best efforts, our number will ultimately be up. Most recently in the news, self-proclaimed World&#8217;s Finest Data Center operator <a href="http://www.365main.com">365 Main</a> suffered an approximately <a href="http://365main.com/press_releases/pr_8_1_07_365_main_report.html">45 minute power outage</a> at their San Francisco facility. Much to their credit., and unlike most of their <a href="http://blogs.feedburner.com/feedburner/archives/001280.html">competitors</a>, 365 Main has been extremely open about their investigation. I&#8217;ll examine this a bit today, as it&#8217;s a rare public glimpse into what goes on inside a large data center facility.<br />
<span id="more-27"></span><br />
One afternoon, a transformer owned by the supplying power utility failed and caused a power surge. Normally, a power problem like this should trigger an automated transition from utility power to data center generated power. There&#8217;s a <a href="http://www.hitecusa.com/upsoperation.html">pretty cool animation</a> from the company that makes the generator that 365 Main uses that shows how the transition happens. Except, unfortunately, this time when the utility power was interrupted, three of the generators failed due to a software bug, and 365 Main&#8217;s design could only survive two failures.</p>
<p>So, 365 Main screwed up big time? Not really. Their design was horribly flawed? Not so much. They have eight rooms full of servers, and ten generators &#8212; enough for every room to have one, plus two extra generators for contingencies. This type of UPS is a Diesel Rotary UPS: utility power makes a flywheel spin, the flywheel runs a generator, the generator supplies power to the computers. When utility power goes away, there&#8217;s a brief (to humans) pause while the diesel engine starts up make power for the computers. As long as the diesel spins up before the flywheel spins down, power keeps flowing. Proponents of this design like to emphasize how it&#8217;s simple, and thus pretty reliable, and in its defense, no part of the Rotary system seemed to fail in this case. For completeness, the other kind of UPS uses lots and lots of a batteries.</p>
<p>What did fail was the diesel engine&#8217;s controller. What makes the electricity is an enormous diesel engine, so naturally there&#8217;s quite a bit of support equipment to keep it running that had to be checked. During the investigation, issues with other systems (exhaust) were uncovered. Ultimately, though, they found a flaw that could be reproduced in a critical part of a system that should never fail. Yet, it did fail in 30% of the cases, and that was enough to bring down all sorts of different products, not to mention causing a highly public issue for 365 Main.</p>
<p>I&#8217;m not affiliated in any way with 365 Main, but I do have to say I&#8217;m impressed with how they have handled the aftermath of the incident. They were open and honest about the incident, provided lots of public information about their investigation, even to non-customers like me. Heck, they even ignored the idiotic rumors that Valleywag was tricked into posting as &#8220;news.&#8221; Nicely done, guys.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332224" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/08/02/things-fall-apart-datacenter-edition/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/08/02/things-fall-apart-datacenter-edition/</feedburner:origLink></item>
		<item>
		<title>PRESENTATION: Geographic Distribution for Global Web Application Performance</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332225/</link>
		<comments>http://www.jacobrosenberg.net/2007/04/17/presentation-geographic-distribution-for-global-web-application-performance/#comments</comments>
		<pubDate>Wed, 18 Apr 2007 04:26:02 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[web2expo]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/04/17/presentation-geographic-distribution-for-global-web-application-performance/</guid>
		<description><![CDATA[As promised, the presentation from Geographic Distribution for Global Web Application Performance. This was presented today at Web 2.0 Expo.
]]></description>
			<content:encoded><![CDATA[<p>As promised, the presentation from <a href="http://www.jacobrosenberg.net/wp-content/uploads/2007/04/geodist_web20.pdf" title="Geographic Distribution for Global Web Application Performance">Geographic Distribution for Global Web Application Performance</a>. This was presented today at Web 2.0 Expo.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332225" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/04/17/presentation-geographic-distribution-for-global-web-application-performance/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/04/17/presentation-geographic-distribution-for-global-web-application-performance/</feedburner:origLink></item>
		<item>
		<title>Geographic Distribution for Global Web Application Performance</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332226/</link>
		<comments>http://www.jacobrosenberg.net/2007/04/16/geographic-distribution-for-global-web-application-performance/#comments</comments>
		<pubDate>Mon, 16 Apr 2007 22:28:27 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Internet]]></category>

		<category><![CDATA[Technology]]></category>

		<category><![CDATA[web2expo]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/04/16/geographic-distribution-for-global-web-application-performance/</guid>
		<description><![CDATA[I&#8217;m pleased to announce that on Tuesday, April 17th, I&#8217;ll be presenting a brief  discussion of Geographic Distribution at Web 2.0 Expo in San Francisco. As the web matures, performance has become a tremendous issue, especially when deploying an application for a global audience. One important way to improve performance is the geographic distribution [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m pleased to announce that on Tuesday, April 17th, I&#8217;ll be presenting a <a href="http://conferences.oreillynet.com/cs/webex2007/view/e_sess/11043">brief  discussion of Geographic Distribution</a> at Web 2.0 Expo in San Francisco. As the web matures, performance has become a tremendous issue, especially when deploying an application for a global audience. One important way to improve performance is the geographic distribution of application delivery. Join me at 8:30am tomorrow in 2018, or check out my slides, which will be posted shortly after the discussion.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332226" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/04/16/geographic-distribution-for-global-web-application-performance/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/04/16/geographic-distribution-for-global-web-application-performance/</feedburner:origLink></item>
		<item>
		<title>Really Big Data Centers</title>
		<link>http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~3/256332227/</link>
		<comments>http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/#comments</comments>
		<pubDate>Wed, 11 Apr 2007 17:34:03 +0000</pubDate>
		<dc:creator>jacob</dc:creator>
		
		<category><![CDATA[Data Centers]]></category>

		<category><![CDATA[Infrastructure]]></category>

		<category><![CDATA[Internet]]></category>

		<category><![CDATA[aol]]></category>

		<guid isPermaLink="false">http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/</guid>
		<description><![CDATA[While most of my time these days is spent contemplating software and application considerations, I&#8217;d like to take a moment to address a topic which only occasionally gets the attention it deserves: the role of a high quality data center. While a few folks may think that networking and data center infrastructure are dead arts, [...]]]></description>
			<content:encoded><![CDATA[<p>While most of my time these days is spent contemplating software and application considerations, I&#8217;d like to take a moment to address a topic which only occasionally gets the attention it deserves: the role of a high quality data center. While a few folks may think that <a href="http://gigaom.com/2007/04/10/web-20-death-of-the-network-engineer/">networking and data center infrastructure are dead arts</a>, I&#8217;m quite confident that there is still significant work going on in this space. Case in point: <a href="http://yodel.yahoo.com/2006/11/27/powering-the-yahoo-network/">Yahoo</a>, <a href="http://wenatcheeworld.com/apps/pbcs.dll/article?AID=/20070401/NEWS04/704010448/0/FRONTPAGE">Microsoft</a>, <a href="http://web2.commongate.com/post/Photos_Google_s_Secret_data_center">Google</a>, <a href="http://www.computerworld.com/action/article.do?command=viewArticleBasic&amp;articleId=287832&amp;source=rss_news50">Google</a>, and (shockingly) <a href="http://www.infoworld.com/article/07/04/05/HNgoogledatacentersc_1.html">Google</a> are building massive new data centers taking advantage of all of the latest features to increase density and automation and reduce cost. At the end of the day, scale wins, and these facilities (which have price tags in the half-a-billion dollar range) have scale. Not to be outdone, incidentally, AOL has <a href="http://www.internetnews.com/xSP/article.php/75071">built a few</a> <a href="http://www.gatewayva.com/biz/virginiabusiness/magazine/yr2001/june01/deals.html">big data centers</a> &#8212; and <a href="http://www.datacenterknowledge.com/archives/2005/Jul/29/dupont_fabros_pays_58_million_for_aol_center.html">sold them too</a>. <span id="more-23"></span></p>
<p>So, what makes companies like Google, Yahoo, Microsoft, and AOL build their own data center facilities, when the vast majority of companies end up leasing space from carrier-neutral colocation providers like <a href="http://www.equinix.com">Equinix</a>, or telecommunications providers like <a href="http://www.verizon.com">Verizon Business</a> or <a href="http://www.level3.com">Level3</a>? The tongue in cheek answer would be &#8216;because they can&#8217;, but building a data center facility is as much about control as it is about anything else. Being able to control key elements such as physical security, power and network access, space assignment policy, and general access to the space makes it compelling to own and operate a space. In addition, in recent years, the market for large spaces in the leasing market has dried up, adding both an availability benefit (when you build 250,000 square feet of space, you know it&#8217;s there for you) and a cost benefit (no competing with Google for the last big cage in a facility). Of course, owning a space locks in your cost basis in a way which leasing doesn&#8217;t, but for a business on the grow, there&#8217;s not really a question of whether the space will be used or not.</p>
<p>And yet, the Google, Microsoft, Yahoo, and AOL&#8217;s of the world don&#8217;t exclusively use owned space to host their servers: at least some of their footprint remains in leased space. Part of this is necessary in order to build out network connectivity in a desired way. You might not be able to get every peer you want to follow you to the middle-of-nowhere, but you can certainly get to them in San Jose or Washington D.C. There are also certain situations where being present in a specific place is the most important consideration for technical, legal, or contractual reasons.</p>
<p>So, what drives the massive centers out to the boonies? It certainly isn&#8217;t a proliferation of talent in those areas. In the past, land prices were a primary consideration, but in the last two years, the single most important factor seems to be power. Let&#8217;s take a look at a little something from the Department of Energy about <a href="http://www.eia.doe.gov/neic/brochure/electricity/electricity.html">Residential Electricity Prices</a>, understanding that commercial/industrial pricing follows the same trend, but is likely a bit lower. The following map illustrates average prices in centers per kWh (caveat, this data is 4 years old. Here&#8217;s <a href="http://www.eia.doe.gov/cneaf/electricity/epm/table5_6_a.html">an ugly but new table of energy prices by State</a>):</p>
<p><img src="http://www.eia.doe.gov/neic/brochure/electricity/images/us%20map.gif" /></p>
<p>You&#8217;ll see some interesting facts &#8212; the states with the largest populations also have some of the highest energy costs. Compare 5.81 cents per kWh in Kentucky against New York&#8217;s pricely 14.31 cents, and you&#8217;ll see that there&#8217;s a tremendous incentive to locate in the lower cost states: Washington, Idaho, North Dakota, Nebraska, Tennessee, Kentucky, West Virginia. Of course, these are averages &#8212; most of those big high profile deals involved low-cost energy. The Microsoft deal, for example, was rumored to include sub-2 cent power. Add to this that several of these states receive power from hydroelectric (Bonneville Power Administration, or Tennessee Valley Authority) which have highly fixed costs, rather than burning a fuel whose cost could increase markedly in the future.</p>
<p>Cheap power is great, but without fiber to connect into, a data center is just a gigantic resistance heater. There are certainly plenty of public networks out there, but the larger providers seem to get especially interested in so-called Dark Fiber. A well-documented story about Google discussed their appetite for Dark Fiber and expected they&#8217;d start their own ISP. What&#8217;s much more likely is they just wanted to build their own backbone to connect up their numerous facilities. Dark fiber are individual fibers within an already laid cable that haven&#8217;t been connected or used yet. While the market has tightened with the consolidation to a few providers (okay, <a href="http://www.level3.com">Level3 bought everybody:</a> Wiltel, Progress, ICG, Telcove, Looking Glass, and Broadwing), major providers still can get access to the raw material they need to build their own network, using technologies such as Dense Wavelength Division Multiplexing to cram breathtaking amounts of traffic onto a single fiber. Trace the lines on, say, the <a href="http://www.level3.com/images/global_map/Level_3_Network_map.pdf">Level3 network map</a> and you can plot locations for almost all of those new gigantic datacenters. There&#8217;s no coincidence that Montana or North Dakota have no fiber and hence no data centers.</p>
<p>Next time: so, what&#8217;s actually in one of those gigantic Google data centers.</p>
<img src="http://feeds.jacobrosenberg.net/~r/JacobRosenbergsBlog/~4/256332227" height="1" width="1"/>]]></content:encoded>
			<wfw:commentRss>http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/feed/</wfw:commentRss>
		<feedburner:origLink>http://www.jacobrosenberg.net/2007/04/11/really-big-data-centers/</feedburner:origLink></item>
	</channel>
</rss>
