In this episode: Codero's Phoenix data center suffers lengthy outage, Wordpress.com goes down affecting millions of sites, Microsoft's Live ID servers wobble after loss of a brother, Google gives honest assessment of February app outage, and Ubisoft's new DRM system falls down and locks out paying customers.
Published: March 23, 2010
A review of recent network-based service outages and issues. In this episode: Codero's Phoenix data center suffers lengthy outage, Wordpress.com goes down affecting millions of sites, Microsoft's Live ID servers wobble after loss of a brother, Google gives honest assessment of February app outage, and Ubisoft's new DRM system falls down and locks out paying customers.
Backfire - Codero Addresses Lengthy Power Outage:
"The incident began at about 8 a.m. Central time, when the facility lost utility power. The backup generators started properly, but an automatic transfer switch (ATS) failed to switch the power to generator power, leaving the data center operating on the battery banks of its uninterruptible power supply (UPS) units. 'Unfortunately, time ran out and our facility went dark,' said Codero chief operating officer Ryan Elledge. The outage also damaged a power distribution unit (PDU) that supported the core network router, which delayed resumption of service after power was restored to the data center. A small number of servers remained offline late Monday evening due to hardware problems associated with the power issue."
"There was a latent misconfiguration, specifically a cable plugged someplace it shouldn't have been, from a few months ago. Something called the spanning tree protocol kicked in and started trying to route all of our private network traffic to a public network over a link that was much too small and slow to handle even 10% of our traffic which caused high packet loss. This "sort of working" state was much worse than if it had just gone down and confused our systems team and our failsafe systems. It is not clear yet why the misconfiguration bit us yesterday and not earlier. Even though the network issue was unfortunate, we responded too slowly in pinpointing the issue and taking steps to resolve it using alternate routes, extending the downtime 3-4x longer than it should have been."
"Due to the failure of one server, Windows Live ID logins were failing for some customers, and this increased the load on our remaining servers. We took the problematic server offline and brought a new server into rotation. We identified the root cause and fixed it in less than an hour, but it took a while to resolve the logjam that had built up in the meantime, and to redistribute the load to normal levels."
"There was confusion about the instructions for switching to a back-up data centre and the decision-maker for the crossover could not be found. The team then received data suggesting that the data centre was recovering and that a changeover was not necessary. However, the data turned out to be inaccurate and this extended the outage considerably. By the time the move to the backup servers had been made, Google Apps had been down for more than two hours."
"Having recently implemented a wildly unpopular new form of digital rights management for its PC titles, (which requires a constant connection to its DRM servers) over the last few days Ubisoft released two key games for the platform, Assassin's Creed II and Silent Hunter V. Thing is, over the weekend, Ubisoft's DRM servers went down. And at time of posting are still down. Meaning many users had trouble installing games, saving games and in some cases even playing those two titles. As a means of rewarding those remaining customers loyal enough to stick with the publisher despite the outrageous demands of the DRM, it's...hardly what you'd call a success. Especially when it only affects paying customers, with pirates bypassing the DRM enjoying the games all weekend long."
*All views and opinions expressed in ESG blog posts are intended to be those of the post's author and do not necessarily reflect the views of Enterprise Strategy Group, Inc., or its clients. ESG bloggers do not and will not engage in any form of paid-for blogging. Click to see our complete Disclosure Policy.