Amazon Explains its Massive S3 Outage (Mar 2, 2017)

Amazon’s S3 service went down in part of the US on Tuesday, something I commented on at the time. But we now have an official explanation, which is that an employee attempting to debug an issue with the billing system for AWS accidentally took down more servers than he/she intended to, which in turn had a knock-on effect on several other services which manage other aspects of the S3 system (including the dashboard which reports whether the service is performing as expected). Restarting several of the servers took far longer than anyone had expected, which meant Amazon’s contingency planning turned out not to be adequate after all. It sounds like it has now put in place some protections to prevent similar things from happening in future, but once again it’s just a reminder of how vulnerable big chunks of the Internet are to an AWS outage, something we discussed in depth on this week’s Beyond Devices Podcast, recorded earlier today shortly after this announcement was made.

via Amazon


The company, topic, and narrative tags below will take you to other posts with the same tags. The narrative link(s) will also take you to the narrative essay which provides additional context behind the post.

Vote for or share this post

Use the Like button below to vote for this post as one of the most important of the week. The posts voted most important are more likely to be included in the News Roundup podcast episode I do each week. Or use the sharing buttons to share a link to this post to social networks or other services.