AWS cut off long tail


Amazon wide The outage of cloud web services that began early Monday morning exposed the fragile interdependence of the Internet as communication, financial, health care, education and government platforms around the world were disrupted. Later that day, AWS recognized and began working to correct the problem, which was caused by the company’s US-EAST-1 critical region based in Northern Virginia. But the cascade of effects took time to fully resolve.

Researchers reflecting on the incident particularly highlighted the length of the outage, which began around 3 a.m. ET on Monday, October 20. AWS said in status updates that “all AWS services were back to normal operation” as of 6:01 p.m. Monday evening. The outage stems directly from the APIs of Amazon’s DynamoDB database and affected 141 other AWS services, according to the company. Several network engineers and infrastructure experts told WIRED that failures are understandable and inevitable for so-called “hyperscalers” like AWS, Microsoft Azure and Google Cloud Platform, given their complexity and sheer size. But they also noted that this fact shouldn’t simply absolve cloud providers of long-term downtime.

“The word Foresight It’s easy to see what went wrong is key, but AWS’s overall reliability shows how difficult it is to prevent any failure.

AWS did not respond to WIRED’s questions about the long recovery tail for customers. An AWS spokesperson says the company plans to release one of its “post-event summaries” about the incident.

“I don’t think it’s just a ‘random’ outage,” says Jake Williams, vice president of research and development at Hunter Strategy. So this is to their credit. “But it’s very easy to get into this mindset of giving these companies permission, and we shouldn’t forget that they’re creating this situation by actively trying to attract more customers to their infrastructure, customers who may not have oversight of their infrastructure.”

This incident was caused by a familiar culprit in web outages – “domain name system” resolution problems. DNS is essentially the Internet’s phone book mechanism for directing web browsers to the appropriate servers. As a result, DNS problems are a common source of outages, as they can cause requests to fail and prevent content from loading.

Leave a Reply

Your email address will not be published. Required fields are marked *