What the massive AWS outage shows about the Internet
A huge cloud An outage from Amazon Web Services’ key US-EAST-1 region, headquartered in northern Virginia near the US Capitol, caused widespread disruption to websites and platforms around the world on Monday morning. Amazon’s main e-commerce platform and other features, including ringtones and the Alexa smart assistant, suffered outages throughout the morning, as did WhatsApp’s meta communication platform, OpenAI’s ChatGPT, PayPal’s Venmo payment platform, several web services from Epic Games, several UK government sites and many others.
The outages were caused by Amazon’s DynamoDB database APIs in US-EAST-1, and AWS said in status updates that the issue was specifically related to DNS resolution issues. The “Domain Name System” is a basic Internet service that essentially acts as an automated phone book lookup to translate web URLs such as www.wired.com into numeric server IP addresses so that web browsers can display the appropriate content to users. DNS resolution problems occur when DNS servers do not connect the dots accurately and, to maintain the phone book analogy, provide the wrong numbers for a particular name, or vice versa.
“Based on our investigation, this issue appears to be related to the DNS resolution of the DynamoDB API endpoint on US-EAST-1,” AWS wrote in Monday’s status updates. Shortly thereafter, the company added, “If you’re still experiencing problems resolving DynamoDB service endpoints on US-EAST-1, we recommend flushing your DNS cache.”
An AWS spokesperson did not immediately respond to a request for details on the nature of the failure. DNS resolution issues can be disruptive — known as DNS hijacking — but there’s no indication that Monday’s AWS outages were sinister.
“Cascading outages knocked out service all over the Internet when the system couldn’t correctly determine which server to connect to,” says Davey Ottenheimer, longtime director of security and compliance operations and vice president of data infrastructure company Inrupt. Today’s AWS outage is a classic availability problem, and we should see it more as a data integrity failure.
The problems started around 3 am EST. By 5:22 AM, AWS had applied the “early discounts” they were starting to take effect. At 6:35 a.m., Amazon said it had fully addressed the underlying technical issues, but “some services have a backlog of work to do, which may take longer to fully process.”
AWS has suffered other major outages, including a major incident in 2023. Reliance on centralized cloud services from giants such as AWS, Microsoft Azure, and Google Cloud Services has in many ways improved cybersecurity and stability around the world by creating a baseline of guardrails and best practices for all customers. But this standardization comes with major trade-offs, as platforms become a single point of failure for large parts of critical services.
“Failures increasingly lead to integration,” says Ottenheimer. “Broken data, failed validation, or in this case, broken name resolution poisoning every downstream dependency. Until we better understand and protect integrity, our overall focus on update time is an illusion.”