In the early hours of Monday, 20 October 2025, Amazon Web Services experienced a disruption that affected a considerable portion of the global internet infrastructure. The event originated in the region known as US East 1, located in Northern Virginia. This region holds a position of particular importance within the AWS global network, for it is one of the oldest and most heavily utilised operational zones. A vast number of services, commercial applications, public websites and private systems rely upon it either directly or indirectly.

The first sign of trouble appeared shortly after three o’clock in the morning Eastern Time. Monitoring systems began recording increases in error rates, delayed responses to service requests and instances in which requests returned no response at all. These conditions affected consumer applications, enterprise systems, automation networks and a wide array of hosted services. The official AWS service health dashboard soon confirmed that multiple services in US East 1 were experiencing significant issues related to elevated error rates and overloaded service endpoints.

What failed

AWS later clarified that the source of the disruption could be traced to two internal faults that occurred in proximity. The first fault involved the Domain Name System resolution layer for the DynamoDB service endpoint. DNS serves as the directory of the internet that converts names into machine reachable network addresses. If a DNS entry fails, queries cannot be routed to the appropriate destination. In this case, applications attempting to reach DynamoDB found themselves unable to locate the service. The second fault arose in the internal system responsible for monitoring the health of load balancers within the EC2 network fabric. These load balancers are tasked with distributing requests among healthy servers so that no single resource becomes overwhelmed. When the health monitoring subsystem failed to relay correct signals, the system could not determine which targets were functional or available.

When these two faults occurred together, the disruption intensified. In normal circumstances the failure of one subsystem may be mitigated by internal fallback behaviour. In this situation, however, the failure of the DNS layer prevented applications from routing correctly, while the failure of the load balancer health signalling prevented AWS from restoring service paths with the speed usually expected. The result was a broad and persistent interruption.

What it broke

The observable effects extended well beyond Northern Virginia. Numerous applications and services that reside in regions outside the United States became inaccessible to users in Europe, Asia and South America. This occurred because many applications rely upon US East 1 for either primary operations or central coordination tasks such as authentication, identity services or database calls. Even if the majority of an application’s content is physically hosted in other parts of the world, dependency chains may still reach into this region. When the chain breaks at one critical point, the entire system may be brought to a halt.

Reports appeared regarding interruptions in banking applications, mobile payment services, e-commerce platforms and media streaming services. Popular consumer communication platforms encountered login failures, message retrieval issues or inability to load user feeds. Enterprises that rely upon AWS for internal data processing reported delays in batch operations, order handling and internal service automation. Home automation products, particularly those which depend upon continuous cloud connectivity to respond to voice commands or remote actions, also experienced disruption. This served as a reminder that the modern household, no less than the modern workplace, now depends upon cloud services in ways that are often invisible until failure occurs.

How recovery progressed

AWS engineers began work on mitigation shortly after the first faults were detected. The initial step involved isolating the incorrect DNS resolution behaviour and bringing alternative resolution paths online. At the same time, they undertook efforts to restore accurate health signalling within the EC2 network. During this phase, AWS temporarily limited the rate at which new EC2 instances could be launched in the affected region. This measure was intended to prevent additional strain on the recovering infrastructure and to maintain a stable environment for service restoration. Such throttling is not undertaken lightly, for it disrupts systems that scale dynamically. However, it is sometimes necessary to preserve overall system integrity while repair work proceeds.

The most severe phase of the disruption lasted approximately fifteen hours from the first spike in error rates to general recovery. During the later portion of the event many services began gradually returning to stability. Full recovery required time due to the need to clear backlog queues, re-initialise certain service relationships and restore consistent data replication. Some organisations observed lingering effects several hours beyond the official resolution as their systems synchronised and load patterns normalised.

Lessons for architects and operators

  • Reduce concentration risk. Do not allow a single region to anchor identity, control planes or critical storage.
  • Design true multi-region behaviour. Replication is not enough. Address identity, session state, routing logic and write strategy.
  • Plan graceful degradation. Permit read-only modes, fallback caches and deferred writes so that service remains useful during partial failure.
  • Map dependencies. Maintain a live inventory of upstream and downstream services, including DNS, queueing, storage and control endpoints.
  • Drill failover in production-like conditions. Test live traffic migration, capacity headroom and throttling effects.
  • Monitor the invisible layers. Track DNS response quality, resolver health, load balancer status and internal health signals in addition to user-facing metrics.

What to fix next

Teams should begin with an audit that names every dependency which points into US East 1. Catalogue authentication systems, data stores, queues and control services. For each item, document its alternate path. If none exists, define one. Where possible, separate reads and writes so that a loss of the write path does not remove the entire service. Introduce caching with clear timeouts and limits. Keep error messages simple and useful so that users understand whether to retry, to wait, or to try a reduced feature path.

From there, build a failover runbook that any trained engineer can follow. Include conditions for invocation, traffic migration steps, a communications plan and rollback steps. Practice on a schedule. Measure the time from incident start to user confirmation of recovery. After each drill, adjust thresholds and steps so that the plan improves.

Closing note

The October 2025 AWS outage serves as a sober reminder that our networks run on layers that are often unseen until they fail. DNS and load balancer health signals do not call attention to themselves during normal hours, yet they hold much of the structure together. When they falter, consequences arrive quickly and without ceremony. The sensible response is not panic but craft. Map the system. Remove silent single points of failure. Teach the service to bend without breaking. Do the work before the next storm.

Cloud AWS Resilience DNS Operations