When the Cloud Sneezes: a look at the ‘Outage Season’

The past few months have been a bruising reminder that even the biggest cloud providers can stumble. AWS, Microsoft Azure, and Cloudflare have all suffered major outages, disrupting any services that rely on them from websites to shopping sites to CRM and Finance systems and AI tools. For businesses (and their customers / users) there cause huge problems, lower confidence, impact services and reputation. Many are asking what happened, why and what can be done about it?

The Outages – What Happened?

There’s been a fair few this autumn, some causing performace issues, others huge impacts bringing  down lots of services. Most recent examples include:

  • Amazon Web Services (20 October 2025): A DNS automation bug in the US‑East‑1 region corrupted internal records for DynamoDB, cascading into failures across Lambda, API Gateway, and thousands of downstream apps which then replicated across other regions. 
  • Microsoft Azure (29 October 2025): A faulty configuration in Azure Front Door’s CDN nodes caused global downtime, impacting Microsoft 365, Xbox Live, and airline systems. 
  • Cloudflare (two recent ones – 18 November & 5 December 2025): According to their support pages, the first outage was the result of a bot‑management configuration file which got too big and grew beyond expected limits, crashing traffic‑handling software worldwide. The recent one last week, was different, caused by a firewall update triggered a bug, disrupting almost a third of (29%) of HTTP traffic globally. 


Whilst we hear a lot about cyber threats (similar to the CrowdStrike problem a couple years ago) these were not cyber-attacks or capacity limit failures. They were internal configuration and metadata errors. This may have been bad change management, inadequate fail back and validation controls or something else!

What are these Providers doing about it?

As you’d imagine, these major cloud providers need to tell their customers what is going on, provide assurance as to their stability and ability to recover. Following these, each have (and remember every cloud service does suffer outages from time to time for a number of reasons) has a requirement to update customers on the root cause and long term strategies to improve resilience. As an example:

  • AWS have said they are rolling out their “Route 53 Accelerated Recovery” and will be partnering with partnering with Google Cloud for multi‑cloud failover. This will be part of their premium resiliency services.
  • Microsoft have committed to strengthening their change‑management processes, checks and controls and are publishing detailed post‑mortems.
  • Cloudflare, in similar vein to Microsoft, are adding extra guardrails to monitor more controls and configuration parameters to prevent issues such as oversized configuration files. They are also committed to improving their Web app firewall testing and failover services.

How can organisations can be better prepared

It’s always tough when a business relies on Cloud providers to host and power their business as their hands can be tied and them left a little helpless when outages or performance issues occur – often it’s the visibility (or lack of) of what is happening that is most daunting for organisations and for IT. Whilst the benefits of cloud are not being disputed here, businesses can find themselves in tricky situations, having to answer to their customers, the board and shareholders as outages to their business can cause everything from mild inconveniences to value and reputational damage.

A multi cloud strategy can of course help, mitigating some of the impact of an outage of one, a hybrid (on prem/cloud) approach can do the same, but each of these add huge amount of complexity to their infrastructure management and cost so it’s about weighing up the impact and cost vs impact cost.

So, what else can organisations do then?

Visibility and Early Warnings

With any cloud service, it is often the visibility (or lack of) of what is happening that is most daunting for organisations and for IT. Just like tsunami warnings or those sniffles you get that let you know you have a cold or flu coming, there are services that can provide a holistic view across all your cloud services and help IT understand the continuous, predicted and previous performance and reliability of such services.

Cisco ThousandEyes as an example, is an incredible powerful service that provides end‑to‑end visibility across Internet Service Providers (ISPs) SaaS, and all cloud providers.

It helps detect outages early and quickly, can pinpoint root causes (DNS, CDN, SaaS, routing), and prove whether performance issues or service outages are external or internal. There are also plug-ins for desktops devices and browsers that can do this for remote and cloud users.

ThousandEyes also offers digital experience monitoring for apps like Microsoft 365, Service Now, Salesforce, and Zoom, and they have AI‑driven insights to flag risks before they cascade.

It can’t fix it as such, but if do have failovers, contingencies or DR in place, it can help organisations prepare, understand where the issue is and how to help communication with leadership, customers and other stakeholders and sooth the stress (slightly) of diagnosing the issues.

Failover Services

Of course, visibility and understanding are great but depending on how big the risk and cost of these outages, many organisations (many giant content providers do this) are looking at ways to add more contingency, failover, etc.

Where the issues are with DNS or content delivery providers, many organisations look at leveraging Multi‑CDN/DNS providers with auto Failover. This can help keep traffic flowing by rerouting when one provider fails. 

These tools often provide their own native cloud monitoring. Whilst not as in-depth as Thousand Eyes, they will typically provide dashboards that provide telemetry and status of these services and can instigate failover automatically or on-demand.

Cloud providers also provide their own status and performance monitoring tools often with in-depth telemetry but limited of course to their own platforms.

Chaos Engineering

This is really the approach, organisations will ypically employ as part of their DR/BC planning. Chaos Engineering is essentially the process organisations use to simulate outages to enable them to validate (and improve) their recovery plans before the real thing happens. This of course needs cultural buy-in, a continuous improvement approach and awareness across different aspects of their environment to ensure the impact of such outages can have. Again – visibility tools can help.

Insurance Safety Net

Even with resilience tools in place, redundancy and well architected fail-over solutions, outages will always happen, and this can mean lost revenue. Many “tech” insurers offer cyber insurance with many also offering “business interruption cover”, compensating for lost earnings due to outages caused by IT fails. I’ve not gone into details here, but some of these include:

  • Hiscox UK: Offers optional cyber business interruption cover, including income loss if systems fail. 
  • Clarke Williams Insurance Brokers: Provides Dependent Business Interruption (DBI) coverage for third‑party service failures, such as internet or cloud outages. 
  • Grove & Dean Insurance Brokers: Offers cyber insurance with explicit business interruption protection, covering lost income due to downtime from cyber incidents. 

This means UK businesses can not only plan for resilience but also insure against the financial impact of outages. Of course, a continuous increase in outages may increase these insurance premiums….

Conclusion

The cloud outages at prove that cloud isn’t “always on”, despite the various up-time (sometimes financially backed) SLAs providers promise. Providers are tightening controls, but businesses must assume failure and design for resilience. Whilst observability tools like Cisco ThousandEyes can’t stop an outage, they do help you see them faster, understand them better, and recover more effectively.

Combined with multi‑cloud strategies, failover planning, chaos testing, and cyber insurance with business interruption cover, organisations can turn inevitable outages into manageable events — protecting both operations and earnings.