When the Cloud Sneezes: a look at the ‘Outage Season’

The past few months have been a bruising reminder that even the biggest cloud providers can stumble. AWS, Microsoft Azure, and Cloudflare have all suffered major outages, disrupting any services that rely on them from websites to shopping sites to CRM and Finance systems and AI tools. For businesses (and their customers / users) there cause huge problems, lower confidence, impact services and reputation. Many are asking what happened, why and what can be done about it?

The Outages – What Happened?

There’s been a fair few this autumn, some causing performace issues, others huge impacts bringing  down lots of services. Most recent examples include:

  • Amazon Web Services (20 October 2025): A DNS automation bug in the US‑East‑1 region corrupted internal records for DynamoDB, cascading into failures across Lambda, API Gateway, and thousands of downstream apps which then replicated across other regions. 
  • Microsoft Azure (29 October 2025): A faulty configuration in Azure Front Door’s CDN nodes caused global downtime, impacting Microsoft 365, Xbox Live, and airline systems. 
  • Cloudflare (two recent ones – 18 November & 5 December 2025): According to their support pages, the first outage was the result of a bot‑management configuration file which got too big and grew beyond expected limits, crashing traffic‑handling software worldwide. The recent one last week, was different, caused by a firewall update triggered a bug, disrupting almost a third of (29%) of HTTP traffic globally. 


Whilst we hear a lot about cyber threats (similar to the CrowdStrike problem a couple years ago) these were not cyber-attacks or capacity limit failures. They were internal configuration and metadata errors. This may have been bad change management, inadequate fail back and validation controls or something else!

What are these Providers doing about it?

As you’d imagine, these major cloud providers need to tell their customers what is going on, provide assurance as to their stability and ability to recover. Following these, each have (and remember every cloud service does suffer outages from time to time for a number of reasons) has a requirement to update customers on the root cause and long term strategies to improve resilience. As an example:

  • AWS have said they are rolling out their “Route 53 Accelerated Recovery” and will be partnering with partnering with Google Cloud for multi‑cloud failover. This will be part of their premium resiliency services.
  • Microsoft have committed to strengthening their change‑management processes, checks and controls and are publishing detailed post‑mortems.
  • Cloudflare, in similar vein to Microsoft, are adding extra guardrails to monitor more controls and configuration parameters to prevent issues such as oversized configuration files. They are also committed to improving their Web app firewall testing and failover services.

How can organisations can be better prepared

It’s always tough when a business relies on Cloud providers to host and power their business as their hands can be tied and them left a little helpless when outages or performance issues occur – often it’s the visibility (or lack of) of what is happening that is most daunting for organisations and for IT. Whilst the benefits of cloud are not being disputed here, businesses can find themselves in tricky situations, having to answer to their customers, the board and shareholders as outages to their business can cause everything from mild inconveniences to value and reputational damage.

A multi cloud strategy can of course help, mitigating some of the impact of an outage of one, a hybrid (on prem/cloud) approach can do the same, but each of these add huge amount of complexity to their infrastructure management and cost so it’s about weighing up the impact and cost vs impact cost.

So, what else can organisations do then?

Visibility and Early Warnings

With any cloud service, it is often the visibility (or lack of) of what is happening that is most daunting for organisations and for IT. Just like tsunami warnings or those sniffles you get that let you know you have a cold or flu coming, there are services that can provide a holistic view across all your cloud services and help IT understand the continuous, predicted and previous performance and reliability of such services.

Cisco ThousandEyes as an example, is an incredible powerful service that provides end‑to‑end visibility across Internet Service Providers (ISPs) SaaS, and all cloud providers.

It helps detect outages early and quickly, can pinpoint root causes (DNS, CDN, SaaS, routing), and prove whether performance issues or service outages are external or internal. There are also plug-ins for desktops devices and browsers that can do this for remote and cloud users.

ThousandEyes also offers digital experience monitoring for apps like Microsoft 365, Service Now, Salesforce, and Zoom, and they have AI‑driven insights to flag risks before they cascade.

It can’t fix it as such, but if do have failovers, contingencies or DR in place, it can help organisations prepare, understand where the issue is and how to help communication with leadership, customers and other stakeholders and sooth the stress (slightly) of diagnosing the issues.

Failover Services

Of course, visibility and understanding are great but depending on how big the risk and cost of these outages, many organisations (many giant content providers do this) are looking at ways to add more contingency, failover, etc.

Where the issues are with DNS or content delivery providers, many organisations look at leveraging Multi‑CDN/DNS providers with auto Failover. This can help keep traffic flowing by rerouting when one provider fails. 

These tools often provide their own native cloud monitoring. Whilst not as in-depth as Thousand Eyes, they will typically provide dashboards that provide telemetry and status of these services and can instigate failover automatically or on-demand.

Cloud providers also provide their own status and performance monitoring tools often with in-depth telemetry but limited of course to their own platforms.

Chaos Engineering

This is really the approach, organisations will ypically employ as part of their DR/BC planning. Chaos Engineering is essentially the process organisations use to simulate outages to enable them to validate (and improve) their recovery plans before the real thing happens. This of course needs cultural buy-in, a continuous improvement approach and awareness across different aspects of their environment to ensure the impact of such outages can have. Again – visibility tools can help.

Insurance Safety Net

Even with resilience tools in place, redundancy and well architected fail-over solutions, outages will always happen, and this can mean lost revenue. Many “tech” insurers offer cyber insurance with many also offering “business interruption cover”, compensating for lost earnings due to outages caused by IT fails. I’ve not gone into details here, but some of these include:

  • Hiscox UK: Offers optional cyber business interruption cover, including income loss if systems fail. 
  • Clarke Williams Insurance Brokers: Provides Dependent Business Interruption (DBI) coverage for third‑party service failures, such as internet or cloud outages. 
  • Grove & Dean Insurance Brokers: Offers cyber insurance with explicit business interruption protection, covering lost income due to downtime from cyber incidents. 

This means UK businesses can not only plan for resilience but also insure against the financial impact of outages. Of course, a continuous increase in outages may increase these insurance premiums….

Conclusion

The cloud outages at prove that cloud isn’t “always on”, despite the various up-time (sometimes financially backed) SLAs providers promise. Providers are tightening controls, but businesses must assume failure and design for resilience. Whilst observability tools like Cisco ThousandEyes can’t stop an outage, they do help you see them faster, understand them better, and recover more effectively.

Combined with multi‑cloud strategies, failover planning, chaos testing, and cyber insurance with business interruption cover, organisations can turn inevitable outages into manageable events — protecting both operations and earnings.

The end of Windows Server 2012 – Band-aid it or innovate it?

What has happened?

Support for Windows Server 2012 and Windows Server 2012 R2 ended on 10th Oct 23.

This means that the security updates that rolled out in this month’s Patch Tuesday was the last for Windows Server 2012, meaning that there will be no more security updates, non-security updates, bug fixes or technical support.

What are my options

With any end of support stages, there are always options. In short these can be summarised as:

  1. Do nothing [not the best idea]
  2. Upgrade to a supported version of Windows Server [this means upgrading to Windows Server 2022]
  3. Purchase Extended Security Updates (ESUs) for Windows Server 2012 – [these provide one to three years of security updates only – no new feature or bug fixes]
  4. Migrate the on-prem 2012 servers to Azure [by doing this and receive up to three years of free Extended Security Updates (ESUs) for free]

Option four is a logical choice for most – from an operational, cost and sustainability perspective – besides of course mitigating the immediate increased security risk (with free security updates for 3 years)

So why is now the right time to migrate and modernise with Azure?

Shifting on-prem servers to Microsoft Azure provides many benefits including reduced maintenance/support costs, less/no power usage (good for your CO2 numbers), flexible and predictable pricing, and an opportunity to migrate and modernise the workloads running on these servers to platform-as-a-service (PaaS) for example Azure SQL or Azure App Services. You can of course migrate to Azure and still upgrade to Server 2022 if you are not ready to move to PaaS 😊

Your Azure / Cloud Partner can help

Many organisations are eligible for “migration assistance”, usually in the terms of funded assistance from their Azure Migrate partner or directly through Microsoft. Depending on where you are on your Cloud journey, the Azure Migrate and Modernisation Program is designed to simplify and accelerate an organisations cloud migration and modernisation projects and offers by working with a certified Azure partner.

Working with Microsoft and your Azure partner (like Cisilion) can help you by providing: –

  • A Proven Approach: We use best practices based on the Microsoft Cloud Adoption Framework for Azure and Well-Architected Framework at every stage of your cloud adoption journey.
  • Expert Assistance: We provide industry and hands on guidance direct from certified Azure engineers – we help by assessing your environment, planning migrations, and can support your transition.
  • Inclusive support: If you choose to use us as your Azure partner, we can also provide your Azure licensing through the Cloud Solution Provider programme (CSP) which includes inclusive 24/7 support at no extra cost.
  • Cost Savings: Our expertise in cost optimisation, platform design and fin ops means that not only can we help you minimise migration costs (with funding assistance), but we ensure right sizing, the right licensing models and the right terms – typically saving organisations more that 38%.

Microsoft’s made Azure Single Sign-On and MFA free*.

Microsoft have announced that any customer using a subscription of a their commercial online services (Azure, Dynamics, Office 365 etc.) can connect all their cloud applications to Azure AD for single sign-on (SSO), and protect this access with multi-factor authentication (MFA) as a huge additional security benefit at no extra cost –  other than internal (or partner) resource to configure and test it. Using MFA alone is proven to reduce the attack surface and prevent over 99% of breaches caused by credential theft. 

Using SSO reduces the number of sign-in prompts for employees, reduces the number of different user ID and password combinations needed also enables one-click access to the most used line of business applications  – and it should make working remotely even easier and more secure – since user access control can be made central – and under the protection and safeguard of Azure AD.

Microsoft has also added several other Azure AD enhancements which will help simplify identity and access management and improve the experiences for all those working remotely – these include the following:

  • Streamlined identity management
  • Improve application configuration and security for Azure AD SSO
  • Seamless and secure collaboration
  • Safeguard identities with industry-leading security
  • App gallery integration