Navigating the Aftermath of the CrowdStrike Cybersecurity Outage: Insights and Strategies

I run a monthly fireside chat panel discussion with IT and Business leaders from a handful of our Cisilion customers. Today, we talked about the outage and reflected on if, can and what we, the industry and our vendors need to do to minimise/prevent this vast impact happening again.

If you missed the "show" - you can watch it below.
September 2024 – Cisilion Fireside Chat

In our September 2024, fireside chat, our panel and I delved into the significant impact and lessons that can be learned from the CrowdStrike outage in July which is estimated to have cost more than $10B US and affected more than 8.5 million Windows devices when CrowdStrike distributed a faulty configuration update for its Falcon sensor software running on Windows PCs and servers.

This update featured a “modification” to a configuration file which was responsible for screening named pipes [Channel File 291]. The faulty update caused an out-of-bounds memory read in the Windows sensor client that resulted in an invalid page fault. The update caused machines to either enter into a bootloop or boot into recovery mode.

Today’s fireside chat conversation covered a range of topics, from the immediate effects of the outage to long-term strategies for enhancing cybersecurity resilience.

The Immediate Impact of the CrowdStrike Outage

The panel began by addressing the widespread disruption caused by the CrowdStrike outage. We discussed the outage’s extensive reach, affecting millions of devices and various sectors, including healthcare, finance, and transportation. In my intro to the episode, I mentioned that “It was really hard to believe…such a small relatively trivial and small update could impact so many people, devices and organisations“. This set the stage for a deeper exploration of the outage’s implications on cybersecurity practices.

As we kicked off, I praised the collaboration between Microsoft and CrowdStrike in addressing the outage. He mentioned that despite initial blame-shifting in the media, there was a concerted effort to resolve the issue, showcasing the importance of vendor cooperation in crisis management. The panel in short didn’t think there was much more Microsoft could have done – the key was updates and openness which is so critical in a global issue like this – as people and businesses need updates and answers as well as help in restoring systems which both Microsoft and CrowdStrike did in drones.

Vendor Reliance and Preparedness

Ken Dickie (Chief Information and Transformation Officer at Leathwaite), emphasised the importance of incident management and the worlds’ reliance on third-party and cloud providers. He shared his insights into the challenges of controlling the fix and the revelation of technology’s utility nature to leadership teams stating that it can be hard to explain to “IT” on “how little control we had over the actual fix“. Matthew Wallbridge (Chief Digital and Information Officer at Hillingdon Council) echoed the sentiment, stressing the need for preparedness and the role of people in cybersecurity, stating, “It’s less about the technology, it’s more about people.”

Supply Chain Risks

Matthew raised concerns about supply chain risks, highlighting recent attacks on media and the need for better understanding and mitigation strategies. This part of the discussion underscored the interconnected nature of cybersecurity and the potential vulnerabilities within the supply chain.

Goher Mohammed (Group Head of InfoSec at L&Q Group.) mentioned the impact on their ITSM due to vendor reliance in the supply chain, which degraded their service, emphasising the need for resilience and contingency plans. This led to further discussions about how important understanding the importance of the Supply Chain validation is in our security and disaster recovery planning and co-ordination. Matt talked frequently about “control the controllable” but ask the right questions to the ones (vendors) you can’t control. Goher said that whilst L&Q were not directly affected, they did experience “degraded service due to supply chain impacts“, emphasising the need for resilience and contingency plans and review of that of their supply chain(s).

Resilience and Disaster Recovery Planning

The conversation then shifted to strategies for enhancing resilience. Here I discussed how we at Cisilion are revisiting our own disaster recovery plans to include scenarios like the Crowdstrike outage.

We discussed a lot about the cost of resilience and that there is a “limit” to what you can mitigate against before the cost skyrockets out of control with very little reduction in risk. It was agreed there are many things that can’t “easily” be mitigated in this particular scenario, but that we can be better prepared.

The panel talked about various strategies that “could be considered” including recovering to “on-prem”, re-visiting the considerations around multi-cloud strategies and the potential benefits of edge computing in mitigating risks associated with device reliance.

We also discussed whether leveraging technologies such as Cloud PCs, and Virtual Desktops have a part to play in recovery and preparation as well as whether using Bring Your Own Devices would/could/should be a bigger part of our IT and desktop strategy, along with, of course SASE technology to secure access.

Goher advised “do a real audit, understand the most critical assets, the impact they have further down the line and whether there is more that can be done to mitigate against outage/failure/issue“. This led us into an interesting side discussion around Secure Access Service Edge (SASE) – emphasising the “importance of not relying on trusted devices alone”.

The Human Aspect of IT Incidents

David Maskell (Head of IT and Information Security at Thatcham Research) brought a crucial perspective to the discussion, focusing on the human aspect of IT incidents. He reminded the audience of the importance of supporting IT teams during crises, highlighting the stress and pressure they face. The panel agreed with David, all of whom emphasised the importance of ensuring teams are looked after, highlighting the human aspect of managing IT incidents especially when things are not directly controllable (such with Cloud outages) and the need for good, solid communications to the business.

Ken also reflected on leadership’s reaction to the outage, emphasising the “gap in understanding the reliance on technology” that many business leaders (especially those not from a techy background) have”. The days of “it’s with IT to fix” are clearly not as simple as they once were!

Conclusion: The Path Forward

As we concluded the discussion, the panel dwelled over the lessons and tips to offer viewers, each other and the industry.

In general the guidance acoss the panel were around

  1. The importance of regular security reviews, external audits, and business continuity testing.
  2. The need to adopt a proactive stance around cyber security and technology outages, ensuring that their teams are prepared (they run testing and attack/outage simulations).
  3. Ask more questions of your supply chains – they may be your weakest link. Are they secure, and are their recovery plans robust?
  4. Map your critical systems and know the impact on an outage – what is the continuity plan – if devices are affected, how can people access your technology – look at Cloud PCs (such as Windows 365), can you support the use of personal devices (look at SASE technologies such as Cisco Secure Connect)
  5. Review your technology dependencies. It’s not necessarily about multi-vendor but this might be a consideration – even for backup.

In summary, the CrowdStrike outage serves as a stark reminder of the vulnerabilities inherent in our reliance on technology and the critical need for comprehensive cybersecurity strategies.

Microsoft wants to lock down the kernel after CrowdStrike hiccup knocks out millions of Windows devices.

Windows Kernel Security - Image by Designer (AI)

Microsoft is reviewing their options and looking to push for significant changes to their Windows security architecture in the after math of the major outage caused by a “faulty” CrowdStrike update last a couple of week back. The impact of the faulty update, is thought to have afftected around 8.5 million Windows devices and services when the faulty update caused Windows devices to reboot and enter their protected recovery mode.

Microsoft acknowledges the inherent ‘tradeoff’ kernel-level cybersecurity solutions pose and confirms the root cause of the global outage.

This has prompted Microsoft to reassess the level of control that third party security vendors have over the deepest parts of their operating system and they are considering limiting kernel- level access for these vendors.

This incident shows clearly that Windows must prioritize change and innovation in the area of end-to-end resilience“. | John Cable | Microsoft see blog post,


Time to bring control back?

John Cable, Microsoft’s VP of program management for Windows servicing and delivery, discussed passionately their viewpoint in a blog post named “Windows resiliency: Best practices and the path forward.” In this post, he emphasised the need for “end-to-end resilience” and discussed potential changes Microsoft are reviewing that could mean restricting kernel access for third party security vendors such as CrowdStrike.

Snipit from John Cable’s blog post | July 2024


The CrowdStrike update bug, which resulted in widespread system crashes, has clearly highlighted the risks associated with allowing third-party security apps and services to operate at the kernel level – a new approach is needed.

Privileged access, though advantageous for detecting threats, can result in disastrous failures if mishandled. Microsoft is investigating alternatives that circumvent future kernel access issues, including VBS enclaves and the Azure Attestation service. Employing Zero Trust methodologies, these solutions aim to bolster security without incurring the dangers inherent in kernel-level operations.

Why do Microsoft let third parties access the kernel?

In short, they dont have much choice (see below).

While Microsoft may be looking to further restrict access to its Windows kernel going forward, they have used this event to  explain why third-parties antivirus and security vendors to access the “core of Windows” the first place.

The Windows kernel is a deep layer of its operating system. Kernel-level cybersecurity lets developers do more to protect machines, can perform better, and can be harder for threat actors to alter or disable.

When a kernel-level cybersecurity solution loads at the earliest possible time, it gives users (and companies) the most data and context possible when threats arise and also ensures protection can kick in at the earliest stage of the Operating Systems boot up stage rather than waiting for the OS to load and then running as a normal system process.

The EU may prevent changes over anti-trust claims

Whilst this makes common sense to most, after all why shouldn’t Microsoft be able to restrict access to ensure stability of an operating system used by more than a billion users, their push for change is likley to face resistance from both cybersecurity vendors and regulators.

Back in 2006, Microsoft tried to restrict kernel access around the release of Windows Vista, but was met with opposition and a ruling that preventing them doing this, citing anti compete. In contrast, however, Apple successfully managed to lock down their kernel level
access in macOS in 2020. The market for Windows software is of course far larger than Apple’s MacOS and Microsoft is an open platform for developers to build upon so any changes will need to be done in a way that make this possible without preventing developers software doing what they are supposed to do!

Microsoft has attributed part of the CrowdStrike outage to the 2009 European Union antitrust agreement, which mandates that Microsoft must provide kernel-level access to third-party software vendors. Conversely, Apple started to phase out kernel extensions in macOS in 2020, encouraging software vendors to adopt the “system extension framework” due to its reliability and security advantages.

It is not the first and wont be the last time either that the EU have played the anti-trust card. Microsoft has recently had to decouple Teams from Microsoft 365 as a response to competitors such as Zoom citing Mcirosoft have an unfair advantage. They have had recent claims against them with Internet Explorer and Edge.

Zero Trust Kernel Protection mayt be the way forward

The blog post indicates that Microsoft is not proposing a complete shutdown of access to the Windows kernel. Rather, it highlights alternatives like the newly introduced VBS enclaves, which offer an isolated computing environment that doesn’t necessitate kernel mode drivers for tamper resistance.

“These examples use modern Zero Trust approaches and show what can be done to encourage development practices that do not rely on kernel access…We will continue to develop these capabilities, harden our platform, and do even more to improve the resiliency of the Windows ecosystem, working openly and collaboratively with the broad security community vendors”.
John Cable | Microsoft Windows VP

Trade off between “anti-compete” and stability.

Microsoft acknowledges that the tradeoff of kernel-level cybersecurity products is that if it glitches out, it can’t be easily fixed, saying in their blog that. “all code operating at kernel level requires extensive validation because it cannot fail and restart like a normal user application.”

As such companies have to demonstrate strict quality and testing controls over their software. The CrowdStrike issue occurred since this wasn’t a new product but” simply” and software patch by CrowdStrike that… well, went wrong.

Microsoft can’t vet every patch and every update released by their “trusted” ISVs/third parties, especially when it comes to security updates which these security vendors need to roll out requently.

“There is a tradeoff that security vendors must rationalise when it comes to kernel drivers. Since kernel drivers run at the most trusted level of Windows, where containment and recovery capabilities are by nature constrained, security vendors must carefully balance needs like visibility and tamper resistance with the risk of operating within kernel mode.” | Microsoft

What ever happens – businesses still need to have backup and remediation processed in place.

In response to the CrowdStrike incident, Microsoft deployed over 5,000 support engineers to aid affected organizations and provided continuous updates via the Windows release health dashboard. They rapidly developed recovery tools to assist companies in their recovery efforts, while emphasising the significance of business continuity planning, secure data backups, and the adoption of cloud-native strategies for managing Windows devices to bolster resilience against future incidents.

Further whitepapers and guidance will be released in the coming months and I expect this will lead to Microsoft, and their third party vendors releasing more recovery tools and guidance.


Summary

Microsoft “confirmed that CrowdStrike’s analysis that this was a read-out-of-bounds memory safety error in the CrowdStrike developed CSagent.sys driver,” Microsoft explained in their technical analysis of the crash and why the impact was so huge in a technical paper published last week.

Reviewing the security architecture and access to the kernel is definately needed, but their approach and desire to prevent future issues with third party glitches will likley be at the brunt of complaints from third party security vendors and the EU anti-compete regulators.

Apple “seem” to have a much easier ride when it comes to doing what they want – they say “jump” and developers say “how high”. Microsoft repeatedly have to “please” regulators far more – this recent huge global impact, may work in Microsoft’s favour however, to bring some control and governance in the name of system and business stability which I am sure will get the backing of everyone and every organisation impacted.

One thing is for certain -Microsoft wont take this sitting down. They will work hard to continue to protect their OS which is run on billions of devices and used by almost all coporations, education and crititical infrastrucutre. Change will happen!

CrowdStrike Update caused “Global IT Outage” with “Blue Recovery Screen” Issue on older Windows devices

BSOD - Crowdstrike

We have seen social media frenzy this morning following a triple whammy of issues impacting Azure Virtual Machines (running Windows 10 and Server 2016) and Windows devices across hundreds of organisations where devices are rebooting to the Windows Recovery Screen issue on Windows 10 devices and Server running older versions.

19/7/24 11:00am: The impacts of the issue are still on-going although the root cause is known and CrowdStrike and working with Microsoft on getting a patch out…

19/7/24: 15:00: CrowdStrike have updated their sites to take accountability of the issue (Microsoft still helping) that has impacted devices due to a “bug” in their software update which caused the BSOD. They have pulled and fixed the update and are working with their customers to remediate the impact. Microsoft have also offered guidance on what can be done to reverse the issue – links to this below.

29/7/2024: 18.00: this is not a Microsoft problem (yet I imagine they will be blamed) but it affected millions of Windows systems… Read to the bottom to see why.


Summary

Since the early hours of the morning, several media companies, airlines, transport companies, tech companies, and schools / universities are reporting a Blue Screen (actually a safety recovery screen) issue Windows 10.

The issue is impacting Windows 10 devices that are using CrowdStrike Falcon agent – their flagship Extended Detect and Response (XDR) Security platform.

Impacted devices are crashing following this Falcon Client update and then getting stuck at the “Recovery” screen due to a critical system driver failure that is preventing the device from starting back up.

CrowdStrike and Microsoft are actively working on this to drive a permanent fix, workarounds are available which require manually preventing this service from starting on affected devices.

The issue is not known to be affecting devices running Windows 11 and Server 2019 and beyond.

What is CrowdStrike?

CrowdStrike, a cybersecurity firm based in the US, assists organisations in securing their IT environments, which encompasses all internet-connected resources.

Their mission is to “safeguard businesses from data breaches, ransomware, and cyberattacks” and they position themselves as having leading offerings that compete with other vendors including Microsoft themselves, SentinelOne, and Palo Alto Networks. Their client base is extensive and includes legal, banking, finance, travel firms, airlines, educational institutions, and retail customers.

A key offering from CrowdStrike is their Falcon XDR tool, touted on their website for delivering “real-time indicators of attack, hyper-accurate detection, and automated protection” against cybersecurity threats.

Root Cause

Information available from CrowdStrike and Microsoft state that the issue is caused by a “faulty” version of the csagent.sys file which is key system start-up file needed by CrowdStrike’s new sensors update for their Falcon Sensor agent. It is this file that has been responsible for the BSOD errors on Windows 11 and many servers running older Windows Server OS running in private and public data centres such as Microsoft Azure. .

George Kurtz, the CEO of the global cybersecurity firm CrowdStrike, stated that the issues were due to a “defect” in a “content update” for Microsoft Windows devices.

“The issue has been identified, isolated, and a fix has been deployed.” he said as he clarified that the problems did not impact operating systems other than Windows 10 and WIndows Server 2016 and older and also emphasized, “This is not a security incident or cyber-attack.”

Impact

  • Windows 10 devices are primarily affected.
  • Devices running Windows Server 2016 and older in Azure are also impacted if they run the CrowdStrike Falcon agent.
  • Limited/less impact on devices running Windows 11 or Windows 2019 and later.

Note: Windows 10 enters end of support in October 2025.

Is there a fix?

Updated: 21/7/2024: Microsoft have updated their guidance and provided additional support for fixing these issues using managed devices via Intune. This can be found here.

The formal advice if this issue is affecting your organisation is to contact your CrowdStrike Support representative – CrowdStrike and Microsoft are actively working to address the issue both as a response to the issue and preventative to ensure more devices are not impacted.

Since the issue is known to be caused by the csagent.sys file, there are ways to manually prevent this file being loaded, allowing the device to load. There are a couple of ways to do this.

  1. Use Safe Mode and delete the affected file:
    • Boot the device to Safe Mode
    • Open Command Prompt and navigate to the CrowdStrike directory which should be C:\Windows\System32\drivers\CrowdStrike
    • Locate and delete the file matching the pattern C-00000291.sys* – you can do this using the by using a wildcard dir C-00000291*.sys.
    • Remove or rename the file.
  2. Use Registry Editor to block the CrowdStrike CSAgent service:
    • Boot to Safe Mode
    • Open Windows Registry Editor.
    • Navigate to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\CSAgent
    • Change the Start value to 4 to disable the service.

Dan Card, of BCS, The Chartered Institute for IT and a cyber security expert said: “People should remain calm whilst organisations respond to this global issue. It’s affecting a very wide range of services from banks to stores to air travel.

He also said that whilst the cause is now known, it is still causing worldwide issues and impacts on consumer services, banking, healthcare and travel and will take some time to remediate.

Companies should make sure their IT teams are well supported as it will be a difficult and highly stressful weekend for them as they help customers of all kinds. People often forget the people that are running around fixing things.”

Updated: 21/7/2024: Microsoft have updated their guidance and provided additional support for fixing these issues using managed devices via Intune. This can be found here.

Conclusion

CrowdStrike has acknowledged the issue and is investigating the cause. Users can follow the above steps to resolve the recovery screen issues  and boot their PCs normally.

Crowdstrike and Microsoft worked tirelessly to resolve this issue and prevent further widespread impact.

“The issue has been identified, isolated, and a fix has been deployed.” he said as he clarified that the problems did not impact operating systems other than Windows 10 and WIndows Server 2016 and older and also emphasized, “This is not a security incident or cyber-attack.”

Devices running Microsoft’s latest Operating Systems seem to be less impacted (though information still being collated).


How did Microsoft allow this to this happen?

How did this happen? Many people are asking why Microsoft are shifting blame to Crowdstrike (who have admitted fault) asking why and how did Microsoft allow this?

In short, it’s not their fault and there really wasn’t anything they could have done to prevent it…. Here’s why..

Many Security products such as XDR products made by Crowdstrike, Palo Alto, and even Microsoft’s own XDR product defender, are what is known as “kernel mode products” . Whilst this issue affected Windows the same “hiccup error with the update” could have equally of affected other OS such as MacOS and Linux since they are kernal extensions.. This means is they had made the same mistake on the updates for these OS’s the same product mess up would have occurred. 

In an ideal world all applications and services would run in user mode rather than Kernel Mode, but with many security and AV products, these have a need (a legitimately one) to monitor at the lowest levels of the OS in order to detect attacks… This is not possible if running in user mode as the kernel is protected.

The Blue Recovery Screen (which was mistaken by most as the Blue Screen of Death (BSoD) which it actually was not is actually the Windows OS safety net.

As such, there is not much more Microsoft can do here. These are third party applications not managed or developed or controlled/updated by Microsoft. If Microsoft were to manually vet every update and change to an application, Microsoft would be classed as control hogs and the world will crucify them for it!

Microsoft cannot legally wall off its operating system in the same way Apple does because of an understanding it reached with the European Commission following a complaint. In 2009, Microsoft agreed it would give makers of security software the same level of access to Windows that Microsoft gets.

The outage is awful and has impacted so many organisation including crutiic services, but it’s also not fair IMO that Microsoft and Windows have been dragged through the dirt simply because it’s their OS that was impacted by the poor updates and issues another third party application caused. 

It’s not the first time this had happened…to other OS’s

According report by Neowin, ” similar problems have been occurring for months without much awareness, despite the fact that many may view this as an isolated incident. Users of Debian and Rocky Linux also experienced significant disruptions as a result of CrowdStrike updates, raising serious concerns about the company”s software update and testing procedures. These occurrences highlight potential risks for customers who rely on their products daily.

In April, a CrowdStrike update caused all Debian Linux servers in a civic tech lab to crash simultaneously and refuse to boot. The update proved incompatible with the latest stable version of Debian, despite the specific Linux configuration being supposedly supported. The lab”s IT team discovered that removing CrowdStrike allowed the machines to boot and reported the incident. “

What this shows it the vital importance on update testing and deployment rings.