I run a monthly fireside chat panel discussion with IT and Business leaders from a handful of our Cisilion customers. Today, we talked about the outage and reflected on if, can and what we, the industry and our vendors need to do to minimise/prevent this vast impact happening again.
If you missed the "show" - you can watch it below.
In our September 2024, fireside chat, our panel and I delved into the significant impact and lessons that can be learned from the CrowdStrike outage in July which is estimated to have cost more than $10B US and affected more than 8.5 million Windows devices when CrowdStrike distributed a faulty configuration update for its Falcon sensor software running on Windows PCs and servers.
This update featured a “modification” to a configuration file which was responsible for screening named pipes [Channel File 291]. The faulty update caused an out-of-bounds memory read in the Windows sensor client that resulted in an invalid page fault. The update caused machines to either enter into a bootloop or boot into recovery mode.
Today’s fireside chat conversation covered a range of topics, from the immediate effects of the outage to long-term strategies for enhancing cybersecurity resilience.
The Immediate Impact of the CrowdStrike Outage
The panel began by addressing the widespread disruption caused by the CrowdStrike outage. We discussed the outage’s extensive reach, affecting millions of devices and various sectors, including healthcare, finance, and transportation. In my intro to the episode, I mentioned that “It was really hard to believe…such a small relatively trivial and small update could impact so many people, devices and organisations“. This set the stage for a deeper exploration of the outage’s implications on cybersecurity practices.
As we kicked off, I praised the collaboration between Microsoft and CrowdStrike in addressing the outage. He mentioned that despite initial blame-shifting in the media, there was a concerted effort to resolve the issue, showcasing the importance of vendor cooperation in crisis management. The panel in short didn’t think there was much more Microsoft could have done – the key was updates and openness which is so critical in a global issue like this – as people and businesses need updates and answers as well as help in restoring systems which both Microsoft and CrowdStrike did in drones.
Vendor Reliance and Preparedness
Ken Dickie (Chief Information and Transformation Officer at Leathwaite), emphasised the importance of incident management and the worlds’ reliance on third-party and cloud providers. He shared his insights into the challenges of controlling the fix and the revelation of technology’s utility nature to leadership teams stating that it can be hard to explain to “IT” on “how little control we had over the actual fix“. Matthew Wallbridge (Chief Digital and Information Officer at Hillingdon Council) echoed the sentiment, stressing the need for preparedness and the role of people in cybersecurity, stating, “It’s less about the technology, it’s more about people.”
Supply Chain Risks
Matthew raised concerns about supply chain risks, highlighting recent attacks on media and the need for better understanding and mitigation strategies. This part of the discussion underscored the interconnected nature of cybersecurity and the potential vulnerabilities within the supply chain.
Goher Mohammed (Group Head of InfoSec at L&Q Group.) mentioned the impact on their ITSM due to vendor reliance in the supply chain, which degraded their service, emphasising the need for resilience and contingency plans. This led to further discussions about how important understanding the importance of the Supply Chain validation is in our security and disaster recovery planning and co-ordination. Matt talked frequently about “control the controllable” but ask the right questions to the ones (vendors) you can’t control. Goher said that whilst L&Q were not directly affected, they did experience “degraded service due to supply chain impacts“, emphasising the need for resilience and contingency plans and review of that of their supply chain(s).
Resilience and Disaster Recovery Planning
The conversation then shifted to strategies for enhancing resilience. Here I discussed how we at Cisilion are revisiting our own disaster recovery plans to include scenarios like the Crowdstrike outage.
We discussed a lot about the cost of resilience and that there is a “limit” to what you can mitigate against before the cost skyrockets out of control with very little reduction in risk. It was agreed there are many things that can’t “easily” be mitigated in this particular scenario, but that we can be better prepared.
The panel talked about various strategies that “could be considered” including recovering to “on-prem”, re-visiting the considerations around multi-cloud strategies and the potential benefits of edge computing in mitigating risks associated with device reliance.
We also discussed whether leveraging technologies such as Cloud PCs, and Virtual Desktops have a part to play in recovery and preparation as well as whether using Bring Your Own Devices would/could/should be a bigger part of our IT and desktop strategy, along with, of course SASE technology to secure access.
Goher advised “do a real audit, understand the most critical assets, the impact they have further down the line and whether there is more that can be done to mitigate against outage/failure/issue“. This led us into an interesting side discussion around Secure Access Service Edge (SASE) – emphasising the “importance of not relying on trusted devices alone”.
The Human Aspect of IT Incidents
David Maskell (Head of IT and Information Security at Thatcham Research) brought a crucial perspective to the discussion, focusing on the human aspect of IT incidents. He reminded the audience of the importance of supporting IT teams during crises, highlighting the stress and pressure they face. The panel agreed with David, all of whom emphasised the importance of ensuring teams are looked after, highlighting the human aspect of managing IT incidents especially when things are not directly controllable (such with Cloud outages) and the need for good, solid communications to the business.
Ken also reflected on leadership’s reaction to the outage, emphasising the “gap in understanding the reliance on technology” that many business leaders (especially those not from a techy background) have”. The days of “it’s with IT to fix” are clearly not as simple as they once were!
Conclusion: The Path Forward
As we concluded the discussion, the panel dwelled over the lessons and tips to offer viewers, each other and the industry.
In general the guidance acoss the panel were around
- The importance of regular security reviews, external audits, and business continuity testing.
- The need to adopt a proactive stance around cyber security and technology outages, ensuring that their teams are prepared (they run testing and attack/outage simulations).
- Ask more questions of your supply chains – they may be your weakest link. Are they secure, and are their recovery plans robust?
- Map your critical systems and know the impact on an outage – what is the continuity plan – if devices are affected, how can people access your technology – look at Cloud PCs (such as Windows 365), can you support the use of personal devices (look at SASE technologies such as Cisco Secure Connect)
- Review your technology dependencies. It’s not necessarily about multi-vendor but this might be a consideration – even for backup.
In summary, the CrowdStrike outage serves as a stark reminder of the vulnerabilities inherent in our reliance on technology and the critical need for comprehensive cybersecurity strategies.