This is an ongoing event, and so all thoughts here reflect the current situation and my understanding of the incident.
It does not reflect the views of my employer in any way.
#Background
Today (July 19th, 2024) there was an incident that affected a lot of critical infrastructure worldwide.
People worldwide are affected by this incident and businesses are working hard to remediate and mitigate the
damages caused by this.
I don't envy folks working on system that are affected nor CrowdStrike employees dealing with this incident - big #HugOps to them.
Goes without saying that it's painful to witness the people worldwide affected by all of this.
This security product, among other things, is responsible for detecting and alerting on suspicious activity on endpoints (either production server or business laptops/desktops),
and it's also responsible for collecting and analyzing data on the activity. This is a highly business-critical product in the security industry, and is designed to help keep
your systems safe and running, and so a being caused non-maliciously by these products is a big deal.
Technically, from various sources I gather this was an update that
affected the code in a kernel driver used by CrowdStrike's EDR product.
I believe that there may be an extra complexity here, where endpoints used to connect to remote systems may be affected as well
(so a workstation that has CrowdStrike installed that is used to remotely connect to production systems may be bricked in itself as well).
##What is a kernel driver?
Drivers in an operating system are components that handle low-level interactions with underlying devices (mostly hardware, but these can be virtual devices or other low-level services or even other drivers as well), and they are essential for the operation of the operating system.
These can run in lower privilege ("user-mode drivers"), which have limited access and an isolation, such that when they crash, they won't brick your machine.
Alternatively, they can run in high privilege ("kernel-mode drivers"), which have elevated access to underlying hardware and are not isolated like user-mode drivers,
such that an issue in one of these drivers could disrupt other kernel drivers or cause a system crash (i.e., "kernel panic")
Kernel drivers, to no surprise, need to be properly vetted before being distributed to operating system kernels.
There is often a signing process involved (the code is signed by a trusted authority, so you know that the code was not tampered with, came from the source you trust and is safe to run), and this can be done
by either the operating system vendor (i.e., Windows) or third parties (i.e., CrowdStrike).
##How did a kernel driver cause this?
I believe that what happened in this incident was that CrowdStrike's kernel driver was updated to a version that, for some reason (waiting for the to be published,
as is customary in these types of incidents), had such a severe bug that it crashed the machines it was running on.
Seemingly, this was an issue with the Windows kernel driver distributed in CrowdStrike's Falcon®, and so it affected production servers running Windows (
Microsoft's server offering that is based on Windows is called "Windows Server").
So, CrowdStrike deployed a fix - why are system still down?
From my understanding, in this case, the system fails to load after booting, and so you have a circular problem - the product needs to be updated,
but the system is bricked due to the crash and so every time the server boots, it crashes before it can get the update with the fix.
Presumably, this was a memory safety issue:
CrowdStrike's fix involved booting into Windows Safe Mode, which works around this issue.
#What has Microsoft got to do with this?
From what I understand, this incident was not releated to any changes that Microsoft made.
The kernel driver that was deployed by CrowdStrike was probably not tested properly and reached production systems with an issue that caused them to crash (often evident with a
Windows ).
An argument could be made that Microsoft (and other kernel/OS vendors) could have protections for rogue kernel drivers, but I think this may be a straw man argument.
#Prevention
Such global outages are usually a chain of events and cannot be pinned down to one single culprit, but I think it could be narrowed down to two things:
Critical software distribution:
Critical software should be treated as critical. I have my sympathies for CrowdStrike, and bear in mind that they have a long history
of deploying critical software without such incidents (unfortunately it takes years to build trust, but a second to ruin it - though I
doubt that is the case), but such critical software should have a better QA and testing process.
Disaster recovery and redundancy:
These issues are hard to predict, but it's important that high-stakes business software has a clear process for recovering from
incidents. Servers can go down for any reason (even to the point of the system not being able to boot at all), and so this is why
processes exist.
Chaos engineering is an entire field that deals with this.
Airlines, banks and other companies that were severly affected by this incident, could have had guardrails in place to detect and respond
to such incidents.
#Thoughts on public communications
I am far from an expert on this topic, but it is interesting to see how companies deal with incidents like this when stuff like this happens.
After watching an interview with the CrowdStrike CEO, I think that a crucial framing was not communicated properly; it may be easy for people to take
away from this incident "CrowdStrike deploy bug, CrowdStrike bad", but I think the right framing to the public should be that CrowdStrike deploys
critical security software and for its operation, it needs high priviliges in the operating system which has risks of causing business disruption
if it doesn't work properly.
#Closing
This is an unfortunate incident and I do hope that there are learnings from this and measures are taken to prevent this from happening again,
both on the vendor side and on the end users' side.
I've seen that industries like electrical or mechanical engineering sometimes have such long processes to manufacture new products,
where fixes are hard to mitigate (e.g., may require a recall), and so the end quality is taken very seriously.
In software, this is sometimes taken for granted since updates are often delivered frequently and over-the-air. This is not necessarily the case here,
but just something to keep in mind.
Remember - critical software is critical software.