The incorrect update resulted in a memory read that was outside of the allowed range, which in turn created a “unrecoverable exception.”
According to what was written in a post incident review (PIR), CrowdStrike has placed the responsibility for a bug-ridden update that caused 8.5 million Windows computers all over the world to crash on problematic testing software. “Due to a bug in the Content Validator, one of the two [updates]passed validation despite containing problematic data,” the business stated in its announcement. It pledged a number of new procedures to be taken in order to prevent a recurrence of the issue.
The major blue screen of death (BSOD) outage affected a wide variety of businesses all over the world, including international airlines, broadcasters, the London Stock Exchange, and a great number of other organizations. Windows PCs were unable to recover from the problem because it caused them to enter a boot loop. In order to recover, technicians needed to have local access to the machines. Apple and Linux devices were not affected. Many businesses, including Delta Airlines, are still in the process of recovering.
CrowdStrike provides a technology known as the Falcon Sensor that can protect against distributed denial of service assaults as well as other forms of attacks. It comes pre-installed with material that operates at the kernel level and is referred to as Sensor material. This content makes use of a “Template Type” to define how it protects itself from potential dangers. If something new is introduced, it will send out “Rapid Response Content” in the form of “Template Instances.”
It was on March 5, 2024 that a Template Type for a new sensor was made available, and it operated just as anticipated. On the other hand, on July 19, two new Template Instances were made available, and one of them, which was only forty kilobytes in size, got through validation despite having “problematic data,” according to CrowdStrike. This resulted in an out-of-bounds memory read, which caused an exception to be triggered when it was received by the sensor and loaded into the Content Interpreter. Because this unanticipated exception could not be handled in a graceful manner, the Windows operating system crashed (also known as a blue screen of death).
In order to avoid a recurrence of the incident, CrowdStrike has committed to taking a number of preventative actions. Before anything else, there is a more comprehensive testing of the Rapid Response content, which includes testing for local developers, testing for content updates and rollbacks, testing for stress and stability, and other types of testing. It also includes the addition of validation tests and the improvement of error handling.
As an additional measure to prevent a recurrence of the worldwide outage, the organization will initiate the implementation of a staggered rollout plan for Rapid Response Content. Additionally, it will give clients a higher degree of control over the delivery of content of this kind and will provide release notes for subsequent upgrades.
Some analysts and engineers, on the other hand, are of the opinion that the corporation ought to have implemented such safeguards from the very beginning. “CrowdStrike must have been aware that these updates are interpreted by the drivers and could lead to problems,” said Florian Roth, a developer, in a post that was published on X. “They should have implemented a staggered deployment strategy for Rapid Response Content from the start.”