Global cyber/software security giant Crowdstrike sent out a software update for its enterprise-class Falcon anti-malware software on Friday, July 19, and ended the world as we know it. The software update caused Windows to crash on an estimated 8.5 million PCs and servers worldwide and brought up blue screens of death (BSODs) on the PC displays of nearly all worldwide subscribers to Crowdstrike’s services including airlines, logistics companies such as FedEx and UPS, financial institutions, hospitals, pharmacies, 911 call centers and other emergency services, network television broadcast studios including Sky News in the UK and Canal+ in France, retailers, grocery stores, and Microsoft Cloud Services including Azure and Office 365. Technically speaking, the crash is so severe that Windows must be restarted manually on the affected machines. That’s fine for PCs with keyboards and displays on people’s desks, but it’s a migraine-sized headache for physically inaccessible PCs being used as embedded controllers. (Mac and Linux machines are unaffected, so the cloud was relatively safe from this disaster.) Apparently, Falcon protects the PC from all sorts of malware that might crash the machine, except itself.
By my estimation, this crash cost the world economy several billion dollars, at a minimum. According to the Statista Web site, the annual worldwide airline revenue was $996 billion in 2023. That’s about $2.7 billion per day. So, the lost revenue could amount to several billion dollars for the lost airline flights alone on July 19 and on several subsequent days as the global airline industry recovered. I’ve seen estimates putting the airlines’ losses at $5 billion. By Monday following the outage, several hundred daily flights were still being cancelled in the US alone. For some reason, US carrier Delta Airlines seems to have been the hardest hit and the cancellations for that airline stretched through Wednesday.
Numerous other industries also suffered losses, some in my direct experience. My wife and I happened to check into a major chain hotel in Fernley, Nevada on the day of the crash, and the hotel desk clerk could not issue keys to new guests because of the Crowdstrike crash. Earlier that same day, we paused at a truck stop for lunch, and the desk cashier could accept only cash because the POS terminals were displaying BSODs. Multiply these experiences by the tens of thousands of gas stations and hotels across the US and around the world. However, the direct loss-of-service costs may just be the tip of the iceberg. Bad actors have set up scam sites offering to help with fixing the Crowdstrike problems or outright repairs, in the hopes of netting a big payday or harvesting monetizable information from desperate, affected Windows users. Meanwhile, Crowdstrike is reported to have sent $10 Uber Eats gift cards to “teammates and partners” as compensation for wrecking their weekend. Seems like mighty low pay for mission critical IT repairs, at least to me.
In the outstanding 1951 science fiction movie, The Day the Earth Stood Still, an alien visitor named Klaatu (played by Michael Rennie) landed on a baseball field in Washington DC in a silvery flying saucer with his giant police/enforcer robot companion Gort (silently portrayed by the imposing 7-foot, 7-inch Lock Martin) and later shut down the world for an hour by cutting off all electricity – even in vehicles – except for hospitals and emergency services, to demonstrate his galactic organization’s power over life on Earth. He also spared aircraft in flight. They weren’t allowed to fall out of the sky. Crowdstrike’s crashing of approximately 8.5 million Windows PCs on July 19 did the same thing just as effectively to nearly everyone, everywhere including hospitals and emergency services. Fortunately, aircraft avionics don’t run on Windows, at least not to my knowledge.
Billy Rose Theatre Division, The New York Public Library. “Audience queue at Brandt’s Mayfair Theatre to see the motion picture The Day the Earth Stood Still.” Image credit: The New York Public Library Digital Collections. 1951. https://digitalcollections.nypl.org/items/a15fc5c4-8411-f36d-e040-e00a18062fdc
Airlines around the world cancelled thousands of flights on that Friday and continued to cancel flights over the next several days because their ticketing, customer service, and crew-tracking systems went down, hard. Southwest Airlines was spared, apparently, because the company is still running Windows 3.1 and therefore is not (and cannot be) a Crowdstrike customer. Southwest gets no Cloudstrike updates simply because Southwest’s OS of choice is too old to be served by Crowdstrike’s Falcon software, which works with Windows 7, 8.1, 10, and 11. Alaska Airlines, Frontier Airlines, and JetBlue were similarly unaffected.
The problem’s cause has been traced to an ill-formed Crowdstrike .sys configuration file that contains nothing but zeroes. The ill-formed configuration file causes Crowdstrike’s Falcon to throw a page fault when attempting to access a non-paged area of the PC’s memory map using a dereferenced null pointer. Because Falcon is installed as a Windows driver that runs at the kernel level, Ring 0, Windows cannot shut Falcon down, so it halts the machine and displays a BSOD. This behavior is caused by Crowdstrike’s decision to install Falcon as a driver so that it can intercept certain kernel-level system calls. (If you want to understand the problem in great detail, I recommend a 14-minute YouTube episode from the “Dave’s Garage” channel titled “CrowdStrike IT Outage Explained by a Windows Developer.”
Recovering an affected PC requires the following steps:
-
- Reboot Windows into Safe Mode or the Windows Recovery Environment (hold the shift key down while rebooting)
- Navigate to the C:\Windows\System32\drivers\CrowdStrike directory
- Locate and delete a file or files matching the file name “C-0000029xxx.sys” (or something similar)
- Reboot the PC normally
As one wag put it on Reddit: “How am I supposed to tell this to Sally in Accounting?” Also, if you’re using Bitlocker, things aren’t this “easy.”
The boot into safe mode prevents the Crowdstrike Falcon driver from loading. Deleting the offending file does not remove the Falcon driver code, it deletes a configuration file used by the Falcon driver. Once the configuration file is deleted, the Falcon driver will no longer throw page faults.
Now, multiply this process by 8.5 million for the estimated number of affected PCs.
This recovery process requires physical access to the affected PC’s keyboard and display. Considering that many of the affected PCs are used as fully embedded CPUs (kiosks, retail and information displays, etc.), some recovery attempts will require a lot more effort than others.
Instead of a malware or ransomware attack, the Crowdstrike debacle appears to be the result of a software update that was issued without the proper controls using a flawed release procedure. An ill-formed .sys file was installed in approximately 8.5 million computers. However, the root cause of this problem, along with malware and ransomware attacks like the 2021 attack on Colonial Pipeline that triggered gasoline panics in the Eastern US or the WannaCry ransomware attack that took out chunks of the UK’s National Health Service in 2017, is the Internet’s original sin: anonymity. The reason that bad actors can strike our public networking systems with aplomb and the reason we need anti-malware software in the first place is because the original designers of the Internet opted for anonymity.
The lesson in all of this for EEJournal readers is that embedded systems cannot be based on operating systems capable of halting due to any sort of application error. Microsoft Windows certainly is taking the heat for the Crowdstrike affair, but Linux and Mac OS can also halt and throw up death screens. Linux has a black screen of death that reads “Kernel Panic” and the Mac OS also has a screen of death.
Embedded developers must find ways to prevent their systems from locking up in this manner, whether it’s through watchdog timers or significantly better kernel error handling. This problem will remain for as long as we have over-the-air or over-the-internet software updates.
I emailed my friend Jack Ganssle, firmware and embedded expert extraordinaire, for the last word on this subject. He wrote back:
“We have so many tools at our disposal to prevent these sorts of problems – code inspections, analysis tools, use of standards, testing, and more. It’s inexcusable that such mission-critical code got released without the care we know how to exercise. When software can disrupt the lives of so many people, how awful companies continue to do such a poor job of releasing updates! We know how to do better… but choose not to.”
Fixing the problem is easy if…the CrowdStrike directory and offending C-0000029xxx.sys file don’t require admin/elevated privileges.
I do remote work for an enterprise aircraft electronics company. There was an hour+ wait for IT to answer, then they had to go up the IT chain to generate a temp admin password to login into Win 10 in Safe mode. Only then was the file visible and able to be deleted.
I initially tried delete the file in a recovery cmd session, but it wasn’t visible since I was at a user privilege level.
To quote my former TI electronic tech, Paul C, “What about the farmer in Iowa…” who encounters a similar esoteric IT issue – what does he do?
urbite, the only thing that saved Crowdstrike from disabling hundreds of millions of Windows PCs with its anti-malware update instead of “just” 8.5 million was that it’s an enterprise/government tool. Farmers in Iowa are unlikely Crowdstrike subscribers as are other SME businesses. Otherwise, the economic damage could have been far, far more severe. The affected enterprises and institutions bear partial blame (legally, contributory negligence) by allowing updates to go through without testing. It’s a lesson about needing a good IT department that I fear few will learn.
Two words:-
due diligence
As noted by Fireship (https://odysee.com/@fireship:6/real-men-test-in-production%E2%80%A6-the-truth:7) the current CEO of CrowdStrike was the CTO of McAfee back in 2010 when a similar scourge was sprinkled on “the net” with similar results. Why learn when you can lead?
Failing upwards isn’t just a fad.
While the buck certainly stops with the CEO, ericwertz, I’d pin this debacle on the head of QA. The software development and release procedures seem to have been sorely lacking here. Noting so disastrous and so easily tested should have escaped from Crowdstrike. At the same time, the number of incidents of people sending me the wrong files over the years, or no files, as promised attachments is larger than I can count. However, those people did not have release procedures that any company like Crowdstrike must and does have. I don’t think this event was caused by a lack of procedures, I think it was a failure to follow those procedures.
Hi Steven,
I absolutely agree. Why invest the time and effort in creating procedures if they are not followed!
I would like to draw your attention to OpenBSD and the development process followed by that team. Like all good firewalls, the default strategy is deny – if a new feature is proposed as a good idea by someone, it won’t get in to the kernel or base system unless it passes intense scrutiny and can prove it has real value to many users, not just the proposer. The ring 0 feature would certainly not have got past the development team. The OpenBSD team put a real emphasis on security and privilege separation. There is also a rigorous release strategy that IS followed. Interestingly the project is an autocracy. Also, these people “eat their own dog food”! I am not part of the team nor am advocating it but I am following up on Jack Ganssle’s comment about learning lessons and doing better. Really good point Jack. (I do have the book and have been aware of Jack Ganssle for more years than I care to admit publicly.) Steven, as you point out, this DOES matter because it has a very real financial impact as well as other detrimental effects. Thank you for starting this thread. I also follow the SeL4 project which I am sure you and your readership are aware of. So, in summary, what OS would I like at the core of any critical embedded system? SeL4. What OS would I like running on my home router/firewall? OpenBSD. I know that Linux and other BSDs are available but where is the security focus? I could go on but this reply is far too long already!