A Global Microsoft Outage

by Zain Jaffer

In late July 2024 a massive global outage caused millions of computers around the world running Microsoft Windows to go down [https://www.nytimes.com/2024/07/19/business/microsoft-outage-cause-azure-crowdstrike.html]. The outage caused massive outrage because it brought to a halt thousands of services such as airlines, banks, restaurants, retail stores, hospitals and emergency rooms, and a host of others. The cause was found to be a botched update by CrowdStrike, a company used by Microsoft to send anti-malware patches to its Windows operating system.

This event just highlights how much our daily lives are dependent on computers and networks. Although this bug has been fixed by an updated patch, it does show us that single point failures do exist in our computers and networks, many of which we do not even know about, until we trigger those. That ignorance includes experts.

Unfortunately because these systems have become so complex, it is probably impossible to know every single possible point of failure. So the question begs itself. What can we do about it?

Well there is a joke circulating on social media in various forms saying that Apple and Linux users are wondering what the problem or fuss is about [https://www.sfgate.com/tech/article/linux-mac-users-crowdstrike-memes-microsoft-outage-19584184.php]. But behind those jokes lie a possible solution, one that many Information Technology (IT) experts know but for cost considerations try to avoid.

Let’s face it. It is easier to manage a global corporate IT network if all your systems are standardized. Although in reality no one probably has a 100% Microsoft or 100% Intel or whatever computer system in their global network. This is because companies and people purchase computers at various points over the years, and they end up buying different gadgets and software. However companies do standardize their systems every few years. Remember how employees used to have Blackberrys? Now they probably have Apple or Android smartphones.

The problem is, while it is easier to update and maintain standardized networks, you do leave yourself open to vulnerabilities that are specific to those hardware and software that you have standardized on. So you end up with system downtime if there is a Microsoft bug like what happened in late July 2024.

On the other hand if a company keeps some of its data IT requirements on Amazon Web Services, Microsoft Azure, Google Cloud, it will probably be more expensive. However think about it. What is the likelihood that all three (or at least two) will go down at the same time? Probably not as high as one of them taken individually.

These technology providers will of course argue that they have teams of experts and consultants who comb their technology products for weaknesses. But in reality these networks have become so complex that it will probably be impossible to spot all of these problems. One can only hope that the occurrence of those issues that will really bring down the network for a long time are extremely remote, and can be fixed. Although if you do not know what to look for, it becomes hard or even impossible to anticipate it.

Those who manage IT systems know that it is a balancing act. They want to keep their lives simple and go home at 5pm, and not spend more of their Company money than is required.

Redundancy after all is expensive. But it is like insurance, and can pay off when other companies systems go down.

It is a balancing act, to manage risks and costs.