The notion that non-tech-industry people should care about cloud computing continues to weasel its way into mainstream media and popular culture, most recently with a slew of stories about how last week's East Coast storms were to blame for the unscheduled downtime that affected a bunch of popular services on the web like Instagram, Pinterest, and Netflix. Who knew there was actually such a thing as a derecho or a shelf cloud (sinister photo above)? I digress. Anyways, as people continue to come to terms with their reliance on availability and uptime for the unseen back ends of their favorite web destinations, the discussion rages on about we can learn from this and what we'd do differently next time. A few have posed the question, but it's kind of the same old same old ... a discussion of computing architecture, availability zones, replication, failover, and other IT-specific stuff. It's useful, for sure, but it begs the most fundamental question of all: how do complex systems fail? If you ask the question in the context of cloud computing, you'll get to comb through lots of tech papers by tech people talking all sorts of tech jargon about stuff that only other tech people would understand. I guess that's fine, but it puts a tech lens on something that should be more basic and fundamental than that.
So I went poking around and found a proverbial diamond in the rough ... a paper written over 12 years ago by a doctor, and not a guy with a doctoral degree in computer science, but a medical doctor. Richard Cook is a director at the Cognitive Technologies Laboratory at the University of Chicago who has done a bunch of work examining impact of health IT on patient safety, and in 2000 published a paper he described as a "short treatise" on the nature of failure entitled, How Complex Systems Fail. In reading this paper, you'd think he'd had the web on his mind, even back then, but no ... this short 4-pager rises above the level of vertical-industry-specific context and gets to the heart of how stuff breaks. It's very, very good. And reading this last week while news of the storm's impact on the web was making headlines here in the U.S. made it an obvious connection back to how we need to think about these new architectures and design points. I thought I knew a thing or two about this stuff, but this paper was pretty eye-opening.