Loading user information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading user information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements

How Complex Systems Fail

A shelf cloud over Enschede, Netherlands
(Photo by John Kerstholt)

The notion that non-tech-industry people should care about cloud computing continues to weasel its way into mainstream media and popular culture, most recently with a slew of stories about how last week's East Coast storms were to blame for the unscheduled downtime that affected a bunch of popular services on the web like Instagram, Pinterest, and Netflix.   Who knew there was actually such a thing as a derecho or a shelf cloud (sinister photo above)?  I digress.  Anyways, as people continue to come to terms with their reliance on availability and uptime for the unseen back ends of their favorite web destinations, the discussion rages on about we can learn from this and what we'd do differently next time.  A few have posed the question, but it's kind of the same old same old ... a discussion of computing architecture, availability zones, replication, failover, and other IT-specific stuff.  It's useful, for sure, but it begs the most fundamental question of all: how do complex systems fail?  If you ask the question in the context of cloud computing, you'll get to comb through lots of tech papers by tech people talking all sorts of tech jargon about stuff that only other tech people would understand.  I guess that's fine, but it puts a tech lens on something that should be more basic and fundamental than that.

So I went poking around and found a proverbial diamond in the rough ... a paper written over 12 years ago by a doctor, and not a guy with a doctoral degree in computer science, but a medical doctor.  Richard Cook is a director at the Cognitive Technologies Laboratory at the University of Chicago who has done a bunch of work examining impact of health IT on patient safety, and in 2000 published a paper he described as a "short treatise" on the nature of failure entitled, How Complex Systems Fail.  In reading this paper, you'd think he'd had the web on his mind, even back then, but no ... this short 4-pager rises above the level of vertical-industry-specific context and gets to the heart of how stuff breaks.  It's very, very good.  And reading this last week while news of the storm's impact on the web was making headlines here in the U.S. made it an obvious connection back to how we need to think about these new architectures and design points.  I thought I knew a thing or two about this stuff, but this paper was pretty eye-opening. 

- Tim


Follow the discussion

  • Oops, something didn't work.

    Getting subscription
    Subscribe to this conversation
  • Great post, Tim. While I've read Dr.Cook's essay some time ago, your post inspired me to get a refresher plus look a bit deeper and do my own write-up: http://blogs.msdn.com/agile/archive/2012/07/03/thinking-about-complex-systems-and-cloud-availability.aspx

  • JohnAskewJohnAskew 9 girl in pink sweater

    I'm most intrigued by the repeated call from Dr. Cook that the tendency to blame people when complex systems fail is both wrong and very destructive.

    It is a pain point and nearly impossible to resist blaming human error. I see that in comments about the article describing the outage of Azure on leap day via Gregori's blog.

    What I was left with was a desire for recommended approaches to best handle the pitfalls Dr. Cook describes... but I guess that is a challange left to the reader.

Remove this comment

Remove this thread


Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.