Blog Post

Inside the Azure Storage Outage of November 18th

Sign in to queue

Description

On November 18, Microsoft Azure customers experienced a service interruption that impacted Azure Storage and a few other services, including Virtual Machines. Azure Chief Technology Officer Mark Russinovich sits down with Channel 9 to provide customers with a look into what happened, along with how the team is actively working to improve customer experiences on the Azure platform as a result.

Tag:

Azure

Embed

Download

The Discussion

  • User profile image
    mrpaulb

    Good to see a clear and detailed explanation, the key to this is people and risk assessment, systems will fail, operators will fail to follow process, these things happen and always will.

    The only way to stop issues like this is to have separate people (along with separate systems)  for separate regions, you clearly cannot rely on system or process. If different people are accountable for (and limited to) different regions then these multi region issues could be eliminated.

    Paul

  • User profile image
    compupc1

    Thanks for the update.  I am glad to see you are taking a look at your social media strategy -- it's always a fantastic plan B when something like the service dashboard itself goes down.  Even more, it's a great way to address specific questions not covered by service dashboard updates; I hope the new policy will permit this type of customer interaction as it makes sense.

  • User profile image
    MikeH2011

    Are there some official links to the social media sites?

    Infinite loop? OMG.

    Do you always start with the roll out in the same region?

    What I ever wanted to know: How to to test your Datacenter? This one is easy, build a second one to test with. But how to test the cloud? For real world test I guess one needs a second planet earth. Isn't this something, which cannot be tested 100%?

  • User profile image
    HansOlavS

    My trust in Azure is 100% restored. Nuff said!

  • User profile image
    Mark Russinovich

    @MikeH2011 Thanks, Mike, for the questions. For social media sites, we are investigating methods of communication for our customers and will be rolling out policies to ensure they receive updates in a timely fashion. For your question on the roll-out, we do not always start in the same region for the flighting. Our standard deployment and test policies and principles are outlined in the public RCA.

  • User profile image
    codedj

    Thanks for the update. Has the original update (that caused the failure) been re-deployed. For the network fix did you follow the same rollout procedure or a different one i.e. is there a different rollout procedure for fixes during down-time vs. feature release.

  • User profile image
    Mark Russinovich

    @codedj Good questions. We have not rolled out the original update yet, and avoid updating over holiday periods, so it will go out after the New Year.

  • User profile image
    Will Jones

    An incredibly frank and open account about the Azure outage, thank you Mark. I applaud how honest you are being about this, mistakes will always happen, learning from them and explaining them is key to keeping cloud customers happy.

  • User profile image
    John

    Office 365 was DOWN too.

  • User profile image
    Jack

    Stay with On-Premises, Public Cloud is shaky.
    Once your Data is in the Public Cloud services, the Public Cloud Vendor (Microsoft) OWNS your Data, so remember “If you don't hold your Data, you don't own your Data”....

  • User profile image
    Lee

    I have heard North Korean have also Hacked Office 365 servers lately after attack on Sony.

  • User profile image
    pcgeek86

    Glad to see Mark Russinovich putting his face out in front of Azure, and own up to human error. It's okay - I realize that people make mistakes, and that this is being taken seriously. I still trust Azure, and will continue to fully support it in front of my customers.

    Cheers,
    Trevor Sullivan
    Microsoft MVP: PowerShell
    @pcgeek86

  • User profile image
    Steven

    He didn't mention anything about VM disk corruption that happened in West Europe.

    We had to manually attach the disks to another VM and restore the Windows registry and run chkdsk: http://blogs.msdn.com/b/mast/archive/2014/11/20/recovering-azure-vm-by-attaching-os-disk-to-another-azure-vm.aspx

    Our VMs were down for 4 days because we had no way of diagnosing the issue ourselves without console access.
    Apparently console access is coming soon ...

  • User profile image
    Rob Casey

    +1 for Mark's candor in terms of what happened, the details of what failed, and the plan to improve based on what was learned. This is very good PR for an otherwise embarrassing incident. These are complex systems and the tiniest mistakes can have large scale consequences.

  • User profile image
    Mihran Dars

    Human Error sources in IT = a*(ProcessError) + b*(SoftwareError)+c*(ControlError)+d*(ViolationOfTrust)

  • User profile image
    simianmonkey

    I find it really hilarious that some dude has so much power over there to decide that a standard, procedural roll-out is to be superseded by his own overly optimistic views... human stupidity and the Universe are infinite.

Add Your 2 Cents