One of the most fabulous aspects of running your apps on top of Platform-as-a-Service components is that someone else is running and watching these components for you.
That doesn't let you off the hook from watching you app, but saves you from a lot of depth troubleshooting that you'd otherwise have to deal with if, for instance, you were running middleware like a messaging system yourself.
For this episode I sat down with my colleague Mohamed F. Ahmed, who organizes, amongst other important things, our servicing strategy.
We talk about how we proactively monitor our own systems and the other platform features we depend on, and how we actively observe logs to catch reliability issues and address privacy concerns as we do that.
Mohamed and I also discuss the layered structure of our world-wide 24h/365d servicing team, with a 1st level live-site operations crew with a constantly refined operations handbook for known behaviors, backed up by on-call product team crews who investigate, if needed at 3 in the morning, and either fix issues, guide customers to solutions, or provide new procedures for the operations handbook.
We talk about the various issue classifications, including the ones that will or would lead us to giving up on an entire cluster or even datacenter facility and fail over to a different one, and how we're structurally set up to learn from past mistakes and to improve our processes, which includes weekly reviews with executive leadership.
If you run a service yourself, in Windows Azure or elsewhere, you may want to make some time to watch this.