@James hotfix rollout depends on the severity. For sev 2 and 3s (i.e. lower severity), they go through the same deployment ring model. For Sev1s, we deploy through ring 0 and then can skip to impacted scale unit if it is isolated to a particularly scale unit. Otherwise it also follows normal ring model. For Sev0 (highest severity), we deploy as fast as possible to alleviate the problem. If another release is already in the pipeline, then the hofix may take a priority as explained here. We need to craft different versions of hotfixes depending on where particular scale units are. Regarding roll back, yes it is automated. After updating the binaries, we check a set of health monitors. Based on specific thresholds, the binaries are rolled back if things don't look right. Hope this helps.
@Kevin: Yes we have very rich set of telemetry and dashboards to monitor tons of metrics, including the ones you mentioned. It's an internal system that our friends in Azure have built, includes some parts that are externally available like Azure Log Analytics, Application Insights. Maybe we should do another video of our live site dashboards! :)
@Donovan:@Manuel Wellman: Unfortunately this change was swapping out a compile-time dependency. So there isn't a good way to introduce a change like this behind a feature flag. It's a good question.
Comments
Interview with Munil Shah (Safe Deployment Questions Answered)
@Niner447018, there is usually urgency to get the hotfixes out through Release branch. Hence we do it that way.
Interview with Munil Shah (Quality)
@James hotfix rollout depends on the severity. For sev 2 and 3s (i.e. lower severity), they go through the same deployment ring model. For Sev1s, we deploy through ring 0 and then can skip to impacted scale unit if it is isolated to a particularly scale unit. Otherwise it also follows normal ring model. For Sev0 (highest severity), we deploy as fast as possible to alleviate the problem. If another release is already in the pipeline, then the hofix may take a priority as explained here. We need to craft different versions of hotfixes depending on where particular scale units are. Regarding roll back, yes it is automated. After updating the binaries, we check a set of health monitors. Based on specific thresholds, the binaries are rolled back if things don't look right. Hope this helps.
Interview with Munil Shah (Safe Deployment Questions Answered)
@Kevin: Yes we have very rich set of telemetry and dashboards to monitor tons of metrics, including the ones you mentioned. It's an internal system that our friends in Azure have built, includes some parts that are externally available like Azure Log Analytics, Application Insights. Maybe we should do another video of our live site dashboards! :)
Interview with Munil Shah (Safe Deployment)
@Donovan:@Manuel Wellman: Unfortunately this change was swapping out a compile-time dependency. So there isn't a good way to introduce a change like this behind a feature flag. It's a good question.