Interview with Aaron Bjork (Consistency)

In this interview, Principal DevOps Manager Donovan Brown interviews Director of Engineering Munil Shah about Safe Deployment of Visual Studio Team Services.
[00:40] Scale Units
[03:54] Safe Deployment Practices
[04:28] Rings
[09:20] Performance Testing
[10:24] Go Big Environment
[12:40] Feature Flags & Stages
[15:45] Preview Flag
[19:25] Testing
[24:29] DB or Binaries who goes first
[27:52] DB Schema Management
[32:35] API Versions
[18:50] Feature Flags!
Blog: DonovanBrown.com
Follow @DonovanBrown
Follow @munilsh
ring 4a 'special db', I want to hear more about that!
@spottedmah:Munil will be returning to the show and we will dive deeper for you.
Very interesting and informative. Thanks for the video!
Awesome! Looking Forward...
Munil kept mentioning deployments taking place over days or like 24 hours. How are you optimizing the process to get down to hours or minutes?
Could we actually get a video showing the actual deployment process? I assume you must have a 24x7 war room staffed during a deployment?
How in the world do you track bugs in real time against so multiple deployments/versions and not have huge overhead in try to track the bug back to when the individual developer committed the change?
Munil and I will be recording another show and we will be sure to cover these questions. Thanks for watching.
How do you monitor the operational impact of deployment rollouts with dashboards like grafana (to see things like CPU utilization, RAM utilization, and Disk Throughput)?
@Dave Wilson:I don't think they meant the actual deployment takes 24 hours. It sounds like the feature(s) sit in a ring for 24 hours before the next ring gets the feature(s).
@Dave Wilson:I don't think they meant the actual deployment takes 24 hours. It sounds like the feature(s) sit in a ring for 24 hours before the next ring gets the feature(s).
That is correct. We actually let the code bake in Ring 0 for 48 hours before the deployment to the next ring. Munil and I just recorded another episode today (being produced) where we go into more detail.
Question for the next interview / conversation:
Why couldn't this outage be mitigated by instantly turning off a feature flag?
https://blogs.msdn.microsoft.com/vsoservice/?p=15276
@Manuel Wellman:The issue was in a dependency.
"In that deployment, we pulled in version 5.1.4 of the System.IdentityModel.Tokens.Jwt library. This library had a performance regression in it that caused CPU utilization on the SPS web roles to spike to 100%."
Taking a dependency is not like adding a feature you can turn on or off. Either you take the dependency or you do not.
@Donovan:@Manuel Wellman: Unfortunately this change was swapping out a compile-time dependency. So there isn't a good way to introduce a change like this behind a feature flag. It's a good question.
Fascinating. Would love to understand the metrics you've put in place to measure your deployments consistently and not to treat each one as a unique snow flow. Time is one consistent metric: how long it took, but also how many man hours went into the process, etc. Would be really helpful to understand how you keep up morale when it seems the team is *always* on a diet...err what I mean is always running at full steam ahead to the next deployment. How do you catch your breath? How do you avoid burn out of team members? What's next on your list to go and solve. Can anyone deploy automatically to production, if not why can't they, what's stopping anyone from being able to deploy easily (i.e. why isn't the process 100% automated, you have ALL of the data surely it's possible?)?