Interview with Munil Shah (Safe Deployment)

Sign in to queue

The Discussion

  • User profile image
    spottedmah

    ring 4a 'special db', I want to hear more about that!

  • User profile image
    Donovan

    @spottedmah:Munil will be returning to the show and we will dive deeper for you. 

  • User profile image
    GregCobb

    Very interesting and informative.  Thanks for the video!

  • User profile image
    Chris Richner

    Awesome! Looking Forward...

  • User profile image
    Dave Wilson

    Munil kept mentioning deployments taking place over days or like 24 hours. How are you optimizing the process to get down to hours or minutes?

  • User profile image
    Tadn

    Could we actually get a video showing the actual deployment process? I assume you must have a 24x7 war room staffed during a deployment?

  • User profile image
    Ellen

    How in the world do you track bugs in real time against so multiple deployments/versions and not have huge overhead in try to track the bug back to when the individual developer committed the change?

  • User profile image
    Donovan

    Munil and I will be recording another show and we will be sure to cover these questions. Thanks for watching. 

  • User profile image
    Kevin Harn

    How do you monitor the operational impact of deployment rollouts with dashboards like grafana (to see things like CPU utilization, RAM utilization, and Disk Throughput)?

  • User profile image
    nmenegay

    @Dave Wilson:I don't think they meant the actual deployment takes 24 hours. It sounds like the feature(s) sit in a ring for 24 hours before the next ring gets the feature(s).

  • User profile image
    Donovan

    @Dave Wilson:I don't think they meant the actual deployment takes 24 hours. It sounds like the feature(s) sit in a ring for 24 hours before the next ring gets the feature(s).

    That is correct. We actually let the code bake in Ring 0 for 48 hours before the deployment to the next ring. Munil and I just recorded another episode today (being produced) where we go into more detail. 

     

  • User profile image
    Manuel Wellman

    Question for the next interview / conversation:
    Why couldn't this outage be mitigated by instantly turning off a feature flag?
    https://blogs.msdn.microsoft.com/vsoservice/?p=15276

  • User profile image
    Donovan

    @Manuel Wellman:The issue was in a dependency. 

    "In that deployment, we pulled in version 5.1.4 of the System.IdentityModel.Tokens.Jwt library. This library had a performance regression in it that caused CPU utilization on the SPS web roles to spike to 100%."

    Taking a dependency is not like adding a feature you can turn on or off. Either you take the dependency or you do not. 

  • User profile image
    munils

    @Donovan:@Manuel Wellman: Unfortunately this change was swapping out a compile-time dependency. So there isn't a good way to introduce a change like this behind a feature flag. It's a good question. 

  • User profile image
    Nigel R

    Fascinating. Would love to understand the metrics you've put in place to measure your deployments consistently and not to treat each one as a unique snow flow. Time is one consistent metric: how long it took, but also how many man hours went into the process, etc. Would be really helpful to understand how you keep up morale when it seems the team is *always* on a diet...err what I mean is always running at full steam ahead to the next deployment. How do you catch your breath? How do you avoid burn out of team members? What's next on your list to go and solve. Can anyone deploy automatically to production, if not why can't they, what's stopping anyone from being able to deploy easily (i.e. why isn't the process 100% automated, you have ALL of the data surely it's possible?)?

Add Your 2 Cents