@wkempf: @MichaelCzepiel: Hey Michael (and wkempf ), I'm the Channel 9 guy who you can beat up about yesterday's outage. So, it wasn't the conference Wi-Fi, it was completely us. Scaling up in Azure is easy, and we did scale up before the day started and we definitely scaled up once the CPU usage started to spike. And it was that CPU usage spiking up to 100% that caused the outage.
The key is though, we couldn't determine the *cause* of this CPU usage increase, and increasing the # of instances was *not* reducing the CPU usage significantly, we were just bringing up new nodes which would then quickly ramp up to 100%. The load on our site was actually higher before the CPU started to spike, and even when the load went down, the CPU usage stayed at 100%.
So, scaling up (easy as it was) was not solving our problem. Our theory is that some particular bit of code, that was being hit by TechEd attendees only, was causing the issue, not the overall site load. We spent most of the day and last night digging through code trying to figure out what event specific feature was causing the issue and we found quite a few areas where improvement was possible. We are watching the CPU load carefully today and have several people still digging into code, running traces and profiles.
Essentially our failure is that our pre-event testing and load was not a true representation of the load that the real event puts onto our site. We definitely feel your pain on this, trust me we were actively engaged trying to fix it and stayed on site until after midnight last night attempting to debug.
Hopefully today, and the rest of the conference, will proceed more smoothly.