Myself and others were trying to get onto Channel9 yesterday during TechEd to do our evaluations and the system was pretty much unusable. Couldn't even bring up my schedule on my phone at times. I found it ironic that Microsoft's keynote and other sessions talked about how easy it is to spin up additional resources to meet the needs of a business but it seems Microsoft can't manage their own workloads.
I didn't see anything on the site commenting on the performance issues. Maybe MS will comment to this thread.
I had no issues with C9 at all yesterday. Far more likely the problem was with your access. The WiFi at conferences is almost always overloaded. That's at ALL conferences, and there's not much they can do about it.
@wkempf: @MichaelCzepiel: Hey Michael (and wkempf ), I'm the Channel 9 guy who you can beat up about yesterday's outage. So, it wasn't the conference Wi-Fi, it was completely us. Scaling up in Azure is easy, and we did scale up before the day started and we definitely scaled up once the CPU usage started to spike. And it was that CPU usage spiking up to 100% that caused the outage.
The key is though, we couldn't determine the *cause* of this CPU usage increase, and increasing the # of instances was *not* reducing the CPU usage significantly, we were just bringing up new nodes which would then quickly ramp up to 100%. The load on our site was actually higher before the CPU started to spike, and even when the load went down, the CPU usage stayed at 100%.
So, scaling up (easy as it was) was not solving our problem. Our theory is that some particular bit of code, that was being hit by TechEd attendees only, was causing the issue, not the overall site load. We spent most of the day and last night digging through code trying to figure out what event specific feature was causing the issue and we found quite a few areas where improvement was possible. We are watching the CPU load carefully today and have several people still digging into code, running traces and profiles.
Essentially our failure is that our pre-event testing and load was not a true representation of the load that the real event puts onto our site. We definitely feel your pain on this, trust me we were actively engaged trying to fix it and stayed on site until after midnight last night attempting to debug.
Hopefully today, and the rest of the conference, will proceed more smoothly.
ah, good story from the trenches
hope your code get better for BUILD
@Duncanma: Thanks for the insight. While we don't like to see anyone fail, it's good to hear that you were able to scale up and that the issue was truly deeper. It's good to see too that the MS folks are human like the rest of us and things happen. I and I'm sure others appreciate the honesty.
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.