[sorry for not hopping into this conversation earlier, I'm on paternity leave right now, and not checking email/web very frequently]
Wow, thanks for taking the time to watch and make your detailed comments. I appreciate it.
First, I'll apologize: this is my first time talking to the channel 9 audience. I obviously misjudged the degree of technical detail that you're interested in seeing.
Secondly, analyzing the patterns of behavior in a system without fully understanding the system's architecture/topology is exactly the challenge! The issue is that with the large-scale systems that I'm looking at (primarily Internet services, not enterprise systems), the system changes frequently enough that any assumptions you make about base patterns and system characteristics might not be valid for long! And certainly, making assumptions about the characteristics of failures is a very bad idea!
The overall goal is then to develop scalable techniques that help operators understand their system's behavior (both its correct and incorrect operation) by summarizing patterns/detecting anomalies/etc while making only minimal assumptions (not "no assumptions") about the system and its behavior. On the machine learning side of things, this involves leaning heavily on non-parametric methods and robust statistics. On the systems side of things, the main question is choosing key instrumentation points and figuring out how to best respond to automated analyses (alerting operators, rebooting machines, etc) to achieve the higher-level goal of maintaining a reliable system.
What makes this partially tractable is that right now, 1) with a large system, you really get to take advantage of statistical properties of seeing lots of data points; and 2) these techniques are not looking for anything very subtle --- people can see these patterns pretty easily. Unfortunately, if you're trying to monitor 10k+ streams of data, you can't find enough people to look at them all individually, and you certainly can't flood a single person with all the data to get a sense of a "global" view.
Feel free to email me if you want to chat more.
JParrish: Today, *people* definitely see and use these patterns all the time. Unfortunately, people don't scale well as systems grow even to 10k nodes (much less the larger Internet services!). There are some automated monitoring systems (e.g., I think zenprise is neat) out there today, but they're very specialized and have a lot of in-built knowledge about the systems they monitor.
Shoshan: Ugh, that's horrible! Did I actually do that?
Thanks again to all of you for taking the time to comment.