Emre Kiciman: Reliable Computing for Large Scale Distributed Systems
- Posted: Oct 11, 2006 at 3:48 PM
- 19,727 Views
- 11 Comments
Loading User Information from Channel 9
Something went wrong getting user information from Channel 9
Loading User Information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
Right click “Save as…”
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums,
or
Contact Us and let us know.
Follow the Discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications page.sign up for email notifications?
I'm not trying to step on toes or anything, but this seems to be a waste of the bandwith of OrcasWeb, and not a very informative channel 9 piece at all, IMHO.
Wow. Sorry you feel so strongly about this... Emre can't discuss the specifics of what he's working on for hopefully obvious reasons...
The point of the video was to discuss the problem space. In fact, we do discuss how the low level processing works. Further, we discuss the relationship between Machine Learning (pattern recognition) and systems management. What do you mean by self-promotion?
C
Charles, is there a video glitch a minute and fourty seconds into the interview after you asked about figuring the density of the Universe?
@ 00:16:45
This stuff is actually really interesting. So from what I'm gathering is that Emre is simply a wan-management-mastermind? Alot of the stuff he's discussing is actually very relevant, because since I joined my new job, I've been experiencing many new technologies which blow me away. Some of which are what he's talking about. The company I work for is a Medical-related Facility with offices littering the South-East Coast. I get automated emails from every one of the offices whenever their server detects any type of instrusion, or related warnings. This type of technology is awesome! Not to mention, our servers fix themselves most of the time and then email the "Techies" a summary of the activity.
Excellent Interview, Charles. Maybe he didn't get as deep into a specific subject as some others would have preferred, but I am a sucker for tech-talk, no matter how broad the conversation is - kudos.
No glitch. I edited out my inadvertant explanation of the meaning of everything.
C
This video was titled incorrectly. Previous material, and I'm sorry common practices within distributed design are farther along than this video. I think that is all I really need to say.
What would a better title be in this case? This is a research video, not a product or how-to. I'm not certain I understand the problem here. Can you help me understand the issue(s)? Will help the next time around...
Thanks for the feedback,
C
First, 45 minute talking head videos aren't exactly exciting to begin with, right?! I see a young guy who has an amazing amount of statistics and probability knowledge, now-drowning in data, with some algorithms and passion about machine-learning. Great!
But, how has he leveraged some things that we have seen at Redmond and elsewhere (in past channel 9 videos), like interpretative machine-learning of sales trends data, and stuff like the great new instrumentation and perf counters sliding downhill in Vista?
You began to question Emre more on the "how's and why's" 35 minutes or so in..
First, let me say doing analysis on reliability with a large-scale distributed system without knowing it's functions or base patterns is near-suicidal. It is akin to trying to analyze water not knowing it's source.. things could be regular, irregular, irregularly regular, regularly irregular, and the whole gamet in-between (or maybe not even water!)
Nowadays, code-(wh*re)s like me [I'm a code-writing senior-architect for a large company with offices in 63 countries..] are trying to "bake-in" the concept of better instrumentation, exceptional-condition handling, reliable messaging, increased trust and security, better profile integrations etc. into our distributed apps, using service-oriented architortures (haha) and implementing plethora's of polymorphic use-cases and workflows. (Whew!)
How does Erme's research in pattern analysis identify clusters at the machine-level of a distributed system statistically? How could data be effected for example (as Erme related to) with a distributed edge-network for content-delivery, or a distinct protocol sub-system (like some large-scale distributed source control networks?)
How is this machine-pattern analysis affected by denial-of-service attacks to various parts of the system? What about time-zone and multicultural (read:internationalization) issues in the data pools?? What kind of control statistics can we compare patterns analysis to if we do not know normal patterns vs. abnormal...
What kind of Microsoft technologies becoming available could assist in mitigating failures? (Ie: reliable messaging in WCF, profile services in asp.net for those checkouts, building loosely-granular interfaces, utilizing generics and strong-typing features with large data pools, What about CardSpace features for enterprise users, etc. etc. etc. I would be less worried abut the PR guys if you guys could find a way to connect more with us Microsoft-friendly developers using Microsoft technologies, and not just (as I said) some conceptual talk about pattern analysis and how much work a phd is at Stanford. (And I'm certain pricey too.. so Emre, if you get down to Portland look me up and I'll buy the coffee.. lol.. )
I guess what I would like to see is a second round.. something where we expound on how this can bring value to what we are doing at the enterprise level. How pieces of the large-scale distributed system may return valuable trends data (such as Christmas season vs. football season shopping)
What kind of tactics, scoping, span-of-control and rules affect the base statistical means in determining problem X can be made obsolete to increase reliability.
(As you pointed out, "What about the OS?".. obviously, there are some inherit reverse-scalability issues in trying to solve statistical problems without large numbers, etc. )
The bottom line is, I thought "This could have been so much more". Although I am glad I can even criticize it at all. It will just make things that much better here at Channel 9 as we all trudge ahead.
Thanks and have phun
-Eric
Don't [ever] put your hand over your mouth or nose.
it's really bad body languagebtw:
I liked this video, and I think MSR videos are really important!
Thank you for the feedback. I appreciate it. I will ask some of your questions in future interviews.
We will continue to cover systems management work going on at Microsoft. There is some very exciting stuff going on that we are not ready to talk about publicly. When we are, you can bet that we will have deep dives into the technologies behind the management products here on C9. This video was primarily just a conversation with a researcher working on problems in the management/reliability space. We can't discuss the gory details of Emre's work in a public setting at this point. I hope you understand...
Thanks again and keep posting!
C
[sorry for not hopping into this conversation earlier, I'm on paternity leave right now, and not checking email/web very frequently]
Hi Adenocard,
Wow, thanks for taking the time to watch and make your detailed comments. I appreciate it.
First, I'll apologize: this is my first time talking to the channel 9 audience. I obviously misjudged the degree of technical detail that you're interested in seeing.
Secondly, analyzing the patterns of behavior in a system without fully understanding the system's architecture/topology is exactly the challenge! The issue is that with the large-scale systems that I'm looking at (primarily Internet services, not enterprise systems), the system changes frequently enough that any assumptions you make about base patterns and system characteristics might not be valid for long! And certainly, making assumptions about the characteristics of failures is a very bad idea!
The overall goal is then to develop scalable techniques that help operators understand their system's behavior (both its correct and incorrect operation) by summarizing patterns/detecting anomalies/etc while making only minimal assumptions (not "no assumptions") about the system and its behavior. On the machine learning side of things, this involves leaning heavily on non-parametric methods and robust statistics. On the systems side of things, the main question is choosing key instrumentation points and figuring out how to best respond to automated analyses (alerting operators, rebooting machines, etc) to achieve the higher-level goal of maintaining a reliable system.
What makes this partially tractable is that right now, 1) with a large system, you really get to take advantage of statistical properties of seeing lots of data points; and 2) these techniques are not looking for anything very subtle --- people can see these patterns pretty easily. Unfortunately, if you're trying to monitor 10k+ streams of data, you can't find enough people to look at them all individually, and you certainly can't flood a single person with all the data to get a sense of a "global" view.
Feel free to email me if you want to chat more.
---
JParrish: Today, *people* definitely see and use these patterns all the time. Unfortunately, people don't scale well as systems grow even to 10k nodes (much less the larger Internet services!). There are some automated monitoring systems (e.g., I think zenprise is neat) out there today, but they're very specialized and have a lot of in-built knowledge about the systems they monitor.
---
Shoshan: Ugh, that's horrible! Did I actually do that?
---
Thanks again to all of you for taking the time to comment.
Cheers,
Emre
Greetings Emre Abi, i hope we listen more from u...
PS: About body language: He is putting his hand to his chest while listening, this means in turkish culture (i think also europeans do it as well) that "i am listening u carefully" ...
Remove this comment
Remove this thread
close