Loading user information from Channel 9

Something went wrong getting user information from Channel 9

Latest Achievement:

Loading user information from MSDN

Something went wrong getting user information from MSDN

Visual Studio Achievements

Latest Achievement:

Loading Visual Studio Achievements

Something went wrong getting the Visual Studio Achievements

Emre Kiciman: Reliable Computing for Large Scale Distributed Systems

41 minutes, 31 seconds


Right click “Save as…”

Emre Kiciman is a researcher in MSR focusing on solving hard problems in the field of automated systems management of large scale distributed systems. Basically, he works on algorithms and systems that can be used to understand how systems are working, independent of complexity, by analyzing low-level data captured at runtime.


Follow the discussion

  • Oops, something didn't work.

    Getting subscription
    Subscribe to this conversation
  • OK, I got to 35 minutes in and I stopped watching.  Iguess this coul have been a useful video on exactly what Emre is working on. Instead, it seemed to me to be a self-promotional video of allocentric concepts in relation to the 'big picture' of a large-scale system.  I would have expected more than concept of a Standford phD doing something within Microsoft.. like "okay, i learned these concepts and I want to introduce this really cool low-level implentation of xyz I'm working on for product x"

    I'm not trying to step on toes or anything, but this seems to be a waste of the bandwith of OrcasWeb, and not a very informative channel 9 piece at all, IMHO.
  • CharlesCharles Welcome Change

    Wow. Sorry you feel so strongly about this... Emre can't discuss the specifics of what he's working on for hopefully obvious reasons...

    The point of the video was to discuss the problem space. In fact, we do discuss how the low level processing works. Further, we discuss the relationship between Machine Learning (pattern recognition) and systems management. What do you mean by self-promotion?
  • jsampsonPCjsampsonPC SampsonBlog.​com Sampson​Videos.com
    @ 00:01:40

        Charles, is there a video glitch a minute and fourty seconds into the interview after you asked about figuring the density of the Universe?

    @ 00:16:45

        This stuff is actually really interesting. So from what I'm gathering is that Emre is simply a wan-management-mastermind? Alot of the stuff he's discussing is actually very relevant, because since I joined my new job, I've been experiencing many new technologies which blow me away. Some of which are what he's talking about. The company I work for is a Medical-related Facility with offices littering the South-East Coast. I get automated emails from every one of the offices whenever their server detects any type of instrusion, or related warnings. This type of technology is awesome! Not to mention, our servers fix themselves most of the time and then email the "Techies" a summary of the activity.

    Excellent Interview, Charles. Maybe he didn't get as deep into a specific subject as some others would have preferred, but I am a sucker for tech-talk, no matter how broad the conversation is - kudos.
  • CharlesCharles Welcome Change
    jsampsonPC wrote:
    @ 00:01:40

        Charles, is there a video glitch a minute and fourty seconds into the interview after you asked about figuring the density of the Universe?

    No glitch. I edited out my inadvertant explanation of the meaning of everything. Smiley (Just culled some of my rambling. Nothing more.)
  • I have to say I agree very much with what Adenocard said. Reason being that every day software designers are faced with very significant issues, and occasionaly a pattern comes forward from multiple inputs to help determine the "right" way to approach a problem. This is not such a case.

    This video was titled incorrectly. Previous material, and I'm sorry common practices within distributed design are farther along than this video. I think that is all I really need to say. Sad
  • CharlesCharles Welcome Change
    JParrish wrote:
    I have to say I agree very much with what Adenocard said. Reason being that every day software designers are faced with very significant issues, and occasionaly a pattern comes forward from multiple inputs to help determine the "right" way to approach a problem. This is not such a case.

    This video was titled incorrectly. Previous material, and I'm sorry common practices within distributed design are farther along than this video. I think that is all I really need to say.

    What would a better title be in this case? This is a research video, not a product or how-to. I'm not certain I understand the problem here. Can you help me understand the issue(s)? Will help the next time around...

    Thanks for the feedback,
  • Ok Charles, Im glad you asked!   I will expound why I snickered a bit at this video. I went back and watched the whole thing again.

    First, 45 minute talking head videos aren't exactly exciting to begin with, right?! I see a young guy who has an amazing amount of statistics and probability knowledge, now-drowning in data, with some algorithms and passion about machine-learning. Great!

    But, how has he leveraged some things that we have seen at Redmond and elsewhere (in past channel 9 videos), like interpretative machine-learning of sales trends data, and stuff like the great new instrumentation and perf counters sliding downhill in Vista?

    You began to question Emre more on the "how's and why's" 35 minutes or so in..
    First, let me say doing analysis on reliability with a large-scale distributed system without knowing it's functions or base patterns is near-suicidal. It is akin to trying to analyze water not knowing it's source.. things could be regular, irregular, irregularly regular, regularly irregular, and the whole gamet in-between (or maybe not even water!)

    Nowadays, code-(wh*re)s like me [I'm a code-writing senior-architect for a large company with offices in 63 countries..] are trying to "bake-in" the concept of better instrumentation, exceptional-condition handling, reliable messaging, increased trust and security, better profile integrations etc. into our distributed apps, using service-oriented architortures (haha) and implementing plethora's of polymorphic use-cases and workflows. (Whew!) 

    How does Erme's research in pattern analysis identify clusters at the machine-level of a distributed system statistically?  How could data be effected for example (as Erme related to) with a distributed edge-network for content-delivery, or a distinct protocol sub-system (like some large-scale distributed source control networks?) 

    How is this machine-pattern analysis affected by denial-of-service attacks to various parts of the system? What about time-zone and multicultural (read:internationalization) issues in the data pools?? What kind of control statistics can we compare patterns analysis to if we do not know normal patterns vs. abnormal...

    What kind of Microsoft technologies becoming available could assist in mitigating failures? (Ie: reliable messaging in WCF, profile services in asp.net for those checkouts, building loosely-granular interfaces, utilizing generics and strong-typing features with large data pools, What about CardSpace features for enterprise users, etc. etc. etc. I would be less worried abut the PR guys if you guys could find a way to connect more with us Microsoft-friendly developers using Microsoft technologies, and not just (as I said) some conceptual talk about pattern analysis and how much work a phd is at Stanford. (And I'm certain pricey too.. so Emre, if you get down to Portland look me up and I'll buy the coffee.. lol.. )

    I guess what I would like to see is a second round.. something where we expound on how this can bring value to what we are doing at the enterprise level.  How pieces of the large-scale distributed system may return valuable trends data (such as Christmas season vs. football season shopping)

    What kind of tactics, scoping, span-of-control and rules affect the base statistical means in determining problem X can be made obsolete to increase reliability. 
    (As you pointed out, "What about the OS?".. obviously, there are some inherit reverse-scalability issues in trying to solve statistical problems without large numbers, etc. )

    The bottom line is, I thought "This could have been so much more". Although I am glad I can even criticize it at all.  It will just make things that much better here at Channel 9 as we all trudge ahead.

    Thanks and have phun Smiley

    -Eric Cool
  • Don't [ever] put your hand over your mouth or nose.

    it's really bad body language Perplexed

    I liked this video, and I think MSR videos are really important!
  • CharlesCharles Welcome Change

    Thank you for the feedback. I appreciate it. I will ask some of your questions in future interviews.

    We will continue to cover systems management work going on at Microsoft. There is some very exciting stuff going on that we are not ready to talk about publicly. When we are, you can bet that we will have deep dives into the technologies behind the management products here on C9. This video was primarily just a conversation with a researcher working on problems in the management/reliability space. We can't discuss the gory details of Emre's work in a public setting at this point. I hope you understand...

    Thanks again and keep posting!

  • [sorry for not hopping into this conversation earlier, I'm on paternity leave right now, and not checking email/web very frequently]


    Hi Adenocard,

    Wow, thanks for taking the time to watch and make your detailed comments.  I appreciate it.

    First, I'll apologize:  this is my first time talking to the channel 9 audience.  I obviously misjudged the degree of technical detail that you're interested in seeing.

    Secondly, analyzing the patterns of behavior in a system without fully understanding the system's architecture/topology is exactly the challenge!  The issue is that with the large-scale systems that I'm looking at (primarily Internet services, not enterprise systems), the system changes frequently enough that any assumptions you make about base patterns and system characteristics might not be valid for long!  And certainly, making assumptions about the characteristics of failures is a very bad idea!

    The overall goal is then to develop scalable techniques that help operators understand their system's behavior (both its correct and incorrect operation) by summarizing patterns/detecting anomalies/etc while making only minimal assumptions (not "no assumptions") about the system and its behavior.  On the machine learning side of things, this involves leaning heavily on non-parametric methods and robust statistics.  On the systems side of things, the main question is choosing key instrumentation points and figuring out how to best respond to automated analyses (alerting operators, rebooting machines, etc) to achieve the higher-level goal of maintaining a reliable system.

    What makes this partially tractable is that right now, 1) with a large system, you really get to take advantage of statistical properties of seeing lots of data points; and 2) these techniques are not looking for anything very subtle --- people can see these patterns pretty easily.  Unfortunately, if you're trying to monitor 10k+ streams of data, you can't find enough people to look at them all individually, and you certainly can't flood a single person with all the data to get a sense of a "global" view.

    Feel free to email me if you want to chat more.


    JParrish: Today, *people* definitely see and use these patterns all the time.  Unfortunately, people don't scale well as systems grow even to 10k nodes (much less the larger Internet services!).  There are some automated monitoring systems (e.g., I think zenprise is neat) out there today, but they're very specialized and have a lot of in-built knowledge about the systems they monitor.


    Shoshan:  Ugh, that's horrible!  Did I actually do that?


    Thanks again to all of you for taking the time to comment.


  • xenonysfxenonysf merhaba
    First of all it was a very interesting video for me... And i loved that a turkish is speaking in C9, from his name and the way of saying his name i see that he is turkish like me as well Smiley I hope we will listen more about his research. Actually it is very hard topic for me but it gave some new ideas about such large scale systems. I am studying computer science enginnering in www.deu.edu.tr and hope that i ll be a professional like him.

    Greetings Emre Abi, i hope we listen more from u...Tongue Out

    PS: About body language: He is putting his hand to his chest while listening, this means in turkish culture (i think also europeans do it as well) that "i am listening u carefully" ...

Remove this comment

Remove this thread


Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.