Tech Off Thread

19 posts

Longhorn back to basics: Error messages that makes sense!

Back to Forum: Tech Off
  • User profile image
    lars

    I have a suggestion for an improvement that I think is long over due. And this is going to be a real rant, so hold on to your hats.

    Short version: Something should be made to make system errors more understandable to the end user.

    Rant: Right now there are two levels of errors.
     
    The ones you get in windows that is pointless because
    they don't give enough information - like "Unable to print", and then you know that the Wizard is just going to ask you if the cable is plugged in and then give up.

    And then there is the other kind, the kind that is horribly undecipherable. The information you get when Windows blue screens is practicly useless for the average joe. And for most computer professionals outside Microsoft aswell. When my computer reboots and I find an entry in the event log that says "event ID 1003, category 202, source: unknown, System Error". What am I supposed to do with that information? First stop is the Knowledgebase. And I'm grateful for that. But if that turns up a blank? Call Microsoft Technical Support and start handing over money? And to make it even more useless, it's even translated. Probably by someone who didn't know what it ment either...

    It's like being right back in the good old days when
    the Amiga coughed up a [guru meditation] with some strange number you could look up to find out what subsystem that crashed. Atleast that was cute. Now it's 20 years later. And I still have to go to
    www.softwaretipsandtricks.com or www.eventid.net or some other strange website. And if I'm lucky, and that's a big IF, they have a listing relevant to my case. Is it so hard to put something understandable in the event log? Or is it obscure by design to generate support call revenue? Talk to me.
    But don't tell me to analyze the crash dump. Normal end users don't do that. And I shouldn't have to do it either.

    Get some guys over from the Longhorn translucent spinning 3D prettywindows team and put them to work writing real error messages, and get the basics right first.

    /Lars.

  • User profile image
    Knute

    Lars,

    I am afraid you will need to face the fact that the majority of people who use computers could care less what the error ID is. They just want the computer to work. Not everyone is willing to troubleshoot their own computer.

    Because of that fact, this will never happen. Farley T. User wants innovation and cool whiz bang features before he is going to part with his hard earned money for the next version of windows.

    ~ Knute

  • User profile image
    lars

    Knute wrote:
    They just want the computer to work.


    That's even better. Then I won't even need the error messages. Just make it work. Please.

    /Lars.

  • User profile image
    Knute

    Well it appears to be working now, unless you are somehow magically accessing this site another way.

    Wink

    ~ Knute

  • User profile image
    lars

    In that respect using windows is much like driving a Ferrari. You need to have two of them: one to drive while the other one is in the shop. Or atleast so richer folk than me say. Smiley

    Anywho, this is more of a general rant for the benefit of mankind than a sneaky way to get free support with my specific case. (IMHO, ofcourse)

    Am I the only one who think this is frustrating?

    How do you guys at Microsoft handle problems with your own computers? Do you just give up and reinstall like the rest of us when the going gets tough? What do you tell your parents when they call and ask for your advice* on some mysterious error with code X in the event log?


    /Lars.

    *No, refering to the NDA is not a valid answer Smiley

  • User profile image
    Charles

    Ranting is fine, Lars. Keep it up! 

    Sometimes, either due directly to some user action or more often by some process (say, some app with a poor memory management implementation that's leaking memory...) the system can become unstable or a process will crash and an exception message will be surfaced by the OS.

    The causes of unhandled exceptions are often in and of themselves cryptic in nature. Memory reading exceptions are not easy to make understandable to the layman. When an app can't read from memory it expected to be able to read from, then the app may crash on you. When a process tries to write to a place it doesn't have authority to write to or is being used by some other process, an exception will be raised.

    Typically, things like memory address space allocation/de-allocation, file IO are concepts that do not make a lot of sense to the average user.  That said, we are working on this problem and with every iteration of Windows thinking about and constructing meaningful error messages, even for the most arcane of exceptions, are always on the radar.

    Remember, for exceptions that cause system stability issues, reliability problems, app crash, app hang, etc, when you see the prompt to send the error details to Microsoft you can rest assured that the details of the exception will be read by somebody who really understands what it means. So, keep on sending the data to us!

    As to your question about what we Microsoft people do when confronted with system stability/reliability problems it depends on the individual. A system developer, for example, will most likely hook up a kernel debugger to her system and try to identify the cause, then pass the info on to the appropriate team. For the less super technical type of person, and there are many, he would call help desk or backup data and re-install... I think the latter is extreme and often unnecessary.

    If you find yourself re-installing an operating system all of the time to combat system stability problems, I'd suggest contacting the manufacturer's support division and get some help figuring out what is wrong.  

    Nailing system reliability and stability are huge tasks in any complex operating system. We have come a long way with Windows and even though we are far from perfect, we are making steady progress.


    Keep on posting,

    Charles

  • User profile image
    lars


    Thank you for taking my raving rants seriously Charles. Great to see that someone is listening!

    Charles wrote:

    a process will crash and an exception message will be surfaced by the OS.


    When this happens, what I as a developer or computer admin want to know is the name of the process/program owning the thread that malfunctioned. Or if it's a driver, the name of the driver. If it always reported this it would be so much easier to find out in what end to start looking. The Source: unknowns drive me nuts.

    Charles wrote:

    Typically, things like memory address space allocation/de-allocation, file IO are concepts that do not make a lot of sense to the average user. 


    I agree. But with that somewhat cryptic information they have a basis to start formulating a question to someone who does understand how software works. Turning out "Event ID: 1003 (202)" gives very little clue as to who or what to ask.


    Charles wrote:

    That said, we are working on this problem and with every iteration of Windows thinking about and constructing meaningful error messages, even for the most arcane of exceptions, are always on the radar.


    Heck, I would even settle for a error code -> geekspeak dictionary. Like the event ID database site I mentioned in my first post. As long as it covered just about every event ID there is, and isn't just best guesses by other sysadmins. I assume some of this information is already in the KB. The problem is how to find the appropriate and relevant articles every time.

    /Lars.

  • User profile image
    Cider

    Charles,

    Charles wrote:

    Remember, for exceptions that cause system stability issues, reliability problems, app crash, app hang, etc, when you see the prompt to send the error details to Microsoft you can rest assured that the details of the exception will be read by somebody who really understands what it means. So, keep on sending the data to us!

    Charles


    I've often wondered about this.  What exactly do you do with these error details?  I am not coming here from a privacy viewpoint or whatever, but more logistically.

    Obviously, Microsoft must get millions of these calls sent to them.  I am presuming that people do not sit down at each issue and work through them like in a call management system or such like.  So what happens?  Are they compiled statistically and then addressed in terms of what is most prevelant or what?  Or some other method?

    Mark

  • User profile image
    eddwo

    I think you could do more community type stuff with the crash data you collect. Well over half of my 21 logged OCA reports say "This error was caused by a device driver" but none of them tell me which one it is. Do I need to start blameing Nvidia, Creative, Via or Microsoft? If I knew which was causeing the problems I could vote with my wallet and choose a different IHV, or at least keep complaining to them till they fix their buggy drivers.(Actually a couple of them do now say "Resolution Found" and blame Nvidia, thanks for that)
     It would be good to see how many other peoples systems crashed with the same error in the same device driver, program or with the same bit of hardware. That way we could have a Hall of Shame feature to highlight the worst offenders.
    I guess Longhorn error reporting will include the complete .Net stack trace, it should be much easier to determine the problems with it that way.

    Any hints on how I can go about troubleshootimg my current problem? I can't get S3 Suspend-to-RAM to work properly. It will go into standby fine, but when I want to wake it up all I get is a screenful of coloured character blocks. Something is not reinitialising right, but what is it? The only thing I can do is reboot, but there is no error report from this. Its a really great feature when it works, it just doesn't seem to go on working for more than a couple of months at a time.

    I've often wondered what people at MS do about performance degredation issues. Most big corperations seem to just re-image the disk at the first sign of trouble. You must have thousands of non techinal users, finance and marketing people etc, what happens when they start to experiance problems? Is there anyone who can work out  why explorer keeps crashing, or why they keep seeing the flashlight waving when they open the control panel? For every problem there must be a solution and a reinstall or image should never be required.
    I hope someone dog-foods longhorn for at least 6 months before it is released, really iron out all the long-term issues that will no doubt arise.

  • User profile image
    Charles

    lars wrote:


    Heck, I would even settle for a error code -> geekspeak dictionary. Like the event ID database site I mentioned in my first post. As long as it covered just about every event ID there is, and isn't just best guesses by other sysadmins. I assume some of this information is already in the KB. The problem is how to find the appropriate and relevant articles every time.

    /Lars.


    All I can say, Lars, is that all of this stuff will get much better in Longhorn. I know that's hand waving, but it is true and I have talked to the reliability people several times, including forwarding them some of your related posts.

    Longhorn will be better at this. We promise. Smiley

    Charles

  • User profile image
    Charles

    Cider wrote:

    So what happens?  Are they compiled statistically and then addressed in terms of what is most prevelant or what?  Or some other method?


    Good questions. You are correct in that not every single blob of crash data is perused by a human being. This would be a daunting task. Then again, this doesn't mean that data is ignored. In fact, as you guessed, statistics are run on the data and the most prevelant type of problem is addressed first. Now, priority is based on the type of problem. That is, an app hang will garner less attention by definition than an app or, especially, a system crash.

    It's always a good idea to send crash data to Microsoft. Now, if an app hangs because of something you've done (like written infinite recursion into one of your algorithms) causing you to kill the process imediately which in turn invokes the error reporting wizard, then this data isn't really a problem with the system or an app running on the system so it's not a big deal if you choose not to send the data... But, again, when an app just crashes for seemingly no reason or, of course, the system crashes, fire that data off to us please!


    Charles

  • User profile image
    lars

    eddwo wrote:

    Is there anyone who can work out why explorer keeps crashing, or why they keep seeing the flashlight waving when they open the control panel? For every problem there must be a solution and a reinstall or image should never be required.


    Very well put! I asked about that class of errors in this thread:
    Give me your best Windows diagnostics tips!
     
    Somehow all roads lead right back to jumping in the deep end and getting all wet with the kernel debugger.

    /Lars.

  • User profile image
    lars

    Charles wrote:
    I have talked to the reliability people several times, including forwarding them some of your related posts.


    Great! Thank you!

    /Lars.
     

  • User profile image
    Charles

    eddwo wrote:
    For every problem there must be a solution and a reinstall or image should never be required.

    Amen.
    eddwo wrote:

    I hope someone dog-foods longhorn for at least 6 months before it is released, really iron out all the long-term issues that will no doubt arise.

    Everybody at Microsoft will be dog-fooding Longhorn well before it is released. This was the case with XP. Still, problems do get by us. I think the reason for this is that testing every single possible combination of device drivers is next to impossible as is successfully testing every possible combination of applications and services. 

    It's very suspect in my opinion to guarantee no crashes ever again. Remember, the Windows kernel is a monolithic one. This means that drivers that go crazy can crash your machine. (This is also the case for operating systems like Linux and Unix) This doesn't mean we aren't trying as hard as we can to make Longhorn our most stable and reliable OS ever. Just the opposite is true. We thrive on challenges and this is a really big fish.


    Charles

  • User profile image
    lars

    Charles wrote:
    Remember, the Windows kernel is a monolithic one. This means that drivers that go crazy can crash your machine.


    Was it ever considered for Longhorn to change this? To add memory protection to kernel mode code? Or is the performance penalty so high that it is impossible?

    /Lars.

  • User profile image
    eddwo

    Well they are doing some stuff. Like taking GDI for display and printers out of kernel mode. Looks like all the traditional GDI stuff will be done entirely in usermode, with the result sent as a texture through Direct3D. This should significantly reduce the complexity of graphics drivers and increase overal stability.

    But other stuff seems to be being added to kernel mode, like the HTTP handler to speed up Indigo. I'm a bit concerned that the real-time stuff for "glitch-free" video is going to unnessesarily increase the kernel complexity.

    The NGSCB stuff is supposed to provide a level of memory protection over and above kernel mode, but I don't think it would stop things going wrong since it is not in overall control of the computer.

    Theorectically would it be possible to move to using more protection modes? The original reason for only using rings 0 and 3 was for portability to other processor architectures that only supported 2 modes. Except soon all the other architectures were dropped and only x86 was left. How many protection modes do IA64 and x86-64 support?
    I guess it would take too much work to make such a fundamental change.

  • User profile image
    amg

    I will take cryptic, "google-able" error codes any day of the week over, for example, what Mac's spit out.  For 10+ years I've seen Mac's misbehave (not as much in the past few years...doctor thinks it best if I stick to Windows). 

    When a Mac misbehaves it gives an error code a layman can understand.  Of course, it could just beep and draw a pretty picture (no text) and accomplish the exact same thing.  The error dialogs provides near zero to go on from a troubleshooting perspective...thus making Mac's a platform of almost voodoo-like troubleshooting shenanigans. An example would be...whilst I was typing this paragraph I switched over to another browser Window and helped someone over at Apple's Discussions understand what a "-69 error" is with her iPod.

    The point is...as long as we can finally eliminate the "BSOD legacy" I'd love to have descriptive errors going forward. Smiley

  • User profile image
    androidi

    One possible "solution":

    Keep the error messages in english and make some special hidden window that can be always brought up with some key combo or whatever. This window would log in all the more cryptic data, but no hex dumps, I mean stuff like where the error occured, what was being done just before that (in such way that you can understand it if you are techie, not a MS guru or developer of the crashed driver).

    I imagine that for most 3rd party code around running in average users computer, it would not be too high performance penalty to include more "debug data" in the retail builds for allowing such a log window to show a bit more detail than what the event viewer is showing.

    Kind of like when you start up Linux, you get a lot of tech data which can help diagnose small problems, average joe may not understand it, but there's a lot more people who understand it than there is people who understand what some hex dump of memory contains. Even though I have been heavily involved in computers since C64, the only information I usually understand of the windows error technical data is what driver for example the problem occured, if it is even shown. In these cases I usually have to apply my long experience in computers to make educated guess what might have caused the error. A detailed log window could save hours of guess work.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.