Windows, Part I - Dave Probert

Play Windows, Part I - Dave Probert

The Discussion

  • User profile image
    Notice he doesn't visit the UK...

    Microsoft _ALWAYS_ skips over the UK Sad

    Very good video (set) .. although I must admit some of
    the concepts are a little over me. I am still trying to
    get my head around concurrency and this only confused me
    more... So yeah...

    I wish someone would talk about the sub-systems of
    particular interest to me is the GUI sub-system
    (drawing cr@p to screen).
  • User profile image
    Always interesting to hear from another UCSB grad' in the sciences. He makes a great point about how "closed source," company secrects can be a bad thing as well as as Bling, Bling thang...
  • User profile image
    Manip wrote:
    I am still trying to
    get my head around concurrency and this only confused me
    more... So yeah...

    You need to trudge through Larry Osterman's 15! part posting about concurrency.  Long but well worth it.

    Here's part one.
  • User profile image
    Dave explanation of Hyperthreading is really ham-fisted. Lets get Dual Cores out of the way first - that's just two processors on the same piece of silicon - literally in the case of the Pentium 4D. Sometimes those processors share a level 2 cache to save space but that is no different to what you used to have L2 cache external from the chip itself and on multi-cpu machines that cache may have been shared by 2 or more processors. Whatever way you look at it though with Dual Core you are getting 2 actual processors and 2 threads will execute in parallel.

    Hyperthreading (HT) is not like that at all. With HT you still only have one thread executaing at a time just like you do with a conventional single processor machine. The difference is that with HT the processor has the ability to hold 2 process contexts at the same time. This is crucial because the OS multi-tasks by switching process contexts every so often. A context switch takes a long time. With modern heavilty piplined CPUs, sometimes the CPU needs to access some code or data that isn't in the on-chip cache and it has to wait for main memory. This wait is a long time in modern CPU terms and it a normal CPU the time is just wasted because switching to another process and doing work there would take even longer. With HT the CPU can switch instantly to another process that already has it's context loaded and use the dead time to do work there (assuming that process isn't also waiting for off-chip code/data). Ultimately it's a way of using processor resources more efficiently - you add maybe 15% complexity to the chip design for a 30% increase in throughput (but only in the case where there are multiple processes trying to make use of the CPU).

    Windows virtualizes a Hyperthreaded CPU as two processors because it deals in process contexts and there are two on a HT CPU even though only one can execute at a time.

    Now Dave's explanation was simpler and shorter, but really it's wrong because there aren't two cpus that sometimes have to wait because they share resources like the floating point unit. Instead there is one CPU that happens to duplicate some of its resources (like the register file) but not the ones that actually do the work.

    You could argue that it's just another way of looking at it, but it's a pet hate of mine when people try to simplify explanations past the point where they actually make sense - it's what leads to bad science.

    By the way, Hyperthreading is just Intel's trademark name of this mechanism. It was originally proprosed for the still-born Alpha EV8 architecture. Insider Update has a series of articles on the subject that explain it in it's full gory detail.

    Great video though. I really liked the story of how Dave's kids don't like computers because of his Sparc Station1. Oh and btw. does Charles work out? Smiley
  • User profile image
    This was an extremely interesting video, i'll be looking forward to the next ones.
    Dual Core solutions are obviously the way the world is going at the moment, and i would be interested to know what kind of updates in the kernal space are planned for Longhorn.

    If, in the future, we are heading towards massively parallel computing in the home, it's difficult to see how the pace of application development will keep up. For instance, with simple core speed increases, one's application will always run faster. However, to take advantage of 2 cores needs some multithreading work as is being done now in limited amounts, but go over 2, and the only things that benefit now are distributed rendering and science applications. Hardly home use stuff. If our desktop in 2010 has 8 cores, for instance, one's daul core (2 thread split) multithreading aware game or application is unlikely to see a performance boost. One would need to code with an 8 way load split in mind. So, each generation of application is going to become increasingly targeted to its generation of processors (2 way, 4 way, 8 way), etc, because coding for an architecture more parallel than the one you are using would be a waste of time for developers who often only forsee their application use within the time frame of the respective current generation of processors. I think it would also be very difficult to split something other than a rendering app or science app into something that would execute simultaneous on 8 cores, especially if the onus has to come from the developer.

    To be truely future aware, you would need a system that could take any task that is marked as parallel-izable and split it into the number of threads matching the cores of the processors you have on that system. Otherwise, like i said above, you will end up with a lot of apps targetted at dual core, then quad core, etc, with no application able to take advantage of a processor with more cores than it was written for originally. The parallelizations need to be abstracted from programmer-written threads into some new structure which can be automatically parallelized into the required amount of threads to suppoer the cores on the system

    So, i ask, wouldn't that be a task for the kernal in the future?

    I hope that made sense, i'll push post and hope someone understands it Smiley
  • User profile image
    |With HT the CPU can switch instantly to another process that already |has it's context loaded and use the dead time to do work there
    |(assuming that process isn't also waiting for off-chip code/data).

    rhm, you're wrong. Dave's explanation is correct;
    You're confusing HT (which is Intels name for SMT) with other ThreadLevelParallelism methods.
    With SMT you do have two seperate threads running at the same time, they compete for the same set of execution units (ALUs, FP Units, SIMD units, braching units,...), this is what Dave meant by resources.
    HT/SMT: works like this: every clock cycle, the OutOfOrder logic of the CPU must figure out how to fill all it's execution units with the instructions coming in, so it looks at a couple of the instructions that are to be executed, figures out which ones must happen or can happen now, and which can be executed in parallel. Worst case: only one instruction can be executed, ie. one unit is used, all the others have to idle. If several instructions can be executed in parallel, then the situation is better, cause several execution units are used an parallelism (ILP) is exploited.
    HT/SMT: is just a way of improving this, by simply offering two streams of instructions (two threads) that the OutOfOrder logic can use to fill the execution units. So, if one thread has only one instruction that can be executed, the OutOfOrder logic simply looks at the second stream and chooses some instructions from there. In the ideal case, this looks, for instance, like this: Thread 1 needs one ALU unit, Thread 2 needs an ALU Unit and an FP Unit, ,...

    The problems with this approach are: if both threads need all the ALU units they can get, then they obviously can't run at the same time.

    What you're thinking of (the CPU internally switching threads on a data access or cache miss) is talked about in this article:

  • User profile image
    You can tell what I'm thinking - I'm reporting you to PsiCorps.

    Actually I've never even heard of "ThreadLevelParallelism". And you'd think that since I linked a bunch of articles that explained SMT in detail that I would know the difference. I see your explanation and it doesn't sound any different to mine. Both of them sound very different to Mr. Probert's Discovery Channel-esque explanation though.
  • User profile image

    MarkPerris --

    Applications can have more threads than are available in hardware. There are plenty of applications that benefit from multithreading even on single-core/non-HT systems. Servers spin up new threads for each client request. Many Windows applications use seperate threads so that long-running operations don't affect GUI responsiveness.

    On a single-core/single hardware thread CPU, the CPU runs one thread at a time, and switches between the available threads after a certain amount of time. Even though only one thread is running at any given time, performance can be gained because execution time is yielded to other operations rather than having to wait for each operation to fully complete (could take a long time) before starting the next operation.

    Using this same multithreaded application on a system with multiple hardware threads (some combination of multi-core/multi-cpu, etc.), can provide further performance benefits to the same, unchanged, multithreaded application. Instead of having to run one thread at a time, switching between threads at a given interval, a multi-core/CPU system can run each of those threads in parallel -- one running on each core/CPU simultaneously.

    There can be some architectural details that may require extra tuning to maximize performance on a given architecture, but the above holds true for the general case.

  • User profile image

    What you say is most certainly correct, and i do agree with it, but its not quite the point i wanted to make in my original message. You are right in saying that even on single core systems multithreaded apps will get a performance benefit because of better handling of blocking and such like (and all the more on HT), but on that point, that doesn't really make much use of the power of dual cores. The same can go for applications which run the GUI in one thread and the process in the other, having owned an Athlon MP system, the difference is very small. Very very rarely does any multithreaded app utilise more than 50% of the dual processors' time (i.e. 100% of the timeframe of one CPU), showing little work is being spread onto the 2nd processing unit (unless one runs a specifically SMP aware app, of which there are frighteningly few)

    For dual cores, for any significant speed-up to happen, its not going to come from multithreaded IO queueing or gui thread/app thread, as this load level is already well handled in the context of a single CPU, so to speak. As you correctly say, in the server enviroment, this is great, one can create many threads for client requests, etc, but on the desktop, the only benefit is likely to come from multitasking, in some situations, for example, compressing CD audio to WMA while trying to play a video. And i think thats how dual cores will be marketed. Thats not a bad thing, but its not going give any single app a definate speed increase, something which most home users are looking for.

    Eventually, multi-core aware applications will arrive that make use of both CPU's in one application context, that aren't either high end video encoders or renderers, but until that time comes, most dual cores are likely to go rather underused.

    This leads to my second argument that, becuase even programming 2 threads is difficult enough, software will likely be written with only 2 cores in mind. When highly multi-core processors arrive in 2010 or sometime around then, we're back to the same problem that on a 4 core system, for example, only 50% of the CPU time would be used (2 cpu's worth), unless the developer specifically recodes the app to create 4 threads.

    What is needed is a way to mark, in code, any task as parallel-izable, without going into the thread level in the program. The kernal should be able to create enough threads to occupy all the processors based on some abstracted defition of the parallel task. That would mean that however many cores one's CPU has, an application with an explicity parallel-izable section is automatically interpreted to use all of the available computing power on the host system. I dont know how it would be done in pratice, but its just an idea.
  • User profile image
    This is the interview I've been waiting for since Channel 9 began Smiley. I think I brought up Windows internals some time back on the forum. Did you have it planned all along, or did you just put it up because I asked for it (because I'd be flattered if you did)? Wink

    This is great guys. I'm downloading the video right now. Thanks.

  • User profile image
    Glad you like this, Gandalf.

    This is the first set in a new C9 vid series, Going Deep. Windows will be covered for a while given it's size and complexity. Then, other applications will be the target of deep investigation and analysis. That's the idea anyway. Basically, I wanted us to ge a bit more heavy on the tech side so I thought this up as a way to achieve this.

    BTW, Niners' feedback is always taken into consideration around here.


  • User profile image
    "Bob" is making a comeback? All right! I remember reading that Melinda Gates was somehow involved with "Bob". Is that right? The phone numbers on a black board is also very funny. I guess that's where Windows 3 gets its task scheduler from. Smiley
  • User profile image
    Minh wrote:
    "Bob" is making a comeback? All right! I remember reading that Melinda Gates was somehow involved with "Bob". Is that right? The phone numbers on a black board is also very funny. I guess that's where Windows 3 gets its task scheduler from.

    Bob making a comeback!  Ha, BG would he happy.
  • User profile image
    Beer28 wrote:
    I would really like to see the NT kernel open sourced if at all possible.

    No worries mate. If Longhorn does not come in 2006, Apple could drive MS out of business and after few hundred years it could be legal to "open" the leaked Win2k source. If your presence is still with us in the collective cybermind, you may take a look Wink

    Disclaimer/MS-Special Removal Unit: This does not imply that I have the leaked source, however it does imply that in collective cybermind it is likely that someone has the source and therefore it would be publicly available.

    For more information about "Special Removal", see

    And another interesting news that just came:
    The Next Chapter In The Patriot Act

    Legal: Yes they do show the same 60-minutes here too, albeit a month later. So no I did not download the 60-minutes episode, I just happen to be a light month away from the Earth right now! I mean, seriously.
  • User profile image
    Mike Dimmick
    Beer28 wrote:
    IOn linux on x86 user2kernel calls for IO to devices through the kernel are done by calling a CPU interupt instruction 0x80 with type of kernel sys function in eax, then the params for the kernel function in ebx-edx like you would do a fastcall from VC++, except with an INT instruction, not a call/ret to the start addy of the function,

    Does it work the same on NT?

    XP and 2003 use the SYSENTER/SYSEXIT instructions. IIRC earlier versions of NT used interrupt 0x2e. The user->kernel transitions are isolated in NTDLL.DLL apart from some places where gdi32.dll and user32.dll call into the win32k.sys driver directly.

    In fact it appears that the system call instruction might be dynamically generated! NtWriteFile, for example, loads edx with the contents of SharedUserData!SystemCallStub then performs an indirect call to that address. Since this is an Intel P4 system it uses SYSENTER.

    The arguments appear to be retrieved from the stack directly, the only arguments passed in registers are the user stack pointer (passed in edx) and the system call to execute (passed in eax).

    I would expect x64 and Itanium to pass parameters in registers rather than on the stack, since their calling conventions are register-based.

    Beer28 wrote:
    Is it always passed through the registers or can you pass stuff with pointers to memory from user2kernel and back without drawing an access violation(like maybe stack address space)?
    Once it passes the context switch to kernel mode the page protection is gone at ring 0 right? It's up to the kernel exported function to make sure you're not giving it a bad user space address(passed in the ebx-edx register) in NT also?

    Page protections still apply in ring 0. One of the bits in the Page Table Entry is the User/Supervisor bit, which governs whether a page is writable from user mode or from supervisor/kernel mode. On the x86, code running in rings 0, 1, or 2 can access supervisor and user pages; ring 3 can only access user pages (a processor will raise an access fault if it tries to access supervisor pages).

    NT breaks, for each process, the virtual address space into a user region and a system region. The split point is normally at 2GB (first system address is 0x80000000), however if the system is booted with /3GB that changes to 3GB user, 1GB kernel (first system address 0xC0000000). Finally XP and 2003 also offer the /USERVA switch which when combined with /3GB allows the system address start point to be tweaked further.

    The system address space is identical across all processes. Because the page tables are the same after the user/kernel transition (a user/kernel transition is not generally termed a context switch - the same thread is running, only now it's using its kernel stack, and it's running at a higher privilege level), the system code can access anything in the user-mode part of the address space that the thread's process can.

    Interrupt-handling code can, and will, be called with arbitrary process context - the process of whichever thread was last executing. It can't therefore write directly into a user-mode buffer. Instead it must queue an Asynchronous Procedure Call (APC) to the thread that initiated the I/O. When the APC is dispatched Windows performs a context switch to that thread, so now the correct process page tables are referenced and the operation can go ahead. (I've left out Deferred Procedure Calls [DPCs], which also occur in arbitrary process context).

    There are some threads in the system which don't run in a particular process's context - they're worker threads. Instead they run in pseudo-processes, which in Task Manager (and Process Explorer) are shown as "System Idle Process" and "System". The "System Idle Process" contains only one thread, which is the zero-page thread. This thread has the lowest priority in the system, does not get dynamic boosts, will never pre-empt any other thread, and is responsible only for zeroing out free pages. When it doesn't have any work to do it halts the processor. All other worker threads run in "System".

    The Structured Exception Handling mechanism is also supported in kernel mode; drivers should always wrap accesses to user-mode buffers in __try/__except blocks.

    At this point I have to confess I've done no kernel-mode programming. I've found out all I have from "Windows Internals, 4th Edition" (and its predecessor "Inside Windows 2000"), and from OSR's NT Insider.
  • User profile image
    Very fine presentation!  The old mainframe that I used to work on had processes (called "runs") and threads (called "activities").  The thread dispatcher maintained context for each thread in a structure called a "switch list" (SWL) which was a misnomer because the actual switch list was a priority table of linked lists of SWLs.  Paired with the SWL was the Activity Save Area (ASA) which contained processor state (relocation information for program instructions and data, which were separated), and the contents of CPU registers the last time the thread halted in favor of a different thread.  There was no paged memory.  Programs were either entirely in memory or entirely out on the swapping drum (yes, drum).  That decision was taken to avoid thrashing, which was possible because working set theory didn't exist and it was overkill for this machine's small memory of iron cores.  The OS was written to run equally well and in parallel on all available CPUs.  By 1970, this machine only crashed once or twice a day because of bugs in the OS.  It supported dozens of users in interactive mode using teletypewriters (Model 33, Model 35) and it also ran jobs in background as "batch" processing.  A huge backlog of batch jobs would ordinarily accumulate during the day and would be worked off at night.  Many trees died to afford users something to look at as output.  The whole thing was royally poo pooed by sophisticated faculty from universities "of quality" where they had adopted Unix wholesale (and Multics before that) as the sine qua non of operating systems.  The thing they didn't like was that the interactive mode was exactly the same as the batch mode in user interface.  (The text-based user inteface was nothing like JCL:  it was NOT compiled, it comprised simple commands.)  That was quite an advantage when creating production code and for testing out production run control language.  But hell, what good is an operating system without redirection and piping and a tree-structured file system, etc., etc. Anyway, my point is that Windows NT is a very good operating system, beats the pants off of Linux in terms of out-of-box usability, and builds on the valuable legacy of the OS that I described above, which was very good for its time -- and still exists -- and still can run binary programs written for it 40 years ago if you can figure out how to read in the deck. 

    When I first read about hyperthreading in 2002, I decided that Intel had built a chip that was able to hold context for two threads at the same time.  From what I have read in response to Robert Probert's talk, I was right.  Windows must somehow schedule the right two threads on the chip so that the fast context switch in the chip can be used.  Otherwise, HT is of no value.  I imagine the top two threads on the priority queue would ordinarily be a good choice, assuming they aren't already scheduled on some other chip.  Then, when one of the two threads is blocked waiting for, say, an I/O completion, the other thread can instantly be restarted using context already onboard the chip.  There are a lot of CPU cycles to be saved by avoiding the slow context switch!
  • User profile image
    > Actually I've never even heard of "ThreadLevelParallelism".

    In the articles you link it's acronym TLP is used.  It just means that the CPU wants to extract parallelism from threads; it's similar to ILP (InstructionLevelParallelism, where parallelism is to be found in a stream of instructions).

    > And you'd think that since I linked a bunch of articles that explained
    > SMT in detail that I would know the difference.

    Indeed, you seem to think that.

    > I see your explanation and it doesn't sound any different to mine

    It does.
    You described CMT (coarse grained multithreading), I described SMT (simultaneus.multithreading).
    CMT means having several contexts in your CPU, but only *one* is active at a time, they're switched as soon as the CPU waits for I/O.

    SMT means having 2 contexts active in your CPU at the same time.

    Again, your linked articles explain exactly that.

  • User profile image
    I'm down and out with a flu today. This is the perfect video to watch soaking out feverish muscle pains in a hot tub. Dave is a great story teller. I really liked the little personal tidbits like the Sparcstation story and his experiences presenting to college kids. The tech stuff is broad and high level, and does not require too much concentration to follow while at the same time it is not too handwavy. Part II is 61% down. I hope my battery lasts long enough too pull it in completely.

    Nice job Dave. Thanks C9. 

    (Another recomendation for those in a similar disconfort: the Jim Gray series is also highly recommended, as is hot lemon and camomile tea)
  • User profile image
    I'm often amazed by talks like this, but then I realize everyone comes at things from their own perspectives as well as struggling to get a set of concepts across in a limited amount of time - and sometimes on the fly.

    The idea that Unix and VMS are the only significant OS family lines has my sides aching from laughing though.  This may be true in some narrow sense, but believe it or not there are OS families with much lengthier heritage and as much or more "success" within their markets.

    The hoary old "we loaded stuff from cards on a machine with little or no OS, no disk, in a single-tasking environment" went out pretty early on.  The most primitive box I ever worked on was a very early 60s IBM 1620.  While primitive, even there we had disk and disk-resident compilers.  True, the card-resident compilers were still there to be used, but almost nobody did this.

    As for things like virtual memory, protected address spaces for processes, and the like - commercial implementations go back at least to the Burroughs B5000 (1961).  This machine didn't even offer an assembler, the OS itself being written in a high-level language.  The descendents of this platform are in use today and indeed are still actively marketed by Burroughs' successor organization Unisys.

    Developers dealt with concurrency and "threading" frequently, since multiprocessor machines were quite common along with a complement of sophisticated I/O and communications processors that operated asynchronously.  Such "servers" routinely supported tens of thousands of simultaneous users through OLTP, often in regional, national, and international multi-site networks.

    The minicomputer (and later microcomputer) world was a very simplistic place by comparison.  Crude things like the Unix "fork" were something other people shook their heads at.

    What the mini/micro ecosystem did do however was democratize computing.  These systems were cheap in relative terms, and stayed so as they grew in power and sophistication.  This meant that more and more people were exposed to computing, and exposed to more sophisticated software.

    But the VMS/Unix family lines are still rediscovering things that were old hat by the 1970s elsewhere in computing.

    Everyone seems to be getting excited that application developers should be learning to deal with multithreading now.  Have we forgotten that most machines - even desktops - are running numerous asynchronous processes and threads all day long?  Pop open your Task Manager, gee.

    And in a server environment I can't believe people really find themselves running a single application.  Didn't "got an app, get a box" go out of style years ago, even in the NT world?

    Multithreading "because I can" is not a sensible way to architect applications.  It is also unnecessary to ensure that hyperthread/multicore/multiprocessors get fully utilized.  That's why you have environmental system software between your application code and the OS.  You let that middle layer manage worker threads and instances of your application code - which typically should remain "single threaded."
  • User profile image
    I couldn't find a complete summary of this excellent series of interviews anywhere on the web, so for any "collectors" out there, I've listed all the interviews in one place (I hope I found them all; please tell me if I missed any):
  • User profile image

    Dave P is a great raconteur and these vids are pure pleasure.

    A few demurs:

    POSIX meant more than just software portability (Write Once Run Anywhere) .  It was the feds way of specifying UNIX w/o using the protected brandname belonging to AT&T at the time.  POSIX promised that buyers of hardware systems would have a choice of vendors as long as its specifications were followed as a guideline.  US DOD made POSIX certification mandatory for hardware acquisitions. 

    People seem to think UNIX started in 1973.  In fact UNIX started in 1969 (space travel game, PDP-7 and all that).  In 1973 Ken Thompson introduced the UNIX timesharing system at an ACM Symposium on Operating Systems.  It was Version 3 at the time.


    I was privileged to see Dave Cutler present NT architecture in 1992 at an operating systems workshop by Usenix in Seattle.  That crowd was NOT nice to him.  I reckon he had a thick skin.

    It was after that workshop that I began to wonder if UNIX would really be the once and future system. 


  • User profile image

    Is the video unavailable to anyone else? It doesn't seem to be working for me:


    "Media Failure. Try reloading the page or visiting the main site for assistance."

  • User profile image

    great series of discussions with dave probert. he is an amazing guy with a unique (well, almost Tongue Out) perspective, having a lot of experience on kernel level on unix, as well as on windows nt. like he says, these are the most important systems out there - vms/wnt and unix/linux, and it's great to be able to talk to or listen to somebody who can really give a deep insight into that, with none of the FUD or subjectivity that is rampant out there. i am in no position to judge what he says, but happen to agree with most of the things he says. unix is 40 years old, vms around 30, linux about 20 and winnt around 15 (broadly speaking, as platforms), and yet vms and wnt have done a better job in the it industry, because those systems did not start out as hobbies, but as well designed systems that had clear targets and strategies behind them etc. the history and philosophy of operating systems is a fascinating subject, and it's interesting to see how and why a lot of the things we are dealing with today as consumers or it professionals have their roots into the way developers approached these things at the outset etc.  

Add Your 2 Cents