Posted By: The Channel 9 Team | Mar 31st, 2005 @ 1:51 PM | 91,587 Views | 22 Comments
Dave Probert is an architect on the Windows Kernel team. This is the first time we've had an architect from the kernel team on Channel 9.

You'll hear from Dave over the next week or so (we split the interview up into four parts).

This is also the first of a series of deep looks inside Windows.
Tags: Kernel, OS
Media Downloads:
Rating:
2
0
Notice he doesn't visit the UK...

Microsoft _ALWAYS_ skips over the UK Sad

Very good video (set) .. although I must admit some of
the concepts are a little over me. I am still trying to
get my head around concurrency and this only confused me
more... So yeah...

I wish someone would talk about the sub-systems of
particular interest to me is the GUI sub-system
(drawing cr@p to screen).
rasx
rasx
Programmer/Analyst III, Emperor of String.Empty
Always interesting to hear from another UCSB grad' in the sciences. He makes a great point about how "closed source," company secrects can be a bad thing as well as as Bling, Bling thang...
barlo_mung
barlo_mung
w00t
Manip wrote:
I am still trying to
get my head around concurrency and this only confused me
more... So yeah...


You need to trudge through Larry Osterman's 15! part posting about concurrency.  Long but well worth it.

Here's part one.
http://blogs.msdn.com/larryosterman/archive/2005/02/14/372508.aspx
rhm
rhm
Dave explanation of Hyperthreading is really ham-fisted. Lets get Dual Cores out of the way first - that's just two processors on the same piece of silicon - literally in the case of the Pentium 4D. Sometimes those processors share a level 2 cache to save space but that is no different to what you used to have L2 cache external from the chip itself and on multi-cpu machines that cache may have been shared by 2 or more processors. Whatever way you look at it though with Dual Core you are getting 2 actual processors and 2 threads will execute in parallel.

Hyperthreading (HT) is not like that at all. With HT you still only have one thread executaing at a time just like you do with a conventional single processor machine. The difference is that with HT the processor has the ability to hold 2 process contexts at the same time. This is crucial because the OS multi-tasks by switching process contexts every so often. A context switch takes a long time. With modern heavilty piplined CPUs, sometimes the CPU needs to access some code or data that isn't in the on-chip cache and it has to wait for main memory. This wait is a long time in modern CPU terms and it a normal CPU the time is just wasted because switching to another process and doing work there would take even longer. With HT the CPU can switch instantly to another process that already has it's context loaded and use the dead time to do work there (assuming that process isn't also waiting for off-chip code/data). Ultimately it's a way of using processor resources more efficiently - you add maybe 15% complexity to the chip design for a 30% increase in throughput (but only in the case where there are multiple processes trying to make use of the CPU).

Windows virtualizes a Hyperthreaded CPU as two processors because it deals in process contexts and there are two on a HT CPU even though only one can execute at a time.

Now Dave's explanation was simpler and shorter, but really it's wrong because there aren't two cpus that sometimes have to wait because they share resources like the floating point unit. Instead there is one CPU that happens to duplicate some of its resources (like the register file) but not the ones that actually do the work.

You could argue that it's just another way of looking at it, but it's a pet hate of mine when people try to simplify explanations past the point where they actually make sense - it's what leads to bad science.

By the way, Hyperthreading is just Intel's trademark name of this mechanism. It was originally proprosed for the still-born Alpha EV8 architecture. Insider Update has a series of articles on the subject that explain it in it's full gory detail.

Great video though. I really liked the story of how Dave's kids don't like computers because of his Sparc Station1. Oh and btw. does Charles work out? Smiley
This was an extremely interesting video, i'll be looking forward to the next ones.
Dual Core solutions are obviously the way the world is going at the moment, and i would be interested to know what kind of updates in the kernal space are planned for Longhorn.

If, in the future, we are heading towards massively parallel computing in the home, it's difficult to see how the pace of application development will keep up. For instance, with simple core speed increases, one's application will always run faster. However, to take advantage of 2 cores needs some multithreading work as is being done now in limited amounts, but go over 2, and the only things that benefit now are distributed rendering and science applications. Hardly home use stuff. If our desktop in 2010 has 8 cores, for instance, one's daul core (2 thread split) multithreading aware game or application is unlikely to see a performance boost. One would need to code with an 8 way load split in mind. So, each generation of application is going to become increasingly targeted to its generation of processors (2 way, 4 way, 8 way), etc, because coding for an architecture more parallel than the one you are using would be a waste of time for developers who often only forsee their application use within the time frame of the respective current generation of processors. I think it would also be very difficult to split something other than a rendering app or science app into something that would execute simultaneous on 8 cores, especially if the onus has to come from the developer.

To be truely future aware, you would need a system that could take any task that is marked as parallel-izable and split it into the number of threads matching the cores of the processors you have on that system. Otherwise, like i said above, you will end up with a lot of apps targetted at dual core, then quad core, etc, with no application able to take advantage of a processor with more cores than it was written for originally. The parallelizations need to be abstracted from programmer-written threads into some new structure which can be automatically parallelized into the required amount of threads to suppoer the cores on the system

So, i ask, wouldn't that be a task for the kernal in the future?

I hope that made sense, i'll push post and hope someone understands it Smiley
|With HT the CPU can switch instantly to another process that already |has it's context loaded and use the dead time to do work there
|(assuming that process isn't also waiting for off-chip code/data).

rhm, you're wrong. Dave's explanation is correct;
You're confusing HT (which is Intels name for SMT) with other ThreadLevelParallelism methods.
With SMT you do have two seperate threads running at the same time, they compete for the same set of execution units (ALUs, FP Units, SIMD units, braching units,...), this is what Dave meant by resources.
HT/SMT: works like this: every clock cycle, the OutOfOrder logic of the CPU must figure out how to fill all it's execution units with the instructions coming in, so it looks at a couple of the instructions that are to be executed, figures out which ones must happen or can happen now, and which can be executed in parallel. Worst case: only one instruction can be executed, ie. one unit is used, all the others have to idle. If several instructions can be executed in parallel, then the situation is better, cause several execution units are used an parallelism (ILP) is exploited.
HT/SMT: is just a way of improving this, by simply offering two streams of instructions (two threads) that the OutOfOrder logic can use to fill the execution units. So, if one thread has only one instruction that can be executed, the OutOfOrder logic simply looks at the second stream and chooses some instructions from there. In the ideal case, this looks, for instance, like this: Thread 1 needs one ALU unit, Thread 2 needs an ALU Unit and an FP Unit, ,...

The problems with this approach are: if both threads need all the ALU units they can get, then they obviously can't run at the same time.

What you're thinking of (the CPU internally switching threads on a data access or cache miss) is talked about in this article:
http://developers.sun.com/solaris/articles/chip_multi_thread.html

rhm
rhm
You can tell what I'm thinking - I'm reporting you to PsiCorps.

Actually I've never even heard of "ThreadLevelParallelism". And you'd think that since I linked a bunch of articles that explained SMT in detail that I would know the difference. I see your explanation and it doesn't sound any different to mine. Both of them sound very different to Mr. Probert's Discovery Channel-esque explanation though.

MarkPerris --

Applications can have more threads than are available in hardware. There are plenty of applications that benefit from multithreading even on single-core/non-HT systems. Servers spin up new threads for each client request. Many Windows applications use seperate threads so that long-running operations don't affect GUI responsiveness.

On a single-core/single hardware thread CPU, the CPU runs one thread at a time, and switches between the available threads after a certain amount of time. Even though only one thread is running at any given time, performance can be gained because execution time is yielded to other operations rather than having to wait for each operation to fully complete (could take a long time) before starting the next operation.

Using this same multithreaded application on a system with multiple hardware threads (some combination of multi-core/multi-cpu, etc.), can provide further performance benefits to the same, unchanged, multithreaded application. Instead of having to run one thread at a time, switching between threads at a given interval, a multi-core/CPU system can run each of those threads in parallel -- one running on each core/CPU simultaneously.

There can be some architectural details that may require extra tuning to maximize performance on a given architecture, but the above holds true for the general case.

@n4cer:

What you say is most certainly correct, and i do agree with it, but its not quite the point i wanted to make in my original message. You are right in saying that even on single core systems multithreaded apps will get a performance benefit because of better handling of blocking and such like (and all the more on HT), but on that point, that doesn't really make much use of the power of dual cores. The same can go for applications which run the GUI in one thread and the process in the other, having owned an Athlon MP system, the difference is very small. Very very rarely does any multithreaded app utilise more than 50% of the dual processors' time (i.e. 100% of the timeframe of one CPU), showing little work is being spread onto the 2nd processing unit (unless one runs a specifically SMP aware app, of which there are frighteningly few)

For dual cores, for any significant speed-up to happen, its not going to come from multithreaded IO queueing or gui thread/app thread, as this load level is already well handled in the context of a single CPU, so to speak. As you correctly say, in the server enviroment, this is great, one can create many threads for client requests, etc, but on the desktop, the only benefit is likely to come from multitasking, in some situations, for example, compressing CD audio to WMA while trying to play a video. And i think thats how dual cores will be marketed. Thats not a bad thing, but its not going give any single app a definate speed increase, something which most home users are looking for.

Eventually, multi-core aware applications will arrive that make use of both CPU's in one application context, that aren't either high end video encoders or renderers, but until that time comes, most dual cores are likely to go rather underused.

This leads to my second argument that, becuase even programming 2 threads is difficult enough, software will likely be written with only 2 cores in mind. When highly multi-core processors arrive in 2010 or sometime around then, we're back to the same problem that on a 4 core system, for example, only 50% of the CPU time would be used (2 cpu's worth), unless the developer specifically recodes the app to create 4 threads.

What is needed is a way to mark, in code, any task as parallel-izable, without going into the thread level in the program. The kernal should be able to create enough threads to occupy all the processors based on some abstracted defition of the parallel task. That would mean that however many cores one's CPU has, an application with an explicity parallel-izable section is automatically interpreted to use all of the available computing power on the host system. I dont know how it would be done in pratice, but its just an idea.
Microsoft Communities