Posted By: Charles | Jan 14th @ 11:39 AM | 674,224 Views | 47 Comments
How has Windows evolved, as a general purpose operating system and at the lowest levels, in Windows 7? Who better to talk to than Technical Fellow and Windows Kernel guru Mark Russinovich? Here, Mark enlightens us on the new kernel constructs in Windows 7 (and, yeah, we do wander up into user mode, but only briefly). One very important change in the Windows 7 kernel is the dismantling of the dispatcher spin lock and redesign and implementation of its functionality. This great work was done by Arun Kishan (you've met him here on C9 last year). EDIT: You can learn exactly what Arun did in eliminating the dispatcher lock and replacing it with a set of synchronization primitives and a new "pre-wait" thread state, here. The direct result of the reworking of the dispatcher lock is that Windows 7 can scale to 256 processors. Further, this enabled the great Landy Wang to tune the Windows Memory Manager to be even more efficient than it already is. Mark also explains (again) what MinWin really is (heck, even I was confused. Not anymore...). MinWin is present in Windows 7. Native support for VHD (boot from VHD anyone?) is another very cool addition to our next general purpose OS. Yes, and there's more!

Tune in. This is a great conversation (if you're into operating systems). It's always great to chat with Mark.
Rating:
28
0
The topic summary refers to "the Spin Lock Dispatcher" -- i.e., a component that dispatches spin locks.  That is meaningless.  The talk in fact refers correctly to "the Dispatcher Spin Lock" -- i.e., the spin lock that protects the dispatcher (or rather its data).

This is a technical talk. These details matter.

Thanks.
There are definitely things that Windows could do better with disk I/O unless my experience is atypical and due to something wrong with my system.

Consider this example which I experienced with quite simple Win32 code on my Vista machine:

I had two uncompressed BMP files on the HDD, about 50MB each. I needed to read both files into memory and process them and they had to be loaded completely before processing could begin. There was plenty of memory and it was a Core2Duo system with 32-bit Vista.

If I used two threads to load the 50MB files in parallel (one file per thread) then, no matter how I wrote it, it consistently took *twice* as long as using a single thread and reading the files in series. Not the same length of time but twice as long. This was true even when both threads allocated the full 50MB each and read their respecitve files in a single 50MB ReadFile call each. No data was being written to disk and no other processes were using significant resources.

That cannot be right. The OS is being asked by two threads in the same process to do two reads and it's allowing them to compete with each other to the extent that it takes twice as long. It makes no sense for those reads to be done in parallel as, even in the impossible best case of zero seek times, the result would still be both threads waiting until the full 100MB of data was read. Better, when the OS knows both threads are reading 50MB of data in a single ReadFile call, to let one thread read 50MB and move on, then let the other thread read its 50MB. That would mean one thread is ready after 50MB and the other after 100MB. (i.e. Compared to the impossible best case of the other method, one thread takes no longer to be ready while the other thread is ready twice as quickly. Win.)

I realise that doing that could be complex given the way the system is layered. Some interleving may be inevitable but what happens now has a lot of room for improvement. Neither thread is ready until the amount of time it would take a single thread to read 200MB of data, yet only 100MB of data is being read.

Even if you can refactor your own process to have a single "data loading" thread (which is very difficult with many 3rd party libraries and/or workloads which mix loading and processing), you have no way to synchronize your data loading with that of any other process on the system.

I did a test and see no significant difference between serial and parallel reads on two 50 MB files. Parallel is a bit faster even, when I don't use the FILE_FLAG_NO_BUFFERING flag, but that would be cheating since it probably comes from the disk cache (buffered vs unbuffered, test project).
Hi gabest (nice to see you here; I use MPC all the time!)

I ran your test project and got similar results to what I described on my system, although there was some variance with just 2 files.

I then modified your project to be more like my real example (which, unlike what I described, loads more than 2 files) in case that helps magnify what's going on (and make it less likely that disk caching is skewing things). The program still uses 2 threads but each thread reads 5 files, then the serial version reads all 10 files.

I got some interesting results, especially when buffered and unbuffered are compared.

The files were copies of the same 23meg file. They were read off a standard NTFS partition (my system drive). Real-time antivirus was disabled. (NOD32 installed but turned off for the tests.) Vista 32-bit. Core2Duo. NVidia motherboard and NVidia SATA drivers. 2gig of RAM with 40% free.

With FILE_FLAG_SEQUENTIAL_SCAN the parallel reads were consistently 2 to 3 times slower than the serial reads (the real exe reports more detail than pasted here):

Parallel 23484567924; Serial 10204840629; Parallel was worse. 230% as long as serial.
Parallel 32899454271; Serial 10073167110; Parallel was worse. 326% as long as serial.
Parallel 33913801872; Serial 10052993466; Parallel was worse. 337% as long as serial.

With FILE_FLAG_NO_BUFFERING things are even BUT both are as slow as the buffered parallel case above. i.e. Now everything is 2-3 times slower than the buffered serial reads. It seems like reading data in parallel disables or cancels out read buffering, on my system at least:

Parallel 34752228822; Serial 34509786165; Parallel was worse. 100% as long as serial.
Parallel 33359695965; Serial 34333134759; Serial was worse. 102% as long as parallel.
Parallel 32994485361; Serial 33712713216; Serial was worse. 102% as long as parallel.

I also made versions which read the entire files in one go, instead of multiple small read operations. (There being a big difference in these cases is why I think there is a problem. The OS appears to be allowing two large reads to compete with each other with the result that they both lose horribly.) These show more variable speed when buffering is enabled, I guess due to the files being cached in memory on subsequent reads (both in the same execution and between executions), but the parallel reads are still consistently slower for me even with the variance. (Perhaps it would be worth trying the tests in reverse order but I've spent too long on this for now and everything so far has confirmed what I saw in the past in cases where there was too much data, and too much time between tests of serial vs parallel builds, for disk caching to have been the only factor.)

Here's my version of the project:
http://nudel.dopus.com/par_read_proj.zip

Pre-build exes of the four versons (buffered/non-buffered and small-reads vs one-big-read):
http://nudel.dopus.com/par_read_bins.zip

Finally, in case it matters, the 23meg test file. Create 10 copies of it named 0.tmp to 9.tmp in the same dir that the exe is run from:
http://nudel.dopus.com/par_read_file.zip

Quick follow-up: I just shut-down all my apps and freed up a bit more RAM, then tried again. Now it's running through the files very quickly after the first test has run, suggesting that everything is being done in memory now. (The results are all over the place now, but it's going through the files so fast that nothing is really being tested.)

To properly test this stuff I think you need to make sure enough data is being read, or memory is low enough, that it isn't all being cached.

Or to clear the disk cache between each test (not just each run of the exe, of course, but each test within it). I don't know a way to do that, though.

Is that Mark's iPhone calendar reminder beeping at 31:47 or the reporter's?

Either way, at least one of them knows Windows Mobile needs first aid and a hospital, fast!
aL_
aL_
Rx ftw
the vhd stuff sounds really awsome :O
beeing a guy who installs a lot o beta stuff, could i install windows on a vhd and just have it as a backup and if just want to reset, could i just point the boot manager at a diffrent vhd and be done?

also, i could to a lengthy os install on a vhd image and the boot on the image just to finish the installation Smiley really really cool Smiley
Microsoft Communities