Parallel Computing in Native Code: New Trends and Old Friends

How has Windows evolved, as a general purpose operating system and at the lowest levels, in Windows 7? Who better to talk to than Technical Fellow and Windows Kernel guru Mark Russinovich? Here, Mark enlightens us on the new kernel constructs in Windows 7 (and, yeah, we do wander up into user mode, but only briefly). One very important change in the Windows 7 kernel is the dismantling of the dispatcher spin lock and redesign and implementation of its functionality. This great work was done by Arun Kishan (you've met him here on C9 last year). EDIT: You can learn exactly what Arun did in eliminating the dispatcher lock and replacing it with a set of synchronization primitives and a new "pre-wait" thread state, here. The direct result of the reworking of the dispatcher lock is that Windows 7 can scale to 256 processors. Further, this enabled the great Landy Wang to tune the Windows Memory Manager to be even more efficient than it already is. Mark also explains (again) what MinWin really is (heck, even I was confused. Not anymore...). MinWin is present in Windows 7. Native support for VHD (boot from VHD anyone?) is another very cool addition to our next general purpose OS. Yes, and there's more!
Tune in. This is a great conversation (if you're into operating systems). It's always great to chat with Mark.
the updated tools are only available through the Microsoft Desktop Optimization Pack (MDOP). MDOP is an add-on subscription to Windows Client Software Assurance. MDOP also contains a lot of other cool tools like Application Virtualization that used to the the Softricity SoftGrid product.
Mark also talks about the next version of Sysinternals in the interview I did with him on TechNet Edge:
Interview with Mark Russinovich: the future of Sysinternals, Security, Windows
Actually, R2 will contain PowerShell as an optional install. Read about
here
C
I'd like windows to not do a context swtich on a thread when its doing disk i/o. i want it to hold onto the thread for say 500ms, instead of 20ms as this gives the disk more time to read/write.
If each thread is accessing a file, the whole thing slows down to a crawl as the hard disc read head has to jump to each file every 20ms. It would be MUCH better if the operating system could allocate more time to read a file before it allowed a context switch.
say 500ms. that would allow more data to be retrieved from the hard disc, less head thrash, less time waiting for the head to move, and performance would go up greatly.
Just try creating 2 or more zip archives at the same time, then time it again but only doing 1 at a time. Winrar has a feature where it will wait (probably using a global mutex) for other winrar windows to finish before the next one starts.
You can context switch CPU threads till the cows come home, but a phsical device needs more time to read/write when the head arrives.
IO is extremely costly. In most cases, if a thread requests an IO operation, it's going to be hanging around for quite a while for the IO to complete, so it makes sense to curtail the thread's quantum and move onto the next thread awaiting CPU time (i.e.
context switch).
When the IO operation returns (most likely a DMA operation these days), the CPU will be interrupted and the interrupt handler fires, unblocking the interrupt service thread (IST) and releases the CPU. The CPU then works out whch thread to run next. Because
the IST is a high-priority thread, it'll most likely get the next quantum and complete the IO operation. Your IO requesting thread will then be reactivated and return.
Forcing the IO requesting thread's quantum to extend (to HALF A SECOND???) will only slow down the machine as the CPU will be able to execute FEWER threads per second because of the largely dormant thread hogging the CPU's time.
The reason that creating two Zip archives simultaneously mgiht be is slower has many factors, including the rotational, seek and data transfer capabilities of the storage device itself, how fragmented your storage device is, whether your device implements some
kind of write buffering, etc. And that's not to mention whether you're running single/multiple processors and what else is running on your box.
If it takes longer to create two zips at the same time vs. doing it serially indicates to me that you may be suffering from slow disk and.or high disk fragmentation forcing your Zip tool to create and extend its file in many small chunks, causing lots of disk
seeking and therefore slowing you down.
Parallel 23484567924; Serial 10204840629; Parallel was worse. 230% as long as serial.
Parallel 32899454271; Serial 10073167110; Parallel was worse. 326% as long as serial.
Parallel 33913801872; Serial 10052993466; Parallel was worse. 337% as long as serial.
With FILE_FLAG_NO_BUFFERING things are even BUT both are as slow as the buffered parallel case above. i.e. Now everything is 2-3 times slower than the buffered serial reads. It seems like reading data in parallel disables or cancels out read buffering, on my
system at least:
Parallel 34752228822; Serial 34509786165; Parallel was worse. 100% as long as serial.
Parallel 33359695965; Serial 34333134759; Serial was worse. 102% as long as parallel.
Parallel 32994485361; Serial 33712713216; Serial was worse. 102% as long as parallel.
I also made versions which read the entire files in one go, instead of multiple small read operations. (There being a big difference in these cases is why I think there is a problem. The OS appears to be allowing two large reads to compete with each other with
the result that they both lose horribly.) These show more variable speed when buffering is enabled, I guess due to the files being cached in memory on subsequent reads (both in the same execution and between executions), but the parallel reads are still consistently
slower for me even with the variance. (Perhaps it would be worth trying the tests in reverse order but I've spent too long on this for now and everything so far has confirmed what I saw in the past in cases where there was too much data, and too much time
between tests of serial vs parallel builds, for disk caching to have been the only factor.)
Here's my version of the project:
http://nudel.dopus.com/par_read_proj.zip
Pre-build exes of the four versons (buffered/non-buffered and small-reads vs one-big-read):
http://nudel.dopus.com/par_read_bins.zip
Finally, in case it matters, the 23meg test file. Create 10 copies of it named 0.tmp to 9.tmp in the same dir that the exe is run from:
http://nudel.dopus.com/par_read_file.zip
Well, it'll always take longer to create two zips at the same time than in serial since they are definitely I/O bound operations (assuming your cpu is fast enough) and the two sets of concurrent reads and writes force the disk to seek back and forth between
the two files (this has nothing to do with the fragmentation of those files).
But you're right that his solution is not really the correct one, as it makes no sense for the OS to keep the CPU waiting on an I/O operation when it could be executing some other thread that will actually just use the cpu.
What might make sense (but you'd have to experiment to be sure I think) would be to try to find a way so that you switched to another thread which wasn't disk bound.
WinRAR is very smart to do this, and it makes sense for it's scenario, but the OS can't enforce this generally, since users expect the system to be responsive, which means programs can't be forced to wait for other programs disk-bound operations to complete
before they can run (or even perform their own disk i/o).
Perhasp Solid-State devices will make all of this moot, since they don't have seek time. It would be interesting to run the zip experiment against a flash drive.
Why all the videos??? Alot of time I just want to read the paper. I dont have time to go thru the video..
Thanks ICON58
Windows 7! Hurray!
Does that mean Windows Explorer will no more hang all tasks and go into 'not responding' state while trying to read a CD inserted in CD-ROM drive ?
There are rumors that it can also calculate the estimated time correctly while copying files and does not hang while browsing network... If it's true then Win7 is my dream OS!
Ha ha.... Why don't you download and install the RC build. Then your need to troll will be greatly reduced and your appreciation for Windows greatly increased.
C
I had a conversation with landy a while back talking about memory management in xp and we were discussing changing how windows pro-actively writes pages to the pagefile in vista
he said there was no cue for that function there might be a hardrive bottleneck between pagefile writes and user writes.
landy said they were thinking about giving this page write a cue policy but hadn't done it in vista, (it was still longhorn at the time of this conversation)
I'm wondering if it was ever done for vista and if not if they addressed that issue in seven
Hi,
As of Vista and Win7, MM’s policies for determining when to write pages out to the paging file do not take into account other disk activity.
C
shame charles as this is a hit people do happen to notice when they have an abundance of memory
I understand when pages are being written for cause, however nt writes pages pro-actively, long before they're even considered a candidate for release
as I mentioned with landy, (who did say this was going to be considered), this pro-active writing to the pagefile when the system is not under pressure should enjoy a differant policy then when the pages are being written actively
I understand only unique images in ram and not represernted on disc or the network are written to the pagefile however if there is no memory pressure yet there is hardrive use I think it would smooth things out if there was a "no memory pressure policy" for pages that aren't even a candidate for release
no?
"no?"
Consideration is not definitive.
C
I'd like to add the idea of an I/O queue to this discussion
Thank You very cool
Yeah windows 7 is just wonderful, and very user friendly I may say.
Thanks again
Even it apparently there aren't many changes on Vista, at least major, the really cool stuff is that because systems got bigger and bigger, this aquisition of a global lock, called the dispacher lock can be the sollution to a lot of threads. Problably be swiching on Windows 7 soon..
I feel smart after watching this interview.
I think this video has helped me to understand how programs work better I would definatly recomend it to a beginer programmer even though it isn't about that at all. Its just the way that he explains the tasks and how win7 manages them now.
1