Mark Russinovich: Inside Windows 7
- Posted: Jan 14, 2009 at 11:39AM
- 764,459 views
- 52 comments
Loading user information from Channel 9
Something went wrong getting user information from Channel 9
Loading user information from MSDN
Something went wrong getting user information from MSDN
Loading Visual Studio Achievements
Something went wrong getting the Visual Studio Achievements
Right click “Save as…”
How has Windows evolved, as a general purpose operating system and at the lowest levels, in Windows 7? Who better to talk to than Technical Fellow and Windows Kernel guru Mark Russinovich? Here, Mark enlightens us on the new kernel constructs in Windows 7 (and, yeah, we do wander up into user mode, but only briefly). One very important change in the Windows 7 kernel is the dismantling of the dispatcher spin lock and redesign and implementation of its functionality. This great work was done by Arun Kishan (you've met him here on C9 last year). EDIT: You can learn exactly what Arun did in eliminating the dispatcher lock and replacing it with a set of synchronization primitives and a new "pre-wait" thread state, here. The direct result of the reworking of the dispatcher lock is that Windows 7 can scale to 256 processors. Further, this enabled the great Landy Wang to tune the Windows Memory Manager to be even more efficient than it already is. Mark also explains (again) what MinWin really is (heck, even I was confused. Not anymore...). MinWin is present in Windows 7. Native support for VHD (boot from VHD anyone?) is another very cool addition to our next general purpose OS. Yes, and there's more!
Tune in. This is a great conversation (if you're into operating systems). It's always great to chat with Mark.
Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.
Follow the discussion
Oops, something didn't work.
What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in. You need to be signed in to Channel 9 to use this feature.What does this mean?
Following an item on Channel 9 allows you to watch for new content and comments that you are interested in and view them all on your notifications pagesign up for email notifications?
the updated tools are only available through the Microsoft Desktop Optimization Pack (MDOP). MDOP is an add-on subscription to Windows Client Software Assurance. MDOP also contains a lot of other cool tools like Application Virtualization that used to the the Softricity SoftGrid product.
Mark also talks about the next version of Sysinternals in the interview I did with him on TechNet Edge:
Interview with Mark Russinovich: the future of Sysinternals, Security, Windows
Actually, R2 will contain PowerShell as an optional install. Read about here
C
I'd like windows to not do a context swtich on a thread when its doing disk i/o. i want it to hold onto the thread for say 500ms, instead of 20ms as this gives the disk more time to read/write.
If each thread is accessing a file, the whole thing slows down to a crawl as the hard disc read head has to jump to each file every 20ms. It would be MUCH better if the operating system could allocate more time to read a file before it allowed a context switch. say 500ms. that would allow more data to be retrieved from the hard disc, less head thrash, less time waiting for the head to move, and performance would go up greatly.
Just try creating 2 or more zip archives at the same time, then time it again but only doing 1 at a time. Winrar has a feature where it will wait (probably using a global mutex) for other winrar windows to finish before the next one starts.
You can context switch CPU threads till the cows come home, but a phsical device needs more time to read/write when the head arrives.
IO is extremely costly. In most cases, if a thread requests an IO operation, it's going to be hanging around for quite a while for the IO to complete, so it makes sense to curtail the thread's quantum and move onto the next thread awaiting CPU time (i.e. context switch).
When the IO operation returns (most likely a DMA operation these days), the CPU will be interrupted and the interrupt handler fires, unblocking the interrupt service thread (IST) and releases the CPU. The CPU then works out whch thread to run next. Because the IST is a high-priority thread, it'll most likely get the next quantum and complete the IO operation. Your IO requesting thread will then be reactivated and return.
Forcing the IO requesting thread's quantum to extend (to HALF A SECOND???) will only slow down the machine as the CPU will be able to execute FEWER threads per second because of the largely dormant thread hogging the CPU's time.
The reason that creating two Zip archives simultaneously mgiht be is slower has many factors, including the rotational, seek and data transfer capabilities of the storage device itself, how fragmented your storage device is, whether your device implements some kind of write buffering, etc. And that's not to mention whether you're running single/multiple processors and what else is running on your box.
If it takes longer to create two zips at the same time vs. doing it serially indicates to me that you may be suffering from slow disk and.or high disk fragmentation forcing your Zip tool to create and extend its file in many small chunks, causing lots of disk seeking and therefore slowing you down.
This is a technical talk. These details matter.
Thanks.
Consider this example which I experienced with quite simple Win32 code on my Vista machine:
I had two uncompressed BMP files on the HDD, about 50MB each. I needed to read both files into memory and process them and they had to be loaded completely before processing could begin. There was plenty of memory and it was a Core2Duo system with 32-bit Vista.
If I used two threads to load the 50MB files in parallel (one file per thread) then, no matter how I wrote it, it consistently took *twice* as long as using a single thread and reading the files in series. Not the same length of time but twice as long. This was true even when both threads allocated the full 50MB each and read their respecitve files in a single 50MB ReadFile call each. No data was being written to disk and no other processes were using significant resources.
That cannot be right. The OS is being asked by two threads in the same process to do two reads and it's allowing them to compete with each other to the extent that it takes twice as long. It makes no sense for those reads to be done in parallel as, even in the impossible best case of zero seek times, the result would still be both threads waiting until the full 100MB of data was read. Better, when the OS knows both threads are reading 50MB of data in a single ReadFile call, to let one thread read 50MB and move on, then let the other thread read its 50MB. That would mean one thread is ready after 50MB and the other after 100MB. (i.e. Compared to the impossible best case of the other method, one thread takes no longer to be ready while the other thread is ready twice as quickly. Win.)
I realise that doing that could be complex given the way the system is layered. Some interleving may be inevitable but what happens now has a lot of room for improvement. Neither thread is ready until the amount of time it would take a single thread to read 200MB of data, yet only 100MB of data is being read.
Even if you can refactor your own process to have a single "data loading" thread (which is very difficult with many 3rd party libraries and/or workloads which mix loading and processing), you have no way to synchronize your data loading with that of any other process on the system.
C
I ran your test project and got similar results to what I described on my system, although there was some variance with just 2 files.
I then modified your project to be more like my real example (which, unlike what I described, loads more than 2 files) in case that helps magnify what's going on (and make it less likely that disk caching is skewing things). The program still uses 2 threads but each thread reads 5 files, then the serial version reads all 10 files.
I got some interesting results, especially when buffered and unbuffered are compared.
The files were copies of the same 23meg file. They were read off a standard NTFS partition (my system drive). Real-time antivirus was disabled. (NOD32 installed but turned off for the tests.) Vista 32-bit. Core2Duo. NVidia motherboard and NVidia SATA drivers. 2gig of RAM with 40% free.
With FILE_FLAG_SEQUENTIAL_SCAN the parallel reads were consistently 2 to 3 times slower than the serial reads (the real exe reports more detail than pasted here):
Parallel 23484567924; Serial 10204840629; Parallel was worse. 230% as long as serial.
Parallel 32899454271; Serial 10073167110; Parallel was worse. 326% as long as serial.
Parallel 33913801872; Serial 10052993466; Parallel was worse. 337% as long as serial.
With FILE_FLAG_NO_BUFFERING things are even BUT both are as slow as the buffered parallel case above. i.e. Now everything is 2-3 times slower than the buffered serial reads. It seems like reading data in parallel disables or cancels out read buffering, on my system at least:
Parallel 34752228822; Serial 34509786165; Parallel was worse. 100% as long as serial.
Parallel 33359695965; Serial 34333134759; Serial was worse. 102% as long as parallel.
Parallel 32994485361; Serial 33712713216; Serial was worse. 102% as long as parallel.
I also made versions which read the entire files in one go, instead of multiple small read operations. (There being a big difference in these cases is why I think there is a problem. The OS appears to be allowing two large reads to compete with each other with the result that they both lose horribly.) These show more variable speed when buffering is enabled, I guess due to the files being cached in memory on subsequent reads (both in the same execution and between executions), but the parallel reads are still consistently slower for me even with the variance. (Perhaps it would be worth trying the tests in reverse order but I've spent too long on this for now and everything so far has confirmed what I saw in the past in cases where there was too much data, and too much time between tests of serial vs parallel builds, for disk caching to have been the only factor.)
Here's my version of the project:
http://nudel.dopus.com/par_read_proj.zip
Pre-build exes of the four versons (buffered/non-buffered and small-reads vs one-big-read):
http://nudel.dopus.com/par_read_bins.zip
Finally, in case it matters, the 23meg test file. Create 10 copies of it named 0.tmp to 9.tmp in the same dir that the exe is run from:
http://nudel.dopus.com/par_read_file.zip
To properly test this stuff I think you need to make sure enough data is being read, or memory is low enough, that it isn't all being cached.
Or to clear the disk cache between each test (not just each run of the exe, of course, but each test within it). I don't know a way to do that, though.
Windows 7 promises to be a really solid general purpose OS. I'm running the pre-beta build released to PDC2008 atendees and it's impressive.
C
beeing a guy who installs a lot o beta stuff, could i install windows on a vhd and just have it as a backup and if just want to reset, could i just point the boot manager at a diffrent vhd and be done?
also, i could to a lengthy os install on a vhd image and the boot on the image just to finish the installation
You are confusing real time passing with thread quantum time, which is local to the thread and only counting while the thread is making forward progress. Blocked threads are not making forward progress.
C
Well, it'll always take longer to create two zips at the same time than in serial since they are definitely I/O bound operations (assuming your cpu is fast enough) and the two sets of concurrent reads and writes force the disk to seek back and forth between the two files (this has nothing to do with the fragmentation of those files).
But you're right that his solution is not really the correct one, as it makes no sense for the OS to keep the CPU waiting on an I/O operation when it could be executing some other thread that will actually just use the cpu.
What might make sense (but you'd have to experiment to be sure I think) would be to try to find a way so that you switched to another thread which wasn't disk bound.
WinRAR is very smart to do this, and it makes sense for it's scenario, but the OS can't enforce this generally, since users expect the system to be responsive, which means programs can't be forced to wait for other programs disk-bound operations to complete before they can run (or even perform their own disk i/o).
Perhasp Solid-State devices will make all of this moot, since they don't have seek time. It would be interesting to run the zip experiment against a flash drive.
Thanks again. Windows future is in safe hands.
The haircut + shirt + time has changed but the desktop + [wall] storage has not changed between interviews !!
:^ )
With discs with scratches, or poor quality are inserted into drive, the system is somtimes completely hanged, and user have to perform hard-reset, even when the disc was taken out of the drive, Windows keeps irresponsive!. I cannot tell how many times it have happened to me and my friends.
I'd love to see improvements in Windows 7 in how it handels attached devices, and should not get in a state where user have to do hard-reset due to certain device delay, specifically CD/DVD drives.
// chall3ng3r //
chall3ng3r's entry is interesting, but I think it's much more of a generic issue than even Windows specific. Damaged CD's have the same effect on just about every OS I've tried, including OS/2, Linux, Windows. I rather suspect it's down to the drive's firmware getting stuck in a tight loop and becoming unresponsive and/or locking the bus (?DMA?)
Can anyone comment?
Mike
I remember seeing something last year that the goal was to get MinWin to 4Mb. How bug is it now? Do MS have a website where we could follow-up the advances on this product?
Thank you very much --
C
PS. I am seeing a bug on this. Showing 7 pages and I keep getting a popup saying that a toolbar could not be found. weird.
Why all the videos??? Alot of time I just want to read the paper. I dont have time to go thru the video..
Thanks ICON58
I've often been completely surprised at the beaviour of Windows in relation to how it handles removable devices.
It had got a little better over the years with the release of XP and eventually Vista, but I have seen on numerous occasions Windows completely lock up the entire system because it either couldn't read a CD or a user has attempted access on a removable device that did not have any media in it.
I fail to see how in this day and age how the entire OS needs to be effected; be it either slowing to a crawl or completely locks up explorer because it hasn't got a responce from removable media. This also goes for accessing a network resource that is no longer available, the entire explorer window will lockup and become unresponsive until either contact is restored or a network timeout occurs.
Unforgivable that the user's GUI will just lock up in this way, surely keep the processing / waiting for this action invisible to the user and keep all the interactive components responsive. Maybe I'm asking for something here that isn't possible, but I've never seen this behaviour on any linux distro.
I've seen similar lockups while trying to read damaged media in OS/2 and Linux. I wonder whether it's not due to the DMA access to the damaged disc "locking the bus" and preventing other activity. I have a strong suspicision that this is the case, since when I've turned DMA off and used the same disc, the lockup doesn't occur (although obviously without DMA, for normal discs, you get very suboptimal performance).
Mike
I hope they fix this! Todays Core i7 965/975's don't use just the FSB/QPI to identify the BCLk of a processor. It also uses the mutli settings. It appears a kernel issue , and I've sent this to the team.
My BCLK on my Core i7 965 EE is 24x, with a QPI of 133mhz. That's is seen in bios , and windows at 3.2 ghz. However, Many users, including myself, overclock our processors. So when I set my multi to 34x with the same QPI I'm at 4.5ghz in bios. But in Windows it remains 3.184Ghz. This is because Windows is using the new processors QPI settings to determine the BCLK only, when in reality it should be using the multi's setting also to determine the clock speed of the processor. I bring this up because I sent feedback on this issue, and have yet to see an update to addrees it. As you know, most new computer will have the intel Core i7 processors in them. It would be nice to have windows read them correctly, versus having to use third party software to get correct processor speed ratings.
Thanks!
Windows 7! Hurray!
Does that mean Windows Explorer will no more hang all tasks and go into 'not responding' state while trying to read a CD inserted in CD-ROM drive ?
There are rumors that it can also calculate the estimated time correctly while copying files and does not hang while browsing network... If it's true then Win7 is my dream OS!
Ha ha.... Why don't you download and install the RC build. Then your need to troll will be greatly reduced and your appreciation for Windows greatly increased.
C
I had a conversation with landy a while back talking about memory management in xp and we were discussing changing how windows pro-actively writes pages to the pagefile in vista
he said there was no cue for that function there might be a hardrive bottleneck between pagefile writes and user writes.
landy said they were thinking about giving this page write a cue policy but hadn't done it in vista, (it was still longhorn at the time of this conversation)
I'm wondering if it was ever done for vista and if not if they addressed that issue in seven
Hi,
As of Vista and Win7, MM’s policies for determining when to write pages out to the paging file do not take into account other disk activity.
C
shame charles as this is a hit people do happen to notice when they have an abundance of memory
I understand when pages are being written for cause, however nt writes pages pro-actively, long before they're even considered a candidate for release
as I mentioned with landy, (who did say this was going to be considered), this pro-active writing to the pagefile when the system is not under pressure should enjoy a differant policy then when the pages are being written actively
I understand only unique images in ram and not represernted on disc or the network are written to the pagefile however if there is no memory pressure yet there is hardrive use I think it would smooth things out if there was a "no memory pressure policy" for pages that aren't even a candidate for release
no?
"no?"
Consideration is not definitive.
C
I'd like to add the idea of an I/O queue to this discussion
Thank You very cool
Yeah windows 7 is just wonderful, and very user friendly I may say.
Thanks again
Even it apparently there aren't many changes on Vista, at least major, the really cool stuff is that because systems got bigger and bigger, this aquisition of a global lock, called the dispacher lock can be the sollution to a lot of threads. Problably be swiching on Windows 7 soon..
I feel smart after watching this interview.
I think this video has helped me to understand how programs work better I would definatly recomend it to a beginer programmer even though it isn't about that at all. Its just the way that he explains the tasks and how win7 manages them now.
1
Remove this comment
Remove this thread
Close