Leo Davidson

Leo Davidson Leo Davidson

Niner since 2008


  • Mark Russinovich: Inside Windows 7

    Quick follow-up: I just shut-down all my apps and freed up a bit more RAM, then tried again. Now it's running through the files very quickly after the first test has run, suggesting that everything is being done in memory now. (The results are all over the place now, but it's going through the files so fast that nothing is really being tested.)

    To properly test this stuff I think you need to make sure enough data is being read, or memory is low enough, that it isn't all being cached.

    Or to clear the disk cache between each test (not just each run of the exe, of course, but each test within it). I don't know a way to do that, though.

  • Mark Russinovich: Inside Windows 7

    Hi gabest (nice to see you here; I use MPC all the time!)

    I ran your test project and got similar results to what I described on my system, although there was some variance with just 2 files.

    I then modified your project to be more like my real example (which, unlike what I described, loads more than 2 files) in case that helps magnify what's going on (and make it less likely that disk caching is skewing things). The program still uses 2 threads but each thread reads 5 files, then the serial version reads all 10 files.

    I got some interesting results, especially when buffered and unbuffered are compared.

    The files were copies of the same 23meg file. They were read off a standard NTFS partition (my system drive). Real-time antivirus was disabled. (NOD32 installed but turned off for the tests.) Vista 32-bit. Core2Duo. NVidia motherboard and NVidia SATA drivers. 2gig of RAM with 40% free.

    With FILE_FLAG_SEQUENTIAL_SCAN the parallel reads were consistently 2 to 3 times slower than the serial reads (the real exe reports more detail than pasted here):

    Parallel 23484567924; Serial 10204840629; Parallel was worse. 230% as long as serial.
    Parallel 32899454271; Serial 10073167110; Parallel was worse. 326% as long as serial.
    Parallel 33913801872; Serial 10052993466; Parallel was worse. 337% as long as serial.

    With FILE_FLAG_NO_BUFFERING things are even BUT both are as slow as the buffered parallel case above. i.e. Now everything is 2-3 times slower than the buffered serial reads. It seems like reading data in parallel disables or cancels out read buffering, on my system at least:

    Parallel 34752228822; Serial 34509786165; Parallel was worse. 100% as long as serial.
    Parallel 33359695965; Serial 34333134759; Serial was worse. 102% as long as parallel.
    Parallel 32994485361; Serial 33712713216; Serial was worse. 102% as long as parallel.

    I also made versions which read the entire files in one go, instead of multiple small read operations. (There being a big difference in these cases is why I think there is a problem. The OS appears to be allowing two large reads to compete with each other with the result that they both lose horribly.) These show more variable speed when buffering is enabled, I guess due to the files being cached in memory on subsequent reads (both in the same execution and between executions), but the parallel reads are still consistently slower for me even with the variance. (Perhaps it would be worth trying the tests in reverse order but I've spent too long on this for now and everything so far has confirmed what I saw in the past in cases where there was too much data, and too much time between tests of serial vs parallel builds, for disk caching to have been the only factor.)

    Here's my version of the project:

    Pre-build exes of the four versons (buffered/non-buffered and small-reads vs one-big-read):

    Finally, in case it matters, the 23meg test file. Create 10 copies of it named 0.tmp to 9.tmp in the same dir that the exe is run from:

  • Mark Russinovich: Inside Windows 7

    There are definitely things that Windows could do better with disk I/O unless my experience is atypical and due to something wrong with my system.

    Consider this example which I experienced with quite simple Win32 code on my Vista machine:

    I had two uncompressed BMP files on the HDD, about 50MB each. I needed to read both files into memory and process them and they had to be loaded completely before processing could begin. There was plenty of memory and it was a Core2Duo system with 32-bit Vista.

    If I used two threads to load the 50MB files in parallel (one file per thread) then, no matter how I wrote it, it consistently took *twice* as long as using a single thread and reading the files in series. Not the same length of time but twice as long. This was true even when both threads allocated the full 50MB each and read their respecitve files in a single 50MB ReadFile call each. No data was being written to disk and no other processes were using significant resources.

    That cannot be right. The OS is being asked by two threads in the same process to do two reads and it's allowing them to compete with each other to the extent that it takes twice as long. It makes no sense for those reads to be done in parallel as, even in the impossible best case of zero seek times, the result would still be both threads waiting until the full 100MB of data was read. Better, when the OS knows both threads are reading 50MB of data in a single ReadFile call, to let one thread read 50MB and move on, then let the other thread read its 50MB. That would mean one thread is ready after 50MB and the other after 100MB. (i.e. Compared to the impossible best case of the other method, one thread takes no longer to be ready while the other thread is ready twice as quickly. Win.)

    I realise that doing that could be complex given the way the system is layered. Some interleving may be inevitable but what happens now has a lot of room for improvement. Neither thread is ready until the amount of time it would take a single thread to read 200MB of data, yet only 100MB of data is being read.

    Even if you can refactor your own process to have a single "data loading" thread (which is very difficult with many 3rd party libraries and/or workloads which mix loading and processing), you have no way to synchronize your data loading with that of any other process on the system.