Coffeehouse Thread

61 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Does Parallel.ForEach only return when ALL iterations have completed?

Back to Forum: Coffeehouse
  • User profile image
    cbae

    , BitFlipper wrote

    *snip*

     

    @cbae: I can look into such a solution of first deleting all files then folders.

    BTW, shouldn't the file filter be "*" as opposed to "*.*"? Will it pick up files with no extensions?

    I think there's no difference between "*" and "*.*". They both pick-up files with no extensions. You can test this from cmd.exe. If you want files with no extensions specifically, you'd use "*.".

  • User profile image
    eddwo

    Oops,nevermind..

    Perhaps this sheds some light on what is happening

  • User profile image
    BitFlipper

    , eddwo wrote

    Oops,nevermind..

    Perhaps this sheds some light on what is happening

    Well reading through that, it seems it should be perfectly fine to use Parallel.ForEach in this way. Did you pick up something from that blog that points to what might be causing the issue I'm seeing?

  • User profile image
    davewill

    @BitFlipper: This stood out ... "Second, they're concerned that the outer loop's threads will end up blocking waiting for the inner parallel loops to finish."

    Implying that the parallel inner loops may or may not have completed before the outer loop moves on to the next iteration.

    In a nested directory delete that could mean a higher level directory delete is executed before the inner directory delete has executed.

    At least it seems like that could occur.

  • User profile image
    Bas

    , BitFlipper wrote

    *snip*

    But if I'm not calling Stop or Break anywhere, and there were no exceptions, I would expect that when Parrallel.ForEach returns then all iterations should have completed, correct?

     

     

    I would assume so, but the only way to be sure is to check the IsCompleted property of the ParallelLoopResult it returns.

  • User profile image
    BitFlipper

    , davewill wrote

    @BitFlipper: This stood out ... "Second, they're concerned that the outer loop's threads will end up blocking waiting for the inner parallel loops to finish."

    Implying that the parallel inner loops may or may not have completed before the outer loop moves on to the next iteration.

    In a nested directory delete that could mean a higher level directory delete is executed before the inner directory delete has executed.

    At least it seems like that could occur.

    That's not how I read it. While each nested level of Parallel.ForEach can run in parallel with that same level's sibling iterations, once ForEach returns it means that all of that level's iterations have completed. So deleting the contents of the folder and then the folder itself is supposed to be serialized.

    For the case of the recursive delete operation, any folder will only be deleted once ALL of its subfolders and files have been deleted (let's assume there were no exceptions along the way - I verified the very first exception is when an attempt is made to delete the supposedly empty folder - the very last line in the method).

    So logically, this should work since any folder will only be deleted when the subfolder and file delete operations have supposedly completed.

    Remember that the tree structure is being deleted from the inside out. So recursion will continue all the way to the leaves, and only then those files/folders will be deleted as it steps out again. Nowhere does the code attempt to delete a higher level folder before its contents is supposedly deleted as you suggest. If you still believe this is possible, can you elaborate on how that can happen?

     

    , Bas wrote

    *snip* 

    I would assume so, but the only way to be sure is to check the IsCompleted property of the ParallelLoopResult it returns.

    If there were no exceptions during any of the iterations, is it then possible that it did not complete all iterations when it returns? Is there a case where this could happen if Stop or Break was never called?

  • User profile image
    Bass

    There is a more interesting question in all of this, why in the world would manually iterating through the filesystem and deleting things one-by-one be faster than the single call you'd typically do to accomplish this?

    Even more interesting, why is multithreading I/O improving things? The only situation I could think of where that would be plausible is if the filesystem hierarchy exists on multiple physical devices. It's seems like a big no-no to use threads to break up I/O heavy tasks, and there are entire systems built around the idea using threads for parallelism in I/O-heavy tasks is a bad idea (nginx, node.js, etc.).

    Basically, this entire situation seems entirely f*cked up to me and it would be nice to have an idea of what the hell is actually going on. Smiley

  • User profile image
    MasterPi

    , Bass wrote

    Even more interesting, why is multithreading I/O improving things?

    I was wondering about this too... Like, couldn't you just have a background worker run the synchronous delete if it were slowing things down elsewhere (like blocking the UI, e.g)?

  • User profile image
    evildictait​or

    , Bass wrote

    There is a more interesting question in all of this, why in the world would manually iterating through the filesystem and deleting things one-by-one be faster than the single call you'd typically do to accomplish this?

    How do you think Directory.Delete(filename, true) is implemented? (hint: it is not implemented by NTFS).

    There's an argument that perhaps Directory.Delete(_x, true) should be implemented by Parallel.For if it's faster to do so, but you also have to remember that Directory.Delete predates Parallel by some number of years, and there's all sorts of reasons why you might want your .NET base libraries to not be strongly coupled to a parallelization framework.

    Even more interesting, why is multithreading I/O improving things?

    Because you are making the mistake of thinking that getting information from the filesystem or deleting files is in fact causing IO to happen. The information being requested is highly clustered and after the first call will almost all be in memory rather than needing a disk access to obtain, and deleting the file also doesn't go back to disk until the NTFS cache flushes.

    By parallelising it, what you do get is all of the NTFS structure traversal and expensive syscalls which happen in memory happening on different CPUs, which is exactly what Parallel.For is good for.

  • User profile image
    cbae

    @BitFlipper: I did some testing and I discovered that Directory.Delete(folder, true) is simply inefficient as hell, even if the folders that you delete are empty.

    I tested your original code using a test folder containing around 28K files totaling 2.6 GB and distributed over 7200 subfolders. I wasn't able to reproduce the IO exception, but the code completed in around 40 seconds.

    I created the following routine that enumerates all files and deletes them in parallel. Afterward, it enumerates all folders, grouping them by nesting level, and then deletes all folders within a nesting level in parallel to ensure that a parent folder can never be deleted before a child.

    The following code completed in 28 seconds:

            public static void DeleteFolder(string folder)
            {
                if (!Directory.Exists(folder))
                {
                    return;
                }
    
                DeleteAllFiles(folder);
    
                var fgroups = Directory.GetDirectories(folder, "*", SearchOption.AllDirectories).GroupBy(d => d.Count(s => s == '\\'), d => d, (g, d) => new { Level = g, Folders = d }).Reverse();
                foreach (var fgroup in fgroups)
                {
                    Parallel.ForEach(fgroup.Folders, f =>
                    {
                        Directory.Delete(f);
                    });
                }
                Directory.Delete(folder);
            }
    
            public static void DeleteAllFiles(string folder)
            {
                if (!Directory.Exists(folder))
                    return;
    
                var files = Directory.GetFiles(folder, "*", SearchOption.AllDirectories);
    
                Parallel.ForEach(files, (file) =>
                {
                    File.Delete(file);
                });
            }

  • User profile image
    BitFlipper

    @cbae: Yes what I found was that the original problem is only reproducible if the files are not cached. Meaning if I boot up and run DeleteFolderRecursive on an existing folder, it would most likely reproduce. But if, like I do in my test code, first copy the folder from a source folder, then call DeleteFolderRecursive on it, it almost never reproduces because the OS has cached all files (well at least the metadata).

    You should also try this very simple implementation to see if your implementation is really faster:

    public static void DeleteFolderRecursive(string folder)
    {
        if (!Directory.Exists(folder))
            return;
    
        var files = Directory.GetFiles(folder, "*", SearchOption.AllDirectories);
    
        Parallel.ForEach(files, (file) =>
        {
            File.Delete(file);
        });
    
        Directory.Delete(folder, true);
    }
    

    I found that to be the fastest by far, 6 - 7 times faster than Directory.Delete(xxx, true).

  • User profile image
    cbae

    @BitFlipper: My implementation differs your updated routine in only the folder deletion methodology (i.e. it still deletes the files first). When I first tested the method of deleting all files first and then enumerating the empty folders and deleting them separately, I still used the recursive flag (i.e. Directory.Delete(folder, true)) for the reason I mentioned in my original post. Nevertheless, I thought it might still be relatively fast since it would be deleting just an empty directory tree.

    As it turns out, it actually took 50 seconds (i.e. 10 seconds slower than your original code).

    I broke up the two parts (i.e. deleting files and then deleting folders) and found that deleting the files took about 10 seconds, and deleting the folders, despite being empty, took 40 seconds.

    I changed the code to simply delete the root folder using the recursion flag (as in your most recent code), and the folder deletion took about the same amount of time (50 seconds) as enumerating first and deleting the folders. The main reason was the way the folders are enumerated:

    • root\sub1\
    • root\sub2\
    • root\sub3\
    • root\sub1\suba\
    • root\sub1\subb\
    • root\sub2\suba\
    • root\sub2\subb\
    • root\sub3\suba\
    • root\sub3\subb\

    Do you see the problem there? If you iterate through that list in order, you'd be deleting the entire root\sub1 tree before it even gets to root\sub1\suba\ and root\sub1\subb\. So only 3 separate calls to Directory.Delete(subfolder, true) are being made, which is not that different from just deleting the root folder.

    In order to avoid having to use the recursion flag, I reversed the order of the subfolders and then deleted the subfolders using a regular foreach loop. The time improved to 41 seconds which is around what your original method produced.

    To try to benefit from parallel processing, I used the reverse order of the subfolders, and ran it in parallel without the recursive flag. This, unfortunately, did result in the directory not empty IO error.

    If you look at the list of subfolders above again, you can see how the problem might occur. I have no idea how .NET partitions a collection for parallel processing, but if Thread A starts deleting from root\sub3\subb\ going up that list and Thread B starts with root\sub3\ and works its way up, then Thread B may try to delete \root\sub1\ before Thread A completes root\sub1\suba\.

    So, to benefit from both parallel processing and being able to call Directory.Delete() without the recursion flag, I added the grouping by nesting level as I posted in my last message.

     

  • User profile image
    BitFlipper

    , cbae wrote

    @BitFlipper: When I first tested the method of deleting all files first and then enumerating the empty folders and deleting them separately, I still used the recursive flag (i.e. Directory.Delete(folder, true)) for the reason I mentioned in my original post. Nevertheless, I thought it might still be relatively fast since it would be deleting just an empty directory tree.

    As it turns out, it actually took 50 seconds (i.e. 10 seconds slower than your original code).

    That seems to be different from what I'm seeing. The last version I posted is by far the fastest for me at least. I'll play around with your implementation tomorrow and see if that is faster.

    I broke up the two parts (i.e. deleting files and then deleting folders) and found that deleting the files took about 10 seconds, and deleting the folders, despite being empty, took 40 seconds.

    For me deleting the tree of empty folders didn't seem to take very long. I'll do some more testing tomorrow.

    If you look at the list of subfolders above again, you can see how the problem might occur. I have no idea how .NET partitions a collection for parallel processing, but if Thread A starts deleting from root\sub3\subb\ going up that list and Thread B starts with root\sub3\ and works its way up, then Thread B may try to delete \root\sub1\ before Thread A completes root\sub1\suba\.

    I'm not sure if you are referring to the way you implemented your algorithm or my original algorithm, but there is no way the original algorithm can get into a state where a parent folder is deleted before all of its contents is deleted (in spite of the FS race condition issue I'm seeing - that is a different issue altogether). If you visualize the tree being deleted with each folder enumerating through its list of subfolders, at that particular point that part of the code is blocked by that very same ForEach call which (in theory) should only return once all subfolders have been deleted. Only then when the ForEach call returns, will that folder have its files and then itself deleted. Similarly, all parent folders are blocked by their level's ForEach call until it returns. If the FS doesn't have any race issues, this should work perfectly.

  • User profile image
    cbae

    , BitFlipper wrote

    *snip*

    For me deleting the tree of empty folders didn't seem to take very long. I'll do some more testing tomorrow.

    It probably has to do with the distribution of the files in my test folder. I have a lot of deeply-nested subfolders (some 19 levels deep) with files pretty evenly distributed throughout the directory tree. IOW, there aren't too many folders with a great number of files. In fact, when I first tested your original code, I changed your foreach loop that you used for file deletions to use Parallel.ForEach and it ran even slower. You can't benefit from parallel processing at the subfolder level if you're going to have at most 5 files in that folder. OTOH, enumerating all of the files first and deleting them can greatly benefit from parallel processing, which is why deleting 28K files was faster than deleting 7K empty folders in my test scenario.

    *snip*

    I'm not sure if you are referring to the way you implemented your algorithm or my original algorithm,

    I'm talking about one of my implementations, as described here:

    To try to benefit from parallel processing, I used the reverse order of the subfolders, and ran it in parallel without the recursive flag. This, unfortunately, did result in the directory not empty IO error.

  • User profile image
    JoshRoss

    This is one of the best threads in a while. It's hard not to nerd-out while talking about who can delete files the fastest.

    -Josh

  • User profile image
    kriskdf

    i haven't tested this, but it seems like it might be faster to do rmdir /S /Q <folder>.

    40 seconds seems like a long time.

  • User profile image
    cbae

    , kriskdf wrote

    i haven't tested this, but it seems like it might be faster to do rmdir /S /Q <folder>.

    40 seconds seems like a long time.

    Deleting this folder from the command line was actually slower.

  • User profile image
    davewill

    @BitFlipper: In the original code if the Directory.Delete is wrapped in a try/catch like ...

                try
                {
                    Directory.Delete(folder);
                }
                catch (Exception)
                {
                    var dc = Directory.GetDirectories(folder).Length;
                    var fc = Directory.GetFiles(folder).Length;
                    Console.WriteLine("Folder=" + folder);
                    Console.WriteLine("Dir count=" + dc);
                    Console.WriteLine("File count=" + fc);
                    throw;
                }
    

     

    what does it show? Zero for dir and file count or something greater?

     

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.