Coffeehouse Thread

61 posts

Forum Read Only

This forum has been made read only by the site admins. No new threads or comments can be added.

Does Parallel.ForEach only return when ALL iterations have completed?

Back to Forum: Coffeehouse
  • User profile image
    BitFlipper

    I have a situation where I need to automate the deletion of a folder tree that contains hundreds of thousands of files. The problem is that this is SLOW, taking over a few minutes (why does this same operation on Linux only take about 2 seconds???).

    Anyway, so I've been playing around with some recursive folder delete code. It uses Parallel.ForEach to delete folders, and at first I though it might be slower due to IO contention, however tests have shown that it is about 40% faster than Directory.Delete(folder, true).

    However I've been running into a problem where I get an IOException saying "The directory is not empty". When I then check that directory, it IS empty and running the delete again succeeds. It seems either Parallel.ForEach returns before all iterations have completed, or the Windows filesystem is not thread safe (or I have a bug in my code, but it seems so simple).

    Below is the code. If I switch to the non-parallel loop (commented out in the code), the problem goes away. What's going on?

    Note the following:

    • I have removed all try/catch blocks since this is just sample code
    • No files have read-only attribute set
    • No files are open in any other process

     

        public void DeleteFolderRecursive(string folder)
        {
            if (!Directory.Exists(folder))
                return;
    
            var subFolders = Directory.GetDirectories(folder);
    
            Parallel.ForEach(subFolders, (subFolder) =>
            {
                DeleteFolderRecursive(subFolder);
            });
    
            //foreach (var subFolder in subFolders)
            //{
            //    DeleteFolderRecursive(subFolder);
            //}
    
            var files = Directory.GetFiles(folder);
    
            foreach (var file in files)
            {
                File.Delete(file);
            }
    
            Directory.Delete(folder);
        }
    

  • User profile image
    Bas

    Does it block the calling thread: yes. Does it only return when all iterations are complete: no. After all, somebody may have called Stop or break during some iteration. Parrallel.ForEach returns a ParallelLoopResult structure which you can check to see if all iterations have run to completion.

    Not sure what's going on in your code though.

  • User profile image
    MasterPi

    I've had strange issues with Parallel.Foreach when recursively reading a directory. When I last used it, it didn't end up reading all the files. Making it sequential (but non-blocking) worked just fine. Maybe there's an issue with File IO and Parallel.Foreach?

    In terms of your code, do you know if File.Delete always succeeeds? Like, you're not swallowing up the exception there, right?

    Another thing I'm wondering is if there is some sort of a race condition. Like, you obtain a lock on the directory the moment you manipulate its contents (like delete). Then after all the deletes occur, you try to delete the directory before the lock is returned. If this is the case, then maybe let the directory delete spin in a while condition to check whether it CanDelete().  

  • User profile image
    vesuvius

    , Bas wrote

    Not sure what's going on in your code though.

    He is deleting a folder and all the files and folders that that folder contains.

    I would create an additional Queue of directories "that are not empty", then attempt to delete these again at the end

  • User profile image
    MasterPi

    @vesuvius: I think Bas intended to say, "I'm not sure what's wrong with your code."

  • User profile image
    kettch

    , vesuvius wrote

    *snip*

    I would create an additional Queue of directories "that are not empty", then attempt to delete these again at the end

    Indeed. File and directory locking can come from a lot of sources. It can also be transient, and is very hard to track down if it is.

  • User profile image
    evildictait​or

    What happens if you run this code on a machine with no antvirus? AVs have a nasty habit of getting handles to files when you expect no one to have a handle open, which could be causing your problems.

     

  • User profile image
    vesuvius

    , kettch wrote

    *snip*

    Indeed. File and directory locking can come from a lot of sources. It can also be transient, and is very hard to track down if it is.

    ++

  • User profile image
    ScanIAm

    One debug option is to check in your code to confirm that the directory is actually empty prior to trying to delete it.  Recursion has a purpose, but like regex, it can make things much more complicated.  If you used a queue of FileSystemInfos and process them that way, you can use Tasks to process things in parallel while still keeping things easier to follow (IMHO).

    If you dequeue a directoryinfo, check if it's empty. 

    If so, delete it. 

    If not enqueue all of its inner filesysteminfos and then enqueue the directoryinfo again.

    If you dequeue a fileinfo, delete it.

    repeat until the queue is empty.

    On a side note, both your method and mine will loop endlessly if we come in contact with a looped path structure.  You'll need to check for that.

     

  • User profile image
    Bass

    , BitFlipper wrote

    why does this same operation on Linux only take about 2 seconds???

      

    Depends on the filesystem I think. But handling lots of files at once is something that was optimized long ago because of Linux's use as a mail server and many mail server implementations store every e-mail of every account as a single file. Wouldn't be very nice if someone nuked their entire inbox and locked up the server for several minutes now would it? Smiley

  • User profile image
    Sven Groot

    , Bass wrote

    *snip*

    Depends on the filesystem I think. Smiley

    Yeah, it depends. Ext3 isn't particularly good at deleting huge numbers of files either, if memory. A bit faster than NTFS maybe, but there's definitely better options out there.

  • User profile image
    Bas

    , MasterPi wrote

    @vesuvius: I think Bas intended to say, "I'm not sure what's wrong with your code."

    I figured that would be obvious, since BitFlipper's code sample is extremely straightforward and he already explained what it did in his post. Smiley

  • User profile image
    felix9

    this MIGHT be relevant ?

    The file / folder is NOT guranteed to be deleted right after DeleteFile returns and succeeded.

    http://blogs.msdn.com/b/oldnewthing/archive/2012/09/07/10347136.aspx

    However, the NT model for file deletion is that a file is deleted when the last open handle is closed.Ā¹ If DeleteĀ­File returns and the file still exists, then it means that somebody else still has an open handle to the file.

    I think maybe you can trace the exact problem with ProcessMonitor.

  • User profile image
    JoshRoss

    Why not call System.IO.Delete(path,true)? I don't know how many concurrent deletes $bitmap can take anyways. I suppose the answer to any scientific question is "Let's do an experiment."

    -Josh

  • User profile image
    cbae

    I'm guessing this issue has to do with recursion. Still, there's really no reason to use a recursive method for this. When enumerating the files, you can use the SearchOption.AllDirectories flag when calling Directory.GetFiles():

    var files = Directory.GetFiles(folder, "*.*", SearchOption.AllDirectories);

    After all the files have been deleted, you can use the same flag to enumerate all the folders:

    var folders = Directory.GetDirectories(folder, "*", SearchOption.AllDirectories);

    Of course, there could an issue with the order in which the folders are enumerated (unless you sort the list by length of folder name in descending order to ensure the deepest nested folders are deleted first). So, you might be forced to delete folders using the recursive flag (I'm guessing it's going to be much faster with empty folders) and checking the existence of the folder prior to deleting.

                if (Directory.Exists(folder))
                {
                    Directory.Delete(folder, true));
                }

    Edit: Come to think of it, if deleting empty folders is faster than populated folders, you might be able to get away with just deleting the root folder with the recursion flag.

  • User profile image
    BitFlipper

    , Bas wrote

    Does it block the calling thread: yes. Does it only return when all iterations are complete: no. After all, somebody may have called Stop or break during some iteration. Parrallel.ForEach returns a ParallelLoopResult structure which you can check to see if all iterations have run to completion.

    Not sure what's going on in your code though.

    But if I'm not calling Stop or Break anywhere, and there were no exceptions, I would expect that when Parrallel.ForEach returns then all iterations should have completed, correct?

     

    , MasterPi wrote

    In terms of your code, do you know if File.Delete always succeeeds? Like, you're not swallowing up the exception there, right?

    I set VS to break when exceptions are thrown, and the very first exception that is thrown is at the last line (line 25 in the sample code). This implies all previous operations succeeded with no exceptions. Which implies all files and folders were successfully deleted.

    Another thing I'm wondering is if there is some sort of a race condition. Like, you obtain a lock on the directory the moment you manipulate its contents (like delete). Then after all the deletes occur, you try to delete the directory before the lock is returned. If this is the case, then maybe let the directory delete spin in a while condition to check whether it CanDelete().

    Yea that is what I'm trying to figure out. One would think from user code that the FS would be thread safe. Or at least if one thread returns from performing a FS operation, that the operation would be completely done, no matter if another thread then accesses the FS. 

     

    @evildictaitor: Yes the AV angle is a possibility. Didn't think of that. But why would the non-parallel version of the code then not have this issue?

     

    @ScanIAm: I can look into such an algorithm but it seems such a simple task should not require to go to such lengths?

     

    @Bass: Makes sense. I don't know what Windows is doing but it is SLOW in many cases compared to Linux.

     

    , felix9 wrote

    this MIGHT be relevant ?

    The file / folder is NOT guranteed to be deleted right after DeleteFile returns and succeeded.

    http://blogs.msdn.com/b/oldnewthing/archive/2012/09/07/10347136.aspx

    *snip*

    I think maybe you can trace the exact problem with ProcessMonitor.

    It could very well be related to AV software, but then how do you explain the async version failing but the sync version succeeding in an otherwise identical environment?

     

    , JoshRoss wrote

    Why not call System.IO.Delete(path,true)? I don't know how many concurrent deletes $bitmap can take anyways. I suppose the answer to any scientific question is "Let's do an experiment."

    -Josh

    Are you talking about System.IO.Directory.Delete? If so, I did that originally but that is what was slow. When the async version of my delete function works without running into this issue, it is ~40% faster, in some cases I've measured > 50% faster. So there is a possibility to speed things up, just trying to get to the bottom of what is causing the failures.

     

    @cbae: I can look into such a solution of first deleting all files then folders.

    BTW, shouldn't the file filter be "*" as opposed to "*.*"? Will it pick up files with no extensions?

     

    In the actual implementation, the way I currently deal with this issue is to have N retries, so when it fails, I do a Thread.Sleep(100) and retry the operation up to N times. But out of curiosity I'm trying to understand why this is happening.

  • User profile image
    ScanIAm

    @BitFlipper:it could be as simple as the latency of caching to the drive returning, incorrectly, that the directory isn't empty.

  • User profile image
    BitFlipper

    , ScanIAm wrote

    @BitFlipper:it could be as simple as the latency of caching to the drive returning, incorrectly, that the directory isn't empty.

    So it is a race condition? One would think you won't see such things from the user's POV, no?

    OK I've looked into cbae's suggestion of first deleting all files, then Directory.Delete(folder, true), and got some interesting results:

    [Grrr... C9 code formatting screwing up again. All code on one line... Switching to quoted blocks...]

    public static void DeleteFolderRecursive(string folder)
    {
        if (!Directory.Exists(folder))
            return;

        var files = Directory.GetFiles(folder, "*", SearchOption.AllDirectories);

        foreach (var file in files)
        {
            File.Delete(file);
        }

       Directory.Delete(folder, true);
    }

    Things get quite a bit faster. I'm now seeing 3 time speed improvement over a simple Directory.Delete(..., true). However things can be sped up even more:

    public static void DeleteFolderRecursive(string folder)
    {
        if (!Directory.Exists(folder))
            return;

        var files = Directory.GetFiles(folder, "*", SearchOption.AllDirectories);

        Parallel.ForEach(files, (file) =>
        {
            File.Delete(file);
        });

        Directory.Delete(folder, true);
    }

    The above results in performance improvement of 6 to 7 times over a simple Directory.Delete(..., true) call. I just have to test it more to see if it also suffers from the "Directory is not empty" problem. If so, a retry loop would probably solve that.

Conversation locked

This conversation has been locked by the site admins. No new comments can be made.