Tech Off Thread

14 posts

Poor performance of .NET Regex

Back to Forum: Tech Off
  • User profile image
    glebd

    Recently in a C# application I was writing I had to process quite a few lines of text using regular expressions (many thousands). Initially I used .NET Regex class for that. The processing of the entire amount of data took a little longer than 4 minutes on a P4 3 GHz machine (Windows XP Professional.) After profiling I saw that most of the time (99%) was spent in Regex.Match() function. I tried to pre-compile the Regex object into a separate DLL to improve speed but it did not produce noticeable difference, the whole amount of data still taking about 4 minutes to process.

    Not satisfied with this (we had to go through the whole process quite often and it was very bothersome), I downloaded a Windows port of PCRE C library (Perl Compatible Regular Expressions, http://www.pcre.org/). I then produced a very thin managed wrapper around the plain C API of that library using C++/CLI, and called the wrapper from my C# application, replacing  Regex.Match() with the appropriate PCRE function. The regular expression, however inefficient it may have been (I'm no regex expert), stayed the same.

    The next time I ran the application with a stopwatch, I couldn't believe my eyes. The whole data processing took... wait for it... 8 SECONDS!

    So, using the same regular expression and the same data: .NET Regex: 4 minutes, PCRE wrapped in C++/CLI: 8 seconds.

    Can anyone explain this kind of difference? Is PCRE a miracle? Or, more likely, is .NET Regex implementation really THAT bad?

  • User profile image
    JChung2006

    Did you use a compiled or uncompiled regular expression in your C# version?

  • User profile image
    glebd

    JChung2006 wrote:
    Did you use a compiled or uncompiled regular expression in your C# version?


    Tried both, couldn't notice any significant difference. That's when I pre-compiled the Regex object into a DLL (which didn't help either, quite surprisingly.)

  • User profile image
    Johnny​Awesome

    Are you able to post any code samples with the difference between implementation of your compiled C# RegEx class and your wrapper call?

  • User profile image
    glebd

    JohnnyAwesome wrote:
    Are you able to post any code samples with the difference between implementation of your compiled C# RegEx class and your wrapper call?


    Unfortunately no, as all of this code is client-owned. However, I only used simple matching, there is nothing fancy there.

  • User profile image
    ScanIAm

    glebd wrote:
    
    JohnnyAwesome wrote:
    Are you able to post any code samples with the difference between implementation of your compiled C# RegEx class and your wrapper call?


    Unfortunately no, as all of this code is client-owned. However, I only used simple matching, there is nothing fancy there.


    I've heard of this anecdotally from other devs at my last job, too.  We were working on server software and one of our benchmarks was how many transactions we could process in 1 second.  By replacing the regex with plain old code, we seriously cut down on the amount of time it took to get through a transaction.  I didn't think it was .Net's implementation, but assumed it was just that regex is slow.

    It might be interesting to see if this holds true for the next version of .net so we can see what is causing it to take so long (i.e. when we look at the source of their regex libraries.)

  • User profile image
    W3bbo

    ScanIAm wrote:
    It might be interesting to see if this holds true for the next version of .net so we can see what is causing it to take so long (i.e. when we look at the source of their regex libraries.)


    And we can't with Reflector?

    Note that I doubt the imminent release of the source (under a "look, don't touch" license) will include the InternalCall stuff, which some of regex uses.

  • User profile image
    Adrian​JMartin

    glebd wrote:
    
    JohnnyAwesome wrote:
    Are you able to post any code samples with the difference between implementation of your compiled C# RegEx class and your wrapper call?


    Unfortunately no, as all of this code is client-owned. However, I only used simple matching, there is nothing fancy there.



    Without seeing the code it would be almost pointless to investigate the issue, can you not recreate the issue in a sample app.

    one huge difference is that the body of text you are searching will have been pushed into unmanged memory...

  • User profile image
    glebd

    AdrianJMartin wrote:
    one huge difference is that the body of text you are searching will have been pushed into unmanged memory...


    Not really, as the text is split into lines first in the C# program before matching each line with regex. So I don't think managed/non-managed memory difference comes into play here.

    The only difference in the code was that the initial version used Regex.Match(), and the new version uses PCRE wrapper function to do the same on the same strings using the same regular expression. The surrounding code is absolutely the same. The difference in speed must be in the way regular expressions are implemented in PCRE and .NET.

  • User profile image
    ScanIAm

    W3bbo wrote:
    
    ScanIAm wrote:
    It might be interesting to see if this holds true for the next version of .net so we can see what is causing it to take so long (i.e. when we look at the source of their regex libraries.)


    And we can't with Reflector?

    Note that I doubt the imminent release of the source (under a "look, don't touch" license) will include the InternalCall stuff, which some of regex uses.


    Does reflector compile?  i.e. the method I used to figure out that regex was slow was to run it in a profiler.  The profiler could not see into the guts of the .net regex code because there wasn't a 'debug' version of it that it could map up.

  • User profile image
    rhm

    We've not had any problems with the performance of .NET's regexes in our software at work, I think they're generally regarded as quick, although you could have found some edge-case where they suck, impossible to tell without seeing your code. One possiblility is that your regex creates huge numbers of group objects and is thus slow because you're spanking the GC.

  • User profile image
    ScanIAm

    rhm wrote:
    We've not had any problems with the performance of .NET's regexes in our software at work, I think they're generally regarded as quick, although you could have found some edge-case where they suck, impossible to tell without seeing your code. One possiblility is that your regex creates huge numbers of group objects and is thus slow because you're spanking the GC.


    I don't want to imply that the .net regex was that horrible speedwise, either.  We were shaving off milliseconds in an effort to speed things up.

  • User profile image
    Nikster

    If you had at least a small test case I could even run it via doTrace to see where the bottleneck really is, I don't have time to try and hope to create the same case you're using to have a valid benchmark, though I am curious of the results.

  • User profile image
    TheSteve

    You could use the benchmark feature in Regex Hero (you'll need Silverlight) to see if the slowness is somehow related to your implementation or the regular expression itself.  The only thing about it is that it will also try to highlight matches as you type.  And since you have such a massive target string, that may make working with it rather slow.  So you may want to scale down the experiment with a smaller target string and then set the number of iterations however you'd like.

    By the way, the code for the benchmarking feature is right here:

    http://regexhero.net/blog/2009/05/revised-benchmarking.html

    EDIT -- I've added the ability to disable real-time highlighting. This should allow you to insert a huge target string and benchmark it more easily.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.