Tech Off Thread

7 posts

Generic Regular Expressions

Back to Forum: Tech Off
  • User profile image
    Blue Ink

    Hello,
    I'm using RegEx for some file processing. All's well except that the expression is large, with a small portion that needs to change at each call. This means that the expression can no longer be compiled, at least from my understanding of the documentation.
    Two questions...
    1) is there a workaround to compile a regular expression on the fly AND be able to unload the assembly afterwards?
    2) is there any rough figure about the performance advantage between a compiled expression and one that is not?

    Thanks in advance
    --m

  • User profile image
    footballism

    MSDN wrote:

    Regular Expression Compilation

    In the Regular Expression (Regex) space, there is the option to specify that a regular expression is compiled through use of the RegexOptions.Compiled setting. This switch hints at an underlying aspect of regular expressions in .NET: they have three modes. Let's look at each of those modes and discuss the relative performance trade-offs of each.

    By default, regular expressions in .NET are interpreted. That is, you create a Regex with no compiled switch. Interpreted Regexes have the smallest impact to the startup performance of an application, but they have the slowest run-time speed since all of the processing remains to be conducted at run time. They are a good option if your Regex will be used rarely, since a rarely used expression won't impact overall run-time speed much (by definition), and taking any kind of startup hit probably isn't worth it:

    Regex r = new Regex("abc*");
    Regex.Match("1234bar", @"(\d*)bar");

    The second option is regular expressions that are compiled on the fly. This is simple to achieve by passing in RegexOptions.Compiled when performing your Regex operation on the Regex (or by passing it into the Regex constructor). In this scenario, the regular expression engine does initial work to parse the expression into opcodes. Then, the engine turns those opcodes into Microsoft intermediate language (MSIL) using Reflection.Emit. Basically, it front loads as much work as it can to the startup of your app, effectively trading run-time speed for startup speed. In practice, compilation takes about an order of magnitude longer to start up, but on average yields 30 percent better run-time performance. This is a good solution if you have a regular expression you're going to be performing often. The overall run-time savings will add up and become significant compared to the hit you have to take at startup time.

    There is a caution here, however. Emitting MSIL with Reflection.Emit loads a lot of code and uses a lot of memory, and that's not memory you'll ever get back. Generated MSIL cannot be unloaded, or at least, it couldn't in the .NET Framework 1.x. The only way to unload code is to unload an entire application domain. This is a general rule, and is simply the sacrifice you have to make. The good news is that part of this problem is solved in the .NET Framework 2.0—regular expressions are compiled using Lightweight Code Generation, which allows the generated MSIL to be garbage collected. But even then, you may need to release more of the functionality faster. Therefore, the general approach to compiled Regexes should be to only use this mode for a finite set of expressions that you know will be used repeatedly. Even more specifically, avoid this mode except for one or two key Regexes, and for the rest use either interpreted or, the final option, precompiled. Here's how RegexOptions looks:

    Regex r = new Regex("abc*", RegexOptions.Compiled);
    Regex.Match("1234bar", @"(\d*)bar", RegexOptions.Compiled);

    Regex precompilation solves many of the problems associated with compiling on the fly, and retains all of the performance benefits. Precompilation means you do all of the work of parsing and generating MSIL when you compile your app, ending up with a custom class derived from Regex. Run-time performance of this technique is identical to what you get when you compile on the fly.

    There is, of course, a trade-off: the whole meaning of precompiled is that the Regex is put into its own assembly before you run your application. You therefore can't have the code for constructing the Regex inline, and you'll have to write other code (a tool perhaps) to construct your precompiled Regex ahead of time. You can do this with Regex.CompileToAssembly:

    Dim rci As RegexCompilationInfo = New RegexCompilationInfo( _
    "abc*", RegexOptions.None, "standard", "Sample.Regex", True)
    Regex.CompileToAssembly( _
    New RegexCompilationInfo() {rci}, New AssemblyName("foo.dll"))

    Once in its own assembly, your calling code can reference this Regex directly. There's the startup hit of loading the assembly, but the performance benefits of not having to interpret or compile the expression at run time are preserved. This option has an additional benefit for performance. Your startup time reduces to loading and JITing your class, which should be comparable to the startup cost of interpreted mode, making this option the best of both worlds.

    Overall, I strongly suggest you consider precompilation of Regexes, particularly for expressions you use a lot, and from multiple applications or assemblies.



  • User profile image
    footballism

    BTW, why should your regex be changed at each call?

    I'm pretty curious of what you gonna do with that sorta regex, could you please post it here?

    Sheva

  • User profile image
    footballism

    Probably you should read the whole article, because this awesome articles covers other aspects of BCL which every developer should know about:
    Base Class Library Performance Tips and Tricks

    Sheva

  • User profile image
    Blue Ink

    Hei, I didn't know now we can really get rid of assemblies! Great!Unfortunately the bottom line of the article seems to be: "yes, now you could, but you still better not". Too bad.

    Thanks a lot, footballism.
    --m

    Edit: I didn't see your question.
    The task at hand consists of updating cross references in a document base whenever one of the documents is changed.
    Each document describes a sub-assembly (in the mechanical sense), each composed of several parts. In a certain sections of the document structure, I am to find the part number and update some info.
    90% of the entire expression deals with the document structure, so it never changes. The only part that changes each time is the part number affected.

    If you really want, I can post the expression, but it would be worthless without a description of the document structure.

    --m

  • User profile image
    Blue Ink

    Great article. Thanks
    --m

  • User profile image
    dzCepheus

    As far as #1 goes, the only way to unload an assembly from memory is to unload the entire appdomain, which means that if you want the program to continue running you'll need to be running a seperate appdomain. I haven't messed with this sort of thing in quite a while, so I wouldn't be the best to show you how to do this. Tongue Out

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.