Coffeehouse Thread

22 posts

Has anyone considered writing an MSIL interpreter in C#?

Back to Forum: Coffeehouse
  • User profile image
    BitFlipper

    Of course the first question would be "why?", but I say why not! Or put another way, it would be a good way to learn how IL works and also learn how to write an IL interpreter. And I'm not talking about a JITter, I'm talking about purely interpreting the IL on the fly.

    One thing I want to learn is what patterns are available for an interpreter, and what would give the best performance. I was looking at the .Net Micro Framework source code, and it is purely interpreted (the JITter is disabled). It seems to me their approach is what is causing the managed code to run 100 to 1000 times slower than equivalent native code, since in that case each instruction is sent through a giant switch statement. There seems to be a huge amount of overhead for each single instruction. My idea is to have an array of 256 delegates that each execute the corresponding instruction. For 2 or 4 byte instructions, the initial method will then use a second delegate table that ultimately maps to the correct call.

    The advantage of such an approach is that there is no switch statement. I have no idea whether that would be faster but it seems that it should.

    Has anybody tried this before, or know of a project like this? I searched but couldn't find anything similar.

  • User profile image
    evildictait​or

    So a similar project would be BOCHS, which emulates the x86 instruction set (including 16-bit real-mode) for the purpose of emulating an OS. It's a different instruction set but it's basically the same idea.

    Their implementation just like you say uses a big array of delegates, so that you just read the next byte and call the corresponding offset in the table. Multibyte instructions go via a "thunk" delegate which pulls out the second byte and calls a secondary delegate table.

    In C/C++ land this is an improvement over a switch statement, but not as much as you might think (the compiler will actually turn large switch statements with nearby cases into jump-tables). In C# land switch statements are not very cleverly optimised, but then delegates have much more overhead too.

    As with most things, the correct way to tell is to profile and test, but there's an old saying that I find becomes more and more true the older I get which is "First make it work. Then make it work fast". It's easy to change a switch statement to a delegate array once you've implemented the first few instructions Smiley

  • User profile image
    bdesmet

    @evildictaitor: Don't underestimate the C# compiler wrt the switch statement. A few examples:

    • Switches with a low number of cases can become if statements.
    • Switches with reasonably contigious ranges can be turned into the switch instruction. Holes can be patched with a jump target to the default case.
    • Switches with a non-zero case can get rebased using an add or sub instruction in order to use the switch instruction again.
    • Switches with sparse case values can be turned into nested/hierarchical switch instructions, using any of the above techniques applied.
    • Switches on string cases can become if statements, but if there are more than a treshold number (currently 6 IIRC), a Dictionary<string, int> is lazily constructed, using the int values for switch instruction use.

    Just a little example:

    [MethodImpl(MethodImplOptions.NoInlining)]
    static void Bar(int i, out int j)
    {
        switch (i)
        {
            case 17:
                j = 1;
                break;
            case 19:
                j = 2;
                break;
            case 20:
                j = 3;
                break;
            case 25:
                j = 4;
                break;
            default:
                j = 0;
                break;
        }
    }

    Code above got compiled without /o+, so you'll see redundant nop instruction and excessive st/ld pairs in the IL. Using SOS, we dump the IL and the generated assembler:

    0:004> !name2ee switch.exe Program.Bar
    Module:      000007ff000330d8
    Assembly:    switch.exe
    Token:       0000000006000002
    MethodDesc:  000007ff000340e0
    Name:        Program.Bar(Int32, Int32 ByRef)
    JITTED Code Address: 000007ff001501d0
    0:004> !dumpil 000007ff000340e0
    ilAddr = 000000000137207c
    IL_0000: nop 
    IL_0001: ldarg.0 
    IL_0002: stloc.0 
    IL_0003: ldloc.0 
    IL_0004: ldc.i4.s 17
    IL_0006: sub 
    IL_0007: switch (IL_0023, IL_0037, IL_0028, IL_002d)
    IL_001c: ldloc.0 
    IL_001d: ldc.i4.s 25
    IL_001f: beq.s IL_0032
    IL_0021: br.s IL_0037
    IL_0023: ldarg.1 
    IL_0024: ldc.i4.1 
    IL_0025: stind.i4 
    IL_0026: br.s IL_003c
    IL_0028: ldarg.1 
    IL_0029: ldc.i4.2 
    IL_002a: stind.i4 
    IL_002b: br.s IL_003c
    IL_002d: ldarg.1 
    IL_002e: ldc.i4.3 
    IL_002f: stind.i4 
    IL_0030: br.s IL_003c
    IL_0032: ldarg.1 
    IL_0033: ldc.i4.4 
    IL_0034: stind.i4 
    IL_0035: br.s IL_003c
    IL_0037: ldarg.1 
    IL_0038: ldc.i4.0 
    IL_0039: stind.i4 
    IL_003a: br.s IL_003c
    IL_003c: ret 

    Notice the sub instruction to rebase the switch statement and the insertion of case 18 equals default case by a jump to IL_0037 in the switch instruction's table. Case 25 which is further away from the switch range encountered here is handled by fall-through after the switch instruction and a jump to IL_0032. If you'd have a cluster of cases close to 25 again, you'd see another switch instruction being born.

    I'll leave it as a good SOS exercise to the reader to perform a !U on the JITTED code address for different platform architectures and unravvle the assember generated...

    0:004> !u 000007ff001501d0
    Normal JIT generated code
    Program.Bar(Int32, Int32 ByRef)
    Begin 000007ff001501d0, size 62
    ...

    Also don't forget order of cases in a switch statement doesn't matter, so the techniques described above can be applied even if cases don't appear in a sorted lexical order.

  • User profile image
    Cream​Filling512

    Most compilers turn switch statements into jump tables.  So it's just one indirect jump in the best case.

  • User profile image
    BitFlipper

    Well ultimately I would want to port it to C++ once I get it to work in C# (I prefer doing my experimentation in C#). And even beyond that, I would like to play around with replacing the .Net Micro Framework's interpreter with my own if I ever get it to work.

    I looked at the MF's code and it looks like it is painfully unoptimized. It is not just the giant switch statement I mentioned. If you follow the code path of a single instruction, there are many places where one could imagine speeding things up. I like playing with the Netduino microcontroller but managed code just runs so unbelievably slow on it. One solution being thrown around is to pre-compile some methods into native ARM code before deploying, but that has the big disadvantage that you lose multithreading support while it is executing native code. I think a better solution is to speed up the interpreter so that all of the managed code runs faster.

    Ultimately such a project is probably too big for one person, but worst case I might learn some interesting stuff about virtual machines and MSIL, so I'm still ahead.

    BTW the compiler used to compile .Net MF supposedly costs $6000, so it is out of the question for someone doing this as a side project. Apparently one can use GCC but it produces larger binaries which is not ideal on a microcontroller, plus, who in their right mind would want to debug using GCC/GDB? If it ever gets to that, I remember there was a VS plugin that allows development and debugging right in VS, but using GCC/GDB as a backend. I might need to look into that. Anyone used that, or know something about it?

    EDIT: This is the plugin what I was talking about ($89).

  • User profile image
    evildictait​or

    @BitFlipper: Be careful when you say things like "painfully unoptimised". I've seen lots of cases where a developer being smart and "optimising" caused it to either:

    a) Break entirely in certain circumstances requiring a senior developer to revert the change
    b) Introduce a security hole which either comes out (hopefully) during review or occasionally after release or
    c) Just makes the program outright slower, harder to maintain and more difficult to port.

    The rule I now impose both on myself and on my team is that code should always be easy to read and never do anything "clever". You can improve speed later when you find a bottleneck, but frankly a correct, bug-free and non-hotpath bubble-sort is always better than a shoddily written quick-sort.

  • User profile image
    BitFlipper

    @evildictaitor:

    Yes I know what you are saying, and I agree. But do realize that the MF interpreter runs 100 to 1000 times slower that native code. I find it hard to believe that that is the best you can do with an interpreter. My theory is that when this was developed for SPOT watches, there just wasn't much need for high-performance. The operations were so simple that it didn't matter that it took a little while to complete. Now that people are trying to create quad-copters with it, performance becomes a big issue.

    In addition, on a microcontroller since you will only be running the code you deployed there yourself (you can actually set a security bit to disable any future deployments to it), I think one can simplify the VM when it comes to security checks, etc. I know saying that is probably opening a can of worms, but if the goal is just to use the microcontroller for my own projects, security isn't something I need to worry about too much.

    Of course there would be other complicated parts that need implementing like GC, etc.

  • User profile image
    PerfectPhase

    @BitFlipper: Strikes me that when you only have  60Kb of RAM, on device JIT is not a good idea. Could you not do something with the Mono AOT compiler to produce a native image before deployment?  Portability of the compiled code is really a non-issue.

  • User profile image
    Blue Ink

    @BitFlipper: I suspect one of the sources of the poor performance you are witnessing is that are not just comparing native code vs interpreted IL, you are also missing out on all the optimizations the JIT is in charge of.

    Another issue is that MSIL is not very interpreter friendly. For instance, while the java bytecode contains different opcodes depending on the type of its operand (e.g. dadd, fadd, iadd), MSIL does not as it relies on the fact that the JIT can always determine statically the stack state at any given point and emit the correct sequence of machine instructions.

    Unfortunately, a pure interpreter cannot do that, which means it must keep track of the stack contents somehow and that's expensive both in terms of memory and CPU cycles.

    So, maybe you can shave a few cycles here and there in the interpreter, but I suspect you won't get significant performance gains that way; at least not the kind of improvements you couldn't get by upgrading your hardware.

    I would probably try either to translate MSIL to a different bytecode (trading code size for performance) or, better, to compile chunks of code to native. This doesn't necessarily rule out cooperative multithreading: you could always inject code that keeps track of some sort of timeslice and traps to the VM when it runs out.

  • User profile image
    W3bbo

    ,Blue Ink wrote

    I would probably try either to translate MSIL to a different bytecode (trading code size for performance) or, better, to compile chunks of code to native.

    It's off-topic, but I'd like to chime-in that J#'s runtime came with a Java Bytecode loader which converted it to CIL. I wonder how that was implemented.

  • User profile image
    Charles

    The RiSE group's Spur CIL compiler is rather capable in this context. It even includes (I think, at this point), an embedded theorem prover (Z3) that, among other things, is helpful at maximzing solution efficiency and effectiveness for complex algorithmic problems.

    Go Wolfram et al! Go RiSE!

    C

  • User profile image
    BitFlipper

    ,PerfectPhase wrote

    @BitFlipper: Strikes me that when you only have  60Kb of RAM, on device JIT is not a good idea. Could you not do something with theMono AOT compiler to produce a native image before deployment?  Portability of the compiled code is really a non-issue.

    Mono AOT was actually one of the first things I looked at. The problem with it is that I can't find a lot of documentation about it (it is open source of course so I can always dig into the code), so I could not figure out whether you can specify a target CPU in the command line arguments. It looks to me as if it just compiles to the underlying architecture, so it won't work as is for an embedded ARM processor.

    My understanding is that MonoTouch, which is used to create pure native versions of .Net applications for the iPhone, is a modified version of AOT (Apple forbids any interpreted code). But it isn't open source as far as I can tell so I won't learn much there. But yes, something like MonoTouch is exactly what is needed in this case. In the end I decided that trying to build my own interpreter was more interesting, even if the chance for something successful was probably less than if I tried to modify the Mono AOT code.

     

    @Blue Ink

    I see what you are saying about MSIL not being very interpreter friendly. Your example of the multiply operation won't be too hard to solve though. For instance, for operations that push values onto the stack, the operation can store flags in the tread class that indicate the types of the last two values that were pushed. Then if you come across a multiply, or similar operation, you can use those flags to determine which types they are, and hence how to do the operation.

    EDIT: Maybe a better solution is to store metadata with each pushed value, so that it is easy to determine how to do the operations on them.

    But I am sure there are other instructions that would be more problematic for an interpreter so you have a good point. Remember though that the bar is pretty low - it just needs to be less that 100 times slower than the native equivalent. And specifically I'm talking about a C++ version of the interpreter, not a managed version. I'm only starting with a C# version because I prefer it over C++. I guess without really testing it, it would be hard to know what gains you can have.

  • User profile image
    Blue Ink

    @BitFlipper: I didn't claim it was difficult to solve, just that it was expensive. As you correctly deduced you need to keep track of the type of every value that gets pushed onto the stack, either using a separate stack or by "decorating" each stack slot (memory alignment might make this quite wasteful). Either way you are consuming more precious RAM and more CPU cycles. Might not seem much, but once you consider the sheer number of stack operations involved you see how this piles up.

    A second interesting point is that MSIL essentially describes a pure stack machine (as in "without registers"). That was a good choice as it allows high level compilers to generate code that is CPU agnostic and because it yields compact code. The obvious disadvantage of stack machines is that they require roughly twice as many operations to execute as compared to the equivalent register machine. On the desktop this is ok as the JIT is CPU-specific and knows how to handle registers, but an interpreter will be stuck with the inefficiency.

    Just to give you a sense of what I mean: as you probably know, in order to execute a simple "ADD" instruction you need to:

    pop the first operand from the stack and store it into a register
    pop the second operand from the stack and store it into another register
    perform the addition
    push the result onto the stack

    Seems harmless, but once you consider that executing the next operation will most likely start with "pop the first operand into a register", it screams bloody murder. Unfortunately, an interpreter cannot predict what the next operation will do so it will have to go through the dance pushing and popping values uselessly.

    There are several other concerns (field access, reevaluation of the "this" pointer, parameter passing etc.) and optimizations that cannot be performed on a stack machine, but I'll leave them for another time.

    These are shortcomings of the whole idea of interpreting MSIL, not flaws in the implementation of the MF interpreter which I'm assuming to be quite good (MS has far more brilliant developers than it's generally credited for). So, if you are in for big performance improvements, I think it would be more productive to try and change the whole game, not trying to shave off cycles from the interpreter. The only issue I can foresee is that you may end up trading code size for RAM and performance; while this is generally ok as RAM and CPU muscle are scarcer than Flash space, it might be an interesting challenge to fit all the libraries on your device. You may have to go for a hybrid solution.

    At any rate, even considering all the inefficiencies I listed, I still cannot figure out how you might get the 1:1000 figure you mention. What are you comparing the performance to, exactly?

  • User profile image
    BitFlipper

    ,Blue Ink wrote

    @BitFlipper:

    ...

    At any rate, even considering all the inefficiencies I listed, I still cannot figure out how you might get the 1:1000 figure you mention. What are you comparing the performance to, exactly?

    You make some interesting points, and I'm sure everything you say will be an issue to some degree. I think it would still be an interesting learning experience, if nothing else. Since I'm at the point in the project right now where I'm parsing through the PE file to get at all the managed tables etc, I'm already learning things I didn't know before.

    As for your comment about why .Net MF is 1000 times slower, I base my comment on the fact that I did a quick performance test for someone (they wanted to buy a Netduino but was wondering about performance). What I did was create a simple loop that toggles a pin on the Netduino. It looked something like this:

    var outPin = new OutputPort(Pins.GPIO_PIN_5, false);

    while (true)
    {
        outPin.Write(true);
        outPin.Write(false);
    }

    I then measured the frequency at the pin and it was roughly 9kHz. At 48MHz, that is 5333 machine cycles for one loop. I was told by others that a similar loop in C++ would easily reach 3-4MHz or something in that range. That is why I'm saying it appears to be 100-1000 times slower. See this thread for a discussion about this.

    BTW, it is estimated that on the Netduino's ARM processor, one instruction takes roughly 1.5 machine cycles. So that means it takes ~3555 instructions to perform one loop. So I'm wondering, would an optimized interpreter still need 3555 machine instructions to execute just one loop? It really seems excessive.

    What do you think Blue Ink, since you seem to have a good handle on this stuff? Does 3555 instructions per loop seem ballpark for an optimized interpreter?

    [EDIT: What is UP with C9's code formatting? It used to be you get double lines for everything (unless you used Shift+Enter), and now I get NO newlines, no matter how hard I try. Everything is on the same line...]

    [EDIT EDIT: I just removed the code block altogether and put it inside a quote block. It insisted on putting everything on the same line, no matter how many newlines I put between lines]

    EDIT EDIT EDIT: Of course it depends on what goes on inside OutputPort.Write. Well, ILSpy tells me the method looks like this:

    .method public hidebysig instance void Write (bool state) cil managed internalcall
    {
    } // end of method OutputPort::Write

    ...so it is essentially a straight call into native code, but I don't know what it does in that function. Must be very similar to what you will be doing if you wrote the code purely in C++ I assume.

  • User profile image
    Bass

    You might want to actually test that with a C++ version instead of relying on random hearsay. Just a thought.

  • User profile image
    Blue Ink

    @BitFlipper: actually, I asked a similar question on this forum and you kindly provided that figure, thanks again for that.

    While that was more than good enough for my purposes, I don't think that number can be extended to the general case.

    It's impossible to answer your question correctly just comparing small chunks of code... they are in no way representative of a real scenario. Anyway, just to get a general idea, let's look at the issue a little closer... I already know that I'm going to be as inaccurate as it gets, bear with me.

    Let's assume we need to flip the first pin of some IO port. The fastest native code loop would be composed of three instructions, something like:

    loop:
    move GPIOAddress, 0x1
    move GPIOAddress, 0x0
    jump loop

    Most ARM instructions execute in just one cycle (on average) except that jumps flush the three-stage pipeline. This means our little program takes five cycles to execute each loop, which gets us a 9.6MHz frequency on a 48MHz processor. Awful duty cycle and all that, but that's fine for now.

    The IL code of the main loop you used would require more than just three instructions, I'd expect something like this:

    ... // initialization skipped as it is just a one-time tax

    IL_0001 ldloc.0 // load the instance of the IO pin class
    IL_0002 ldc.i4.1 // push "true" on the stack
    IL_0003 call IO.Write // call the native method
    IL_0004 ldloc.0 // load the instance of the IO pin class
    IL_0005 ldc.i4.0 // push "false" on the stack
    IL_0006 call IO.Write // call the native method
    IL_0007 br.s IL_0001

    That's seven IL instructions. The code expansion is partially due to the use of a stack machine, but the main difference is that we used an OO approach, where the pin is represented by an instance of some IO class. The implementation of the Write method would require something like this:

    save registers
    retrieve the bool argument from the stack
    retrieve the "this" pointer from the stack
    get the address of the port this pin instance is linked to
    get the bit mask for this pin
    read the current status of the port
    if the value is true, OR the bit mask with the port status
    otherwise, AND it with the complement
    write back the new status
    restore registers
    return

    I don't have the ARM specs on hand now, but let's assume that all instructions take just one cycle... that's 11 instructions, the last of which flushes the pipeline (so it ends up costing 3 cycles).

    Let's make a little thought experiment here... let's assume that each IL instruction can be executed as fast as an ARM opcode. That would take our code up to some (theoretical) 39 cycles which would yield a frequency of 1.23 MHz. That's just 128 times faster than the 9.6kHz you reported for managed code... even excluding the native method (that would stay native), this would mean that the interpreter is using less than 400 cycles on average per IL instruction. That's not too bad all considered...

    (sorry for the long rambling... it's just a subject I happen to be quite fond of)

  • User profile image
    BitFlipper

    ,Bass wrote

    You might want to actually test that with a C++ version instead of relying on random hearsay. Just a thought.

    You do realize that this isn't a small project that one can quickly test, right? I would love to test the theory but realistically it will be weeks at the earliest before I would be close to actually having the most simple of interpreter up and running. So unfortunately all we have until then is "hearsay".

  • User profile image
    BitFlipper

    ,Blue Ink wrote

    @BitFlipper:

    ...

    (sorry for the long rambling... it's just a subject I happen to be quite fond of)

    It is good hearing your input since you obviously know a lot about this subject.

    This is a big project to tackle. For instance, I'm realizing that the format of the stack needs to be very specific.  Since there will be internal calls made to native code that is part of mscorlib.dll etc, those native functions are expecting the stack to be in some specific format. Here is where the native method for OutputPort::Write is. Notice it does some weird stuff with the stack.

    So I would need to take a very close look at the .Net Micro Framework implementation to see what they do.

    Maybe Bass is right after all. Maybe I should forget about starting with the hard parts first like reading the raw PE file format etc and just write a very simple interpreter that only interprets a small set of instructions (like a simple managed loop, but with no allocations, etc). That way I could try to get some sort of ballpark performance figure.

Comments closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation, please create a new thread in our Forums, or Contact Us and let us know.