@BitFlipper: actually, I asked a similar question on this forum and you kindly provided that figure, thanks again for that.
While that was more than good enough for my purposes, I don't think that number can be extended to the general case.
It's impossible to answer your question correctly just comparing small chunks of code... they are in no way representative of a real scenario. Anyway, just to get a general idea, let's look at the issue a little closer... I already know that I'm going to be as inaccurate as it gets, bear with me.
Let's assume we need to flip the first pin of some IO port. The fastest native code loop would be composed of three instructions, something like:
move GPIOAddress, 0x1
move GPIOAddress, 0x0
Most ARM instructions execute in just one cycle (on average) except that jumps flush the three-stage pipeline. This means our little program takes five cycles to execute each loop, which gets us a 9.6MHz frequency on a 48MHz processor. Awful duty cycle and all that, but that's fine for now.
The IL code of the main loop you used would require more than just three instructions, I'd expect something like this:
... // initialization skipped as it is just a one-time tax
IL_0001 ldloc.0 // load the instance of the IO pin class
IL_0002 ldc.i4.1 // push "true" on the stack
IL_0003 call IO.Write // call the native method
IL_0004 ldloc.0 // load the instance of the IO pin class
IL_0005 ldc.i4.0 // push "false" on the stack
IL_0006 call IO.Write // call the native method
IL_0007 br.s IL_0001
That's seven IL instructions. The code expansion is partially due to the use of a stack machine, but the main difference is that we used an OO approach, where the pin is represented by an instance of some IO class. The implementation of the Write method would require something like this:
retrieve the bool argument from the stack
retrieve the "this" pointer from the stack
get the address of the port this pin instance is linked to
get the bit mask for this pin
read the current status of the port
if the value is true, OR the bit mask with the port status
otherwise, AND it with the complement
write back the new status
I don't have the ARM specs on hand now, but let's assume that all instructions take just one cycle... that's 11 instructions, the last of which flushes the pipeline (so it ends up costing 3 cycles).
Let's make a little thought experiment here... let's assume that each IL instruction can be executed as fast as an ARM opcode. That would take our code up to some (theoretical) 39 cycles which would yield a frequency of 1.23 MHz. That's just 128 times faster than the 9.6kHz you reported for managed code... even excluding the native method (that would stay native), this would mean that the interpreter is using less than 400 cycles on average per IL instruction. That's not too bad all considered...
(sorry for the long rambling... it's just a subject I happen to be quite fond of)