@BitFlipper: I didn't claim it was difficult to solve, just that it was expensive. As you correctly deduced you need to keep track of the type of every value that gets pushed onto the stack, either using a separate stack or by "decorating" each stack slot (memory alignment might make this quite wasteful). Either way you are consuming more precious RAM and more CPU cycles. Might not seem much, but once you consider the sheer number of stack operations involved you see how this piles up.
A second interesting point is that MSIL essentially describes a pure stack machine (as in "without registers"). That was a good choice as it allows high level compilers to generate code that is CPU agnostic and because it yields compact code. The obvious disadvantage of stack machines is that they require roughly twice as many operations to execute as compared to the equivalent register machine. On the desktop this is ok as the JIT is CPU-specific and knows how to handle registers, but an interpreter will be stuck with the inefficiency.
Just to give you a sense of what I mean: as you probably know, in order to execute a simple "ADD" instruction you need to:
pop the first operand from the stack and store it into a register
pop the second operand from the stack and store it into another register
perform the addition
push the result onto the stack
Seems harmless, but once you consider that executing the next operation will most likely start with "pop the first operand into a register", it screams bloody murder. Unfortunately, an interpreter cannot predict what the next operation will do so it will have to go through the dance pushing and popping values uselessly.
There are several other concerns (field access, reevaluation of the "this" pointer, parameter passing etc.) and optimizations that cannot be performed on a stack machine, but I'll leave them for another time.
These are shortcomings of the whole idea of interpreting MSIL, not flaws in the implementation of the MF interpreter which I'm assuming to be quite good (MS has far more brilliant developers than it's generally credited for). So, if you are in for big performance improvements, I think it would be more productive to try and change the whole game, not trying to shave off cycles from the interpreter. The only issue I can foresee is that you may end up trading code size for RAM and performance; while this is generally ok as RAM and CPU muscle are scarcer than Flash space, it might be an interesting challenge to fit all the libraries on your device. You may have to go for a hybrid solution.
At any rate, even considering all the inefficiencies I listed, I still cannot figure out how you might get the 1:1000 figure you mention. What are you comparing the performance to, exactly?