, magicalclick wrote

Assembly > C++ > C# in raw performance.

I'd like the caveat that:

Optimal assembly > optimal C++ > optimal C# in raw performance.

With very few exceptions, people that write assembly write worse assembly (in terms of correctness and in terms of performance) than comes out the back of the C++ compiler and worse assembly than comes out of the C# JITter.

And people that make use of __asm in C++ not only make their code no longer portable to platforms other than x86 (including ARM and x64), they also cause the compiler to turn off inlining for that function, turn off optimisations for the function and cause the compiler to save and restore all of the registers it heuristically thinks it touches. It can also cause the compiler to make really suboptimal use of registers in the code:

Consequently the inline code inside a C++ function:

__asm { xor ebx, ebx; mov [_local], ebx }

is technically equivilent to _local = 0, but can be thousands of times slower in practice (since the compiler now has to save EBX over the block, no longer knows that _local is zero for optimisations later in the function, and can't initialize _local at the same time as it initializes all of the other local variables (using STOSDs not movs), it can't inline the function, can't make the function EBP-less and can't shuffle the asm block around to get better store/fetch performance on the processor, can't use SIMD and can't perform any compile-time checks of the code.

Even worse - if you dare to use that code in a declspec naked function without saving EBX over the call yourself, you might find that EBX is a register that you shouldn't be blindly destroying - and if you do it in kernel mode there's an exploitable error in there too.

It's also much harder to do algorithmic improvement in lower level languages. For example, doing a quick-sort in assembler is so difficult that in practice people do easier-to-implement, less-likely-to-go-wrong, but worse-big-Oh-performance algorithms when forced to use lower level languages, and consequently 2 days work on a handwritten assembler algorithm is likely to yield a slower algorithm than an equivalent amount of work on optimising the algorithm in a higher level language.

Morale of the story is that optimal assembler > optimal C++, but for nearly all values of "you", your assembler <<< Microsoft C++ compiler release output of your C++.