I friend of mine wrote... conditionally used MMX, SSE, or SSE2.
Best approach is be to have the run-time startup code determine the CPU capability and select the correct routine at startup (one time). Write the address to the preferred routine(s) (yes there are other routines benefiting from memory move/init optimizations).
Regarding MMX, SSE, SSE2... these would only be used if the CPU supported them. Also, the FPU can be used to move/init data faster as well.
Definitely should be in the managed runtimes (Jxxx, and .Net CLR/CLI)
During a research project i did on this topic, the judicious use of PREFETCH can accelerate the memory optimizations.