Oh - I forgot to edit a couple of things - the suggested approach adds up to 5 clocks, not 6, which makes it 2.5x slower, not 3x.
Interesting question. I am not as good with perf as I would like, but in a situation like this, we're essentially counting instructions. Here's how it breaks down:
Method 1 (suggested) (11 instructions) 32-bit, not optimized
unsigned char tmp = (unsigned char)i;
009A143E mov al,byte ptr [i]
009A1441 mov byte ptr [tmp],al
if(tmp != i)
009A1444 movzx eax,byte ptr [tmp]
009A1448 cmp eax,dword ptr [i]
009A144B je IntToUchar2+31h (9A1451h)
009A144D xor al,al
009A144F jmp IntToUchar2+3Bh (9A145Bh)
uc = (unsigned char)i;
009A1451 mov eax,dword ptr [uc]
009A1454 mov cl,byte ptr [i]
009A1457 mov byte ptr [eax],cl
009A1459 mov al,1
Method 2 (original) (10 instructions)
if(i < 0 || i > 255)
009A13DE cmp dword ptr [i],0
009A13E2 jl IntToUchar+2Dh (9A13EDh)
009A13E4 cmp dword ptr [i],0FFh
009A13EB jle IntToUchar+31h (9A13F1h)
009A13ED xor al,al
009A13EF jmp IntToUchar+3Bh (9A13FBh)
uc = (unsigned char)i;
009A13F1 mov eax,dword ptr [uc]
009A13F4 mov cl,byte ptr [i]
009A13F7 mov byte ptr [eax],cl
009A13F9 mov al,1
Suggested method in release build:
01041050 0F B6 C8 movzx ecx,al (3 clock)
01041053 3B C8 cmp ecx,eax (1 clock)
01041055 75 03 jne main+5Ah (104105Ah) (1 clock)
Current method in release build:
01041018 3D FF 00 00 00 cmp eax,0FFh (1 clock)
0104101D 77 06 ja main+25h (1041025h) (1 clock)
I also checked this in 64-bit, and it is the same instructions. In the suggested approach, we have an extra movzx, and it consumes a register, which would tend to cause surrounding code to optimize less efficiently on x86 (probably negligble on x64, but same effect). Due to the movzx being expensive (relative to cmp), the suggested approach is 6 cycles to the current approach's 2 cycles, or 3x worse. If the surrounding code were register constrained, as x86 often is, then you might need to push something on the stack to free a register, which would be more overhead.
To be fair, I chalk this up to pure, blind luck. While I did try to write the code to be as efficient as possible, readability and correctness were of higher importance. It is entirely possible that other operations may not be as efficient as possible. I think I did try to avoid temporary variables where possible, for just this reason, though in some cases it couldn't be avoided, and the effect would often be diluted by how expensive large multiplication and division operations are.
A large chunk of work that has not yet been done is to take all 64 combinations for each and every operation type and see exactly how they optimize, and dink with it to see if it can be better. As an aside, we considered using intrinsics for x64 multiplication, which would have been a lot faster, but decided not to in this release due to schedule constraints.
In addition, unless you happened to be using SafeInt inside of a rendering engine, or something else where a couple of cycles matter, a couple of cycles here and there are not going to add up to anything, considering that an allocation could cost several 10's of thousands of cycles, loading a COM object is much, much bigger, and so on.
Thanks for the question - I got to learn a bit about assembly that I didn't know before today.