I always find it interesting listening to people focusing on improving performance instead of waiting for new CPU power! I remember the days of the C64 when programming in machincode, I was doing a sideborder DYXPP (Different Y and X Pixel Possition - big thing in the old days).

I remember the main "core" of the routine could have been done in around 50 looping lines, but it just took to much prosessing out of the huge C64 processor to maintain the framerate. I had to manually program all the DYXPP manipulations for all characters, I still remember braking the 8192 line barrier in the assembler... Man it was a lot of lines, but hell - it prosessed much faster.

By the way, You can see the result in the old Demo archives that exist for the C64 all over the web, look in the PAL part of it after demoes from Cross and HTBFD (Hulter til bulter files demo). Some nostalgia there, hehe.