@evildictaitor: I found this other document that talks about the architecture of Singularity including background information on how they came up with those cycles calculations..
One of the problems I usually have with benchmarks like this, is that what someone thinks of as a valid comparison often isn't. For example, in this case they've chosen as their benchmark for what is a fast "syscall" the ProcessService.GetCyclesPerSecond() function in singularity, but chosen "SetFilePointer" for the Windows side-by-side comparison.
The problem is that unless you actually understand what Windows is doing, you'd probably not realize quite how unfair that comparison really is, particularly if you own ProcessService.GetCyclesPerSecond() and can tweak it for the benchmark.
Let's look at what SetFilePointer needs to do in contrast. Let's take the 32-bit process running on 64-bit Win7 just because that's the default C++ project that Visual Studio gives me on my machine.
So first of all, SetFilePointer calls not into the kernel - but into kernel32.dll in usermode. This first checks to see if the handle is a console handle via a comparison in order to quickly fail. Next, it inspects the move method chosen (which the benchmarker hasn't told us what value they used). If either FILE_CURRENT or FILE_END are chosen, this triggers a call to NtQueryInformationFile in the 32-bit ntdll to get the base address that we can add our offset to - the kernel doesn't have a concept of relative moves, so we need to do this stage first.
But wait - we're a 64-bit kernel, so what actually happens is a call to the Wow64 thunking layer, which causes a processor switch into 64-bit, triggering a natural flush of the pipeline and cache and an expensive reload of the entire processor's GDT before eventually landing us in wow64.dll.
wow64.dll then does a big switch statement to jump to Wow64!NtQueryInformationFileImpl, which then calls the kernel via a syscall. This goes through the syscall switch statement that takes us to nt!NtQueryInformationFile which then calls nt!ZwQueryInformationFile, which then checks the handle via nt!ObReferenceObjectByHandle to get a kernelmode pointer out of the handle, and then calls nt!IoQueryInformationFile, which gets the file pointer out of the handle - although not before recording the fact that all of this happens because the system collects IO usage statistics for things like Task Manager and entropy information for things like CryptGenRandom.
This then returns down the chain back to usermode which gives us a FILE_POSITION_INFORMATION64 in wow64, which then gets mashed into a FILE_POSITION_INFORMATION32 for our 32-bit process. We then do a full processor switch back into 32-bit with all of the associated cost of doing so.
Yay, now we're back in 32-bit land, and we've resolved the FILE_CURRENT / FILE_END issue, so we now do a 64-bit add to add our offset (we don't have 64-bit registers anymore so this is more expensive too), and now we need to actually set the filepointer, we call into ntdll!NtSetInformationFile, which does a processor switch, a wow64 jump table, a mash of our FILE_POSITION_INFORMATION32 into a FILE_POSITION_INFORMATION64 followed by a syscall followed by a syscall table lookup followed by an ntNtSetInformationFile followed by an nt!ZwSetInformationFile followed by an nt!ObReferenceObjectByHandle followed by an nt!IoSetInformationHandle.
But wait - what if someone else is using the handle?
So what then happens is that we have to lock the file via IopLockFileObject() - a full blown acquire of a critical section.
Only after all of this jiggerypokery can we actually go ahead and set the value on the handle to actually set the file pointer. We can only then return our success condition by performing a massive return all the way back out via wow64 and a processor switch to kernel32 which eventually returns to our tiny C program.
Which feels like a bit of an unfair comparison to me.