Beer28 wrote:
IOn linux on x86 user2kernel calls for IO to devices through the kernel are done by calling a CPU interupt instruction 0x80 with type of kernel sys function in eax, then the params for the kernel function in ebx-edx like you would do a fastcall from VC++,
except with an INT instruction, not a call/ret to the start addy of the function,
Does it work the same on NT?
XP and 2003 use the SYSENTER/SYSEXIT instructions. IIRC earlier versions of NT used interrupt 0x2e. The user->kernel transitions are isolated in NTDLL.DLL apart from some places where gdi32.dll and user32.dll call into the win32k.sys driver directly.
In fact it appears that the system call instruction might be dynamically generated! NtWriteFile, for example, loads edx with the contents of SharedUserData!SystemCallStub then performs an indirect call to that address. Since this is an Intel P4 system it uses
SYSENTER.
The arguments appear to be retrieved from the stack directly, the only arguments passed in registers are the user stack pointer (passed in edx) and the system call to execute (passed in eax).
I would expect x64 and Itanium to pass parameters in registers rather than on the stack, since their calling conventions are register-based.
Beer28 wrote:
Is it always passed through the registers or can you pass stuff with pointers to memory from user2kernel and back without drawing an access violation(like maybe stack address space)?
Once it passes the context switch to kernel mode the page protection is gone at ring 0 right? It's up to the kernel exported function to make sure you're not giving it a bad user space address(passed in the ebx-edx register) in NT also?
Page protections still apply in ring 0. One of the bits in the Page Table Entry is the User/Supervisor bit, which governs whether a page is writable from user mode or from supervisor/kernel mode. On the x86, code running in rings 0, 1, or 2 can access supervisor
and user pages; ring 3 can only access user pages (a processor will raise an access fault if it tries to access supervisor pages).
NT breaks, for each process, the virtual address space into a user region and a system region. The split point is normally at 2GB (first system address is 0x80000000), however if the system is booted with /3GB that changes to 3GB user, 1GB kernel (first system
address 0xC0000000). Finally XP and 2003 also offer the /USERVA switch which when combined with /3GB allows the system address start point to be tweaked further.
The system address space is identical across all processes. Because the page tables are the same after the user/kernel transition (a user/kernel transition is
not generally termed a context switch - the same thread is running, only now it's using its kernel stack, and it's running at a higher privilege level), the system code can access anything in the user-mode part of the address space that the thread's
process can.
Interrupt-handling code can, and will, be called with arbitrary process context - the process of whichever thread was last executing. It can't therefore write directly into a user-mode buffer. Instead it must queue an Asynchronous Procedure Call (APC) to the
thread that initiated the I/O. When the APC is dispatched Windows performs a context switch to that thread, so now the correct process page tables are referenced and the operation can go ahead. (I've left out Deferred Procedure Calls [DPCs], which also occur
in arbitrary process context).
There are some threads in the system which don't run in a particular process's context - they're worker threads. Instead they run in pseudo-processes, which in Task Manager (and Process Explorer) are shown as "System Idle Process" and "System". The "System
Idle Process" contains only one thread, which is the zero-page thread. This thread has the lowest priority in the system, does not get dynamic boosts, will never pre-empt any other thread, and is responsible only for zeroing out free pages. When it doesn't
have any work to do it halts the processor. All other worker threads run in "System".
The Structured Exception Handling mechanism is also supported in kernel mode; drivers should always wrap accesses to user-mode buffers in __try/__except blocks.
At this point I have to confess I've done no kernel-mode programming. I've found out all I have from "Windows Internals, 4th Edition" (and its predecessor "Inside Windows 2000"), and from
OSR's NT Insider.