Beer28 wrote:
I thought most devices handled their own memory, and the OS interaction was doing IN/OUT to ports which the device is registered on through the driver ?
So you're saying some devices have their memory offboard on the ram sticks as virtual and it's shared by the OS?
Very little I/O is now done through explicit IN or OUT instructions - basically only legacy devices. These instructions are basically a novelty of the x86 architecture anyway, and are a lot slower than the memory copy commands (this is effect rather than cause - they're slower because they're used less frequently). The 68000, PowerPC, MIPS, SPARC, ARM, you name it, don't have IN or OUT. In those architectures, all devices are memory-mapped. That is, their control registers occupy space in the regular memory map.
Any given physical address can map to ROM, RAM, or a device's control registers (or nothing). You can see roughly what's where in your system by opening Device Manager and selecting the Resources By Type view, then expanding the Memory node. You'll see a number of items below 0x00100000 (640KB), then a big block allocated to 'System Board' between that address and some large number. On this system I have 512MB which is 0x20000000 in hex, so this block ends at 0x1FFFFFFF. Then there's a big gap - on this system - to 0xF2000000 which is where the graphics card's resources start.
The processor has special registers which tell it which areas of 'memory' space are cacheable. You don't want the processor to cache any device registers, nor try to combine writes, because the values of the memory locations can change without the processor noticing (DMA writes are 'snooped' by the processor to keep caches up to date), and the order in which writes to the registers occur may be significant.
You can't normally write directly to all of the video card's memory. AGP supports an 'aperture' of memory address space that can be mapped to areas of the video card RAM using the Graphics Address Remapping Table [GART]. The default configuration is normally 64MB, which may not be enough any more.
I'm surprised that the original poster indicated that only 3GB was available, but perhaps the address decoding logic on his motherboard can only cope with decoding to a whole bank of RAM.
Most I/O for modern devices is performed using Direct Memory Access transfers, where the device is instructed where to transfer the data from/to and how much to transfer. This happens directly by the device asking to become the Bus Master. Once it is the bus master, it drives the memory address and read/write lines directly (at least, logically - with HyperTransport and inter-hub transport the physical configuration is point-to-point with multiplexing, but you get the idea). On completion it returns the bus to the processor. In the meantime, the processor can get on with performing useful work.
The system software still has to tell the device where and how to transfer data. It does this by programming the device's registers, which as I said above are now often memory-mapped.
This is why Programmed I/O is so much slower than DMA. For Programmed I/O, the system software sets the values of device registers to indicate where to read - either with a memory write or an OUT, the difference is simply whether the IO/memory line is set or unset - then reads from the device registers into the CPU registers with a memory read or an IN. The software then writes the values read from the device into main memory. The data has to come further, the processor has to wait for the device registers to become ready, and this code often can't be interrupted.