It's clearly not possible because of what I said in my previous post - while loading contiguous memory is possible due to the way intel processors agressively cache both page and bank requests, there is no way to stream memory from discontiguous parts of memory - so there is no way to stream two different 64bit addresses to registers under any intel processor, as a consequence of the way memory loads work in the processor.
Because these are nonprogrammable there's no way to tell the processor to preemptively load pages to the processor, or to ignore cache requests (believe me, I've tried), so no there is no way to stream 64bit addresses to registers.
As I said in my previous post, if you're performing a functional mapping over contiguous memory, then a GPU is your best bet, and CUDA is the best programming language to do it in. If you're performing a generalized lookup over non-contiguous data, then a programmable FPGA is the fastest method.
So basically the answer is no, unless you're willing to go rather outside of an what I would be expecting from an undergrad. The question then remains as to _why_ you would be wanting to do such a thing - given that any speed issues are usually due to
1. Bad Algorithm
2. Lots of overhead (perhaps due to an overuse of library or OS functions)
3. Bank or cache misses (as detailed earlier)
4. Limitation ultimately at the memory bus speed
So now I've repeated what I said before but in a wordier way. The answer is probably no, but it's qualified because there are cases where it isn't no. Happy now?
Dexter said:
What you want to be "vectorized" are 4 memory reads at different addresses
Are there any examples where you would want to vectorize 4 memory reads at the same address?
mov xmm1, [xmm0]
mov xmm2, xmm1
mov xmm3, xmm1
mov xmm4, xmm1
and the vectorized (using the fact that we get a cache and bank bonus) for contiguous addresses
mov xmm1, [xmm0]
mov xmm2, [xmm0 + 64]
mov xmm3, [xmm0 + 128]
mov xmm4, [xmm0 + 192]
has one bank miss if xmm0 is aligned on a 128 byte boundary
and for non continguous addresses
mov xmm1, [xmm0]
mov xmm2, [xmm0 + 64]
mov xmm3, [xmm0 + 128]
mov xmm4, [xmm0 + 192]
mov xmm1, [xmm1]
mov xmm2, [xmm2]
mov xmm3, [xmm3]
mov xmm4, [xmm4]
has 7-9 bank misses (4 from the direct accesses, one from the leas and 2-4 from the page table directories during the page table misses) and four cache fails, and is a whole lot slower (loading in four addresses stored as size_t* on xmm0)