page 1 of 1
Comments: 5 | Views: 902
I am working on a project (where I had to rewrite code to alleviate a known bottleneck) and I went through some extra trouble to find an algorithm that does no branching to replace the only loop in my code (which happens to be inside a loop that fills an array) so that in theory, the entire computation would be vectorizable. The algorithm that replaces the loop does sixteen operations (consisting of right logical shifts, bitwise ors, one addition and one multiplication) to compute the index of the value in a lookup table that contains the result of the loop it replaced.

I have the loop that it replaced commented out and I would have liked to mention in a comment that the code that replaced the loop is theoretically simdizable on X architecture. The computation of the array index is easily vectorizable, but I was not quite so sure if its use of a look-up table is vectorizable, so I started reading through Intel's manuals. I found out on page 237 of the "Intel® 64 and IA-32 Architectures Optimization Reference Manual" that table look-up operations are not vectorizable.

There are workarounds that you can use with IA-32, but I am curious, is there any technical reason why IA-32 does not have an instruction that allows table look-ups to be vectorized? If there is no technical reason why IA-32 does not have an instruction that allows table look-ups to be vectorized, are there other instruction set architectures that do have an instruction that allows table look-ups to be vectorized and if they exist, what would those be? On a more extreme paradigm, do stream processors such as Nvidia's graphics processors or ATI's/AMD's graphics processors have instructions that allow this to be done?
I'm not an expert in SIMD but I'll try to guess:

- What you call table look-up is basically indexed memory access
- What you want to be "vectorized" are 4 memory reads at different addresses

To vectorize some arithmetic or logical operation is easy, you put N ALUs in the CPU and you have them executing the same operation on different bit ranges of the input registers.

To vectorize memory reads what do you need? Well, I assume you need N memory access units, N buses and a N-ported memory. Tough job having all those and if you don't have them you'll end up serializing all memory reads. Maybe it could work if all reads would be on the same cacheline but that doesn't always happen.

evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
What are you talking about? Intel chipsets come with an onboard (nonprogrammable) hardware page and bank cache, which means so long as you're accessing contiguous memory you ARE vectorising memory reads, whether you like it or not.

Moreover, where were you going to vectorize these memory reads to? If you're applying a functional map over an array then to get high levels of parallelism you should look at using a GPU to take the body of work (see CUDA), or if you're looking at a generalized lookup table, there are programmable FPGAs, but if your program is really running slowly, it's probably due to (in order):

1. Bad algorithm.
2. Lots of overhead - overuse of standard libraries?
3. Bank or cache misses
4. Memory bus line speed.

What YOU are talking about? Smiley

Maybe his original post wasn't clear enough but in the second post he clearly mentioned different addresses. Basically he asks about an instruction like

mov xmm0, [xmm1]

which could load 2 values in xmm0 from 2 different addresses stored in xmm1. That means doing two 64 bit loads from 2 different addresses. As far as I know current Intel chips can do one 128 bit load from a single address.

evildictaitor
evildictaitor
if( !succeed( try() ) ) { while(true) try(); }
It's clearly not possible because of what I said in my previous post - while loading contiguous memory is possible due to the way intel processors agressively cache both page and bank requests, there is no way to stream memory from discontiguous parts of memory - so there is no way to stream two different 64bit addresses to registers under any intel processor, as a consequence of the way memory loads work in the processor.

Because these are nonprogrammable there's no way to tell the processor to preemptively load pages to the processor, or to ignore cache requests (believe me, I've tried), so no there is no way to stream 64bit addresses to registers.

As I said in my previous post, if you're performing a functional mapping over contiguous memory, then a GPU is your best bet, and CUDA is the best programming language to do it in. If you're performing a generalized lookup over non-contiguous data, then a programmable FPGA is the fastest method.

So basically the answer is no, unless you're willing to go rather outside of an what I would be expecting from an undergrad. The question then remains as to _why_ you would be wanting to do such a thing - given that any speed issues are usually due to

1. Bad Algorithm
2. Lots of overhead (perhaps due to an overuse of library or OS functions)
3. Bank or cache misses (as detailed earlier)
4. Limitation ultimately at the memory bus speed


So now I've repeated what I said before but in a wordier way. The answer is probably no, but it's qualified because there are cases where it isn't no. Happy now?

Dexter said:
 What you want to be "vectorized" are 4 memory reads at different addresses
Are there any examples where you would want to vectorize 4 memory reads at the same address?
mov xmm1, [xmm0]
mov xmm2, xmm1
mov xmm3, xmm1 
mov xmm4, xmm1

and the vectorized (using the fact that we get a cache and bank bonus) for contiguous addresses
mov xmm1, [xmm0]
mov xmm2, [xmm0 + 64]
mov xmm3, [xmm0 + 128]
mov xmm4, [xmm0 + 192]

has one bank miss if xmm0 is aligned on a 128 byte boundary

and for non continguous addresses
mov xmm1, [xmm0]
mov xmm2, [xmm0 + 64]
mov xmm3, [xmm0 + 128]
mov xmm4, [xmm0 + 192]
mov xmm1, [xmm1]
mov xmm2, [xmm2]
mov xmm3, [xmm3]
mov xmm4, [xmm4]

has 7-9 bank misses (4 from the direct accesses, one from the leas and 2-4 from the page table directories during the page table misses) and four cache fails, and is a whole lot slower (loading in four addresses stored as size_t* on xmm0)
page 1 of 1
Comments: 5 | Views: 902
Microsoft Communities