David Tarditi wrote:[...]
The problem there is that the intention here - to perform a 5x5 convolve by repeatedly compositing an offset image - it right for the GPU, but wrong for the CPU.
Now, if you want good performance, you need to traverse the output pixels in the correct order to preserve spatial locality.
You're quite right, rhm, that different target platforms have different issues, and that you have to adapt the structure of your program to your processor if you want the comparison to be meaningful. I can assure you that in our convolution benchmark, the CPU version we compare against is quite clever about how it iterates.
For our multi-core backend, we are indeed being as ambitious as you suggest. Our goal is to tailor the loop ordering to suit the machine. There have been decades of research into automatic loop transformations (strip-mining, tiling, skewing, ...), so the idea of doing this in a compiler isn't novel. As David points out, the advantage we have is that the program is specified at a higher level, so we don't have to burn cycles trying to figure out which transformations we can legally apply without breaking a data dependency.