Two things to consider:

1) PCIe is much better than AGP and transferring data back.  In effect, the asymmetry has been erased.

2) On the smaller benchmarks, the communication time is kind of irrelevant.  If convolving with a 5x5 filter is your entire computation, then it’s probably not worth parallelizing anyway.  For the larger benchmarks, take StereoMatch as an example.  The communication time adds < 25%.  That’s pretty small compared to a 4x speedup over  C++.