At 15:59 08/05/2012, you wrote:
Yep you are correct. I did the same and it worked. When I have more
than 3 MPI tasks there is lot of overhead on GPU.
But for CPU there is not overhead. All three machines have 4 quad
core processors with 3.8 GB RAM.
Just wondering why there is no degradation of performance on CPU ?
Your GPU is saturated. It has more work than it can handle so its
performance drops.
If your kernel code is the one you posted some days ago you can
divide the number of threads and multiply the work done in each one,
so you do the same work (maybe faster) without using/wasting all the
thread pool and sm bandwith.
HTH