Hi, Are the recent peer to peer capabilities of cuda leveraged by Open MPI when eg you're running a rank per gpu on the one workstation?
It seems in my testing that I only get in the order of about 1GB/s as per http://www.open-mpi.org/community/lists/users/2011/03/15823.php, whereas nvidia's simpleP2P test indicates ~6 GB/s. Also, I ran into a problem just trying to test. It seems you have to do cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was wanting to derive from the rank. You don't however know the rank until after MPI_Init() and you need to initialize cuda before. Not sure if there's a standard way to do it? I have a workaround atm. Thanks, Chris