Re: [OMPI users] Stream interactions in CUDA

2012-12-13 Thread Shamis, Pavel
vel. Thanks, Justin From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> [users-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart [rvandeva...@nvidia.com] Sent: Thursday, December 13, 2012 6:18 AM To: Open MPI Users Subject: Re: [OMPI users] S

Re: [OMPI users] Stream interactions in CUDA

2012-12-13 Thread Justin Luitjens
PI Users Subject: Re: [OMPI users] Stream interactions in CUDA Hi Justin: I assume you are running on a single node. In that case, Open MPI is supposed to take advantage of the CUDA IPC support. This will be used only when messages are larger than 4K, which yours are. In that case, I would hav

Re: [OMPI users] Stream interactions in CUDA

2012-12-13 Thread Rolf vandeVaart
.@open-mpi.org] >On Behalf Of Jens Glaser >Sent: Wednesday, December 12, 2012 8:12 PM >To: Open MPI Users >Subject: Re: [OMPI users] Stream interactions in CUDA > >Hi Justin > >from looking at your code it seems you are receiving more bytes from the >processors then you se

Re: [OMPI users] Stream interactions in CUDA

2012-12-12 Thread Jens Glaser
Hi Justin from looking at your code it seems you are receiving more bytes from the processors then you send (I assume MAX_RECV_SIZE_PER_PE > send_sizes[p]). I don't think this is valid. Your transfers should have matched sizes on the sending and receiving side. To achieve this, either communicat

Re: [OMPI users] Stream interactions in CUDA

2012-12-12 Thread Dmitry N. Mikushin
Hi Justin, Quick grepping reveals several cuMemcpy calls in OpenMPI. Some of them are even synchronous, meaning stream0. I think the best way of exploring this sort of behavior is to execute OpenMPI runtime (thanks to its open-source nature!) under debugger. Rebuild OpenMPI with -g -O0, add some

[OMPI users] Stream interactions in CUDA

2012-12-12 Thread Justin Luitjens
Hello, I'm working on an application using OpenMPI with CUDA and GPUDirect. I would like to get the MPI transfers to overlap with computation on the CUDA device. To do this I need to ensure that all memory transfers do not go to stream 0. In this application I have one step that performs an