Re: [OMPI users] Program hangs when using OpenMPI and CUDA

Fengguang Song Mon, 6 Jun 2011 12:01:20 -0400

Hi Rolf,

I  double checked the flag just now. It was set correctly, but the hanging 
problem is still there.
But I found another way to solve the hanging problem. Just setting environment 
CUDA_NIC_INTEROP 1
could solve the issue.


Thanks,
Fengguang


On Jun 6, 2011, at 10:44 AM, Rolf vandeVaart wrote:

> Hi Fengguang:
> 
> That is odd that you see the problem even when running with the openib flags 
> set as Brice indicated.  Just to be extra sure there are no typo errors in 
> your flag settings, maybe you can verify with the ompi_info command like this?
> 
> ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags
> 
> When running with the 304 setting, then all communications travel through a 
> regular send/receive protocol on IB.  The message is broken up into a 12K 
> fragment, followed by however many 64K fragments it takes to move the message.
> 
> I will try and find to time to reproduce the other 1 Mbyte issue that Brice 
> reported.
> 
> Rolf
> 
> 
> 
> PS: Not sure if you are interested, but in the trunk, you can configure in 
> support so that you can send and receive GPU buffers directly.  There are 
> still many performance issues to be worked out, but just thought I would 
> mention it.
> 
> 
> -----Original Message-----
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Fengguang Song
> Sent: Sunday, June 05, 2011 9:54 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA
> 
> Hi Brice,
> 
> Thank you! I saw your previous discussion and actually have tried "--mca 
> btl_openib_flags 304".
> It didn't solve the problem unfortunately. In our case, the MPI buffer is 
> different from the cudaMemcpy buffer and we do manually copy between them. 
> I'm still trying to figure out how to configure OpenMPI's mca parameters to 
> solve the problem...
> 
> Thanks,
> Fengguang
> 
> 
> On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:
> 
>> Le 05/06/2011 00:15, Fengguang Song a écrit :
>>> Hi,
>>> 
>>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. 
>>> My program uses MPI to exchange data between nodes, and uses 
>>> cudaMemcpyAsync to exchange data between Host and GPU devices within a node.
>>> When the MPI message size is less than 1MB, everything works fine. 
>>> However, when the message size is > 1MB, the program hangs (i.e., an MPI 
>>> send never reaches its destination based on my trace).
>>> 
>>> The issue may be related to locked-memory contention between OpenMPI and 
>>> CUDA.
>>> Does anyone have the experience to solve the problem? Which MCA 
>>> parameters should I tune to increase the message size to be > 1MB (to avoid 
>>> the program hang)? Any help would be appreciated.
>>> 
>>> Thanks,
>>> Fengguang
>> 
>> Hello,
>> 
>> I may have seen the same problem when testing GPU direct. Do you use 
>> the same host buffer for copying from/to GPU and for sending/receiving 
>> on the network ? If so, you need a GPUDirect enabled kernel and 
>> mellanox drivers, but it only helps before 1MB.
>> 
>> You can work around the problem with one of the following solution:
>> * add --mca btl_openib_flags 304 to force OMPI to always send/recv 
>> through an intermediate (internal buffer), but it'll decrease 
>> performance before 1MB too
>> * use different host buffers for the GPU and the network and manually 
>> copy between them
>> 
>> I never got any reply from NVIDIA/Mellanox/here when I reported this 
>> problem with GPUDirect and messages larger than 1MB.
>> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
>> 
>> Brice
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

Reply via email to