Hi Rolf, I double checked the flag just now. It was set correctly, but the hanging problem is still there. But I found another way to solve the hanging problem. Just setting environment CUDA_NIC_INTEROP 1 could solve the issue.
Thanks, Fengguang On Jun 6, 2011, at 10:44 AM, Rolf vandeVaart wrote: > Hi Fengguang: > > That is odd that you see the problem even when running with the openib flags > set as Brice indicated. Just to be extra sure there are no typo errors in > your flag settings, maybe you can verify with the ompi_info command like this? > > ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags > > When running with the 304 setting, then all communications travel through a > regular send/receive protocol on IB. The message is broken up into a 12K > fragment, followed by however many 64K fragments it takes to move the message. > > I will try and find to time to reproduce the other 1 Mbyte issue that Brice > reported. > > Rolf > > > > PS: Not sure if you are interested, but in the trunk, you can configure in > support so that you can send and receive GPU buffers directly. There are > still many performance issues to be worked out, but just thought I would > mention it. > > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On > Behalf Of Fengguang Song > Sent: Sunday, June 05, 2011 9:54 AM > To: Open MPI Users > Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA > > Hi Brice, > > Thank you! I saw your previous discussion and actually have tried "--mca > btl_openib_flags 304". > It didn't solve the problem unfortunately. In our case, the MPI buffer is > different from the cudaMemcpy buffer and we do manually copy between them. > I'm still trying to figure out how to configure OpenMPI's mca parameters to > solve the problem... > > Thanks, > Fengguang > > > On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote: > >> Le 05/06/2011 00:15, Fengguang Song a écrit : >>> Hi, >>> >>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. >>> My program uses MPI to exchange data between nodes, and uses >>> cudaMemcpyAsync to exchange data between Host and GPU devices within a node. >>> When the MPI message size is less than 1MB, everything works fine. >>> However, when the message size is > 1MB, the program hangs (i.e., an MPI >>> send never reaches its destination based on my trace). >>> >>> The issue may be related to locked-memory contention between OpenMPI and >>> CUDA. >>> Does anyone have the experience to solve the problem? Which MCA >>> parameters should I tune to increase the message size to be > 1MB (to avoid >>> the program hang)? Any help would be appreciated. >>> >>> Thanks, >>> Fengguang >> >> Hello, >> >> I may have seen the same problem when testing GPU direct. Do you use >> the same host buffer for copying from/to GPU and for sending/receiving >> on the network ? If so, you need a GPUDirect enabled kernel and >> mellanox drivers, but it only helps before 1MB. >> >> You can work around the problem with one of the following solution: >> * add --mca btl_openib_flags 304 to force OMPI to always send/recv >> through an intermediate (internal buffer), but it'll decrease >> performance before 1MB too >> * use different host buffers for the GPU and the network and manually >> copy between them >> >> I never got any reply from NVIDIA/Mellanox/here when I reported this >> problem with GPUDirect and messages larger than 1MB. >> http://www.open-mpi.org/community/lists/users/2011/03/15823.php >> >> Brice >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and may > contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ----------------------------------------------------------------------------------- > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users