Hi Rolf, Thanks very much for the info! So with CUDA-aware build, OpenMPI still have to copy all the data first into host memory, and then do send/recv on the host memory? I thought OpenMPI would use GPUdirect and RDMA to send/recv GPU memory directly.
I will try a debug build and see what does it say. Thanks! Best, Yang ------------------------------------------------------------------------ Sent by Apple Mail Yang ZHANG PhD candidate Networking and Wide-Area Systems Group Computer Science Department New York University 715 Broadway Room 705 New York, NY 10003 > On Sep 25, 2015, at 11:07 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote: > > Hello Yang: > It is not clear to me if you are asking about a CUDA-aware build of Open MPI > where you do the MPI_Allreduce() or the GPU buffer or if you are handling > staging the GPU into host memory and then calling the MPI_Allreduce(). > Either way, they are somewhat similar. With CUDA-aware, the MPI_Allreduce() > of GPU data simply first copies the data into a host buffer and then calls > the underlying implementation. > > Depending on how you have configured your Open MPI, the underlying > implementation may vary. I would suggest you compile a debug version > (--enable-debug) and then run some tests with --mca coll_base_verbose 100 > which will give you some insight into what is actually happening under the > covers. > > Rolf > >> -----Original Message----- >> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang >> Sent: Thursday, September 24, 2015 11:41 PM >> To: us...@open-mpi.org >> Subject: [OMPI users] How does MPI_Allreduce work? >> >> Hello OpenMPI users, >> >> Is there any document on MPI_Allreduce() implementation? I’m using it to do >> summation on GPU data. I wonder if OpenMPI will first do summation on >> processes in the same node, and then do summation on the intermediate >> results across nodes. This would be preferable since it reduces cross node >> communication and should be faster? >> >> I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float >> numbers on 6 nodes, each node running 4 processes. The nodes are >> connected via InfiniBand. >> >> Thanks very much! >> >> Best, >> Yang >> >> ------------------------------------------------------------------------ >> >> Sent by Apple Mail >> >> Yang ZHANG >> >> PhD candidate >> >> Networking and Wide-Area Systems Group >> Computer Science Department >> New York University >> >> 715 Broadway Room 705 >> New York, NY 10003 >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: http://www.open- >> mpi.org/community/lists/users/2015/09/27675.php > > ----------------------------------------------------------------------------------- > This email message is for the sole use of the intended recipient(s) and may > contain > confidential information. Any unauthorized review, use, disclosure or > distribution > is prohibited. If you are not the intended recipient, please contact the > sender by > reply email and destroy all copies of the original message. > ----------------------------------------------------------------------------------- > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/09/27678.php