Hi Rolf,

Thanks very much for the info! So with CUDA-aware build, OpenMPI still
have to copy all the data first into host memory, and then do send/recv
on the host memory? I thought OpenMPI would use GPUdirect and RDMA to
send/recv GPU memory directly.

I will try a debug build and see what does it say. Thanks!

Best,
Yang

------------------------------------------------------------------------

Sent by Apple Mail

Yang ZHANG

PhD candidate

Networking and Wide-Area Systems Group
Computer Science Department
New York University

715 Broadway Room 705
New York, NY 10003

> On Sep 25, 2015, at 11:07 AM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:
> 
> Hello Yang:
> It is not clear to me if you are asking about a CUDA-aware build of Open MPI 
> where you do the MPI_Allreduce() or the GPU buffer or if you are handling 
> staging the GPU into host memory and then calling the MPI_Allreduce().  
> Either way, they are somewhat similar.  With CUDA-aware, the MPI_Allreduce() 
> of GPU data simply first copies the data into a host buffer and then calls 
> the underlying implementation.
> 
> Depending on how you have configured your Open MPI, the underlying 
> implementation may vary.  I would suggest you compile a debug version 
> (--enable-debug) and then run some tests with --mca coll_base_verbose 100 
> which will give you some insight into what is actually happening under the 
> covers.
> 
> Rolf
> 
>> -----Original Message-----
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang
>> Sent: Thursday, September 24, 2015 11:41 PM
>> To: us...@open-mpi.org
>> Subject: [OMPI users] How does MPI_Allreduce work?
>> 
>> Hello OpenMPI users,
>> 
>> Is there any document on MPI_Allreduce() implementation? I’m using it to do
>> summation on GPU data. I wonder if OpenMPI will first do summation on
>> processes in the same node, and then do summation on the intermediate
>> results across nodes. This would be preferable since it reduces cross node
>> communication and should be faster?
>> 
>> I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float
>> numbers on 6 nodes, each node running 4 processes. The nodes are
>> connected via InfiniBand.
>> 
>> Thanks very much!
>> 
>> Best,
>> Yang
>> 
>> ------------------------------------------------------------------------
>> 
>> Sent by Apple Mail
>> 
>> Yang ZHANG
>> 
>> PhD candidate
>> 
>> Networking and Wide-Area Systems Group
>> Computer Science Department
>> New York University
>> 
>> 715 Broadway Room 705
>> New York, NY 10003
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: http://www.open-
>> mpi.org/community/lists/users/2015/09/27675.php
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may 
> contain
> confidential information.  Any unauthorized review, use, disclosure or 
> distribution
> is prohibited.  If you are not the intended recipient, please contact the 
> sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/09/27678.php

Reply via email to