Re: [OMPI users] Broadcast faster than barrier

Gilles Gouaillardet Mon, 30 May 2016 20:34:17 -0400 (EDT)


On 5/30/2016 11:09 PM, Saliya Ekanayake wrote:

So, you mean that it guarantees the value received after the bcastcall is consistent with value sent from root, but it doesn't have towait till all the ranks have received it?

this is what i believe, double checking the standard might not hurtthough ...

Still, in this benchmark shouldn't the max time for bcast be equal tothat of barrier?

no.

First, you should find which algo are used for MPI_Barrier() and MPI_Bcast()

this is based on communicator size and message length (for MPI_Bcast())

keep in mind the algo choice is likely not optimized for your network,and is not topology aware(e.g. it is only based on communicator size, not on tasks per node, andhence inter and intra node communications are considered equal).



here is what osu_bcast is doing :

        timer=0.0;
        for(i=0; i < options.iterations + options.skip ; i++) {
            t_start = MPI_Wtime();
            MPI_Bcast(buffer, size, MPI_CHAR, 0, MPI_COMM_WORLD);
            t_stop = MPI_Wtime();

            if(i>=options.skip){
                timer+=t_stop-t_start;
            }
            MPI_Barrier(MPI_COMM_WORLD);

        }

MPI_Bcast for short message does not take long, and since all tasks donot exit MPI_Barrier() at the same time, t_start is local, not global (imean t_stop-t_start is already an approximation ...)

if MPI_Bcast() and MPI_Barrier() are implemented with a tree-basedalgorithm, then MPI_Bcast() has to do down the tree, whereasMPI_Barrier() has to go down and then all the way up.in this specific case, i would expect (once again, assuming allprocesses update t_start at the same time, which is not true)max(MPI_Barrier) ~= 2 * max(MPI_Bcast)

i recommend you evaluate all algo for MPI_Bcast() and MPI_Barrier() andcompare only the best ones.


keep in mind the result also depends on how tasks are mapped to nodes
(e.g. tasks [0-23] on node 0, vs tasks {0,24,48, ...} on node 0)
and also how tasks are pinned within a node
(e.g. tasks [0-11] on socket 0, vs tasks {0,2,4,...} on socket 0)

also, if you are using a fat tree network, then result will also dependon which nodes are used

(because of the infiniband routing tables)

Cheers,

Gilles

On Mon, May 30, 2016 at 9:33 AM, Gilles Gouaillardet<gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>wrote:


    These are very different algorithms, so performance might differ
    (greatly)

    for example, MPI_Bcast on root rank can MPI_Send() and return, if
    the message is short, this is likely an eager send which is very fast.
    that means MPI_Bcast() returns before all ranks received the data,
    or even entered MPI_Bcast.

    On the other hand, MPI_Barrier() cannot return before all ranks
    entered the barrier.

    also, you might find https://github.com/open-mpi/ompi/issues/1713
    useful.

    Cheers,

    Gilles


    On Monday, May 30, 2016, Saliya Ekanayake <esal...@gmail.com
    <mailto:esal...@gmail.com>> wrote:

        Hi,

        I ran Ohio micro benchmarks for openmpi and noticed broadcast
        with smaller number of bytes is faster than a barrier - 2us vs
        120us.

        I'm trying to understand how this could happen?

        Thank you
        Saliya


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/05/29326.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29327.php

Re: [OMPI users] Broadcast faster than barrier

Reply via email to