On 5/30/2016 11:09 PM, Saliya Ekanayake wrote:
So, you mean that it guarantees the value received after the bcast call is consistent with value sent from root, but it doesn't have to wait till all the ranks have received it?

this is what i believe, double checking the standard might not hurt though ...

Still, in this benchmark shouldn't the max time for bcast be equal to that of barrier?

no.

First, you should find which algo are used for MPI_Barrier() and MPI_Bcast()

this is based on communicator size and message length (for MPI_Bcast())

keep in mind the algo choice is likely not optimized for your network, and is not topology aware (e.g. it is only based on communicator size, not on tasks per node, and hence inter and intra node communications are considered equal).


here is what osu_bcast is doing :

        timer=0.0;
        for(i=0; i < options.iterations + options.skip ; i++) {
            t_start = MPI_Wtime();
            MPI_Bcast(buffer, size, MPI_CHAR, 0, MPI_COMM_WORLD);
            t_stop = MPI_Wtime();

            if(i>=options.skip){
                timer+=t_stop-t_start;
            }
            MPI_Barrier(MPI_COMM_WORLD);

        }


MPI_Bcast for short message does not take long, and since all tasks do not exit MPI_Barrier() at the same time, t_start is local, not global (i mean t_stop-t_start is already an approximation ...)

if MPI_Bcast() and MPI_Barrier() are implemented with a tree-based algorithm, then MPI_Bcast() has to do down the tree, whereas MPI_Barrier() has to go down and then all the way up. in this specific case, i would expect (once again, assuming all processes update t_start at the same time, which is not true) max(MPI_Barrier) ~= 2 * max(MPI_Bcast)

i recommend you evaluate all algo for MPI_Bcast() and MPI_Barrier() and compare only the best ones.

keep in mind the result also depends on how tasks are mapped to nodes
(e.g. tasks [0-23] on node 0, vs tasks {0,24,48, ...} on node 0)
and also how tasks are pinned within a node
(e.g. tasks [0-11] on socket 0, vs tasks {0,2,4,...} on socket 0)
also, if you are using a fat tree network, then result will also depend on which nodes are used
(because of the infiniband routing tables)

Cheers,

Gilles
On Mon, May 30, 2016 at 9:33 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>> wrote:

    These are very different algorithms, so performance might differ
    (greatly)

    for example, MPI_Bcast on root rank can MPI_Send() and return, if
    the message is short, this is likely an eager send which is very fast.
    that means MPI_Bcast() returns before all ranks received the data,
    or even entered MPI_Bcast.

    On the other hand, MPI_Barrier() cannot return before all ranks
    entered the barrier.

    also, you might find https://github.com/open-mpi/ompi/issues/1713
    useful.

    Cheers,

    Gilles


    On Monday, May 30, 2016, Saliya Ekanayake <esal...@gmail.com
    <mailto:esal...@gmail.com>> wrote:

        Hi,

        I ran Ohio micro benchmarks for openmpi and noticed broadcast
        with smaller number of bytes is faster than a barrier - 2us vs
        120us.

        I'm trying to understand how this could happen?

        Thank you
        Saliya


    _______________________________________________
    users mailing list
    us...@open-mpi.org <mailto:us...@open-mpi.org>
    Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
    Link to this post:
    http://www.open-mpi.org/community/lists/users/2016/05/29326.php




--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington



_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/05/29327.php

Reply via email to