On 5/30/2016 11:09 PM, Saliya Ekanayake wrote:
So, you mean that it guarantees the value received after the bcast
call is consistent with value sent from root, but it doesn't have to
wait till all the ranks have received it?
this is what i believe, double checking the standard might not hurt
though ...
Still, in this benchmark shouldn't the max time for bcast be equal to
that of barrier?
no.
First, you should find which algo are used for MPI_Barrier() and MPI_Bcast()
this is based on communicator size and message length (for MPI_Bcast())
keep in mind the algo choice is likely not optimized for your network,
and is not topology aware
(e.g. it is only based on communicator size, not on tasks per node, and
hence inter and intra node communications are considered equal).
here is what osu_bcast is doing :
timer=0.0;
for(i=0; i < options.iterations + options.skip ; i++) {
t_start = MPI_Wtime();
MPI_Bcast(buffer, size, MPI_CHAR, 0, MPI_COMM_WORLD);
t_stop = MPI_Wtime();
if(i>=options.skip){
timer+=t_stop-t_start;
}
MPI_Barrier(MPI_COMM_WORLD);
}
MPI_Bcast for short message does not take long, and since all tasks do
not exit MPI_Barrier() at the same time, t_start is local, not global (i
mean t_stop-t_start is already an approximation ...)
if MPI_Bcast() and MPI_Barrier() are implemented with a tree-based
algorithm, then MPI_Bcast() has to do down the tree, whereas
MPI_Barrier() has to go down and then all the way up.
in this specific case, i would expect (once again, assuming all
processes update t_start at the same time, which is not true)
max(MPI_Barrier) ~= 2 * max(MPI_Bcast)
i recommend you evaluate all algo for MPI_Bcast() and MPI_Barrier() and
compare only the best ones.
keep in mind the result also depends on how tasks are mapped to nodes
(e.g. tasks [0-23] on node 0, vs tasks {0,24,48, ...} on node 0)
and also how tasks are pinned within a node
(e.g. tasks [0-11] on socket 0, vs tasks {0,2,4,...} on socket 0)
also, if you are using a fat tree network, then result will also depend
on which nodes are used
(because of the infiniband routing tables)
Cheers,
Gilles
On Mon, May 30, 2016 at 9:33 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com <mailto:gilles.gouaillar...@gmail.com>>
wrote:
These are very different algorithms, so performance might differ
(greatly)
for example, MPI_Bcast on root rank can MPI_Send() and return, if
the message is short, this is likely an eager send which is very fast.
that means MPI_Bcast() returns before all ranks received the data,
or even entered MPI_Bcast.
On the other hand, MPI_Barrier() cannot return before all ranks
entered the barrier.
also, you might find https://github.com/open-mpi/ompi/issues/1713
useful.
Cheers,
Gilles
On Monday, May 30, 2016, Saliya Ekanayake <esal...@gmail.com
<mailto:esal...@gmail.com>> wrote:
Hi,
I ran Ohio micro benchmarks for openmpi and noticed broadcast
with smaller number of bytes is faster than a barrier - 2us vs
120us.
I'm trying to understand how this could happen?
Thank you
Saliya
_______________________________________________
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29326.php
--
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2016/05/29327.php