On Oct 28, 2006, at 6:51 PM, Tony Ladd wrote:
George
Thanks for the references. However, I was not able to figure out if
it what
I am asking is so trivial it is simply passed over or so subtle
that its
been overlooked (I suspect the former).
No. The answer to your question was in the articles. We have more
than just the Rabenseifner reduce and all-reduce algorithms. Some of
the most common collective communication calls have up to 15
different implementations in Open MPI. Of course, each of these
implementations give the best performance under some particular
conditions. Unfortunately, there is no unique algorithms that give
the best performance in all cases. As we have to deal with multiple
algorithms for each collective, we have to figure out which one is
better and where. This usually depend on the number of nodes in the
communicator, the message size as well as the network properties. In
few words, it's difficult to choose the best one without having prior
knowledge about the networks you're trying to use. This is something
we're working on right now on Open MPI. Until then ... It might
happens that for some particular points the performance of he
collective communications will not show the best possible
performance. However, to have a slow-down of a factor of 10 is quite
unbelievable. There might be something else going on there...
Thanks,
george.
PS: BTW which version of Open MPI are you using ? The one who deliver
the best performance or the collective communications (at least on
high performance networks) is the nightly release of he 1.2 branch.
The binary tree algorithm in
MPI_Allreduce takes a tiume proportional to 2*N*log_2M where N is
the vector
length and M is the number of processes. There is a divide and conquer
strategy
(http://www.hlrs.de/organization/par/services/models/mpi/
myreduce.html) that
mpich uses to do a MPI_Reduce in a time proportional to N. Is this
algorithm
or something equivalent in OpenMPI at present? If so how do I turn
it on?
I also found that OpenMPI is sometimes very slow on MPI_Allreduce
using TCP.
Things are OK up to 16 processes but at 24 the rates (Message
length divided
by time) are as follows:
Message size (Kbytes) Throughput (Mbytes/sec)
M=24 M=32 M=48
1 1.38 1.30 1.09
2 2.28 1.94 1.50
4 2.92 2.35 1.73
8 3.56 2.81 1.99
16 3.97 1.94 0.12
32 0.34 0.24 0.13
64 3.07 2.33 1.57
128 3.70 2.80 1.89
256 4.10 3.10 2.08
512 4.19 3.28 2.08
1024 4.36 3.36 2.17
Around 16-32KBytes there is a pronouced slowdown-roughly a factor
of 10,
which seems too much. Any idea whats going on?
Tony
-------------------------------
Tony Ladd
Chemical Engineering
University of Florida
PO Box 116005
Gainesville, FL 32611-6005
Tel: 352-392-6509
FAX: 352-392-9513
Email: tl...@che.ufl.edu
Web: http://ladd.che.ufl.edu
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users