v1.1 does not have the tuned collective (I think but now I'm not 100% sure anymore), or at least they were not active by default. The first version with the tuned collective will be 1.2. The current decision function (from the nightly builds) target high performance networks with 2 characteristics: low latency (4-5 micro-sec) and high bandwidth (over 1Gb/s).

There are several implementations for each of the algorithms. Some are wired and some are not. The most difficult part is to make sure each of these implementations is correct (from MPI point of view) and give the expected answer in all circumstances. More functions we have, more tests we have to perform, and right now that's the main limitation. We have other algorithms implemented which are not in the Open MPI right now. They will come as soon as they get tested well enough in order for us to feel confident about their correctness.

Here are the answers:
1. Not all algorithms are wired to be showed by ompi_info. Everything out of range is set to the default value which means the current decision function. 2. The Allreduce algorithms are coming soon. Btw, all algorithms inside Open MPi support segmentation and all of the tree based one, support a fanout input (number of children).

Time is the only thing we're missing right now ... i.e. the weeks (now without the s) before SC.

  george.


On Nov 2, 2006, at 11:00 PM, Tony Ladd wrote:

George

I found the info I think you were referring to. Thanks. I then experimented essentially randomly with different algorithms for all reduce. But the issue with really bad performance for certain message sizes persisted with v1.1. The good news is that the upgrade to 1.2 fixed my worst problem. Now the
performance is reasonable for all message sizes. I will test the tuned
algorithms again asap.

I had a couple of questions

1) Ompi_info lists only 3 or 4 algorithms for allreduce and reduce and about 5 for b'cast. But you can use higher numbers as well. Are these additional undocmented algorithms (you mentioned a number like 15) or is it ignoring
out of range parameters?
2) It seems for allreduce you can select a tuned reduce and tuned bcast instead of the binary tree. But there is a faster allreduce which is order 2N rather than 4N for Reduce + Bcast (N is msg size). It segments the vector and distributes the root among the nodes; in an allreduce there is no need to gather the root vector to one processor and then scatter it again. I wrote a simple version for powers of 2 (MPI_SUM)-any chance of it being
implemented in OMPI.

Tony


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to