You can add the following MCA parameters either on the command line or
in the $(HOME)/.openmpi/mca-params.conf file.
On Nov 2, 2009, at 08:52 , George Markomanolis wrote:
Dear all,
I would like to ask about collective communication. With debug mode
enabled, I can see many info during the execution which algorithm is
used etc. But my question is that I would like to use a specific
algorithm (the simplest I suppose). I am profiling some applications
and I want to simulate them with another program so I must be able
to know for example what the mpi_allreduce is doing. I saw many
algorithms that depend on the message size and the number of
processors, so I would like to ask:
1) what is the way to say at open mpi to use a simple algorithm for
allreduce (is there any way to say to use the simplest algorithm for
all the collective communication?). Basically I would like to know
the root cpu for every collective communication. What are the
disadvantages for demanding the simplest algorithm?
coll_tuned_use_dynamic_rules=1 to allow you to manually set the
algorithms to be used.
coll_tuned_allreduce_algorithm=*something between 0 and 5* to describe
the algorithm to be user. For the simplest algorithm I guess you will
want to use 1 (star based fan-in fan-out).
The main disadvantage is that the cost of the allreduce will raise
which will negatively impact the overall performance of the application.
2) Is there any overhead because I installed open mpi with debug
mode even if I just run a program without any flag with --mca?
There are many overhead because you compile in debug mode. We do a lot
of extra tracking of internally allocate memory, checks on most/all
internal objects and so on. Based on previous results I would say your
latency increase by about 2-3 micro-secs, but the impact on the
bandwidth is minimal.
3) How you could describe allreduce by words? Can we say that the
root cpu does reduce and then broadcast? I mean is that right for
your implementation? I saw that it depends on the algorithm which
cpu is the root, so is it possible to use an algorithm that I will
know every time that cpu with rank 0 is the root?
Exactly, allreduce = reduce + bcast (and btw this is what the
algorithm basic will do). However, there is no root in an allreduce as
all processors execute symmetric work. Of course if one see the
allreduce as a reduce followed by a broadcast then one has to select a
root (I guess we pick the rank 0 in our implementation).
george.
Thanks a lot,
George
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users