Thanks for your answer. Your example address one possible situation where a parallel application is spawned by a driver with MPI_Comm_Spawn, or multiple parallel applications are spawned at the same time with a MPI_Comm_Span_Multiple, over a set of processors described in the machinefile. It is OK if the next spawn occurs after some processes at the
beginning of the machinefile have stopped.
However I have in hands another case where the spawn processes are really dynamic over time. Any child processes can stop (not necessarily the first in the machinefile), and thus they are freeing some processors on which the new spawned processes must be running. With LAM_MPI this situation has a satisfactory solution with the INFO parameter of the MPI_Comm_Spawn. It allows to specify a "local" machinefile for these spawned processes, instead of taking always the same machinefile from the beginning as in your example.

Do you know if this specific feature will be implemented in Open-MPI (I hope it will be),
and possibly when ?
Dynamic applications really need this.

Best Regards,
Jean Latour

Edgar Gabriel wrote:

so for my tests, Open MPI did follow the machinefile (see output)
further below, however, for each spawn operation it starts from the very
beginning of the machinefile...

The following example spawns 5 child processes (with a single
MPI_Comm_spawn), and each child prints its rank and the hostname.

gabriel@linux12 ~/dyncomm $ mpirun -hostfile machinefile  -np 3
./dyncomm_spawn_father
 Checking for MPI_Comm_spawn.....................working
Hello world from child 0 on host linux12
Hello world from child 1 on host linux13
Hello world from child 3 on host linux15
Hello world from child 4 on host linux16
     Testing Send/Recv on the intercomm..........working
Hello world from child 2 on host linux14


with the machinefile being:
gabriel@linux12 ~/dyncomm $ cat machinefile
linux12
linux13
linux14
linux15
linux16

In your code, you always spawn 1 process at the time, and that's why they are all located on the same node.

Hope this helps...
Edgar


Edgar Gabriel wrote:

as far as I know, Open MPI should follow the machinefile for spawn operations, starting however for every spawn at the beginning of the machinefile again. An info object such as 'lam_sched_round_robin' is currently not available/implemented. Let me look into this...

Jean Latour wrote:


Hello,

Testing the MPI_Comm_Spawn function of Open MPI version 1.0.1, I have an example that works OK, except that it shows that the spawned processes do not follow the "machinefile" setting of processors. In this example a master process spawns first 2 processes, then disconnects from them and spawn 2 more processes. Running on a Quad Opteron node, all processes are running on the same node, although the
machinefile specifies that the slaves should run on different nodes.

With the actual version of OpenMPI is it possible to direct the spawned processes on a specific node ? (the node distribution could be given in the "machinefile" file, as with LAM MPI)

The code (Fortran 90) of this example and makefile is attached as a tar file.
        
Thank you very much

Jean Latour


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


<<attachment: latour.vcf>>

Reply via email to