Hi,
I'm developing MPI support for XtreemOS (www.xtreemos.eu) so that an MPI
program is managed as a single XtreemOS job.
To manage all processes as a single XtreemOS job, I've developed the
program xos-createProcess that plays the role of the rsh agent
(replacing ssh/rsh) to start a process on a remote machine that is part
of the ones reserved for the current job.
I'm running a simple hello world MPI program where each processes sends
a string to the process 0 that itself prints them on standard output.
When using OpenMPI with ssh, this program works perfectly on several
machines.
When using OpenMPI with my launcher xos-createProcess, it works with an
MPI program of 2 processes on 2 different machines.
However I cannot pass through the following error that happens when
running an MPI program of 3 processes on 3 different machines (or any n
processes on n different machines with n >= 3).
A process started by xos-createProcess on a remote machine ends with the
following error:
[paradent-5.rennes.grid5000.fr:08191] [[50627,0],2] routed:binomial:
Connection to lifeline [[50627,0],0] lost
But, process 0 is still running! lifeline should not have been lost!
Actually, process 0 is still waiting for remote process to terminate
(checked with gdb, the initial process is calling libc's poll()).
The run command is:
-bash -c '(mpirun --mca orte_rsh_agent xos-createProcess
--leave-session-attached -np 2 -host `xreservation -a $XOS_RSVID`
mpi/hello_world_MPI < /dev/null > mpirun.out) >& mpirun.err'
Same problem with or without option --leave-session-attached.
So, how is the lifeline implemented? why does it work with 2 processes
but start failing when using 3 or more processes?
I'm using Open MPI 1.6.
Thanks for your help.
--
Yann Radenac
Research Engineer, INRIA
Myriads research team, INRIA Rennes - Bretagne Atlantique