Hi,

I'm developing MPI support for XtreemOS (www.xtreemos.eu) so that an MPI program is managed as a single XtreemOS job. To manage all processes as a single XtreemOS job, I've developed the program xos-createProcess that plays the role of the rsh agent (replacing ssh/rsh) to start a process on a remote machine that is part of the ones reserved for the current job.

I'm running a simple hello world MPI program where each processes sends a string to the process 0 that itself prints them on standard output.

When using OpenMPI with ssh, this program works perfectly on several machines.

When using OpenMPI with my launcher xos-createProcess, it works with an MPI program of 2 processes on 2 different machines.

However I cannot pass through the following error that happens when running an MPI program of 3 processes on 3 different machines (or any n processes on n different machines with n >= 3).

A process started by xos-createProcess on a remote machine ends with the following error:

[paradent-5.rennes.grid5000.fr:08191] [[50627,0],2] routed:binomial: Connection to lifeline [[50627,0],0] lost

But, process 0 is still running! lifeline should not have been lost!
Actually, process 0 is still waiting for remote process to terminate (checked with gdb, the initial process is calling libc's poll()).


The run command is:

-bash -c '(mpirun --mca orte_rsh_agent xos-createProcess --leave-session-attached -np 2 -host `xreservation -a $XOS_RSVID` mpi/hello_world_MPI < /dev/null > mpirun.out) >& mpirun.err'

Same problem with or without option --leave-session-attached.



So, how is the lifeline implemented? why does it work with 2 processes but start failing when using 3 or more processes?


I'm using Open MPI 1.6.


Thanks for your help.

--
Yann Radenac
Research Engineer, INRIA
Myriads research team, INRIA Rennes - Bretagne Atlantique

Reply via email to