You're not doing anything wrong; it's just that Open MPI doesn't [yet]
handle failures well.  It will probably *eventually* respond with a
timeout (and therefore fail).

You might want to run a real resource manager to manage your cluster,
such as SLURM, Torque, or one of a bunch of commercial solutions.  These
applications typically have some kind of daemon running on each node and
get fairly good notifications when nodes go down, etc.



> -----Original Message-----
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of 
> laurent.po...@fr.thalesgroup.com
> Sent: Tuesday, April 25, 2006 4:58 AM
> To: us...@open-mpi.org
> Subject: [OMPI users] Checking the cluster status 
> withMPI_Comm_spawn_multiple
> 
> Hi, 
> 
> Before starting programs on my cluster, I want to check on 
> every CPU if it is up and able to run MPI applications.
> 
> For this, I use a kind of 'ping' program that just send a 
> message saying 'I'm OK' tu a superviser program.
> The 'ping' program is sent by the superviser on each CPU by 
> the MPI_Comm_spawn_multiple command.
> 
> It works fine when every CPU is up, but when one is down, my 
> superviser stops when calling the MPI_Comm_spawn_multiple command.
> 
> So the questions are : 
> * 'What am I doing wrong ?'
> * 'Is there a other way to check my CPUs ?'
> 
> Thanks for your help.
> 
>       Laurent.
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Reply via email to