You're not doing anything wrong; it's just that Open MPI doesn't [yet] handle failures well. It will probably *eventually* respond with a timeout (and therefore fail).
You might want to run a real resource manager to manage your cluster, such as SLURM, Torque, or one of a bunch of commercial solutions. These applications typically have some kind of daemon running on each node and get fairly good notifications when nodes go down, etc. > -----Original Message----- > From: users-boun...@open-mpi.org > [mailto:users-boun...@open-mpi.org] On Behalf Of > laurent.po...@fr.thalesgroup.com > Sent: Tuesday, April 25, 2006 4:58 AM > To: us...@open-mpi.org > Subject: [OMPI users] Checking the cluster status > withMPI_Comm_spawn_multiple > > Hi, > > Before starting programs on my cluster, I want to check on > every CPU if it is up and able to run MPI applications. > > For this, I use a kind of 'ping' program that just send a > message saying 'I'm OK' tu a superviser program. > The 'ping' program is sent by the superviser on each CPU by > the MPI_Comm_spawn_multiple command. > > It works fine when every CPU is up, but when one is down, my > superviser stops when calling the MPI_Comm_spawn_multiple command. > > So the questions are : > * 'What am I doing wrong ?' > * 'Is there a other way to check my CPUs ?' > > Thanks for your help. > > Laurent. > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >