I have a mpi program that aggregates data from multiple sql systems. It all runs fine. To test fault tolerance I switch one of the machines off while it is running. The result is always a hang, ie mpirun never completes. To try and avoid this I have replaced the send and receive calls with immediate calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives but it makes no difference. My requirement is that all complete or mpirun exits with an error - no matter where they are in their execution when a failure occurs. This system must continue (ie fail) if a machine dies, regroup and re-cast the job over the remaining nodes.
I am running FC10, gcc 4.3.2 and openMPI 1.4.1 4G RAM, dual core intel all x86_64 =============================================================================================================== The commands I have tried: mpirun -hostfile ~/mpd.hosts -np 6 ./ingsprinkle test t3 "select * from tab" mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6 ./ingsprinkle test t3 "select * from tab" mpirun -mca orte_forward_job_control 1 -hostfile ~/mpd.hosts -np 6 ./ingsprinkle test t3 "select * from tab" =============================================================================================================== The results: recv returned 0 with status 0 waited # 2000002 tiumes - now status is 0 flag is -1976147192 -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 5. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 29141 on node bd01 exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- [*** wait a long time ***] [bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104) ^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate =============================================================================================================== As you can see, my trap can signal an abort, the tcp layer can time out but mpirun just keeps on running... Any help greatly appreciated.. Vlad
ompi.info.tar.gz
Description: GNU Zip compressed data