I have a mpi program that aggregates data from multiple sql systems.  It all 
runs fine.  To test fault tolerance I switch one of the machines off while it 
is running.  The result is always a hang, ie mpirun never completes.
 
To try and avoid this I have replaced the send and receive calls with immediate 
calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives 
but it makes no difference.
My requirement is that all complete or mpirun exits with an error - no matter 
where they are in their execution when a failure occurs.  This system must 
continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
remaining nodes.

I am running FC10, gcc 4.3.2 and openMPI 1.4.1
4G RAM, dual core intel all x86_64


===============================================================================================================
The commands I have tried:
mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
tab"   

mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
"select * from tab"   


mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
./ingsprinkle  test t3  "select * from tab"   



===============================================================================================================

The results:
recv returned 0 with status 0
waited  # 2000002 tiumes - now status is  0 flag is -1976147192
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 29141 on
node bd01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

[*** wait a long time ***]
[bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)

^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate


===============================================================================================================

As you can see, my trap can signal an abort, the tcp layer can time out but 
mpirun just keeps on running...

Any help greatly appreciated..
Vlad





      

Attachment: ompi.info.tar.gz
Description: GNU Zip compressed data

Reply via email to