That is effectively what I have done by changing to the immediate send/receive 
and waiting in a loop a finite number of times for the transfers to complete - 
and calling MPI_abort if they do not complete in a set time.
It is not clear how I can kill mpirun in a manner consistent with the API.
Are you implying I should call exit() rather than MPI_abort?

--- On Wed, 23/6/10, David Zhang <solarbik...@gmail.com> wrote:

From: David Zhang <solarbik...@gmail.com>
Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun
To: "Open MPI Users" <us...@open-mpi.org>
Received: Wednesday, 23 June, 2010, 4:37 PM

Since you turned the machine off instead of just killing one of the processes, 
no signals could be sent to other processes.  Perhaps you could institute some 
sort of handshaking in your software that periodically check for the attendance 
of all machines, and timeout if not all are present within some alloted time?



On Tue, Jun 22, 2010 at 10:43 PM, Randolph Pullen 
<randolph_pul...@yahoo.com.au> wrote:





I have a mpi program that aggregates data from multiple sql systems.  It all 
runs fine.  To test fault tolerance I switch one of the machines off while it 
is running.  The result is always a hang, ie mpirun never completes.


 
To try and avoid this I have replaced the send and receive calls with immediate 
calls (ie MPI_Isend, MPI_Irecv) to try and trap long waiting sends and receives 
but it makes no difference.
My requirement is that all complete or mpirun exits with an error - no matter 
where they are in their execution when a failure occurs.  This system must 
continue (ie fail)  if a machine dies, regroup and re-cast the job over the 
remaining nodes.



I am running FC10, gcc 4.3.2 and openMPI 1.4.1
4G RAM, dual core intel all
 x86_64


===============================================================================================================
The commands I have tried:
mpirun  -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  "select * from 
tab"   



mpirun -mca btl ^sm -hostfile ~/mpd.hosts -np 6  ./ingsprinkle  test t3  
"select * from tab"   


mpirun -mca orte_forward_job_control 1  -hostfile ~/mpd.hosts -np 6  
./ingsprinkle  test t3  "select * from tab"   



===============================================================================================================

The results:
recv returned 0 with status 0
waited  # 2000002 tiumes - now status is  0 flag is -1976147192


--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 5.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.


You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------


mpirun has exited due to process rank 0 with PID 29141 on
node bd01 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun
 (as reported here).
--------------------------------------------------------------------------

[*** wait a long time ***]
[bd01:29136] [[55293,0],0]-[[55293,0],1] mca_oob_tcp_msg_recv: readv failed: 
Connection reset by peer (104)



^Cmpirun: abort is already in progress...hit ctrl-c again to forcibly terminate


===============================================================================================================

As you can see, my trap can signal an abort, the tcp layer can time out but 
mpirun just keeps on running...



Any help greatly appreciated..
Vlad







       
_______________________________________________

users mailing list

us...@open-mpi.org

http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
David Zhang
University of California, San Diego




-----Inline Attachment Follows-----

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


      

Reply via email to