Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-24 Thread Ralph Castain
? > > Regards, > Randolph > > PS: excellent product, keep up the good work > --- On Thu, 24/6/10, Ralph Castain wrote: > > From: Ralph Castain > Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun > To: "Open MPI Users" > Received: Thursday,

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
-start failed processes on backup nodes without losing the current query. What are your thoughts? Regards, Randolph PS: excellent product, keep up the good work --- On Thu, 24/6/10, Ralph Castain wrote: From: Ralph Castain Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun To: "Ope

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Ralph Castain
de to receive or send a signal. > > > --- On Wed, 23/6/10, Jeff Squyres wrote: > > From: Jeff Squyres > Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun > To: "Open MPI Users" > Received: Wednesday, 23 June, 2010, 9:10 PM > > Open MP

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
node is powered off and can never exit as it appears to wait indefinitely for the missing node to receive or send a signal. --- On Wed, 23/6/10, Jeff Squyres wrote: From: Jeff Squyres Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun To: "Open MPI Users" Received: Wed

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Jeff Squyres
PI_abort? > > --- On Wed, 23/6/10, David Zhang wrote: > > From: David Zhang > Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun > To: "Open MPI Users" > Received: Wednesday, 23 June, 2010, 4:37 PM > > Since you turned the machine off instead of

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
. Are you implying I should call exit() rather than MPI_abort? --- On Wed, 23/6/10, David Zhang wrote: From: David Zhang Subject: Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun To: "Open MPI Users" Received: Wednesday, 23 June, 2010, 4:37 PM Since you turned the machine off inste

Re: [OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread David Zhang
Since you turned the machine off instead of just killing one of the processes, no signals could be sent to other processes. Perhaps you could institute some sort of handshaking in your software that periodically check for the attendance of all machines, and timeout if not all are present within so

[OMPI users] more Bugs in MPI_Abort() -- mpirun

2010-06-23 Thread Randolph Pullen
I have a mpi program that aggregates data from multiple sql systems.  It all runs fine.  To test fault tolerance I switch one of the machines off while it is running.  The result is always a hang, ie mpirun never completes.   To try and avoid this I have replaced the send and receive calls with