Re: [OMPI users] mpiexec option for node failure

2011-09-18 Thread Ralph Castain
(sigh) let me clarify this to resolve some offlist chatter. It would be rather simple to implement an option that allowed an MPI job to continue executing after the failure of one or more processes. The problem is that OMPI's MPI layer does not yet know how to handle that situation. As Josh ind

Re: [OMPI users] mpiexec option for node failure

2011-09-16 Thread Ralph Castain
Actually, I honestly don't remember even having that discussion. In looking at it, this would be relatively easy to implement if someone really wanted it. Only issue: user would bear full responsibility for OMPI not cleaning up failed jobs since we wouldn't terminate upon seeing a proc fail. Def

Re: [OMPI users] mpiexec option for node failure

2011-09-16 Thread Josh Hursey
Though I do not share George's pessimism about acceptance to the Open MPI community, it has been slightly difficult to add such a non-standard feature to the code base for various reasons. At ORNL, I have been developing a prototype for the MPI Forum Fault Tolerance Working Group [1] of the Run-Th

Re: [OMPI users] mpiexec option for node failure

2011-09-15 Thread George Bosilca
Rob, The Open MPI community did consider such as option, but it deemed it as uninteresting. However, we (UTK team) have a patched version supporting several fault tolerant modes, including the one you described in your email. If you are interested please contact me directly. Thanks, geor

Re: [OMPI users] mpiexec option for node failure

2011-09-13 Thread Reuti
Am 13.09.2011 um 02:43 schrieb Ralph Castain: > We don't have anything similar in OMPI. There are fault tolerance modes, but > not like the one you describe. You can join mpi3-ft at http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft, there is also an archive http://lists.mpi-forum.org/mpi

Re: [OMPI users] mpiexec option for node failure

2011-09-12 Thread Ralph Castain
We don't have anything similar in OMPI. There are fault tolerance modes, but not like the one you describe. On Sep 12, 2011, at 5:52 PM, Rob Stewart wrote: > Hi, > > I have implemented a simple fault tolerant ping pong C program with MPI, > here: http://pastebin.com/7mtmQH2q > > MPICH2 offers

[OMPI users] mpiexec option for node failure

2011-09-12 Thread Rob Stewart
Hi, I have implemented a simple fault tolerant ping pong C program with MPI, here: http://pastebin.com/7mtmQH2q MPICH2 offers a parameter with mpiexec: $ mpiexec -disable-auto-cleanup .. as described here: http://trac.mcs.anl.gov/projects/mpich2/ticket/1421 It is fault tolerant in the respec