Hi Ralph, George,

Thanks very much for getting back to me.  Alas, neither of these
options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
recent master (7002535), with slurm's "--no-kill" and openmpi's
"--enable-recovery", once the node reboots one gets the following
error:

```
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[58323,0],0] on node pnod0330
  Remote daemon: [[58323,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
[pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
```

I haven't yet tried the hard reboot case with ULFM (these nodes take
forever to come back up), but earlier experiments SIGKILLing the orted
on a compute node led to a very similar message as above, so at this
point I'm not optimistic...

I think my next step is to try with several separate mpiruns and use
mpi_comm_{connect,accept} to plumb everything together before the
application starts.  I notice this is the subject of some recent work
on ompi master.  Even though the mpiruns will all be associated to the
same ompi-server, do you think this could be sufficient to isolate the
failures?

Cheers,
Tim



On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote:
> It has been awhile since I tested it, but I believe the --enable-recovery 
> option might do what you want.
>
>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>
>> Hi!
>>
>> So I know from searching the archive that this is a repeated topic of
>> discussion here, and apologies for that, but since it's been a year or
>> so I thought I'd double-check whether anything has changed before
>> really starting to tear my hair out too much.
>>
>> Is there a combination of MCA parameters or similar that will prevent
>> ORTE from aborting a job when it detects a node failure?  This is
>> using the tcp btl, under slurm.
>>
>> The application, not written by us and too complicated to re-engineer
>> at short notice, has a strictly master-slave communication pattern.
>> The master never blocks on communication from individual slaves, and
>> apparently can itself detect slaves that have silently disappeared and
>> reissue the work to those remaining.  So from an application
>> standpoint I believe we should be able to handle this.  However, in
>> all my testing so far the job is aborted as soon as the runtime system
>> figures out what is going on.
>>
>> If not, do any users know of another MPI implementation that might
>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>> quiet the last couple of years?
>>
>> Thanks in advance,
>>
>> Tim
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to