Re: [OMPI users] Node failure handling

Tim Burgess Mon, 26 Jun 2017 19:41:13 -0700

Hi Ralph,

Thanks for the quick response.


Just tried again not under slurm, but the same result... (though I
just did kill -9 orted on the remote node this time)

Any ideas?  Do you think my multiple-mpirun idea is worth trying?

Cheers,
Tim


```
[user@bud96 mpi_resilience]$
/d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
--host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
--debug-daemons $(pwd)/test
( some output from job here )
( I then do kill -9 `pgrep orted`  on pnod0331 )
bash: line 1: 161312 Killed
/d/home/user/2017/openmpi-master-20170608/bin/orted -mca
orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
-mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
"bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
"581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
-mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
--------------------------------------------------------------------------
ORTE has lost communication with a remote daemon.

  HNP daemon   : [[8878,0],0] on node bud96
  Remote daemon: [[8878,0],1] on node pnod0331

This is usually due to either a failure of the TCP network
connection to the node, or possibly an internal failure of
the daemon itself. We cannot recover from this failure, and
therefore will terminate the job.
--------------------------------------------------------------------------
[bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
[bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
```

On 27 June 2017 at 12:19, r...@open-mpi.org <r...@open-mpi.org> wrote:
> Ah - you should have told us you are running under slurm. That does indeed 
> make a difference. When we launch the daemons, we do so with "srun 
> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
> any daemon terminates. We take that measure to avoid leaving zombies behind 
> in the event of a failure.
>
> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
> launcher instead of the slurm one, which gives you more control.
>
>> On Jun 26, 2017, at 6:59 PM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>
>> Hi Ralph, George,
>>
>> Thanks very much for getting back to me.  Alas, neither of these
>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> "--enable-recovery", once the node reboots one gets the following
>> error:
>>
>> ```
>> --------------------------------------------------------------------------
>> ORTE has lost communication with a remote daemon.
>>
>>  HNP daemon   : [[58323,0],0] on node pnod0330
>>  Remote daemon: [[58323,0],1] on node pnod0331
>>
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --------------------------------------------------------------------------
>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
>> ```
>>
>> I haven't yet tried the hard reboot case with ULFM (these nodes take
>> forever to come back up), but earlier experiments SIGKILLing the orted
>> on a compute node led to a very similar message as above, so at this
>> point I'm not optimistic...
>>
>> I think my next step is to try with several separate mpiruns and use
>> mpi_comm_{connect,accept} to plumb everything together before the
>> application starts.  I notice this is the subject of some recent work
>> on ompi master.  Even though the mpiruns will all be associated to the
>> same ompi-server, do you think this could be sufficient to isolate the
>> failures?
>>
>> Cheers,
>> Tim
>>
>>
>>
>> On 10 June 2017 at 00:56, r...@open-mpi.org <r...@open-mpi.org> wrote:
>>> It has been awhile since I tested it, but I believe the --enable-recovery 
>>> option might do what you want.
>>>
>>>> On Jun 8, 2017, at 6:17 AM, Tim Burgess <ozburgess+o...@gmail.com> wrote:
>>>>
>>>> Hi!
>>>>
>>>> So I know from searching the archive that this is a repeated topic of
>>>> discussion here, and apologies for that, but since it's been a year or
>>>> so I thought I'd double-check whether anything has changed before
>>>> really starting to tear my hair out too much.
>>>>
>>>> Is there a combination of MCA parameters or similar that will prevent
>>>> ORTE from aborting a job when it detects a node failure?  This is
>>>> using the tcp btl, under slurm.
>>>>
>>>> The application, not written by us and too complicated to re-engineer
>>>> at short notice, has a strictly master-slave communication pattern.
>>>> The master never blocks on communication from individual slaves, and
>>>> apparently can itself detect slaves that have silently disappeared and
>>>> reissue the work to those remaining.  So from an application
>>>> standpoint I believe we should be able to handle this.  However, in
>>>> all my testing so far the job is aborted as soon as the runtime system
>>>> figures out what is going on.
>>>>
>>>> If not, do any users know of another MPI implementation that might
>>>> work for this use case?  As far as I can tell, FT-MPI has been pretty
>>>> quiet the last couple of years?
>>>>
>>>> Thanks in advance,
>>>>
>>>> Tim
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Node failure handling

Reply via email to