Re: [OMPI users] Node failure handling

2017-06-27 Thread George Bosilca
I would also be interested in having the slurm keep the remaining processes
around, we have been struggling with this on many of the NERSC machines.
That being said the error message comes from orted, and it suggest that
they are giving up because they lose connection to a peer. I was not aware
that this capability exists in the master version of ORTE, but if it does
then it makes our life easier.

  George.


On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org  wrote:

> Let me poke at it a bit tomorrow - we should be able to avoid the abort.
> It’s a bug if we can’t.
>
> > On Jun 26, 2017, at 7:39 PM, Tim Burgess 
> wrote:
> >
> > Hi Ralph,
> >
> > Thanks for the quick response.
> >
> > Just tried again not under slurm, but the same result... (though I
> > just did kill -9 orted on the remote node this time)
> >
> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> >
> > Cheers,
> > Tim
> >
> >
> > ```
> > [user@bud96 mpi_resilience]$
> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> > --debug-daemons $(pwd)/test
> > ( some output from job here )
> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
> > bash: line 1: 161312 Killed
> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> > "581828608.0;tcp://172.16.251.96,172.31.1.254:58250" -mca plm "rsh"
> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> > 
> --
> > ORTE has lost communication with a remote daemon.
> >
> >  HNP daemon   : [[8878,0],0] on node bud96
> >  Remote daemon: [[8878,0],1] on node pnod0331
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> > 
> --
> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone -
> exiting
> > ```
> >
> > On 27 June 2017 at 12:19, r...@open-mpi.org  wrote:
> >> Ah - you should have told us you are running under slurm. That does
> indeed make a difference. When we launch the daemons, we do so with "srun
> --kill-on-bad-exit” - this means that slurm automatically kills the job if
> any daemon terminates. We take that measure to avoid leaving zombies behind
> in the event of a failure.
> >>
> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the
> rsh launcher instead of the slurm one, which gives you more control.
> >>
> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess 
> wrote:
> >>>
> >>> Hi Ralph, George,
> >>>
> >>> Thanks very much for getting back to me.  Alas, neither of these
> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
> >>> "--enable-recovery", once the node reboots one gets the following
> >>> error:
> >>>
> >>> ```
> >>> 
> --
> >>> ORTE has lost communication with a remote daemon.
> >>>
> >>> HNP daemon   : [[58323,0],0] on node pnod0330
> >>> Remote daemon: [[58323,0],1] on node pnod0331
> >>>
> >>> This is usually due to either a failure of the TCP network
> >>> connection to the node, or possibly an internal failure of
> >>> the daemon itself. We cannot recover from this failure, and
> >>> therefore will terminate the job.
> >>> 
> --
> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
> >>> ```
> >>>
> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take
> >>> forever to come back up), but earlier experiments SIGKILLing the orted
> >>> on a compute node led to a very similar message as above, so at this
> >>> point I'm not optimistic...
> >>>
> >>> I think my next step is to try with several separate mpiruns and use
> >>> mpi_comm_{connect,accept} to plumb everything together before the
> >>> application starts.  I notice this is the subject of some recent work
> >>> on ompi master.  Even though the mpiruns will all be associated to the
> >>> same ompi-server, do you think this could be sufficient to isolate the
> >>> failures?
> >>>
> >>> Cheers,
> >>> Tim
> >>>
> >>>
> >>>
> >>> On 10 June 2017 at 00:56, r...@open-mpi.org  wrote:
>  It has been awhile since I tested it, but I believe the
> --enable-recovery option might do what you want.
> 
> > On Jun 8, 2017, at 6:17 AM, Tim Burgess 
> wrote:
> >
> >

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Actually, the error message is coming from mpirun to indicate that it lost 
connection to one (or more) of its daemons. This happens because slurm only 
knows about the remote daemons - mpirun was started outside of “srun”, and so 
slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills the 
daemons on the compute nodes, not mpirun. As a result, we always see that error 
message.

The capability should exist as an option - it used to, but probably has fallen 
into disrepair. I’ll see if I can bring it back.

> On Jun 27, 2017, at 3:35 AM, George Bosilca  wrote:
> 
> I would also be interested in having the slurm keep the remaining processes 
> around, we have been struggling with this on many of the NERSC machines. That 
> being said the error message comes from orted, and it suggest that they are 
> giving up because they lose connection to a peer. I was not aware that this 
> capability exists in the master version of ORTE, but if it does then it makes 
> our life easier.
> 
>   George.
> 
> 
> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org  
> mailto:r...@open-mpi.org>> wrote:
> Let me poke at it a bit tomorrow - we should be able to avoid the abort. It’s 
> a bug if we can’t.
> 
> > On Jun 26, 2017, at 7:39 PM, Tim Burgess  > > wrote:
> >
> > Hi Ralph,
> >
> > Thanks for the quick response.
> >
> > Just tried again not under slurm, but the same result... (though I
> > just did kill -9 orted on the remote node this time)
> >
> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
> >
> > Cheers,
> > Tim
> >
> >
> > ```
> > [user@bud96 mpi_resilience]$
> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
> > --debug-daemons $(pwd)/test
> > ( some output from job here )
> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
> > bash: line 1: 161312 Killed
> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
> > "581828608.0;tcp://172.16.251.96 ,172.31.1.254:58250 
> > " -mca plm "rsh"
> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
> > --
> > ORTE has lost communication with a remote daemon.
> >
> >  HNP daemon   : [[8878,0],0] on node bud96
> >  Remote daemon: [[8878,0],1] on node pnod0331
> >
> > This is usually due to either a failure of the TCP network
> > connection to the node, or possibly an internal failure of
> > the daemon itself. We cannot recover from this failure, and
> > therefore will terminate the job.
> > --
> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - exiting
> > ```
> >
> > On 27 June 2017 at 12:19, r...@open-mpi.org  
> > mailto:r...@open-mpi.org>> wrote:
> >> Ah - you should have told us you are running under slurm. That does indeed 
> >> make a difference. When we launch the daemons, we do so with "srun 
> >> --kill-on-bad-exit” - this means that slurm automatically kills the job if 
> >> any daemon terminates. We take that measure to avoid leaving zombies 
> >> behind in the event of a failure.
> >>
> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
> >> launcher instead of the slurm one, which gives you more control.
> >>
> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess  >>> > wrote:
> >>>
> >>> Hi Ralph, George,
> >>>
> >>> Thanks very much for getting back to me.  Alas, neither of these
> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
> >>> "--enable-recovery", once the node reboots one gets the following
> >>> error:
> >>>
> >>> ```
> >>> --
> >>> ORTE has lost communication with a remote daemon.
> >>>
> >>> HNP daemon   : [[58323,0],0] on node pnod0330
> >>> Remote daemon: [[58323,0],1] on node pnod0331
> >>>
> >>> This is usually due to either a failure of the TCP network
> >>> connection to the node, or possibly an internal failure of
> >>> the daemon itself. We cannot recover from this failure, and
> >>> therefore will terminate the job.
> >>> --
> >>> [pnod0330:110442] [[58323,0],0] orted_cmd: received halt_vm cmd
> >>> [pnod0332:56161] [[58323,0],2] orted_cmd: received halt_vm cmd
> >>> ```
> >>>
> >>> I haven't yet tried the hard reboot case with ULFM (these nodes take
> >>> fo

Re: [OMPI users] Node failure handling

2017-06-27 Thread r...@open-mpi.org
Okay, this should fix it - https://github.com/open-mpi/ompi/pull/3771 


> On Jun 27, 2017, at 6:31 AM, r...@open-mpi.org wrote:
> 
> Actually, the error message is coming from mpirun to indicate that it lost 
> connection to one (or more) of its daemons. This happens because slurm only 
> knows about the remote daemons - mpirun was started outside of “srun”, and so 
> slurm doesn’t know it exists. Thus, when slurm kills the job, it only kills 
> the daemons on the compute nodes, not mpirun. As a result, we always see that 
> error message.
> 
> The capability should exist as an option - it used to, but probably has 
> fallen into disrepair. I’ll see if I can bring it back.
> 
>> On Jun 27, 2017, at 3:35 AM, George Bosilca > > wrote:
>> 
>> I would also be interested in having the slurm keep the remaining processes 
>> around, we have been struggling with this on many of the NERSC machines. 
>> That being said the error message comes from orted, and it suggest that they 
>> are giving up because they lose connection to a peer. I was not aware that 
>> this capability exists in the master version of ORTE, but if it does then it 
>> makes our life easier.
>> 
>>   George.
>> 
>> 
>> On Tue, Jun 27, 2017 at 6:14 AM, r...@open-mpi.org 
>>  mailto:r...@open-mpi.org>> 
>> wrote:
>> Let me poke at it a bit tomorrow - we should be able to avoid the abort. 
>> It’s a bug if we can’t.
>> 
>> > On Jun 26, 2017, at 7:39 PM, Tim Burgess > > > wrote:
>> >
>> > Hi Ralph,
>> >
>> > Thanks for the quick response.
>> >
>> > Just tried again not under slurm, but the same result... (though I
>> > just did kill -9 orted on the remote node this time)
>> >
>> > Any ideas?  Do you think my multiple-mpirun idea is worth trying?
>> >
>> > Cheers,
>> > Tim
>> >
>> >
>> > ```
>> > [user@bud96 mpi_resilience]$
>> > /d/home/user/2017/openmpi-master-20170608/bin/mpirun --mca plm rsh
>> > --host bud96,pnod0331 -np 2 --npernode 1 --enable-recovery
>> > --debug-daemons $(pwd)/test
>> > ( some output from job here )
>> > ( I then do kill -9 `pgrep orted`  on pnod0331 )
>> > bash: line 1: 161312 Killed
>> > /d/home/user/2017/openmpi-master-20170608/bin/orted -mca
>> > orte_debug_daemons "1" -mca ess "env" -mca ess_base_jobid "581828608"
>> > -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex
>> > "bud[2:96],pnod[4:331]@0(2)" -mca orte_hnp_uri
>> > "581828608.0;tcp://172.16.251.96 
>> > ,172.31.1.254:58250 " 
>> > -mca plm "rsh"
>> > -mca rmaps_ppr_n_pernode "1" -mca orte_enable_recovery "1"
>> > --
>> > ORTE has lost communication with a remote daemon.
>> >
>> >  HNP daemon   : [[8878,0],0] on node bud96
>> >  Remote daemon: [[8878,0],1] on node pnod0331
>> >
>> > This is usually due to either a failure of the TCP network
>> > connection to the node, or possibly an internal failure of
>> > the daemon itself. We cannot recover from this failure, and
>> > therefore will terminate the job.
>> > --
>> > [bud96:20652] [[8878,0],0] orted_cmd: received halt_vm cmd
>> > [bud96:20652] [[8878,0],0] orted_cmd: all routes and children gone - 
>> > exiting
>> > ```
>> >
>> > On 27 June 2017 at 12:19, r...@open-mpi.org  
>> > mailto:r...@open-mpi.org>> wrote:
>> >> Ah - you should have told us you are running under slurm. That does 
>> >> indeed make a difference. When we launch the daemons, we do so with "srun 
>> >> --kill-on-bad-exit” - this means that slurm automatically kills the job 
>> >> if any daemon terminates. We take that measure to avoid leaving zombies 
>> >> behind in the event of a failure.
>> >>
>> >> Try adding “-mca plm rsh” to your mpirun cmd line. This will use the rsh 
>> >> launcher instead of the slurm one, which gives you more control.
>> >>
>> >>> On Jun 26, 2017, at 6:59 PM, Tim Burgess > >>> > wrote:
>> >>>
>> >>> Hi Ralph, George,
>> >>>
>> >>> Thanks very much for getting back to me.  Alas, neither of these
>> >>> options seem to accomplish the goal.  Both in OpenMPI v2.1.1 and on a
>> >>> recent master (7002535), with slurm's "--no-kill" and openmpi's
>> >>> "--enable-recovery", once the node reboots one gets the following
>> >>> error:
>> >>>
>> >>> ```
>> >>> --
>> >>> ORTE has lost communication with a remote daemon.
>> >>>
>> >>> HNP daemon   : [[58323,0],0] on node pnod0330
>> >>> Remote daemon: [[58323,0],1] on node pnod0331
>> >>>
>> >>> This is usually due to either a failure of the TCP network
>> >>> connection to the node, or possibly an internal failure of
>> >>> the daemon itself. We cannot recover from this failure, and
>> >>> therefore will

Re: [OMPI users] MPI_ABORT, indirect execution of executables by mpirun, Open MPI 2.1.1

2017-06-27 Thread Ted Sussman
Hello Ralph,

Thanks for your quick reply and bug fix.  I have obtained the update and tried 
it in my simple 
example, and also in the original program from which the simple example was 
extracted.  
The update works as expected :) 

Sincerely,

Ted Sussman

On 27 Jun 2017 at 12:13, r...@open-mpi.org wrote:

> 
> Oh my - I finally tracked it down. A simple one character error.
> 
> Thanks for your patience. Fix is https://github.com/open-mpi/ompi/pull/3773 
> and will be ported to 2.x 
> and 3.0
> Ralph
> 
> On Jun 27, 2017, at 11:17 AM, r...@open-mpi.org wrote:
> 
> Ideally, we should be delivering the signal to all procs in the process 
> group of each dum.sh. 
> Looking at the code in the head of the 2.x branch, that does indeed 
> appear to be what we 
> are doing, assuming that we found setpgid in your system: 
> 
> static int odls_default_kill_local(pid_t pid, int signum)
> {
>     pid_t pgrp;
> 
> #if HAVE_SETPGID
>     pgrp = getpgid(pid);
>     if (-1 != pgrp) {
>         /* target the lead process of the process
>          * group so we ensure that the signal is
>          * seen by all members of that group. This
>          * ensures that the signal is seen by any
>          * child processes our child may have
>          * started
>          */
>         pid = pgrp;
>     }
> #endif
>     if (0 != kill(pid, signum)) {
>         if (ESRCH != errno) {
>             OPAL_OUTPUT_VERBOSE((2, 
> orte_odls_base_framework.framework_output,
>                                  "%s odls:default:SENT KILL %d TO PID %d 
> GOT ERRNO %d",
>                                  ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), 
> signum, (int)pid, errno));
>             return errno;
>         }
>     }
>     OPAL_OUTPUT_VERBOSE((2, orte_odls_base_framework.framework_output,
>                          "%s odls:default:SENT KILL %d TO PID %d SUCCESS",
>                          ORTE_NAME_PRINT(ORTE_PROC_MY_NAME), signum, 
> (int)pid));
>     return 0;
> }
> 
> For some strange reason, it appears that you aren´t see this? I´m 
> building the branch now 
> and will see if I can reproduce it.
> 
> On Jun 27, 2017, at 10:58 AM, Ted Sussman  wrote:
> 
> Hello all,
> 
> Thank you for your help and advice.  It has taken me several days to 
> understand what 
> you were trying to tell me.  I have now studied the problem in more 
> detail, using a 
> version of Open MPI 2.1.1 built with --enable-debug.
> 
> -
> 
> Consider the following scenario in Open MPI 2.1.1:
> 
> mpirun --> dum.sh --> aborttest.exe  (rank 0)
>    --> dum.sh --> aborttest.exe  (rank 1)
> 
> aborttest.exe calls MPI_Bcast several times, then aborttest.exe rank 0 
> calls 
> MPI_Abort.
> 
> As far as I can figure out, this is what happens after aborttest.exe rank 
> 0 calls 
> MPI_Abort.
> 
> 1) aborttest.exe for rank 0 exits.  aborttest.exe for rank 1 is polling 
> (waiting for 
> message from MPI_Bcast).
> 
> 2) mpirun (or maybe orted?) sends the signals SIGCONT, SIGTERM, SIGKILL 
> to both 
> dum.sh processes.
> 
> 3) Both dum.sh processes are killed.
> 
> 4) aborttest.exe for rank 1 continues to poll. mpirun never exits.
> 
> 
> 
> Now suppose that dum.sh traps SIGCONT, and that the trap handler in 
> dum.sh sends 
> signal SIGINT to $PPID.  This is what seems to happen after aborttest.exe 
> rank 0 calls 
> MPI_Abort:
> 
> 1) aborttest.exe for rank 0 exits. aborttest.exe for rank 1 is polling 
> (waiting for 
> message from MPI_Bcast).
> 
> 2) mpirun  (or maybe orted?) sends the signals SIGCONT, SIGTERM, SIGKILL 
> to both 
> dum.sh processes.
> 
> 3) dum.sh for rank 0 catches SIGCONT and sents SIGINT to its parent.  
> dum.sh for 
> rank 1 appears to be killed (I don't understand this, why doesn't dum.sh 
> for rank 1 also 
> catch SIGCONT?)
> 
> 4) mpirun catches the SIGINT and kills aborttest.exe for rank 1, then 
> mpirun exits.
> 
> So adding the trap handler to dum.sh solves my problem.
> 
> Is this the preferred solution to my problem?  Or is there a more elegant 
> solution?
> 
> Sincerely,
> 
> Ted Sussman
> 
> 
> 
> 
> 
> 
> 
> 
> On 19 Jun 2017 at 11:19, r...@open-mpi.org wrote:
> 
> >
> >
> >
> > On Jun 19, 2017, at 10:53 AM, Ted Sussman  
> wrote:
> >
> > For what it's worth, the problem might be related to the following:
> >
> > mpirun: -np 2 ... dum.sh
> > dum.sh: Invoke aborttest11.exe
> > aborttest11.exe: Call  MPI_Init, go into an infinite loop.
> >
> > Now when mpirun is running, send signals at the processes, as 
> follows:
> >
> > 1) kill -9 (pid for one of the aborttest11.exe processes)
> >
> > The shell for this aborttest11.exe continues. Once this shell 
>