Hi,

I am currently trying to learn about fault tolerance in MPI so I
experimented a bit with what happens if I kill various components in my MPI
setup, but there are some unexpected hangs in some situations.

I use the following MPI script:

    #!/usr/bin/env python

    from mpi4py import MPI
    import time
    import sys
    import os
    import signal

    comm = MPI.COMM_WORLD

    for i in range(100):
        print("Hello @ %d! I'm rank %d from %d running in total..." % (i,
comm.rank, comm.size))
        time.sleep(2)
        if comm.rank == 1 and i == 2:
            os.system("pstree -p")
            # TRY VARIOUS THINGS IN THE LINE BELOW
            os.kill(os.getpid(), signal.SIGTERM)

    comm.Barrier()

When I run the script above on three nodes, I see the following output:

    Hello @ 0! I'm rank 0 from 3 running in total...
    Hello @ 0! I'm rank 1 from 3 running in total...
    Hello @ 0! I'm rank 2 from 3 running in total...
    Hello @ 1! I'm rank 0 from 3 running in total...
    Hello @ 1! I'm rank 1 from 3 running in total...
    Hello @ 1! I'm rank 2 from 3 running in total...
    Hello @ 2! I'm rank 0 from 3 running in total...
    Hello @ 2! I'm rank 1 from 3 running in total...
    Hello @ 2! I'm rank 2 from 3 running in total...
    Hello @ 3! I'm rank 0 from 3 running in total...
    Hello @ 3! I'm rank 2 from 3 running in total...

timeout(1)---sshd(8)---sshd(18)---orted(19)-+-python3(23)-+-sh(26)---pstree(27)
                                                |
|-{python3}(24)
                                                |
`-{python3}(25)
                                                |-{orted}(20)
                                                |-{orted}(21)
                                                `-{orted}(22)
    Hello @ 4! I'm rank 2 from 3 running in total...

--------------------------------------------------------------------------
    mpiexec noticed that process rank 1 with PID 23 on node 8f528c301215
exited on signal 15 (Terminated).

--------------------------------------------------------------------------
    [program exit]

(Note that each process runs in a Docker container, so these are in fact
all the processes visible to my program.)

This is nice, but if I want to know what happens if a node or the network
fails, then I also need to check other parts, so I changed
`os.kill(os.getpid(), signal.SIGTERM)` to `os.kill(1, signal.SIGTERM)` so
that all processes on that particular node die. I guess this is very
similar to what would happen if I reboot the system. The output is as
follows:

    Hello @ 0! I'm rank 1 from 3 running in total...
    Hello @ 0! I'm rank 0 from 3 running in total...
    Hello @ 0! I'm rank 2 from 3 running in total...
    Hello @ 1! I'm rank 1 from 3 running in total...
    Hello @ 1! I'm rank 0 from 3 running in total...
    Hello @ 1! I'm rank 2 from 3 running in total...
    Hello @ 2! I'm rank 1 from 3 running in total...
    Hello @ 2! I'm rank 0 from 3 running in total...
    Hello @ 2! I'm rank 2 from 3 running in total...

timeout(1)---sshd(6)---sshd(16)---orted(17)-+-python3(21)-+-sh(24)---pstree(25)
                                                |
|-{python3}(22)
                                                |
`-{python3}(23)
                                                |-{orted}(18)
                                                |-{orted}(19)
                                                `-{orted}(20)
    Hello @ 3! I'm rank 1 from 3 running in total...
    Connection to 43982adfb734 closed by remote host.

--------------------------------------------------------------------------
    ORTE was unable to reliably start one or more daemons.
    This usually is caused by:

    * not finding the required libraries and/or binaries on
      one or more nodes. Please check your PATH and LD_LIBRARY_PATH
      settings, or configure OMPI with --enable-orterun-prefix-by-default

    * lack of authority to execute on one or more specified nodes.
      Please verify your allocation and authorities.

    * the inability to write startup files into /tmp
(--tmpdir/orte_tmpdir_base).
      Please check with your sys admin to determine the correct location to
use.

    *  compilation of the orted with dynamic libraries when static are
required
      (e.g., on Cray). Please check your configure cmd line and consider
using
      one of the contrib/platform definitions for your system type.

    * an inability to create a connection back to mpirun due to a
      lack of common network interfaces and/or no route found between
      them. Please check network connectivity (including firewalls
      and network routing requirements).

--------------------------------------------------------------------------
    Hello @ 3! I'm rank 0 from 3 running in total...
    Hello @ 3! I'm rank 2 from 3 running in total...
    Hello @ 4! I'm rank 2 from 3 running in total...
    [program hangs]

I ran this several times and sometimes I would also see the following
output:

    Hello @ 0! I'm rank 2 from 3 running in total...
    Hello @ 0! I'm rank 0 from 3 running in total...
    Hello @ 0! I'm rank 1 from 3 running in total...
    Hello @ 1! I'm rank 2 from 3 running in total...
    Hello @ 1! I'm rank 0 from 3 running in total...
    Hello @ 1! I'm rank 1 from 3 running in total...
    Hello @ 2! I'm rank 2 from 3 running in total...
    Hello @ 2! I'm rank 0 from 3 running in total...
    Hello @ 2! I'm rank 1 from 3 running in total...
    Hello @ 3! I'm rank 2 from 3 running in total...
    Hello @ 3! I'm rank 0 from 3 running in total...

timeout(1)---sshd(7)---sshd(17)---orted(18)-+-python3(22)-+-sh(25)---pstree(26)
                                                |
|-{python3}(23)
                                                |
`-{python3}(24)
                                                |-{orted}(19)
                                                |-{orted}(20)
                                                `-{orted}(21)
    Hello @ 3! I'm rank 1 from 3 running in total...

--------------------------------------------------------------------------
    ORTE has lost communication with a remote daemon.

      HNP daemon   : [[18620,0],0] on node c971706813c7
      Remote daemon: [[18620,0],1] on node 626989823da6

    Connection to 626989823da6 closed by remote host.
    This is usually due to either a failure of the TCP network
    connection to the node, or possibly an internal failure of
    the daemon itself. We cannot recover from this failure, and
    therefore will terminate the job.

--------------------------------------------------------------------------
    [program hangs]

The unexpected behavior is that in both cases `mpiexec` does not terminate,
but hangs. On the node that runs the `mpiexec` command, I see that two
`ssh` processes and one `python3` process are in <defunct> state.

Can you please let me know what I can do so that the `mpiexec` process
terminates when one of the worker nodes goes down?

Thank you,
Tobias
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to