Hi, I am currently trying to learn about fault tolerance in MPI so I experimented a bit with what happens if I kill various components in my MPI setup, but there are some unexpected hangs in some situations.
I use the following MPI script: #!/usr/bin/env python from mpi4py import MPI import time import sys import os import signal comm = MPI.COMM_WORLD for i in range(100): print("Hello @ %d! I'm rank %d from %d running in total..." % (i, comm.rank, comm.size)) time.sleep(2) if comm.rank == 1 and i == 2: os.system("pstree -p") # TRY VARIOUS THINGS IN THE LINE BELOW os.kill(os.getpid(), signal.SIGTERM) comm.Barrier() When I run the script above on three nodes, I see the following output: Hello @ 0! I'm rank 0 from 3 running in total... Hello @ 0! I'm rank 1 from 3 running in total... Hello @ 0! I'm rank 2 from 3 running in total... Hello @ 1! I'm rank 0 from 3 running in total... Hello @ 1! I'm rank 1 from 3 running in total... Hello @ 1! I'm rank 2 from 3 running in total... Hello @ 2! I'm rank 0 from 3 running in total... Hello @ 2! I'm rank 1 from 3 running in total... Hello @ 2! I'm rank 2 from 3 running in total... Hello @ 3! I'm rank 0 from 3 running in total... Hello @ 3! I'm rank 2 from 3 running in total... timeout(1)---sshd(8)---sshd(18)---orted(19)-+-python3(23)-+-sh(26)---pstree(27) | |-{python3}(24) | `-{python3}(25) |-{orted}(20) |-{orted}(21) `-{orted}(22) Hello @ 4! I'm rank 2 from 3 running in total... -------------------------------------------------------------------------- mpiexec noticed that process rank 1 with PID 23 on node 8f528c301215 exited on signal 15 (Terminated). -------------------------------------------------------------------------- [program exit] (Note that each process runs in a Docker container, so these are in fact all the processes visible to my program.) This is nice, but if I want to know what happens if a node or the network fails, then I also need to check other parts, so I changed `os.kill(os.getpid(), signal.SIGTERM)` to `os.kill(1, signal.SIGTERM)` so that all processes on that particular node die. I guess this is very similar to what would happen if I reboot the system. The output is as follows: Hello @ 0! I'm rank 1 from 3 running in total... Hello @ 0! I'm rank 0 from 3 running in total... Hello @ 0! I'm rank 2 from 3 running in total... Hello @ 1! I'm rank 1 from 3 running in total... Hello @ 1! I'm rank 0 from 3 running in total... Hello @ 1! I'm rank 2 from 3 running in total... Hello @ 2! I'm rank 1 from 3 running in total... Hello @ 2! I'm rank 0 from 3 running in total... Hello @ 2! I'm rank 2 from 3 running in total... timeout(1)---sshd(6)---sshd(16)---orted(17)-+-python3(21)-+-sh(24)---pstree(25) | |-{python3}(22) | `-{python3}(23) |-{orted}(18) |-{orted}(19) `-{orted}(20) Hello @ 3! I'm rank 1 from 3 running in total... Connection to 43982adfb734 closed by remote host. -------------------------------------------------------------------------- ORTE was unable to reliably start one or more daemons. This usually is caused by: * not finding the required libraries and/or binaries on one or more nodes. Please check your PATH and LD_LIBRARY_PATH settings, or configure OMPI with --enable-orterun-prefix-by-default * lack of authority to execute on one or more specified nodes. Please verify your allocation and authorities. * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base). Please check with your sys admin to determine the correct location to use. * compilation of the orted with dynamic libraries when static are required (e.g., on Cray). Please check your configure cmd line and consider using one of the contrib/platform definitions for your system type. * an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- Hello @ 3! I'm rank 0 from 3 running in total... Hello @ 3! I'm rank 2 from 3 running in total... Hello @ 4! I'm rank 2 from 3 running in total... [program hangs] I ran this several times and sometimes I would also see the following output: Hello @ 0! I'm rank 2 from 3 running in total... Hello @ 0! I'm rank 0 from 3 running in total... Hello @ 0! I'm rank 1 from 3 running in total... Hello @ 1! I'm rank 2 from 3 running in total... Hello @ 1! I'm rank 0 from 3 running in total... Hello @ 1! I'm rank 1 from 3 running in total... Hello @ 2! I'm rank 2 from 3 running in total... Hello @ 2! I'm rank 0 from 3 running in total... Hello @ 2! I'm rank 1 from 3 running in total... Hello @ 3! I'm rank 2 from 3 running in total... Hello @ 3! I'm rank 0 from 3 running in total... timeout(1)---sshd(7)---sshd(17)---orted(18)-+-python3(22)-+-sh(25)---pstree(26) | |-{python3}(23) | `-{python3}(24) |-{orted}(19) |-{orted}(20) `-{orted}(21) Hello @ 3! I'm rank 1 from 3 running in total... -------------------------------------------------------------------------- ORTE has lost communication with a remote daemon. HNP daemon : [[18620,0],0] on node c971706813c7 Remote daemon: [[18620,0],1] on node 626989823da6 Connection to 626989823da6 closed by remote host. This is usually due to either a failure of the TCP network connection to the node, or possibly an internal failure of the daemon itself. We cannot recover from this failure, and therefore will terminate the job. -------------------------------------------------------------------------- [program hangs] The unexpected behavior is that in both cases `mpiexec` does not terminate, but hangs. On the node that runs the `mpiexec` command, I see that two `ssh` processes and one `python3` process are in <defunct> state. Can you please let me know what I can do so that the `mpiexec` process terminates when one of the worker nodes goes down? Thank you, Tobias
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users