I think I might have fixed this, but I still don't really understand it. In setting up the RPi machines, I followed a config guide that suggested switching the SSH service in systemd to "ssh.socket" instead of "ssh.service". It's supposed to be lighter weight and get you cleaner shut-downs, and I've used this trick on other machines, without really knowing the implications.
By way of completeness of my config audit to try to figure this out, I backed this out, restoring the "ssh.service" link and removing the "ssh.socket" one. Now MPI works, and I also get clean disconnections at exit-time, so apparently there's no reason at all to do this. This behavior has survived two reboot cycles, so I think it's real. Not sure if this behavior is relevant to just Raspbian, or if it's in all architectures of Debian Jessie, or all systemd init systems, or what. -- A. On Sat, May 14, 2016 at 3:27 PM Andrew Reid <andrew.ce.r...@gmail.com> wrote: > Hi all -- > > I am having a weird problem on a cluster of Raspberry Pi model 2 machines > running the Debian/Raspbian version of OpenMPI, 1.6.5. > > I apologize for the length of this message, but I am trying to include all > the pertinent details, but of course can't reliably discriminate between > pertinent and irrelevant details. > > I am actually a fairly long-time user of OpenMPI in various environments, > and have never had any trouble with it, but in configuring my "toy" > cluster, this came up. > > The basic issue is, a sample MPI executable runs with "mpirun -d" or under > "slurm" resource allocation, but not directly from the command line -- in > the direct command-line case, it just hangs, apparently forever. > > What is even weirder is that, earlier today, while backing out a private > domain configuration (see below), it actually started working for a while, > but after reboots, the problem behavior has returned. > > It seems overwhelmingly likely that this is some kind of network transport > configuration problem, but it eludes me. > > > More details about the problem: > > > The Pis are all quad-core, and are named pi (head node), pj, pk, and pl > (work nodes). They're connected by ethernet. They all have a single > non-privileged user, named "pi". > > There's a directory on my account containing an MPI executable, the "cpi" > example from the OpenMPI package, and a list of machines to run on, named > "machines", with the following contents: > > > pj slots=4 > > pk slots=4 > > pl slots=4 > > > > mpirun --hostfile machines ./cpi > > ... hangs forever, but > > > mpirun -d --hostfile machines ./cpi > > ... runs correctly, if somewhat verbosely. > > Also: > > > salloc -n 12 /bin/bash > > mpirun ./cpi > > ... also runs correctly. The "salloc" command is a slurm directive to > allocate CPU resources, and start an interactive shell with a bunch of > environment variables set to give mpirun the clues it needs, of course. > The work CPUs are allocated correctly on my "work" nodes when salloc is run > from the head node. > > > > Config details and diagnostic efforts: > > The outputs of the ompi_info runs are attached. > > The cluster of four Raspberry Pi model 2 computers runs the Jessie > distribution of Raspbian, which is essentially Debian. They differ a bit, > the "head node", creatively named "pi", has an older static network config, > with everything specified in /etc/network/interfaces. The "cluster nodes", > equally creatively named pj, pk, and pl, all have the newer DHCPCD client > daemon configured for static interfaces, via /etc/dhcpcd.conf (NB this is > *not* the DHCP *server*, these machines do not use DHCP services.) The > dhcpcd configuration tool is the new scheme for Raspian, and has been > modified from the "as-shipped" set-up to have a static IPv4 address on > eth0, and to remove some ipv6 functionality (router solicitation) that > pollutes the log files. > > > MDNS is turned off in /etc/nsswitch.conf, "hosts" are resolved via > "files", then "dns". The DNS name servers are statically configured to be > 8.8.8.8 and 8.8.4.4. None of the machines involved in the OpenMPI > operation are in DNS. > > For slightly complicated reasons, all four machines were initially > configured as members of a local, non-DNS-resolveable domain, named ".gb" > This was done because slurm requires e-mail, and my first crack at e-mail > config seemed to require a domain. All the hostnames were statically > configured through /etc/hosts. I realized later that I misunderstood the > mail config, and have backed out the domain configuration, the machines all > have non-dotted names. > > This seemed to briefly change the behavior, it worked several times after > this, but then on reboot, stopped working again, making me think I am > perhaps losing my mind. > > The system is *not* running nscd, so some kind of name-service cache is > not a good explanation here. > > > The whole cluster is set up for host-based SSH authentication for the > default user, "pi". This works for all possible host pairs, tested via: > > > ssh -o PreferredAuthentications=hostbased pi@<target> > > The network config looks OK. I can ping and ssh every way I want to, and > it all works. The pis are all wired to the same Netgear 10/100 switch, > which in turn goes to my household switch, which in turn goes to my cable > modem. "ifconfig" shows eth0 and lo configured. "ifconfig -a" does not > show any additional unconfigured interfaces. > > Ifconfig output is, in order for pi, pj, pk, and pl: > > > > eth0 Link encap:Ethernet HWaddr b8:27:eb:16:0a:70 > inet addr:192.168.0.11 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: ::ba27:ebff:fe16:a70/64 Scope:Global > inet6 addr: fe80::ba27:ebff:fe16:a70/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:164 errors:0 dropped:23 overruns:0 frame:0 > TX packets:133 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:15733 (15.3 KiB) TX bytes:13756 (13.4 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:7 errors:0 dropped:0 overruns:0 frame:0 > TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:616 (616.0 B) TX bytes:616 (616.0 B) > > > > > eth0 Link encap:Ethernet HWaddr b8:27:eb:27:4d:17 > inet addr:192.168.0.12 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: ::4c5c:1329:f1b6:1169/64 Scope:Global > inet6 addr: fe80::6594:bfad:206:1191/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:237 errors:0 dropped:31 overruns:0 frame:0 > TX packets:131 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > > RX bytes:28966 (28.2 KiB) TX bytes:18841 (18.3 KiB) > > > > lo Link encap:Local Loopback > > inet addr:127.0.0.1 Mask:255.0.0.0 > > inet6 addr: ::1/128 Scope:Host > > UP LOOPBACK RUNNING MTU:65536 Metric:1 > > RX packets:136 errors:0 dropped:0 overruns:0 frame:0 > > TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 > > collisions:0 txqueuelen:0 > > RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) > > > > > eth0 Link encap:Ethernet HWaddr b8:27:eb:f4:ec:03 > inet addr:192.168.0.13 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: fe80::ba08:3c9:67c3:a2a1/64 Scope:Link > inet6 addr: ::8e5a:32a5:ab50:d955/64 Scope:Global > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:299 errors:0 dropped:57 overruns:0 frame:0 > TX packets:138 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:34334 (33.5 KiB) TX bytes:19909 (19.4 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:136 errors:0 dropped:0 overruns:0 frame:0 > TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) > > > > eth0 Link encap:Ethernet HWaddr b8:27:eb:da:c6:7f > inet addr:192.168.0.14 Bcast:192.168.0.255 Mask:255.255.255.0 > inet6 addr: ::a8db:7245:458f:2342/64 Scope:Global > inet6 addr: fe80::3c5f:7092:578a:6c10/64 Scope:Link > UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 > RX packets:369 errors:0 dropped:76 overruns:0 frame:0 > TX packets:165 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:38040 (37.1 KiB) TX bytes:22788 (22.2 KiB) > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:65536 Metric:1 > RX packets:136 errors:0 dropped:0 overruns:0 frame:0 > TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) > > > > > There are no firewalls on any of the machines. I checked this via > "iptables-save", which dumps the system firewall state in a way that allows > it to be re-loaded by a script, and the output is reasonably > human-readable. It shows all tables with no rules and a default "accept" > state. > > > The OpenMPI installation is the current Raspbian version, freshly > installed (via "apt-get install openmpi-bin libopenmpi-dev") from the > repos. The OpenMPI is version 1.6.5, the package version is > 1.6.5-9.1+rpi1. No configuration options have been modified. > > There is no ".openmpi" directory on the pi user account on any of the > machines. > > When I run the problem case, I can sometimes catch the "orted" daemon > spinning up on the pj machine, it looks something like this (the port > number on the tcp uri varies from run to run): > > > 1 S pi 4895 1 0 80 0 - 1945 poll_s 20:23 ? > 00:00:00 orted --daemonize -mca ess env -mca orte_ess_jobid 1646002176 > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri 1646002176.0;tcp:// > 192.168.0.11:59646 -mca plm rsh > > (192.168.0.11 is indeed the correct address of the launching machine, > hostname pi. The first "pi" in column 3 is the name of the user who owns > the process. > > If I run "telnet 192.168.0.11 59646", it connects. I can send some > garbage into the connection, but this does not cause the orted to exit, nor > does it immedately blow up the launching process on the launch machine. I > have not investigated in detail, but it seems that if you molest the TCP > connection in this way, the launching process eventually reports an error, > but if you don't, it will hang forever. > > > One additional oddity, when I run the job in "debug" mode, the clients > generate the following dmesg traffic: > > > [ 1002.404021] sctp: [Deprecated]: cpi (pid 13770) Requested > SCTP_SNDRCVINFO event. > > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. > > [ 1002.412423] sctp: [Deprecated]: cpi (pid 13772) Requested > SCTP_SNDRCVINFO event. > > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. > > [ 1002.427621] sctp: [Deprecated]: cpi (pid 13771) Requested > SCTP_SNDRCVINFO event. > > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. > > > > I have tried: > > - Adding or removing the domain suffix from the hosts in the machines > file. > - Checked that the clocks on all four machines match. > - Changing the host names in the machines file to invalid names -- this > causes the expected failure, reassuring me that the file is being read. > Note that the hanging behavior also occurs with the "-H" option in place of > a machine file. > - Running with "-mca btl tcp,self -mca btl_tcp_if_include eth0" in case > it's having device problems. When I do this, I see this argument echoed on > the orted process on pj, but the behavior is the same, it still hangs. > - Removing the "slots=" directive from the machines file. > - Disabling IPv6 (via sysctl). > - Turning off the SLURM daemons (via systemctl, not by uninstalling them.) > - Different host combinations in the machines file. This changes things > in weird ways, which I have not systematically explored. > It seems that if pk is the first in line, then the thing eventually > times out, but if pj or pl is first, it hangs forever. The willingness of > orted to appear in the client process table seems seems inconsistent, but > it may be that it always runs, but I am not consistently catching it. > - Adding/removing "multi on" from /etc/host.conf. > > None of these have changed the behavior, except, as noted, briefly after > backing out the private domain configuration (which involves editing the > hosts file, which motivates the focus on DNS in some of this.) > > > All configurations work with "-d", or with "--debug-daemons" or with no > arguments inside a slurm allocation, but hang in the "ordinary" case. > > I am stumped. I am totally willing to believe I have mucked up the > network config, but where? How? What's different about debug mode? > >