Hi all -- I am having a weird problem on a cluster of Raspberry Pi model 2 machines running the Debian/Raspbian version of OpenMPI, 1.6.5.
I apologize for the length of this message, but I am trying to include all the pertinent details, but of course can't reliably discriminate between pertinent and irrelevant details. I am actually a fairly long-time user of OpenMPI in various environments, and have never had any trouble with it, but in configuring my "toy" cluster, this came up. The basic issue is, a sample MPI executable runs with "mpirun -d" or under "slurm" resource allocation, but not directly from the command line -- in the direct command-line case, it just hangs, apparently forever. What is even weirder is that, earlier today, while backing out a private domain configuration (see below), it actually started working for a while, but after reboots, the problem behavior has returned. It seems overwhelmingly likely that this is some kind of network transport configuration problem, but it eludes me. More details about the problem: The Pis are all quad-core, and are named pi (head node), pj, pk, and pl (work nodes). They're connected by ethernet. They all have a single non-privileged user, named "pi". There's a directory on my account containing an MPI executable, the "cpi" example from the OpenMPI package, and a list of machines to run on, named "machines", with the following contents: > pj slots=4 > pk slots=4 > pl slots=4 > mpirun --hostfile machines ./cpi ... hangs forever, but > mpirun -d --hostfile machines ./cpi ... runs correctly, if somewhat verbosely. Also: > salloc -n 12 /bin/bash > mpirun ./cpi ... also runs correctly. The "salloc" command is a slurm directive to allocate CPU resources, and start an interactive shell with a bunch of environment variables set to give mpirun the clues it needs, of course. The work CPUs are allocated correctly on my "work" nodes when salloc is run from the head node. Config details and diagnostic efforts: The outputs of the ompi_info runs are attached. The cluster of four Raspberry Pi model 2 computers runs the Jessie distribution of Raspbian, which is essentially Debian. They differ a bit, the "head node", creatively named "pi", has an older static network config, with everything specified in /etc/network/interfaces. The "cluster nodes", equally creatively named pj, pk, and pl, all have the newer DHCPCD client daemon configured for static interfaces, via /etc/dhcpcd.conf (NB this is *not* the DHCP *server*, these machines do not use DHCP services.) The dhcpcd configuration tool is the new scheme for Raspian, and has been modified from the "as-shipped" set-up to have a static IPv4 address on eth0, and to remove some ipv6 functionality (router solicitation) that pollutes the log files. MDNS is turned off in /etc/nsswitch.conf, "hosts" are resolved via "files", then "dns". The DNS name servers are statically configured to be 8.8.8.8 and 8.8.4.4. None of the machines involved in the OpenMPI operation are in DNS. For slightly complicated reasons, all four machines were initially configured as members of a local, non-DNS-resolveable domain, named ".gb" This was done because slurm requires e-mail, and my first crack at e-mail config seemed to require a domain. All the hostnames were statically configured through /etc/hosts. I realized later that I misunderstood the mail config, and have backed out the domain configuration, the machines all have non-dotted names. This seemed to briefly change the behavior, it worked several times after this, but then on reboot, stopped working again, making me think I am perhaps losing my mind. The system is *not* running nscd, so some kind of name-service cache is not a good explanation here. The whole cluster is set up for host-based SSH authentication for the default user, "pi". This works for all possible host pairs, tested via: > ssh -o PreferredAuthentications=hostbased pi@<target> The network config looks OK. I can ping and ssh every way I want to, and it all works. The pis are all wired to the same Netgear 10/100 switch, which in turn goes to my household switch, which in turn goes to my cable modem. "ifconfig" shows eth0 and lo configured. "ifconfig -a" does not show any additional unconfigured interfaces. Ifconfig output is, in order for pi, pj, pk, and pl: eth0 Link encap:Ethernet HWaddr b8:27:eb:16:0a:70 inet addr:192.168.0.11 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: ::ba27:ebff:fe16:a70/64 Scope:Global inet6 addr: fe80::ba27:ebff:fe16:a70/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:164 errors:0 dropped:23 overruns:0 frame:0 TX packets:133 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:15733 (15.3 KiB) TX bytes:13756 (13.4 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:7 errors:0 dropped:0 overruns:0 frame:0 TX packets:7 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:616 (616.0 B) TX bytes:616 (616.0 B) eth0 Link encap:Ethernet HWaddr b8:27:eb:27:4d:17 inet addr:192.168.0.12 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: ::4c5c:1329:f1b6:1169/64 Scope:Global inet6 addr: fe80::6594:bfad:206:1191/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:237 errors:0 dropped:31 overruns:0 frame:0 TX packets:131 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:28966 (28.2 KiB) TX bytes:18841 (18.3 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:136 errors:0 dropped:0 overruns:0 frame:0 TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) eth0 Link encap:Ethernet HWaddr b8:27:eb:f4:ec:03 inet addr:192.168.0.13 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: fe80::ba08:3c9:67c3:a2a1/64 Scope:Link inet6 addr: ::8e5a:32a5:ab50:d955/64 Scope:Global UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:299 errors:0 dropped:57 overruns:0 frame:0 TX packets:138 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:34334 (33.5 KiB) TX bytes:19909 (19.4 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:136 errors:0 dropped:0 overruns:0 frame:0 TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) eth0 Link encap:Ethernet HWaddr b8:27:eb:da:c6:7f inet addr:192.168.0.14 Bcast:192.168.0.255 Mask:255.255.255.0 inet6 addr: ::a8db:7245:458f:2342/64 Scope:Global inet6 addr: fe80::3c5f:7092:578a:6c10/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:369 errors:0 dropped:76 overruns:0 frame:0 TX packets:165 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:38040 (37.1 KiB) TX bytes:22788 (22.2 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:136 errors:0 dropped:0 overruns:0 frame:0 TX packets:136 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:11664 (11.3 KiB) TX bytes:11664 (11.3 KiB) There are no firewalls on any of the machines. I checked this via "iptables-save", which dumps the system firewall state in a way that allows it to be re-loaded by a script, and the output is reasonably human-readable. It shows all tables with no rules and a default "accept" state. The OpenMPI installation is the current Raspbian version, freshly installed (via "apt-get install openmpi-bin libopenmpi-dev") from the repos. The OpenMPI is version 1.6.5, the package version is 1.6.5-9.1+rpi1. No configuration options have been modified. There is no ".openmpi" directory on the pi user account on any of the machines. When I run the problem case, I can sometimes catch the "orted" daemon spinning up on the pj machine, it looks something like this (the port number on the tcp uri varies from run to run): > 1 S pi 4895 1 0 80 0 - 1945 poll_s 20:23 ? 00:00:00 orted --daemonize -mca ess env -mca orte_ess_jobid 1646002176 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri 1646002176.0;tcp:// 192.168.0.11:59646 -mca plm rsh (192.168.0.11 is indeed the correct address of the launching machine, hostname pi. The first "pi" in column 3 is the name of the user who owns the process. If I run "telnet 192.168.0.11 59646", it connects. I can send some garbage into the connection, but this does not cause the orted to exit, nor does it immedately blow up the launching process on the launch machine. I have not investigated in detail, but it seems that if you molest the TCP connection in this way, the launching process eventually reports an error, but if you don't, it will hang forever. One additional oddity, when I run the job in "debug" mode, the clients generate the following dmesg traffic: > [ 1002.404021] sctp: [Deprecated]: cpi (pid 13770) Requested SCTP_SNDRCVINFO event. > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. > [ 1002.412423] sctp: [Deprecated]: cpi (pid 13772) Requested SCTP_SNDRCVINFO event. > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. > [ 1002.427621] sctp: [Deprecated]: cpi (pid 13771) Requested SCTP_SNDRCVINFO event. > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead. I have tried: - Adding or removing the domain suffix from the hosts in the machines file. - Checked that the clocks on all four machines match. - Changing the host names in the machines file to invalid names -- this causes the expected failure, reassuring me that the file is being read. Note that the hanging behavior also occurs with the "-H" option in place of a machine file. - Running with "-mca btl tcp,self -mca btl_tcp_if_include eth0" in case it's having device problems. When I do this, I see this argument echoed on the orted process on pj, but the behavior is the same, it still hangs. - Removing the "slots=" directive from the machines file. - Disabling IPv6 (via sysctl). - Turning off the SLURM daemons (via systemctl, not by uninstalling them.) - Different host combinations in the machines file. This changes things in weird ways, which I have not systematically explored. It seems that if pk is the first in line, then the thing eventually times out, but if pj or pl is first, it hangs forever. The willingness of orted to appear in the client process table seems seems inconsistent, but it may be that it always runs, but I am not consistently catching it. - Adding/removing "multi on" from /etc/host.conf. None of these have changed the behavior, except, as noted, briefly after backing out the private domain configuration (which involves editing the hosts file, which motivates the focus on DNS in some of this.) All configurations work with "-d", or with "--debug-daemons" or with no arguments inside a slurm allocation, but hang in the "ordinary" case. I am stumped. I am totally willing to believe I have mucked up the network config, but where? How? What's different about debug mode?
ompi_info.pk.gb.bz2
Description: application/bzip
ompi_info.pl.gb.bz2
Description: application/bzip
ompi_info.pj.gb.bz2
Description: application/bzip
ompi_info.pi.gb.bz2
Description: application/bzip
ompi_info_all.pi.bz2
Description: application/bzip