I think I might have fixed this, but I still don't really understand it.

In setting up the RPi machines, I followed a config guide that suggested
switching the SSH service in systemd to "ssh.socket" instead of
"ssh.service". It's supposed to be lighter weight and get you cleaner
shut-downs, and I've used this trick on other machines, without really
knowing the implications.

By way of completeness of my config audit to try to figure this out, I
backed this out, restoring the "ssh.service" link and removing the
"ssh.socket" one. Now MPI works, and I also get clean disconnections at
exit-time, so apparently there's no reason at all to do this.

This behavior has survived two reboot cycles, so I think it's real. Not
sure if this behavior is relevant to just Raspbian, or if it's in all
architectures of Debian Jessie, or all systemd init systems, or what.

         -- A.

On Sat, May 14, 2016 at 3:27 PM Andrew Reid <andrew.ce.r...@gmail.com>
wrote:

> Hi all --
>
> I am having a weird problem on a cluster of Raspberry Pi model 2 machines
> running the Debian/Raspbian version of OpenMPI, 1.6.5.
>
> I apologize for the length of this message, but I am trying to include all
> the pertinent details, but of course can't reliably discriminate between
> pertinent and irrelevant details.
>
> I am actually a fairly long-time user of OpenMPI in various environments,
> and have never had any trouble with it, but in configuring my "toy"
> cluster, this came up.
>
> The basic issue is, a sample MPI executable runs with "mpirun -d" or under
> "slurm" resource allocation, but not directly from the command line -- in
> the direct command-line case, it just hangs, apparently forever.
>
> What is even weirder is that, earlier today, while backing out a private
> domain configuration (see below), it actually started working for a while,
> but after reboots, the problem behavior has returned.
>
> It seems overwhelmingly likely that this is some kind of network transport
> configuration problem, but it eludes me.
>
>
> More details about the problem:
>
>
> The Pis are all quad-core, and are named pi (head node), pj, pk, and pl
> (work nodes).  They're connected by ethernet.  They all have a single
> non-privileged user, named "pi".
>
> There's a directory on my account containing an MPI executable, the "cpi"
> example from the OpenMPI package, and a list of machines to run on, named
> "machines", with the following contents:
>
> > pj slots=4
> > pk slots=4
> > pl slots=4
>
>
> > mpirun --hostfile machines ./cpi
>
>   ... hangs forever, but
>
> > mpirun -d --hostfile machines ./cpi
>
>   ... runs correctly, if somewhat verbosely.
>
> Also:
>
> > salloc -n 12 /bin/bash
> > mpirun ./cpi
>
>    ... also runs correctly.  The "salloc" command is a slurm directive to
> allocate CPU resources, and start an interactive shell with a bunch of
> environment variables set to give mpirun the clues it needs, of course.
> The work CPUs are allocated correctly on my "work" nodes when salloc is run
> from the head node.
>
>
>
>   Config details and diagnostic efforts:
>
> The outputs of the ompi_info runs are attached.
>
> The cluster of four Raspberry Pi model 2 computers runs the Jessie
> distribution of Raspbian, which is essentially Debian.  They differ a bit,
> the "head node", creatively named "pi", has an older static network config,
> with everything specified in /etc/network/interfaces.  The "cluster nodes",
> equally creatively named pj, pk, and pl, all have the newer DHCPCD client
> daemon configured for static interfaces, via /etc/dhcpcd.conf (NB this is
> *not* the DHCP *server*, these machines do not use DHCP services.)  The
> dhcpcd configuration tool is the new scheme for Raspian, and has been
> modified from the "as-shipped" set-up to have a static IPv4 address on
> eth0, and to remove some ipv6 functionality (router solicitation) that
> pollutes the log files.
>
>
> MDNS is turned off in /etc/nsswitch.conf, "hosts" are resolved via
> "files", then "dns".  The DNS name servers are statically configured to be
> 8.8.8.8 and 8.8.4.4.  None of the machines involved in the OpenMPI
> operation are in DNS.
>
> For slightly complicated reasons, all four machines were initially
> configured as members of a local, non-DNS-resolveable domain, named ".gb"
>  This was done because slurm requires e-mail, and my first crack at e-mail
> config seemed to require a domain.  All the hostnames were statically
> configured through /etc/hosts.   I realized later that I misunderstood the
> mail config, and have backed out the domain configuration, the machines all
> have non-dotted names.
>
> This seemed to briefly change the behavior, it worked several times after
> this, but then on reboot, stopped working again, making me think I am
> perhaps losing my mind.
>
> The system is *not* running nscd, so some kind of name-service cache is
> not a good explanation here.
>
>
> The whole cluster is set up for host-based SSH authentication for the
> default user, "pi".  This works for all possible host pairs, tested via:
>
> > ssh -o PreferredAuthentications=hostbased pi@<target>
>
> The network config looks OK.  I can ping and ssh every way I want to, and
> it all works.  The pis are all wired to the same Netgear 10/100 switch,
> which in turn goes to my household switch, which in turn goes to my cable
> modem.  "ifconfig" shows eth0 and lo configured. "ifconfig -a" does not
> show any additional unconfigured interfaces.
>
> Ifconfig output is, in order for pi, pj, pk, and pl:
>
>
>
> eth0      Link encap:Ethernet  HWaddr b8:27:eb:16:0a:70
>           inet addr:192.168.0.11  Bcast:192.168.0.255  Mask:255.255.255.0
>           inet6 addr: ::ba27:ebff:fe16:a70/64 Scope:Global
>           inet6 addr: fe80::ba27:ebff:fe16:a70/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:164 errors:0 dropped:23 overruns:0 frame:0
>           TX packets:133 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:15733 (15.3 KiB)  TX bytes:13756 (13.4 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           RX packets:7 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:616 (616.0 B)  TX bytes:616 (616.0 B)
>
>
>
>
> eth0      Link encap:Ethernet  HWaddr b8:27:eb:27:4d:17
>           inet addr:192.168.0.12  Bcast:192.168.0.255  Mask:255.255.255.0
>           inet6 addr: ::4c5c:1329:f1b6:1169/64 Scope:Global
>           inet6 addr: fe80::6594:bfad:206:1191/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:237 errors:0 dropped:31 overruns:0 frame:0
>           TX packets:131 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>
>           RX bytes:28966 (28.2 KiB)  TX bytes:18841 (18.3 KiB)
>
>
>
> lo        Link encap:Local Loopback
>
>           inet addr:127.0.0.1  Mask:255.0.0.0
>
>           inet6 addr: ::1/128 Scope:Host
>
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>
>           RX packets:136 errors:0 dropped:0 overruns:0 frame:0
>
>           TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
>
>           collisions:0 txqueuelen:0
>
>           RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)
>
>
>
>
> eth0      Link encap:Ethernet  HWaddr b8:27:eb:f4:ec:03
>           inet addr:192.168.0.13  Bcast:192.168.0.255  Mask:255.255.255.0
>           inet6 addr: fe80::ba08:3c9:67c3:a2a1/64 Scope:Link
>           inet6 addr: ::8e5a:32a5:ab50:d955/64 Scope:Global
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:299 errors:0 dropped:57 overruns:0 frame:0
>           TX packets:138 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:34334 (33.5 KiB)  TX bytes:19909 (19.4 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           RX packets:136 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)
>
>
>
> eth0      Link encap:Ethernet  HWaddr b8:27:eb:da:c6:7f
>           inet addr:192.168.0.14  Bcast:192.168.0.255  Mask:255.255.255.0
>           inet6 addr: ::a8db:7245:458f:2342/64 Scope:Global
>           inet6 addr: fe80::3c5f:7092:578a:6c10/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:369 errors:0 dropped:76 overruns:0 frame:0
>           TX packets:165 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:38040 (37.1 KiB)  TX bytes:22788 (22.2 KiB)
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:65536  Metric:1
>           RX packets:136 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)
>
>
>
>
> There are no firewalls on any of the machines.  I checked this via
> "iptables-save", which dumps the system firewall state in a way that allows
> it to be re-loaded by a script, and the output is reasonably
> human-readable.  It shows all tables with no rules and a default "accept"
> state.
>
>
> The OpenMPI installation is the current Raspbian version, freshly
> installed (via "apt-get install openmpi-bin libopenmpi-dev") from the
> repos.  The OpenMPI is version 1.6.5, the package version is
> 1.6.5-9.1+rpi1.  No configuration options have been modified.
>
> There is no ".openmpi" directory on the pi user account on any of the
> machines.
>
> When I run the problem case, I can sometimes catch the "orted" daemon
> spinning up on the pj machine, it looks something like this (the port
> number on the tcp uri varies from run to run):
>
> > 1 S pi        4895     1  0  80   0 -  1945 poll_s 20:23 ?
>  00:00:00 orted --daemonize -mca ess env -mca orte_ess_jobid 1646002176
> -mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri 1646002176.0;tcp://
> 192.168.0.11:59646 -mca plm rsh
>
> (192.168.0.11 is indeed the correct address of the launching machine,
> hostname pi.  The first "pi" in column 3 is the name of the user who owns
> the process.
>
> If I run "telnet 192.168.0.11 59646", it connects.  I can send some
> garbage into the connection, but this does not cause the orted to exit, nor
> does it immedately blow up the launching process on the launch machine.  I
> have not investigated in detail, but it seems that if you molest the TCP
> connection in this way, the launching process eventually reports an error,
> but if you don't, it will hang forever.
>
>
> One additional oddity, when I run the job in "debug" mode, the clients
> generate the following dmesg traffic:
>
> > [ 1002.404021] sctp: [Deprecated]: cpi (pid 13770) Requested
> SCTP_SNDRCVINFO event.
> > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.
> > [ 1002.412423] sctp: [Deprecated]: cpi (pid 13772) Requested
> SCTP_SNDRCVINFO event.
> > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.
> > [ 1002.427621] sctp: [Deprecated]: cpi (pid 13771) Requested
> SCTP_SNDRCVINFO event.
> > Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.
>
>
>
>   I have tried:
>
>  - Adding or removing the domain suffix from the hosts in the machines
> file.
>  - Checked that the clocks on all four machines match.
>  - Changing the host names in the machines file to invalid names -- this
> causes the expected failure, reassuring me that the file is being read.
> Note that the hanging behavior also occurs with the "-H" option in place of
> a machine file.
>  - Running with "-mca btl tcp,self -mca btl_tcp_if_include eth0" in case
> it's having device problems.  When I do this, I see this argument echoed on
> the orted process on pj, but the behavior is the same, it still hangs.
>  - Removing the "slots=" directive from the machines file.
>  - Disabling IPv6 (via sysctl).
>  - Turning off the SLURM daemons (via systemctl, not by uninstalling them.)
>  - Different host combinations in the machines file.  This changes things
> in weird ways, which I have not systematically explored.
>    It seems that if pk is the first in line, then the thing eventually
> times out, but if pj or pl is first, it hangs forever.  The willingness of
> orted to appear in the client process table seems seems inconsistent, but
> it may be that it always runs, but I am not consistently catching it.
>  - Adding/removing "multi on" from /etc/host.conf.
>
> None of these have changed the behavior, except, as noted, briefly after
> backing out the private domain configuration (which involves editing the
> hosts file, which motivates the focus on DNS in some of this.)
>
>
> All configurations work with "-d", or with "--debug-daemons" or with no
> arguments inside a slurm allocation, but hang in the "ordinary" case.
>
> I am stumped.  I am totally willing to believe I have mucked up the
> network config, but where? How? What's different about debug mode?
>
>

Reply via email to