[OMPI users] Mpirun invocation only works in debug mode, hangs in "normal" mode.

Andrew Reid Sat, 14 May 2016 15:27:51 -0400 (EDT)

Hi all --

I am having a weird problem on a cluster of Raspberry Pi model 2 machines
running the Debian/Raspbian version of OpenMPI, 1.6.5.


I apologize for the length of this message, but I am trying to include all
the pertinent details, but of course can't reliably discriminate between
pertinent and irrelevant details.

I am actually a fairly long-time user of OpenMPI in various environments,
and have never had any trouble with it, but in configuring my "toy"
cluster, this came up.

The basic issue is, a sample MPI executable runs with "mpirun -d" or under
"slurm" resource allocation, but not directly from the command line -- in
the direct command-line case, it just hangs, apparently forever.

What is even weirder is that, earlier today, while backing out a private
domain configuration (see below), it actually started working for a while,
but after reboots, the problem behavior has returned.

It seems overwhelmingly likely that this is some kind of network transport
configuration problem, but it eludes me.


More details about the problem:


The Pis are all quad-core, and are named pi (head node), pj, pk, and pl
(work nodes).  They're connected by ethernet.  They all have a single
non-privileged user, named "pi".

There's a directory on my account containing an MPI executable, the "cpi"
example from the OpenMPI package, and a list of machines to run on, named
"machines", with the following contents:

> pj slots=4
> pk slots=4
> pl slots=4


> mpirun --hostfile machines ./cpi

  ... hangs forever, but

> mpirun -d --hostfile machines ./cpi

  ... runs correctly, if somewhat verbosely.

Also:

> salloc -n 12 /bin/bash
> mpirun ./cpi

   ... also runs correctly.  The "salloc" command is a slurm directive to
allocate CPU resources, and start an interactive shell with a bunch of
environment variables set to give mpirun the clues it needs, of course.
The work CPUs are allocated correctly on my "work" nodes when salloc is run
from the head node.



  Config details and diagnostic efforts:

The outputs of the ompi_info runs are attached.

The cluster of four Raspberry Pi model 2 computers runs the Jessie
distribution of Raspbian, which is essentially Debian.  They differ a bit,
the "head node", creatively named "pi", has an older static network config,
with everything specified in /etc/network/interfaces.  The "cluster nodes",
equally creatively named pj, pk, and pl, all have the newer DHCPCD client
daemon configured for static interfaces, via /etc/dhcpcd.conf (NB this is
*not* the DHCP *server*, these machines do not use DHCP services.)  The
dhcpcd configuration tool is the new scheme for Raspian, and has been
modified from the "as-shipped" set-up to have a static IPv4 address on
eth0, and to remove some ipv6 functionality (router solicitation) that
pollutes the log files.


MDNS is turned off in /etc/nsswitch.conf, "hosts" are resolved via "files",
then "dns".  The DNS name servers are statically configured to be 8.8.8.8
and 8.8.4.4.  None of the machines involved in the OpenMPI operation are in
DNS.

For slightly complicated reasons, all four machines were initially
configured as members of a local, non-DNS-resolveable domain, named ".gb"
 This was done because slurm requires e-mail, and my first crack at e-mail
config seemed to require a domain.  All the hostnames were statically
configured through /etc/hosts.   I realized later that I misunderstood the
mail config, and have backed out the domain configuration, the machines all
have non-dotted names.

This seemed to briefly change the behavior, it worked several times after
this, but then on reboot, stopped working again, making me think I am
perhaps losing my mind.

The system is *not* running nscd, so some kind of name-service cache is not
a good explanation here.


The whole cluster is set up for host-based SSH authentication for the
default user, "pi".  This works for all possible host pairs, tested via:

> ssh -o PreferredAuthentications=hostbased pi@<target>

The network config looks OK.  I can ping and ssh every way I want to, and
it all works.  The pis are all wired to the same Netgear 10/100 switch,
which in turn goes to my household switch, which in turn goes to my cable
modem.  "ifconfig" shows eth0 and lo configured. "ifconfig -a" does not
show any additional unconfigured interfaces.

Ifconfig output is, in order for pi, pj, pk, and pl:



eth0      Link encap:Ethernet  HWaddr b8:27:eb:16:0a:70
          inet addr:192.168.0.11  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: ::ba27:ebff:fe16:a70/64 Scope:Global
          inet6 addr: fe80::ba27:ebff:fe16:a70/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:164 errors:0 dropped:23 overruns:0 frame:0
          TX packets:133 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:15733 (15.3 KiB)  TX bytes:13756 (13.4 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:7 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:616 (616.0 B)  TX bytes:616 (616.0 B)




eth0      Link encap:Ethernet  HWaddr b8:27:eb:27:4d:17
          inet addr:192.168.0.12  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: ::4c5c:1329:f1b6:1169/64 Scope:Global
          inet6 addr: fe80::6594:bfad:206:1191/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:237 errors:0 dropped:31 overruns:0 frame:0
          TX packets:131 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000

          RX bytes:28966 (28.2 KiB)  TX bytes:18841 (18.3 KiB)



lo        Link encap:Local Loopback

          inet addr:127.0.0.1  Mask:255.0.0.0

          inet6 addr: ::1/128 Scope:Host

          UP LOOPBACK RUNNING  MTU:65536  Metric:1

          RX packets:136 errors:0 dropped:0 overruns:0 frame:0

          TX packets:136 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:0

          RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)




eth0      Link encap:Ethernet  HWaddr b8:27:eb:f4:ec:03
          inet addr:192.168.0.13  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: fe80::ba08:3c9:67c3:a2a1/64 Scope:Link
          inet6 addr: ::8e5a:32a5:ab50:d955/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:299 errors:0 dropped:57 overruns:0 frame:0
          TX packets:138 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:34334 (33.5 KiB)  TX bytes:19909 (19.4 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:136 errors:0 dropped:0 overruns:0 frame:0
          TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)



eth0      Link encap:Ethernet  HWaddr b8:27:eb:da:c6:7f
          inet addr:192.168.0.14  Bcast:192.168.0.255  Mask:255.255.255.0
          inet6 addr: ::a8db:7245:458f:2342/64 Scope:Global
          inet6 addr: fe80::3c5f:7092:578a:6c10/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:369 errors:0 dropped:76 overruns:0 frame:0
          TX packets:165 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:38040 (37.1 KiB)  TX bytes:22788 (22.2 KiB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:136 errors:0 dropped:0 overruns:0 frame:0
          TX packets:136 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11664 (11.3 KiB)  TX bytes:11664 (11.3 KiB)




There are no firewalls on any of the machines.  I checked this via
"iptables-save", which dumps the system firewall state in a way that allows
it to be re-loaded by a script, and the output is reasonably
human-readable.  It shows all tables with no rules and a default "accept"
state.


The OpenMPI installation is the current Raspbian version, freshly installed
(via "apt-get install openmpi-bin libopenmpi-dev") from the repos.  The
OpenMPI is version 1.6.5, the package version is 1.6.5-9.1+rpi1.  No
configuration options have been modified.

There is no ".openmpi" directory on the pi user account on any of the
machines.

When I run the problem case, I can sometimes catch the "orted" daemon
spinning up on the pj machine, it looks something like this (the port
number on the tcp uri varies from run to run):

> 1 S pi        4895     1  0  80   0 -  1945 poll_s 20:23 ?
 00:00:00 orted --daemonize -mca ess env -mca orte_ess_jobid 1646002176
-mca orte_ess_vpid 1 -mca orte_ess_num_procs 4 --hnp-uri 1646002176.0;tcp://
192.168.0.11:59646 -mca plm rsh

(192.168.0.11 is indeed the correct address of the launching machine,
hostname pi.  The first "pi" in column 3 is the name of the user who owns
the process.

If I run "telnet 192.168.0.11 59646", it connects.  I can send some garbage
into the connection, but this does not cause the orted to exit, nor does it
immedately blow up the launching process on the launch machine.  I have not
investigated in detail, but it seems that if you molest the TCP connection
in this way, the launching process eventually reports an error, but if you
don't, it will hang forever.


One additional oddity, when I run the job in "debug" mode, the clients
generate the following dmesg traffic:

> [ 1002.404021] sctp: [Deprecated]: cpi (pid 13770) Requested
SCTP_SNDRCVINFO event.
> Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.
> [ 1002.412423] sctp: [Deprecated]: cpi (pid 13772) Requested
SCTP_SNDRCVINFO event.
> Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.
> [ 1002.427621] sctp: [Deprecated]: cpi (pid 13771) Requested
SCTP_SNDRCVINFO event.
> Use SCTP_RCVINFO through SCTP_RECVRCVINFO option instead.



  I have tried:

 - Adding or removing the domain suffix from the hosts in the machines file.
 - Checked that the clocks on all four machines match.
 - Changing the host names in the machines file to invalid names -- this
causes the expected failure, reassuring me that the file is being read.
Note that the hanging behavior also occurs with the "-H" option in place of
a machine file.
 - Running with "-mca btl tcp,self -mca btl_tcp_if_include eth0" in case
it's having device problems.  When I do this, I see this argument echoed on
the orted process on pj, but the behavior is the same, it still hangs.
 - Removing the "slots=" directive from the machines file.
 - Disabling IPv6 (via sysctl).
 - Turning off the SLURM daemons (via systemctl, not by uninstalling them.)
 - Different host combinations in the machines file.  This changes things
in weird ways, which I have not systematically explored.
   It seems that if pk is the first in line, then the thing eventually
times out, but if pj or pl is first, it hangs forever.  The willingness of
orted to appear in the client process table seems seems inconsistent, but
it may be that it always runs, but I am not consistently catching it.
 - Adding/removing "multi on" from /etc/host.conf.

None of these have changed the behavior, except, as noted, briefly after
backing out the private domain configuration (which involves editing the
hosts file, which motivates the focus on DNS in some of this.)


All configurations work with "-d", or with "--debug-daemons" or with no
arguments inside a slurm allocation, but hang in the "ordinary" case.

I am stumped.  I am totally willing to believe I have mucked up the network
config, but where? How? What's different about debug mode?

ompi_info.pk.gb.bz2
Description: application/bzip

ompi_info.pl.gb.bz2
Description: application/bzip

ompi_info.pj.gb.bz2
Description: application/bzip

ompi_info.pi.gb.bz2
Description: application/bzip

ompi_info_all.pi.bz2
Description: application/bzip

[OMPI users] Mpirun invocation only works in debug mode, hangs in "normal" mode.

Reply via email to