Hi,

sorry for the delay in replying -- pretty busy week :-(


On 28 June 2013 21:54, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote:
> Here's what we think we know (I'm using the name "foo" instead of
> your actual hostname because it's easier to type):
>
> 1. When you run "hostname", you get foo.local back

Yes.


> 2. In your /etc/hosts file, foo.local is listed on two lines:
>    127.0.1.1
>    10.1.255.201
>

Yes:

    [rmurri@nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts
    127.0.1.1   nh64-5-9.local nh64-5-9
    10.1.255.194    nh64-5-9.local nh64-5-9


> 3. When you login to the "foo" server and execute mpirun with a hostfile
> that contains "foo", Open MPI incorrectly thinks that the local machine is
> not foo, and therefore tries to ssh to it (and things go downhill from
> there).
>

Yes.


> 4. When you login to the "foo" server and execute mpirun with a hostfile
> that contains "foo.local" (you said "FQDN", but never said exactly what you
> meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then
> Open MPI behaves properly.
>

Yes.

FQDN = foo.local.  (This is a compute node in a cluster that does not
have any public IP address not DNS entry -- it only has an interface
to the cluster-private network.  I presume this is not relevant to
OpenMPI as long as all names are correctly resolved via `/etc/hosts`.)


> Is that all correct?

Yes, all correct.


> We have some followup questions for you:
>
> 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program
> -- "dig foo")

Here's what happens with `dig`:

    [rmurri@nh64-5-9 ~]$ dig nh64-5-9

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9
    ;; global options:  printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

    ;; QUESTION SECTION:
    ;nh64-5-9.                  IN      A

    ;; AUTHORITY SECTION:
    .                   3600    IN      SOA     a.root-servers.net. 
nstld.verisign-grs.com.
2013070200 1800 900 604800 86400

    ;; Query time: 17 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul  2 15:47:57 2013
    ;; MSG SIZE  rcvd: 101

However, `getent hosts` has a different reply:

    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9
    127.0.1.1       nh64-5-9.local nh64-5-9


> 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local")

Here's what happens with `dig`:

    [rmurri@nh64-5-9 ~]$ dig nh64-5-9.local

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local
    ;; global options:  printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092
    ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

    ;; QUESTION SECTION:
    ;nh64-5-9.local.                    IN      A

    ;; ANSWER SECTION:
    nh64-5-9.local.             259200  IN      A       10.1.255.194

    ;; AUTHORITY SECTION:
    local.                      259200  IN      NS      ns.local.

    ;; ADDITIONAL SECTION:
    ns.local.           259200  IN      A       127.0.0.1

    ;; Query time: 0 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul  2 15:48:50 2013
    ;; MSG SIZE  rcvd: 81

Same query resolved via `getent hosts`:

    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9
    127.0.1.1       nh64-5-9.local nh64-5-9


> 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig
> foo.yourdomain.com")

This yields an empty response from both `dig` and `getent hosts` as the node
is only attached to a private network and not registered in DNS:

    [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch
    [rmurri@nh64-5-9 ~]$ dig nh64-5-9.uzh.ch

    ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch
    ;; global options:  printcmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

    ;; QUESTION SECTION:
    ;nh64-5-9.uzh.ch.           IN      A

    ;; AUTHORITY SECTION:
    uzh.ch.                     8921    IN      SOA     ns1.uzh.ch. 
hostmaster.uzh.ch. 384627811
3600 1800 3600000 10800

    ;; Query time: 0 msec
    ;; SERVER: 10.1.1.1#53(10.1.1.1)
    ;; WHEN: Tue Jul  2 15:50:54 2013
    ;; MSG SIZE  rcvd: 84


> 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note
> that it adds diagnostic output; do *not* put this patch into production)
> and:
>    4a. Run with one of your "bad" cases and send us the output
>    4b. Run with one of your "good" cases and send us the output

Please find the outputs attached.  The exact `mpiexec` invocation and
the machines file are at the beginning of each file.

Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node).

Thanks,
Riccardo

Attachment: exam01.out.BAD
Description: Binary data

Attachment: exam01.out.GOOD
Description: Binary data

Reply via email to