Hi, sorry for the delay in replying -- pretty busy week :-(
On 28 June 2013 21:54, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > Here's what we think we know (I'm using the name "foo" instead of > your actual hostname because it's easier to type): > > 1. When you run "hostname", you get foo.local back Yes. > 2. In your /etc/hosts file, foo.local is listed on two lines: > 127.0.1.1 > 10.1.255.201 > Yes: [rmurri@nh64-5-9 ~]$ fgrep nh64-5-9 /etc/hosts 127.0.1.1 nh64-5-9.local nh64-5-9 10.1.255.194 nh64-5-9.local nh64-5-9 > 3. When you login to the "foo" server and execute mpirun with a hostfile > that contains "foo", Open MPI incorrectly thinks that the local machine is > not foo, and therefore tries to ssh to it (and things go downhill from > there). > Yes. > 4. When you login to the "foo" server and execute mpirun with a hostfile > that contains "foo.local" (you said "FQDN", but never said exactly what you > meant by that -- I'm assuming "foo.local", not "foo.yourdomain.com"), then > Open MPI behaves properly. > Yes. FQDN = foo.local. (This is a compute node in a cluster that does not have any public IP address not DNS entry -- it only has an interface to the cluster-private network. I presume this is not relevant to OpenMPI as long as all names are correctly resolved via `/etc/hosts`.) > Is that all correct? Yes, all correct. > We have some followup questions for you: > > 1. What happens when you try to resolve "foo"? (e.g., via the "dig" program > -- "dig foo") Here's what happens with `dig`: [rmurri@nh64-5-9 ~]$ dig nh64-5-9 ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9 ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 4373 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;nh64-5-9. IN A ;; AUTHORITY SECTION: . 3600 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2013070200 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:47:57 2013 ;; MSG SIZE rcvd: 101 However, `getent hosts` has a different reply: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9 127.0.1.1 nh64-5-9.local nh64-5-9 > 2. What happens when you try to resolve "foo.local"? (e.g., "dig foo.local") Here's what happens with `dig`: [rmurri@nh64-5-9 ~]$ dig nh64-5-9.local ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.local ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62092 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1 ;; QUESTION SECTION: ;nh64-5-9.local. IN A ;; ANSWER SECTION: nh64-5-9.local. 259200 IN A 10.1.255.194 ;; AUTHORITY SECTION: local. 259200 IN NS ns.local. ;; ADDITIONAL SECTION: ns.local. 259200 IN A 127.0.0.1 ;; Query time: 0 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:48:50 2013 ;; MSG SIZE rcvd: 81 Same query resolved via `getent hosts`: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9 127.0.1.1 nh64-5-9.local nh64-5-9 > 3. What happens when you try to resolve "foo.yourdomain.com"? (e.g., "dig > foo.yourdomain.com") This yields an empty response from both `dig` and `getent hosts` as the node is only attached to a private network and not registered in DNS: [rmurri@nh64-5-9 ~]$ getent hosts nh64-5-9.uzh.ch [rmurri@nh64-5-9 ~]$ dig nh64-5-9.uzh.ch ; <<>> DiG 9.3.6-P1-RedHat-9.3.6-4.P1.el5 <<>> nh64-5-9.uzh.ch ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 61801 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;nh64-5-9.uzh.ch. IN A ;; AUTHORITY SECTION: uzh.ch. 8921 IN SOA ns1.uzh.ch. hostmaster.uzh.ch. 384627811 3600 1800 3600000 10800 ;; Query time: 0 msec ;; SERVER: 10.1.1.1#53(10.1.1.1) ;; WHEN: Tue Jul 2 15:50:54 2013 ;; MSG SIZE rcvd: 84 > 4. Please apply the attached patch to your Open MPI 1.6.5 build (please note > that it adds diagnostic output; do *not* put this patch into production) > and: > 4a. Run with one of your "bad" cases and send us the output > 4b. Run with one of your "good" cases and send us the output Please find the outputs attached. The exact `mpiexec` invocation and the machines file are at the beginning of each file. Note that I allocated 8 slots (on 4 nodes), but only use 2 slots (on 1 node). Thanks, Riccardo
exam01.out.BAD
Description: Binary data
exam01.out.GOOD
Description: Binary data