Hi Jeff, Thank you for your suggestions. I followed your steps verbatim. Unfortunately, there is a bit of problem. Here's what I did:
[tsakai@vixen ec2]$ ssh -i $MYKEY tsa...@ec2-184-73-62-72.compute-1.amazonaws.com The authenticity of host 'ec2-184-73-62-72.compute-1.amazonaws.com (184.73.62.72)' can't be established. RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13. Are you sure you want to continue connecting (yes/no)? yes Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-194-215-32 ~]$ # this is instance A [tsakai@ip-10-194-215-32 ~]$ nslookup `hostname` Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: ip-10-194-215-32.ec2.internal Address: 10.194.215.32 [tsakai@ip-10-194-215-32 ~]$ [tsakai@ip-10-194-215-32 ~]$ rm -rf $HOME/.ssh [tsakai@ip-10-194-215-32 ~]$ ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/home/tsakai/.ssh/id_dsa): Created directory '/home/tsakai/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/tsakai/.ssh/id_dsa. Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub. The key fingerprint is: 54:eb:bd:e7:f2:52:24:49:94:7b:7a:9e:e4:b7:0b:04 tsakai@ip-10-194-215-32 The key's randomart image is: +--[ DSA 1024]----+ | .... | | . .o | | . .E o | | . . .= o | | S . .* | | o.+ | | .B.. | | oo= .| | +o+o| +-----------------+ [tsakai@ip-10-194-215-32 ~]$ [tsakai@ip-10-194-215-32 ~]$ cd $HOME/.ssh [tsakai@ip-10-194-215-32 .ssh]$ ll total 8 -rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub [tsakai@ip-10-194-215-32 .ssh]$ [tsakai@ip-10-194-215-32 .ssh]$ cp id_dsa.pub authorized_keys [tsakai@ip-10-194-215-32 .ssh]$ chmod 644 authorized_keys [tsakai@ip-10-194-215-32 .ssh]$ [tsakai@ip-10-194-215-32 .ssh]$ ll total 12 -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:16 authorized_keys -rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub [tsakai@ip-10-194-215-32 .ssh]$ Now the next step is to go to instance B via ssh. This doesn't work for me because the id_dsa on instance A at this point is not the counterpart (id_dsa.pub) that's kept on instance B. Here is what happens: [tsakai@ip-10-194-215-32 .ssh]$ ssh ip-10-196-61-219.ec2.internal The authenticity of host 'ip-10-196-61-219.ec2.internal (10.196.61.219)' can't be established. RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ip-10-196-61-219.ec2.internal,10.196.61.219' (RSA) to the list of known hosts. Permission denied (publickey). I got onto instance B directly from my local machine and did the same as what I did on A: [tsakai@vixen ec2]$ ssh -i $MYKEY tsa...@ec2-67-202-49-161.compute-1.amazonaws.com The authenticity of host 'ec2-67-202-49-161.compute-1.amazonaws.com (67.202.49.161)' can't be established. RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9. Are you sure you want to continue connecting (yes/no)? yes Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-196-61-219 ~]$ [tsakai@ip-10-196-61-219 ~]$ # this is instance B [tsakai@ip-10-196-61-219 ~]$ nslookup `hostname` Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: ip-10-196-61-219.ec2.internal Address: 10.196.61.219 [tsakai@ip-10-196-61-219 ~]$ [tsakai@ip-10-196-61-219 ~]$ rm -rf $HOME/.ssh [tsakai@ip-10-196-61-219 ~]$ ssh-keygen -t dsa Generating public/private dsa key pair. Enter file in which to save the key (/home/tsakai/.ssh/id_dsa): Created directory '/home/tsakai/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/tsakai/.ssh/id_dsa. Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub. The key fingerprint is: dd:c1:73:97:50:eb:d1:ad:84:94:0f:98:51:b2:8d:4a tsakai@ip-10-196-61-219 The key's randomart image is: +--[ DSA 1024]----+ | o=oo.. | | oBo.. =| | E o *oo++| | . o . =oo.| | S . . .. | | | | | | | | | +-----------------+ [tsakai@ip-10-196-61-219 ~]$ Now comes another failure from the instance B: [tsakai@ip-10-196-61-219 ~]$ scp @ip-10-194-215-32.ec2.internal:.ssh/id_rsa\* . The authenticity of host 'ip-10-194-215-32.ec2.internal (10.194.215.32)' can't be established. RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13. Are you sure you want to continue connecting (yes/no)? Host key verification failed. [tsakai@ip-10-196-61-219 ~]$ I have seen these problems many times over last few days and I have worked it out. The failure occurs because, in order to do silent authentication, it wants to see an indentity of destination machine in known_hosts file in .ssh directory. One way to get around this is to use -i flag (which requires private key) of ssh once. If that is done from both directions, then ssh can do authentication silently. Essentially, I had done exactly the same thing as your instruction indicate. Only I didn't use dsa, I used rsa. I don't think that is a roadblock, is it? [tsakai@vixen ec2]$ ssh -i $MYKEY tsa...@ec2-50-17-48-206.compute-1.amazonaws.com The authenticity of host 'ec2-50-17-48-206.compute-1.amazonaws.com (50.17.48.206)' can't be established. RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd. Are you sure you want to continue connecting (yes/no)? yes Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-110-10-137 ~]$ [tsakai@ip-10-110-10-137 ~]$ nslookup `hostname` Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: ip-10-110-10-137.ec2.internal Address: 10.110.10.137 [tsakai@ip-10-110-10-137 ~]$ [tsakai@ip-10-110-10-137 ~]$ cd .ssh [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ll total 12 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 config -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # there is no known_hosts file, which we need. [tsakai@ip-10-110-10-137 .ssh]$ # to create it, we need to hide config [tsakai@ip-10-110-10-137 .ssh]$ mv config __config [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ssh -i tsakai tsakai@ip-10-110-10-137.ec2.internal The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)' can't be established. RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137' (RSA) to the list of known hosts. Last login: Fri Feb 18 04:20:29 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-110-10-137 ~]$ [tsakai@ip-10-110-10-137 ~]$ cd .ssh [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # I ssh'ed to the same instance [tsakai@ip-10-110-10-137 .ssh]$ who tsakai pts/0 2011-02-18 04:20 (63.193.205.1) tsakai pts/1 2011-02-18 04:22 (ip-10-110-10-137.ec2.internal) [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ exit logout Connection to ip-10-110-10-137.ec2.internal closed. [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ who tsakai pts/0 2011-02-18 04:20 (63.193.205.1) [tsakai@ip-10-110-10-137 .ssh]$ total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # known_hosts file got made [tsakai@ip-10-110-10-137 .ssh]$ # what's in it? [tsakai@ip-10-110-10-137 .ssh]$ wc known_hosts 1 3 425 known_hosts [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ cat known_hosts ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1 QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw== [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # now go to instance B [tsakai@ip-10-110-10-137 .ssh]$ ssh -i tsakai tsakai@domU-12-31-39-16-C6-70.compute-1.internal The authenticity of host 'domu-12-31-39-16-c6-70.compute-1.internal (10.96.197.154)' can't be established. RSA key fingerprint is 2e:8b:83:39:02:9f:48:d6:fd:49:2f:82:96:0b:84:35. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'domu-12-31-39-16-c6-70.compute-1.internal,10.96.197.154' (RSA) to the list of known hosts. Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # I am on instance B [tsakai@domU-12-31-39-16-C6-70 ~]$ nslookup `hostname` Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: domU-12-31-39-16-C6-70.compute-1.internal Address: 10.96.197.154 [tsakai@domU-12-31-39-16-C6-70 ~]$ cd .ssh [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll total 12 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 config -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # the same trick [tsakai@domU-12-31-39-16-C6-70 .ssh]$ mv config __config [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ssh -i tsakai tsakai@ip-10-110-10-137.ec2.internal The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)' can't be established. RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137' (RSA) to the list of known hosts. Last login: Fri Feb 18 04:22:24 2011 from ip-10-110-10-137.ec2.internal __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-110-10-137 ~]$ [tsakai@ip-10-110-10-137 ~]$ # I am on instance A [tsakai@ip-10-110-10-137 ~]$ # go back to instance B [tsakai@ip-10-110-10-137 ~]$ exit logout Connection to ip-10-110-10-137.ec2.internal closed. [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # known_hosts got made [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ cat known_hosts ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1 QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw== [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ mv __config config [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 config -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@domU-12-31-39-16-C6-70 .ssh]$ [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # go back to instance A [tsakai@domU-12-31-39-16-C6-70 .ssh]$ exit logout Connection to domU-12-31-39-16-C6-70.compute-1.internal closed. [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 __config -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ mv __config config [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 config -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # now show I can ssh without -i flag silently [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ssh domU-12-31-39-16-C6-70.compute-1.internal Last login: Fri Feb 18 04:25:56 2011 from ip-10-110-10-137.ec2.internal __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # and to instance A [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ ssh ip-10-110-10-137.ec2.internal Last login: Fri Feb 18 04:27:36 2011 from domu-12-31-39-16-c6-70.compute-1.internal __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@ip-10-110-10-137 ~]$ [tsakai@ip-10-110-10-137 ~]$ # OK [tsakai@ip-10-110-10-137 ~]$ # go back to instance B [tsakai@ip-10-110-10-137 ~]$ exit logout Connection to ip-10-110-10-137.ec2.internal closed. [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ env | grep -i path LD_LIBRARY_PATH=:/usr/local/lib PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin AWS_PATH=/opt/aws [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # check firewall [tsakai@domU-12-31-39-16-C6-70 ~]$ sudo service iptables status iptables: Firewall is not running. [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ ll -t /usr/local/lib | head total 4100 -rw-r--r-- 1 root root 385864 Feb 16 01:33 libvt.a -rw-r--r-- 1 root root 154950 Feb 16 01:33 libvt.fmpi.a -rw-r--r-- 1 root root 567848 Feb 16 01:33 libvt.mpi.a -rw-r--r-- 1 root root 462838 Feb 16 01:33 libvt.omp.a -rw-r--r-- 1 root root 643482 Feb 16 01:33 libvt.ompi.a -rw-r--r-- 1 root root 231278 Feb 16 01:33 libotf.a -rwxr-xr-x 1 root root 891 Feb 16 01:33 libotf.la drwxr-xr-x 2 root root 4096 Feb 16 01:33 openmpi -rwxr-xr-x 1 root root 991 Feb 16 01:33 libmca_common_sm.la [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ sudo find / -name mpirun /usr/local/bin/mpirun [tsakai@domU-12-31-39-16-C6-70 ~]$ cat .ssh/config Host * IdentityFile /home/tsakai/.ssh/tsakai IdentitiesOnly yes BatchMode yes [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # try mpirun without the other machine [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host `hostname` -np 2 hostname domU-12-31-39-16-C6-70 domU-12-31-39-16-C6-70 [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host domU-12-31-39-16-C6-70.compute-1.internal -np 2 hostname domU-12-31-39-16-C6-70 domU-12-31-39-16-C6-70 [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # now add an extra host [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host \ > domU-12-31-39-16-C6-70.compute-1.internal,ip-10-110-10-137.ec2.internal \ > -np 2 \ > hostname # it is hanging # let me issue control-c ^Cmpirun: killing job... -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- ip-10-110-10-137.ec2.internal - daemon did not report back when launched [tsakai@domU-12-31-39-16-C6-70 ~]$ [tsakai@domU-12-31-39-16-C6-70 ~]$ # go back to machine A [tsakai@domU-12-31-39-16-C6-70 ~]$ exit logout Connection to domU-12-31-39-16-C6-70.compute-1.internal closed. [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ll total 16 -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys -rw------- 1 tsakai tsakai 81 Feb 16 04:10 config -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ sudo service iptables status iptables: Firewall is not running. [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ sudo find / -name mpirun /usr/local/bin/mpirun [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ env | grep -i path LD_LIBRARY_PATH=:/usr/local/lib PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin AWS_PATH=/opt/aws [tsakai@ip-10-110-10-137 .ssh]$ cat config Host * IdentityFile /home/tsakai/.ssh/tsakai IdentitiesOnly yes BatchMode yes [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host `hostname` -np 2 hostname ip-10-110-10-137 ip-10-110-10-137 [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host ip-10-110-10-137.ec2.internal -np 2 hostname ip-10-110-10-137 ip-10-110-10-137 [tsakai@ip-10-110-10-137 .ssh]$ # add the other instance [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host \ > ip-10-110-10-137.ec2.internal,domU-12-31-39-16-C6-70.compute-1.internal \ > -np 2 \ > hostname # again hanging; issuing control-c ^Cmpirun: killing job... -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- domU-12-31-39-16-C6-70.compute-1.internal - daemon did not report back when launched [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # try with IP [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ nslookup `hostname` Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: ip-10-110-10-137.ec2.internal Address: 10.110.10.137 [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host 10.110.10.137 -np 2 hostname ip-10-110-10-137 ip-10-110-10-137 [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ ssh domU-12-31-39-16-C6-70.compute-1.internal 'nslookup domU-12-31-39-16-C6-70' Server: 172.16.0.23 Address: 172.16.0.23#53 Non-authoritative answer: Name: domU-12-31-39-16-C6-70.compute-1.internal Address: 10.96.197.154 [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host \ > 10.110.10.137,10.96.197.154 \ > -np 2 hostname # hanging also, get out by control-d ^Cmpirun: killing job... -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- 10.96.197.154 - daemon did not report back when launched [tsakai@ip-10-110-10-137 .ssh]$ [tsakai@ip-10-110-10-137 .ssh]$ # I can't figure out what more to do.... [tsakai@ip-10-110-10-137 .ssh]$ exit logout [tsakai@vixen ec2]$ Do you see anything incorrect in what I am doing? Thank you. Regards, Tena On 2/17/11 6:52 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote: > On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote: > >> For now, may I point out something I noticed out of the >> DEBUG3 Output last night? >> >> I found this line: >> >>> debug1: Sending command: orted --daemonize -mca ess env -mca >>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064" > > What this means is that ssh sent the "orted ..." command to the remote side. > > As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" -- > it's a helper thingy that mpirun launches on the remote nodes before launching > your actual application. All those parameters (from --daemonize through > ...:56064") are options for orted. > > All of that gorp is considered internal to Open MPI -- most people never see > that stuff. > >> Followed by: >> >>> debug2: channel 0: request exec confirm 1 >>> debug2: fd 3 setting TCP_NODELAY >>> debug2: callback done >>> debug2: channel 0: open confirm rwindow 0 rmax 32768 >>> debug3: Wrote 272 bytes for a total of 1893 >>> debug2: channel 0: rcvd adjust 2097152 >>> debug2: channel_input_status_confirm: type 99 id 0 > > This is just more status information about the ssh connection; it doesn't > really have any direct relation to Open MPI. > > I don't know offhand if ssh displays the ack that a command successfully ran. > If you're not convinced that it did, then login to the other node while the > command is hung and run a ps to see if the orted is actually running or not. > I *suspect* that it is running, but that it's just hung for some reason. > > ----- > > Here's some suggestions to try debugging: > > On your new linux AMI instances (some of this may be redundant with what you > did already): > > - ensure that firewalling is disabled on all instances > > - ensure that your .bashrc (or whatever startup file is relevant to your > shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI > installation. Ensure the *PREFIX* these variables to guarantee that you don't > get interference from already-installed versions of Open MPI (e.g., if Open > MPI is installed by default on your AMI and you weren't aware of it) > > - setup a simple, per-user SSH key, perhaps something like this: > > A$ rm -rf $HOME/.ssh > (remove what you had before; let's just start over) > > A$ ssh-keygen -t dsa > (hit enter to accept all defaults and set no passphrase) > > A$ cd $HOME/.ssh > A$ cp id_dsa.pub authorized_keys > A$ chmod 644 authorized_keys > A$ ssh othernode > (login to node B) > > B$ ssh-keygen -t dsa > (hit enter to accept all defaults and set no passphrase; just to create > $HOME/.ssh with the right permissions, etc.) > > B$ scp @firstnode:.ssh/id_dsa\* . > (enter your password on A -- we're overwriting all the files here) > > B$ cp id_dsa.pub authorized_keys > B$ chmod 644 authorized_keys > > Now you should be able to ssh from one node to the other without passwords: > > A$ ssh othernode hostname > B > A$ > > and > > B$ ssh firstnode hostname > A > B$ > > Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to > ensure that non-interactive logins work properly. That's what Open MPI will > use under the covers. > > - Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh > sessions (i.e., some .bashrc's will exit "early" if they detect that it is a > non-interactive session). For example: > > A$ ssh othernode env | grep -i path > > Ensure that the output shows the path and ld_library_path locations for Open > MPI at the beginning of those variables. To go for the gold, you can try > this, too: > > A$ ssh othernode which ompi_info > (if all paths are set right, this should show the ompi_info of your 1.4.3 > install) > A$ ssh othernode ompi_info > (should show all the info about your 1.4.3 install) > > - If all the above works, then test with a simple, non-MPI application across > both nodes: > > A$ mpirun --host firstnode,othernode -np 2 hostname > A > B > A$ > > - When that works, you should be able to test with a simple MPI application > (e.g., the examples/ring_c.c file in the Open MPI distribution): > > A$ cd /path/to/open/mpi/source > A$ cd examples > A$ make > ... > A$ scp ring_c @othernode:/path/to/open/mpi/source/examples > ... > A$ mpirun --host firstnode,othernode -np 4 ring_c > > Make sense?