Hi Jeff,

Thank you for your suggestions.  I followed your steps verbatim.
Unfortunately, there is a bit of problem.  Here's what I did:

  [tsakai@vixen ec2]$ ssh -i $MYKEY
tsa...@ec2-184-73-62-72.compute-1.amazonaws.com
  The authenticity of host 'ec2-184-73-62-72.compute-1.amazonaws.com
(184.73.62.72)' can't be established.
  RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13.
  Are you sure you want to continue connecting (yes/no)? yes
  Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-194-215-32 ~]$ # this is instance A
  [tsakai@ip-10-194-215-32 ~]$ nslookup `hostname`
  Server:         172.16.0.23
  Address:        172.16.0.23#53

  Non-authoritative answer:
  Name:   ip-10-194-215-32.ec2.internal
  Address: 10.194.215.32

  [tsakai@ip-10-194-215-32 ~]$
  [tsakai@ip-10-194-215-32 ~]$ rm -rf $HOME/.ssh
  [tsakai@ip-10-194-215-32 ~]$ ssh-keygen -t dsa
  Generating public/private dsa key pair.
  Enter file in which to save the key (/home/tsakai/.ssh/id_dsa):
  Created directory '/home/tsakai/.ssh'.
  Enter passphrase (empty for no passphrase):
  Enter same passphrase again:
  Your identification has been saved in /home/tsakai/.ssh/id_dsa.
  Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub.
  The key fingerprint is:
  54:eb:bd:e7:f2:52:24:49:94:7b:7a:9e:e4:b7:0b:04 tsakai@ip-10-194-215-32
  The key's randomart image is:
  +--[ DSA 1024]----+
  |          ....   |
  |         . .o    |
  |        . .E o   |
  |       . . .= o  |
  |        S . .*   |
  |            o.+  |
  |            .B.. |
  |            oo= .|
  |             +o+o|
  +-----------------+
  [tsakai@ip-10-194-215-32 ~]$
  [tsakai@ip-10-194-215-32 ~]$ cd $HOME/.ssh
  [tsakai@ip-10-194-215-32 .ssh]$ ll
  total 8
  -rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa
  -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub
  [tsakai@ip-10-194-215-32 .ssh]$
  [tsakai@ip-10-194-215-32 .ssh]$ cp id_dsa.pub authorized_keys
  [tsakai@ip-10-194-215-32 .ssh]$ chmod 644 authorized_keys
  [tsakai@ip-10-194-215-32 .ssh]$
  [tsakai@ip-10-194-215-32 .ssh]$ ll
  total 12
  -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:16 authorized_keys
  -rw------- 1 tsakai tsakai 668 Feb 18 02:15 id_dsa
  -rw-r--r-- 1 tsakai tsakai 613 Feb 18 02:15 id_dsa.pub
  [tsakai@ip-10-194-215-32 .ssh]$

Now the next step is to go to instance B via ssh.  This doesn't
work for me because the id_dsa on instance A at this point is
not the counterpart (id_dsa.pub) that's kept on instance B.  Here
is what happens:

  [tsakai@ip-10-194-215-32 .ssh]$ ssh ip-10-196-61-219.ec2.internal
  The authenticity of host 'ip-10-196-61-219.ec2.internal (10.196.61.219)'
  can't be established.
  RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9.
  Are you sure you want to continue connecting (yes/no)? yes
  Warning: Permanently added 'ip-10-196-61-219.ec2.internal,10.196.61.219'
  (RSA) to the list of known hosts.
  Permission denied (publickey).

I got onto instance B directly from my local machine and did the same
as what I did on A:

  [tsakai@vixen ec2]$ ssh -i $MYKEY
tsa...@ec2-67-202-49-161.compute-1.amazonaws.com
  The authenticity of host 'ec2-67-202-49-161.compute-1.amazonaws.com
(67.202.49.161)' can't be established.
  RSA key fingerprint is e5:ab:5b:d1:67:2c:ec:7e:33:3c:b8:b3:8a:73:5e:e9.
  Are you sure you want to continue connecting (yes/no)? yes
  Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-196-61-219 ~]$
  [tsakai@ip-10-196-61-219 ~]$ # this is instance B
  [tsakai@ip-10-196-61-219 ~]$ nslookup `hostname`
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: ip-10-196-61-219.ec2.internal
  Address: 10.196.61.219

  [tsakai@ip-10-196-61-219 ~]$
  [tsakai@ip-10-196-61-219 ~]$ rm -rf $HOME/.ssh
  [tsakai@ip-10-196-61-219 ~]$ ssh-keygen -t dsa
  Generating public/private dsa key pair.
  Enter file in which to save the key (/home/tsakai/.ssh/id_dsa):
  Created directory '/home/tsakai/.ssh'.
  Enter passphrase (empty for no passphrase):
  Enter same passphrase again:
  Your identification has been saved in /home/tsakai/.ssh/id_dsa.
  Your public key has been saved in /home/tsakai/.ssh/id_dsa.pub.
  The key fingerprint is:
  dd:c1:73:97:50:eb:d1:ad:84:94:0f:98:51:b2:8d:4a tsakai@ip-10-196-61-219
  The key's randomart image is:
  +--[ DSA 1024]----+
  |          o=oo.. |
  |          oBo.. =|
  |        E o *oo++|
  |       . o . =oo.|
  |        S . . .. |
  |                 |
  |                 |
  |                 |
  |                 |
  +-----------------+
  [tsakai@ip-10-196-61-219 ~]$

Now comes another failure from the instance B:

  [tsakai@ip-10-196-61-219 ~]$ scp
@ip-10-194-215-32.ec2.internal:.ssh/id_rsa\* .
  The authenticity of host 'ip-10-194-215-32.ec2.internal (10.194.215.32)'
  can't be established.
  RSA key fingerprint is cb:52:71:49:63:c2:52:58:9c:2e:04:46:f7:4e:b9:13.
  Are you sure you want to continue connecting (yes/no)?
  Host key verification failed.
  [tsakai@ip-10-196-61-219 ~]$

I have seen these problems many times over last few days and I have
worked it out.  The failure occurs because, in order to do silent
authentication, it wants to see an indentity of destination machine
in known_hosts file in .ssh directory.  One way to get around this
is to use -i flag (which requires private key) of ssh once.  If that
is done from both directions, then ssh can do authentication silently.
Essentially, I had done exactly the same thing as your instruction
indicate.  Only I didn't use dsa, I used rsa.  I don't think that is
a roadblock, is it?

  [tsakai@vixen ec2]$ ssh -i $MYKEY
tsa...@ec2-50-17-48-206.compute-1.amazonaws.com
  The authenticity of host 'ec2-50-17-48-206.compute-1.amazonaws.com
(50.17.48.206)' can't be established.
  RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
  Are you sure you want to continue connecting (yes/no)? yes
  Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-110-10-137 ~]$
  [tsakai@ip-10-110-10-137 ~]$ nslookup `hostname`
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: ip-10-110-10-137.ec2.internal
  Address: 10.110.10.137

  [tsakai@ip-10-110-10-137 ~]$
  [tsakai@ip-10-110-10-137 ~]$ cd .ssh
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ll
  total 12
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 config
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # there is no known_hosts file, which we
need.
  [tsakai@ip-10-110-10-137 .ssh]$ # to create it, we need to hide config
  [tsakai@ip-10-110-10-137 .ssh]$ mv config __config
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ssh -i tsakai
tsakai@ip-10-110-10-137.ec2.internal
  The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)'
can't be established.
  RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
  Are you sure you want to continue connecting (yes/no)? yes
  Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137'
(RSA) to the list of known hosts.
  Last login: Fri Feb 18 04:20:29 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-110-10-137 ~]$
  [tsakai@ip-10-110-10-137 ~]$ cd .ssh
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 __config
  -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # I ssh'ed to the same instance
  [tsakai@ip-10-110-10-137 .ssh]$ who
  tsakai   pts/0        2011-02-18 04:20 (63.193.205.1)
  tsakai   pts/1        2011-02-18 04:22 (ip-10-110-10-137.ec2.internal)
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ exit
  logout
  Connection to ip-10-110-10-137.ec2.internal closed.
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ who
  tsakai   pts/0        2011-02-18 04:20 (63.193.205.1)
  [tsakai@ip-10-110-10-137 .ssh]$
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 __config
  -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:22 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # known_hosts file got made
  [tsakai@ip-10-110-10-137 .ssh]$ # what's in it?
  [tsakai@ip-10-110-10-137 .ssh]$ wc known_hosts
    1   3 425 known_hosts
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ cat known_hosts
  ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO
iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A
LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq
Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1
QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw==
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # now go to instance B
  [tsakai@ip-10-110-10-137 .ssh]$ ssh -i tsakai
tsakai@domU-12-31-39-16-C6-70.compute-1.internal
  The authenticity of host 'domu-12-31-39-16-c6-70.compute-1.internal
(10.96.197.154)' can't be established.
  RSA key fingerprint is 2e:8b:83:39:02:9f:48:d6:fd:49:2f:82:96:0b:84:35.
  Are you sure you want to continue connecting (yes/no)? yes
  Warning: Permanently added
'domu-12-31-39-16-c6-70.compute-1.internal,10.96.197.154' (RSA) to the list
of known hosts.
  Last login: Wed Feb 16 21:20:01 2011 from 63.193.205.1

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # I am on instance B
  [tsakai@domU-12-31-39-16-C6-70 ~]$ nslookup `hostname`
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: domU-12-31-39-16-C6-70.compute-1.internal
  Address: 10.96.197.154

  [tsakai@domU-12-31-39-16-C6-70 ~]$ cd .ssh
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll
  total 12
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 config
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # the same trick
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ mv config __config
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ssh -i tsakai
tsakai@ip-10-110-10-137.ec2.internal
  The authenticity of host 'ip-10-110-10-137.ec2.internal (10.110.10.137)'
can't be established.
  RSA key fingerprint is b4:4b:e3:74:42:d9:9c:82:21:0e:7d:d6:e3:13:4b:dd.
  Are you sure you want to continue connecting (yes/no)? yes
  Warning: Permanently added 'ip-10-110-10-137.ec2.internal,10.110.10.137'
(RSA) to the list of known hosts.
  Last login: Fri Feb 18 04:22:24 2011 from ip-10-110-10-137.ec2.internal

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-110-10-137 ~]$
  [tsakai@ip-10-110-10-137 ~]$ # I am on instance A
  [tsakai@ip-10-110-10-137 ~]$ # go back to instance B
  [tsakai@ip-10-110-10-137 ~]$ exit
  logout
  Connection to ip-10-110-10-137.ec2.internal closed.
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 __config
  -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # known_hosts got made
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ cat known_hosts
  ip-10-110-10-137.ec2.internal,10.110.10.137 ssh-rsa
AAAAB3NzaC1yc2EAAAABIwAAAQEAyEMhrftyAg637XzteErroLE2Uf2PgrPz7S/Hs0Tyedk9ooWO
iIzlpTq3fEGXeZIZ4sMMiwuFQuF60TSkCUKSx9sZi8ce2Tvck1uTNrki/rlP11gY/aJ1oFW9Gg7A
LT2B8xPFThoSZntjMXYwRxxHwqVza0ELCxMV+kk6bdGeTPvFjl3tnyKEQJsdy8/HZy8v2jvFaWRq
Pzc6JIACEdkZ2AArN8Xh33yHFlOQ6XGwf86ZIqwWrbBH4Cvo6058rs9VDjzdBKcdM1D7K5ea5lF1
QGGEzfsUl7dVq6Z1UWnZoI9bqc1Mw+tpW08T2VCm0Dhz7V/UUHRtVGljQmaucpx9aw==
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ mv __config config
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 config
  -rw-r--r-- 1 tsakai tsakai 425 Feb 18 04:27 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ # go back to instance A
  [tsakai@domU-12-31-39-16-C6-70 .ssh]$ exit
  logout
  Connection to domU-12-31-39-16-C6-70.compute-1.internal closed.
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 __config
  -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ mv __config config
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 config
  -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # now show I can ssh without -i flag
silently
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ssh
domU-12-31-39-16-C6-70.compute-1.internal
  Last login: Fri Feb 18 04:25:56 2011 from ip-10-110-10-137.ec2.internal

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # and to instance A
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ ssh ip-10-110-10-137.ec2.internal
  Last login: Fri Feb 18 04:27:36 2011 from
domu-12-31-39-16-c6-70.compute-1.internal

         __|  __|_  )  Amazon Linux AMI
         _|  (     /     Beta
        ___|\___|___|

  See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
  [tsakai@ip-10-110-10-137 ~]$
  [tsakai@ip-10-110-10-137 ~]$ # OK
  [tsakai@ip-10-110-10-137 ~]$ # go back to instance B
  [tsakai@ip-10-110-10-137 ~]$ exit
  logout
  Connection to ip-10-110-10-137.ec2.internal closed.
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ env | grep -i path
  LD_LIBRARY_PATH=:/usr/local/lib
  PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin
  AWS_PATH=/opt/aws
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # check firewall
  [tsakai@domU-12-31-39-16-C6-70 ~]$ sudo service iptables status
  iptables: Firewall is not running.
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ ll -t /usr/local/lib | head
  total 4100
  -rw-r--r-- 1 root root 385864 Feb 16 01:33 libvt.a
  -rw-r--r-- 1 root root 154950 Feb 16 01:33 libvt.fmpi.a
  -rw-r--r-- 1 root root 567848 Feb 16 01:33 libvt.mpi.a
  -rw-r--r-- 1 root root 462838 Feb 16 01:33 libvt.omp.a
  -rw-r--r-- 1 root root 643482 Feb 16 01:33 libvt.ompi.a
  -rw-r--r-- 1 root root 231278 Feb 16 01:33 libotf.a
  -rwxr-xr-x 1 root root    891 Feb 16 01:33 libotf.la
  drwxr-xr-x 2 root root   4096 Feb 16 01:33 openmpi
  -rwxr-xr-x 1 root root    991 Feb 16 01:33 libmca_common_sm.la
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ sudo find / -name mpirun
  /usr/local/bin/mpirun
  [tsakai@domU-12-31-39-16-C6-70 ~]$ cat .ssh/config
  Host *
        IdentityFile /home/tsakai/.ssh/tsakai
        IdentitiesOnly yes
        BatchMode yes
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # try mpirun without the other machine
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host `hostname` -np 2 hostname
  domU-12-31-39-16-C6-70
  domU-12-31-39-16-C6-70
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host
domU-12-31-39-16-C6-70.compute-1.internal -np 2 hostname
  domU-12-31-39-16-C6-70
  domU-12-31-39-16-C6-70
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # now add an extra host
  [tsakai@domU-12-31-39-16-C6-70 ~]$ mpirun --host \
  >
domU-12-31-39-16-C6-70.compute-1.internal,ip-10-110-10-137.ec2.internal \
  >                                  -np 2 \
  >                                  hostname
  # it is hanging
  # let me issue control-c
  ^Cmpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        ip-10-110-10-137.ec2.internal - daemon did not report back when
launched
  [tsakai@domU-12-31-39-16-C6-70 ~]$
  [tsakai@domU-12-31-39-16-C6-70 ~]$ # go back to machine A
  [tsakai@domU-12-31-39-16-C6-70 ~]$ exit
  logout
  Connection to domU-12-31-39-16-C6-70.compute-1.internal closed.
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ll
  total 16
  -rw------- 1 tsakai tsakai 232 Feb 16 04:00 authorized_keys
  -rw------- 1 tsakai tsakai  81 Feb 16 04:10 config
  -rw-r--r-- 1 tsakai tsakai 862 Feb 18 04:25 known_hosts
  -rw------- 1 tsakai tsakai 887 Feb 16 04:07 tsakai
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ sudo service iptables status
  iptables: Firewall is not running.
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ sudo find / -name mpirun
  /usr/local/bin/mpirun
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ env | grep -i path
  LD_LIBRARY_PATH=:/usr/local/lib
  PATH=/usr/local/bin:/bin:/usr/bin:/opt/aws/bin:/home/tsakai/bin
  AWS_PATH=/opt/aws
  [tsakai@ip-10-110-10-137 .ssh]$ cat config
  Host *
        IdentityFile /home/tsakai/.ssh/tsakai
        IdentitiesOnly yes
        BatchMode yes
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host `hostname` -np 2 hostname
  ip-10-110-10-137
  ip-10-110-10-137
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host
ip-10-110-10-137.ec2.internal -np 2 hostname
  ip-10-110-10-137
  ip-10-110-10-137
  [tsakai@ip-10-110-10-137 .ssh]$ # add the other instance
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host \
  >
ip-10-110-10-137.ec2.internal,domU-12-31-39-16-C6-70.compute-1.internal \
  >                               -np 2 \
  >                               hostname
  # again hanging; issuing control-c
  ^Cmpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        domU-12-31-39-16-C6-70.compute-1.internal - daemon did not report
back when launched
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # try with IP
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ nslookup `hostname`
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: ip-10-110-10-137.ec2.internal
  Address: 10.110.10.137

  [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host 10.110.10.137 -np 2 hostname
  ip-10-110-10-137
  ip-10-110-10-137
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ ssh
domU-12-31-39-16-C6-70.compute-1.internal 'nslookup domU-12-31-39-16-C6-70'
  Server:               172.16.0.23
  Address:      172.16.0.23#53

  Non-authoritative answer:
  Name: domU-12-31-39-16-C6-70.compute-1.internal
  Address: 10.96.197.154

  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ mpirun --host \
  >                               10.110.10.137,10.96.197.154 \
  >                               -np 2 hostname
  # hanging also, get out by control-d
  ^Cmpirun: killing job...

  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --------------------------------------------------------------------------
  --------------------------------------------------------------------------
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --------------------------------------------------------------------------
        10.96.197.154 - daemon did not report back when launched
  [tsakai@ip-10-110-10-137 .ssh]$
  [tsakai@ip-10-110-10-137 .ssh]$ # I can't figure out what more to do....
  [tsakai@ip-10-110-10-137 .ssh]$ exit
  logout
  [tsakai@vixen ec2]$

Do you see anything incorrect in what I am doing?

Thank you.

Regards,

Tena


On 2/17/11 6:52 AM, "Jeff Squyres" <jsquy...@cisco.com> wrote:

> On Feb 16, 2011, at 6:17 PM, Tena Sakai wrote:
>
>> For now, may I point out something I noticed out of the
>> DEBUG3 Output last night?
>>
>> I found this line:
>>
>>>  debug1: Sending command:  orted --daemonize -mca ess env -mca
>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>
> What this means is that ssh sent the "orted ..." command to the remote side.
>
> As Gus mentioned, "orted" is the "Open MPI Run-Time Environment daemon" --
> it's a helper thingy that mpirun launches on the remote nodes before launching
> your actual application.  All those parameters (from --daemonize through
> ...:56064") are options for orted.
>
> All of that gorp is considered internal to Open MPI -- most people never see
> that stuff.
>
>> Followed by:
>>
>>>  debug2: channel 0: request exec confirm 1
>>>  debug2: fd 3 setting TCP_NODELAY
>>>  debug2: callback done
>>>  debug2: channel 0: open confirm rwindow 0 rmax 32768
>>>  debug3: Wrote 272 bytes for a total of 1893
>>>  debug2: channel 0: rcvd adjust 2097152
>>>  debug2: channel_input_status_confirm: type 99 id 0
>
> This is just more status information about the ssh connection; it doesn't
> really have any direct relation to Open MPI.
>
> I don't know offhand if ssh displays the ack that a command successfully ran.
> If you're not convinced that it did, then login to the other node while the
> command is hung and run a ps to see if the orted is actually running or not.
> I *suspect* that it is running, but that it's just hung for some reason.
>
> -----
>
> Here's some suggestions to try debugging:
>
> On your new linux AMI instances (some of this may be redundant with what you
> did already):
>
> - ensure that firewalling is disabled on all instances
>
> - ensure that your .bashrc (or whatever startup file is relevant to your
> shell) is set to prefix PATH and LD_LIBRARY_PATH to your Open MPI
> installation.  Ensure the *PREFIX* these variables to guarantee that you don't
> get interference from already-installed versions of Open MPI (e.g., if Open
> MPI is installed by default on your AMI and you weren't aware of it)
>
> - setup a simple, per-user SSH key, perhaps something like this:
>
>      A$ rm -rf $HOME/.ssh
>    (remove what you had before; let's just start over)
>
>      A$ ssh-keygen -t dsa
>    (hit enter to accept all defaults and set no passphrase)
>
>      A$ cd $HOME/.ssh
>      A$ cp id_dsa.pub authorized_keys
>      A$ chmod 644 authorized_keys
>      A$ ssh othernode
>    (login to node B)
>
>      B$ ssh-keygen -t dsa
>    (hit enter to accept all defaults and set no passphrase; just to create
> $HOME/.ssh with the right permissions, etc.)
>
>      B$ scp @firstnode:.ssh/id_dsa\* .
>    (enter your password on A -- we're overwriting all the files here)
>
>      B$ cp id_dsa.pub authorized_keys
>      B$ chmod 644 authorized_keys
>
> Now you should be able to ssh from one node to the other without passwords:
>
>      A$ ssh othernode hostname
>      B
>      A$
>
> and
>
>      B$ ssh firstnode hostname
>      A
>      B$
>
> Don't just test with "ssh othernode" -- test with "ssh othernode <command>" to
> ensure that non-interactive logins work properly.  That's what Open MPI will
> use under the covers.
>
> - Now ensure that PATH and LD_LIBRARY_PATH are set for non-interactive ssh
> sessions (i.e., some .bashrc's will exit "early" if they detect that it is a
> non-interactive session).  For example:
>
>      A$ ssh othernode env | grep -i path
>
> Ensure that the output shows the path and ld_library_path locations for Open
> MPI at the beginning of those variables.  To go for the gold, you can try
> this, too:
>
>      A$ ssh othernode which ompi_info
>      (if all paths are set right, this should show the ompi_info of your 1.4.3
> install)
>      A$ ssh othernode ompi_info
>      (should show all the info about your 1.4.3 install)
>
> - If all the above works, then test with a simple, non-MPI application across
> both nodes:
>
>      A$ mpirun --host firstnode,othernode -np 2 hostname
>      A
>      B
>      A$
>
> - When that works, you should be able to test with a simple MPI application
> (e.g., the examples/ring_c.c file in the Open MPI distribution):
>
>      A$ cd /path/to/open/mpi/source
>      A$ cd examples
>      A$ make
>      ...
>      A$ scp ring_c @othernode:/path/to/open/mpi/source/examples
>      ...
>      A$ mpirun --host firstnode,othernode -np 4 ring_c
>
> Make sense?


Reply via email to