Hi Gus,

Thank you for your reply and suggestions.

I will follow up on these in a bit and will give you an
update.  Looking at what vixen and/or dasher generates
from DEBUG3 would be interesting.

For now, may I point out something I noticed out of the
DEBUG3 Output last night?

I found this line:

>   debug1: Sending command:  orted --daemonize -mca ess env -mca
> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"

Followed by:

>   debug2: channel 0: request exec confirm 1
>   debug2: fd 3 setting TCP_NODELAY
>   debug2: callback done
>   debug2: channel 0: open confirm rwindow 0 rmax 32768
>   debug3: Wrote 272 bytes for a total of 1893
>   debug2: channel 0: rcvd adjust 2097152
>   debug2: channel_input_status_confirm: type 99 id 0

It appears, to my untrained eye/mind, a directive from instance A
to B was issued and then what happened?  I don't see that was
honored by the instance B.

Can you please comment on this?

Thank you.

Regards,

Tena

On 2/16/11 1:34 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:

> Hi Tena
>
> I hope somebody more knowledgeable in ssh
> takes a look at the debug3 session log that you included.
>
> I can't see if/where/why ssh is failing for you in EC2.
>
> See other answers inline, please.
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you again for your reply.
>>
>>> A slight difference is that on vixen and dashen you ran the
>>> MPI hostname tests as a regular user, not as root, right?
>>> Not sure if this will make much of a difference,
>>> but it may be worth trying to run it as a regular user in EC2 also.
>>> I general most people avoid running user applications (MPI programs
>>> included) as root.
>>> Mostly for safety, but I wonder if there are any
>>> implications in the 'rootly powers'
>>> regarding the under-the-hood processes that OpenMPI
>>> launches along with the actual user programs.
>>
>> Yes, between vixen and dahser I was doing the test as user tsakai,
>> not as root.  But the reason I wanted to do this test as root is
>> to show that it fails as regular user (generating pipe system
>> call failed error), whereas as root it would succeed, as it did
>> on Friday.
>
> Sorry again.
> I even wrote "root can and Tena cannot", then I forgot.
> Too many tasks at the same time, too much context-switching ...
>
>> The ami has not changed.  The last change on the ami
>> was last Tuesday.  As such I don't understand this inconsistent
>> behavior.  I have lots of notes from previous sessions and I
>> consulted different successful session logs to replicate what I
>> saw Friday, but with no success.
>>
>> Having spent days and not getting anywhere, I decided to take a
>> different approach.  I instantiated a linux ami that was built by
>> Amazon, which feels like centos/fedora-based.  I downloaded gcc
>> and c++, plus openMPI 1.4.3.  After I got openMPI running, I
>> created an account for user tsakai, uploaded my public key, re-logged
>> in as user tsakai, and ran the same test.  Surprisingly (or not?) it
>> generated the same result.  I.e., I cannot run the same mpirun
>> command when there is a remote instance involved, but on itself
>> mpirun runs fine.  So, I am feeling that this has to be an ssh
>> authentication problem.  I looked at man page for ssh and ssh_config
>> and cannot figure out what I am doing wrong.  I put in "LogLevel
>> DEBUG3" line and it generated lots of lines, in which I found a
>> line:
>>   debug1: Authentication succeeded (publickey).
>> Then I see a bunch of lines that look like:
>>   debug3: Ignored env XXXXXXX
>> and mpirun hangs.  Here is the session log:
>>
>
> Ssh on our clusters uses host-based authentication.
> I think Reuti sent you his page about it:
> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>
> However, I believe OpenMPI shouldn't care which ssh authentication
> mechanism is used, as long as it works passwordless.
>
> As for ssh configuration, ours is pretty standard:
>
> 1) We don't have 'IdentitiesOnly yes' (default is 'no'),
> but use standard identity file names id_rsa, etc.
> I think you are just telling ssh to use the specific identity
> file you named.
> I don't know if this may cause the problem, but who knows?
>
> 2) We don't have 'BatchMode yes' set.
>
> 3) We have the GSS authentication set
>
> GSSAPIAuthentication yes
>
> 4) The locale environment variables are also passed
> (may not be crucial):
>
>         SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
> LC_MESSAGES
>         SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
>         SendEnv LC_IDENTIFICATION LC_ALL
>
> 5) And X forwarding (you're not doing any X stuff, I suppose):
>
> ForwardX11Trusted yes
>
> 6) However, you may want to check what is in your
> /etc/ssh/ssh_config and /etc/ssh/sshd_config,
> because some options may be already set there.
>
> 7) Take a look at 'man ssh[d]' and  'man ssh[d]_config' too.
>
> ***
>
> Finally, if you are willing to, it may be worth to run the same
> experiment (with debug3) on vixen @ dashen, just to compare what
> comes out from the verbose ssh messages to what you see in EC2.
> Perhaps it may help nail down the reason for failure.
>
> Gus Correa
>
>
>
>>   [tsakai@vixen ec2]$
>>   [tsakai@vixen ec2]$ ssh -i $MYKEY
>> tsa...@ec2-50-17-24-195.compute-1.amazonaws.com
>>   Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
>>
>>          __|  __|_  )  Amazon Linux AMI
>>          _|  (     /     Beta
>>         ___|\___|___|
>>
>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status
>>   -bash: service: command not found
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status
>>   iptables: Firewall is not running.
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
>> password authentication
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ ssh
>> domU-12-31-39-16-4E-4C.compute-1.internal
>>   Last login: Wed Feb 16 06:53:14 2011 from
>> domu-12-31-39-16-75-1e.compute-1.internal
>>
>>          __|  __|_  )  Amazon Linux AMI
>>          _|  (     /     Beta
>>         ___|\___|___|
>>
>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh
>> domU-12-31-39-16-75-1E.compute-1.internal
>>   Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
>>
>>          __|  __|_  )  Amazon Linux AMI
>>          _|  (     /     Beta
>>         ___|\___|___|
>>
>>   See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
>> :-)
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # OK
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ exit
>>   logout
>>   Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
>>   LD_LIBRARY_PATH=:/usr/local/lib
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
>>   iptables: Firewall is not running.
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A
>>   [tsakai@domU-12-31-39-16-4E-4C ~]$ exit
>>   logout
>>   Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
>>   LD_LIBRARY_PATH=:/usr/local/lib
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac
>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>   -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>   -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
>> bottom 2 are remote inst (inst B)
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>   ^Cmpirun: killing job...
>>
>>   --------------------------------------------------------------------------
>>   mpirun noticed that the job aborted, but has no info as to the process
>>   that caused that situation.
>>   --------------------------------------------------------------------------
>>   --------------------------------------------------------------------------
>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>   below. Additional manual cleanup may be required - please refer to
>>   the "orte-clean" tool for assistance.
>>   --------------------------------------------------------------------------
>>         domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>> back when launched
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
>> launched ***
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2
>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>   -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
>>   domU-12-31-39-16-75-1E
>>   domU-12-31-39-16-75-1E
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>   Host *
>>         IdentityFile /home/tsakai/.ssh/tsakai
>>         IdentitiesOnly yes
>>         BatchMode yes
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
>>   -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>   Host *
>>         IdentityFile /home/tsakai/.ssh/tsakai
>>         IdentitiesOnly yes
>>         BatchMode yes
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config
>>         LogLevel DEBUG3
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
>>   Host *
>>         IdentityFile /home/tsakai/.ssh/tsakai
>>         IdentitiesOnly yes
>>         BatchMode yes
>>         LogLevel DEBUG3
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
>>   -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$
>>   [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd ..
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
>>   debug2: ssh_connect: needpriv 0
>>   debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
>> [10.96.77.182] port 22.
>>   debug1: Connection established.
>>   debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
>>   debug2: key_type_from_name: unknown key type '-----BEGIN'
>>   debug3: key_read: missing keytype
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug3: key_read: missing whitespace
>>   debug2: key_type_from_name: unknown key type '-----END'
>>   debug3: key_read: missing keytype
>>   debug1: identity file /home/tsakai/.ssh/tsakai type -1
>>   debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
>>   debug1: match: OpenSSH_5.3 pat OpenSSH*
>>   debug1: Enabling compatibility mode for protocol 2.0
>>   debug1: Local version string SSH-2.0-OpenSSH_5.3
>>   debug2: fd 3 setting O_NONBLOCK
>>   debug1: SSH2_MSG_KEXINIT sent
>>   debug3: Wrote 792 bytes for a total of 813
>>   debug1: SSH2_MSG_KEXINIT received
>>   debug2: kex_parse_kexinit:
>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>   debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>   debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
>> iu.se
>>   debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
>> iu.se
>>   debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
>> .com,hmac-sha1-96,hmac-md5-96
>>   debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
>> .com,hmac-sha1-96,hmac-md5-96
>>   debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
>>   debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
>>   debug2: kex_parse_kexinit:
>>   debug2: kex_parse_kexinit:
>>   debug2: kex_parse_kexinit: first_kex_follows 0
>>   debug2: kex_parse_kexinit: reserved 0
>>   debug2: kex_parse_kexinit:
>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff
>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1
>>   debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
>>   debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
>> iu.se
>>   debug2: kex_parse_kexinit:
>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b
>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l
>> iu.se
>>   debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
>> .com,hmac-sha1-96,hmac-md5-96
>>   debug2: kex_parse_kexinit:
>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh
>> .com,hmac-sha1-96,hmac-md5-96
>>   debug2: kex_parse_kexinit: none,z...@openssh.com
>>   debug2: kex_parse_kexinit: none,z...@openssh.com
>>   debug2: kex_parse_kexinit:
>>   debug2: kex_parse_kexinit:
>>   debug2: kex_parse_kexinit: first_kex_follows 0
>>   debug2: kex_parse_kexinit: reserved 0
>>   debug2: mac_setup: found hmac-md5
>>   debug1: kex: server->client aes128-ctr hmac-md5 none
>>   debug2: mac_setup: found hmac-md5
>>   debug1: kex: client->server aes128-ctr hmac-md5 none
>>   debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
>>   debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
>>   debug3: Wrote 24 bytes for a total of 837
>>   debug2: dh_gen_key: priv key bits set: 125/256
>>   debug2: bits set: 489/1024
>>   debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
>>   debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
>>   debug3: Wrote 144 bytes for a total of 981
>>   debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>   debug3: check_host_in_hostfile: match line 1
>>   debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
>>   debug3: check_host_in_hostfile: match line 1
>>   debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
>> matches the RSA host key.
>>   debug1: Found key in /home/tsakai/.ssh/known_hosts:1
>>   debug2: bits set: 491/1024
>>   debug1: ssh_rsa_verify: signature correct
>>   debug2: kex_derive_keys
>>   debug2: set_newkeys: mode 1
>>   debug1: SSH2_MSG_NEWKEYS sent
>>   debug1: expecting SSH2_MSG_NEWKEYS
>>   debug3: Wrote 16 bytes for a total of 997
>>   debug2: set_newkeys: mode 0
>>   debug1: SSH2_MSG_NEWKEYS received
>>   debug1: SSH2_MSG_SERVICE_REQUEST sent
>>   debug3: Wrote 48 bytes for a total of 1045
>>   debug2: service_accept: ssh-userauth
>>   debug1: SSH2_MSG_SERVICE_ACCEPT received
>>   debug2: key: /home/tsakai/.ssh/tsakai ((nil))
>>   debug3: Wrote 64 bytes for a total of 1109
>>   debug1: Authentications that can continue: publickey
>>   debug3: start over, passed a different list publickey
>>   debug3: preferred gssapi-with-mic,publickey
>>   debug3: authmethod_lookup publickey
>>   debug3: remaining preferred: ,publickey
>>   debug3: authmethod_is_enabled publickey
>>   debug1: Next authentication method: publickey
>>   debug1: Trying private key: /home/tsakai/.ssh/tsakai
>>   debug1: read PEM private key done: type RSA
>>   debug3: sign_and_send_pubkey
>>   debug2: we sent a publickey packet, wait for reply
>>   debug3: Wrote 384 bytes for a total of 1493
>>   debug1: Authentication succeeded (publickey).
>>   debug2: fd 4 setting O_NONBLOCK
>>   debug1: channel 0: new [client-session]
>>   debug3: ssh_session2_open: channel_new: 0
>>   debug2: channel 0: send open
>>   debug1: Requesting no-more-sessi...@openssh.com
>>   debug1: Entering interactive session.
>>   debug3: Wrote 128 bytes for a total of 1621
>>   debug2: callback start
>>   debug2: client_session2_setup: id 0
>>   debug1: Sending environment.
>>   debug3: Ignored env HOSTNAME
>>   debug3: Ignored env TERM
>>   debug3: Ignored env SHELL
>>   debug3: Ignored env HISTSIZE
>>   debug3: Ignored env EC2_AMITOOL_HOME
>>   debug3: Ignored env SSH_CLIENT
>>   debug3: Ignored env SSH_TTY
>>   debug3: Ignored env USER
>>   debug3: Ignored env LD_LIBRARY_PATH
>>   debug3: Ignored env LS_COLORS
>>   debug3: Ignored env EC2_HOME
>>   debug3: Ignored env MAIL
>>   debug3: Ignored env PATH
>>   debug3: Ignored env INPUTRC
>>   debug3: Ignored env PWD
>>   debug3: Ignored env JAVA_HOME
>>   debug1: Sending env LANG = en_US.UTF-8
>>   debug2: channel 0: request env confirm 0
>>   debug3: Ignored env AWS_CLOUDWATCH_HOME
>>   debug3: Ignored env AWS_IAM_HOME
>>   debug3: Ignored env SHLVL
>>   debug3: Ignored env HOME
>>   debug3: Ignored env AWS_PATH
>>   debug3: Ignored env AWS_AUTO_SCALING_HOME
>>   debug3: Ignored env LOGNAME
>>   debug3: Ignored env AWS_ELB_HOME
>>   debug3: Ignored env SSH_CONNECTION
>>   debug3: Ignored env LESSOPEN
>>   debug3: Ignored env AWS_RDS_HOME
>>   debug3: Ignored env G_BROKEN_FILENAMES
>>   debug3: Ignored env _
>>   debug3: Ignored env OLDPWD
>>   debug3: Ignored env OMPI_MCA_plm
>>   debug1: Sending command:  orted --daemonize -mca ess env -mca
>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064"
>>   debug2: channel 0: request exec confirm 1
>>   debug2: fd 3 setting TCP_NODELAY
>>   debug2: callback done
>>   debug2: channel 0: open confirm rwindow 0 rmax 32768
>>   debug3: Wrote 272 bytes for a total of 1893
>>   debug2: channel 0: rcvd adjust 2097152
>>   debug2: channel_input_status_confirm: type 99 id 0
>>   debug2: exec request accepted on channel 0
>>   debug2: channel 0: read<=0 rfd 4 len 0
>>   debug2: channel 0: read failed
>>   debug2: channel 0: close_read
>>   debug2: channel 0: input open -> drain
>>   debug2: channel 0: ibuf empty
>>   debug2: channel 0: send eof
>>   debug2: channel 0: input drain -> closed
>>   debug3: Wrote 32 bytes for a total of 1925
>>   debug2: channel 0: rcvd eof
>>   debug2: channel 0: output open -> drain
>>   debug2: channel 0: obuf empty
>>   debug2: channel 0: close_write
>>   debug2: channel 0: output drain -> closed
>>   debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
>>   debug2: channel 0: rcvd close
>>   debug3: channel 0: will not send data after close
>>   debug2: channel 0: almost dead
>>   debug2: channel 0: gc: notify user
>>   debug2: channel 0: gc: user detached
>>   debug2: channel 0: send close
>>   debug2: channel 0: is dead
>>   debug2: channel 0: garbage collecting
>>   debug1: channel 0: free: client-session, nchannels 1
>>   debug3: channel 0: status: The following connections are open:
>>     #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
>>
>>   debug3: channel 0: close_fds r -1 w -1 e 6 c -1
>>   debug3: Wrote 32 bytes for a total of 1957
>>   debug3: Wrote 64 bytes for a total of 2021
>>   debug1: fd 0 clearing O_NONBLOCK
>>   Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
>>   Bytes per second: sent 18384.8, received 18944.3
>>   debug1: Exit status 0
>>   # it is hanging; I am about to issue control-C
>>   ^Cmpirun: killing job...
>>
>>   --------------------------------------------------------------------------
>>   mpirun noticed that the job aborted, but has no info as to the process
>>   that caused that situation.
>>   --------------------------------------------------------------------------
>>   --------------------------------------------------------------------------
>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>   below. Additional manual cleanup may be required - please refer to
>>   the "orte-clean" tool for assistance.
>>   --------------------------------------------------------------------------
>>         domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
>> back when launched
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
>> launched
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean?
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up
>>   [tsakai@domU-12-31-39-16-75-1E ~]$
>>   [tsakai@domU-12-31-39-16-75-1E ~]$ exit
>>   logout
>>   [tsakai@vixen ec2]$
>>   [tsakai@vixen ec2]$
>>
>> Do you see anything strange?
>>
>> One final question: On ssh man page, it mentions a few environmental
>> varialbles.  SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc.  Do
>> any of these matter as far as openMPI is concerned?
>>
>> Thank you, Gus.
>>
>> Regards,
>>
>> Tena
>>
>> On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
>>
>>> Tena Sakai wrote:
>>>> Hi,
>>>>
>>>> I am trying to reproduce what I was able to show last Friday on Amazon
>>>> EC2 instances, but I am having a problem.  What I was able to show last
>>>> Friday as root was with this command:
>>>>   mpirun ­app app.ac
>>>> with app.ac being:
>>>>   -H dns-entry-A ­np 1 (linux command)
>>>>   -H dns-entry-A ­np 1 (linux command)
>>>>   -H dns-entry-B ­np 1 (linux command)
>>>>   -H dns-entry-B ­np 1 (linux command)
>>>>
>>>> Here¹s the config file in root¹s .ssh directory:
>>>>   Host *
>>>>         IdentityFile /root/.ssh/.derobee/.kagi
>>>>         IdentitiesOnly yes
>>>>         BatchMode yes
>>>>
>>>> Yesterday and today I can¹t get this to work.  I made the last part of
>>>> app.ac
>>>> file simpler (it now says /bin/hostname).  Below is the session:
>>>>
>>>>   -bash-3.2#
>>>>   -bash-3.2# # I am on instance A, host name for inst A is:
>>>>   -bash-3.2# hostname
>>>>   domU-12-31-39-09-CD-C2
>>>>   -bash-3.2#
>>>>   -bash-3.2# nslookup domU-12-31-39-09-CD-C2
>>>>   Server:               172.16.0.23
>>>>   Address:      172.16.0.23#53
>>>>
>>>>   Non-authoritative answer:
>>>>   Name: domU-12-31-39-09-CD-C2.compute-1.internal
>>>>   Address: 10.210.210.48
>>>>
>>>>   -bash-3.2# cd .ssh
>>>>   -bash-3.2#
>>>>   -bash-3.2# cat config
>>>>   Host *
>>>>           IdentityFile /root/.ssh/.derobee/.kagi
>>>>           IdentitiesOnly yes
>>>>           BatchMode yes
>>>>   -bash-3.2#
>>>>   -bash-3.2# ll config
>>>>   -rw-r--r-- 1 root root 103 Feb 15 17:18 config
>>>>   -bash-3.2#
>>>>   -bash-3.2# chmod 600 config
>>>>   -bash-3.2#
>>>>   -bash-3.2# # show I can go to inst B without password/passphrase
>>>>   -bash-3.2#
>>>>   -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
>>>>   Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
>>>>   -bash-3.2#
>>>>   -bash-3.2# hostname
>>>>   domU-12-31-39-09-E6-71
>>>>   -bash-3.2#
>>>>   -bash-3.2# nslookup `hostname`
>>>>   Server:               172.16.0.23
>>>>   Address:      172.16.0.23#53
>>>>
>>>>   Non-authoritative answer:
>>>>   Name: domU-12-31-39-09-E6-71.compute-1.internal
>>>>   Address: 10.210.233.123
>>>>
>>>>   -bash-3.2# # and back to inst A is also no problem
>>>>   -bash-3.2#
>>>>   -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
>>>>   Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
>>>>   -bash-3.2#
>>>>   -bash-3.2# hostname
>>>>   domU-12-31-39-09-CD-C2
>>>>   -bash-3.2#
>>>>   -bash-3.2# # log out twice to go back to inst A
>>>>   -bash-3.2# exit
>>>>   logout
>>>>   Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
>>>>   -bash-3.2#
>>>>   -bash-3.2# exit
>>>>   logout
>>>>   Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
>>>>   -bash-3.2#
>>>>   -bash-3.2# hostname
>>>>   domU-12-31-39-09-CD-C2
>>>>   -bash-3.2#
>>>>   -bash-3.2# cd ..
>>>>   -bash-3.2#
>>>>   -bash-3.2# pwd
>>>>   /root
>>>>   -bash-3.2#
>>>>   -bash-3.2# ll
>>>>   total 8
>>>>   -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
>>>>   -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
>>>>   -bash-3.2#
>>>>   -bash-3.2# cat app.ac
>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
>>>>   -bash-3.2#
>>>>   -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
>>>>   -bash-3.2# mpirun -app app.ac
>>>>   mpirun: killing job...
>>>>
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun noticed that the job aborted, but has no info as to the process
>>>>   that caused that situation.
>>>>
>>>> --------------------------------------------------------------------------
>>>>
>>>> --------------------------------------------------------------------------
>>>>   mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>>   below. Additional manual cleanup may be required - please refer to
>>>>   the "orte-clean" tool for assistance.
>>>>
>>>> --------------------------------------------------------------------------
>>>>         domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
>>>> report back when launched
>>>>   -bash-3.2#
>>>>   -bash-3.2# cat app.ac2
>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>   -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
>>>>   -bash-3.2#
>>>>   -bash-3.2# # when there is no remote machine, then mpirun works:
>>>>   -bash-3.2# mpirun -app app.ac2
>>>>   domU-12-31-39-09-CD-C2
>>>>   domU-12-31-39-09-CD-C2
>>>>   -bash-3.2#
>>>>   -bash-3.2# hostname
>>>>   domU-12-31-39-09-CD-C2
>>>>   -bash-3.2#
>>>>   -bash-3.2# # this gotta be ssh problem....
>>>>   -bash-3.2#
>>>>   -bash-3.2# # show no firewall is used
>>>>   -bash-3.2# iptables --list
>>>>   Chain INPUT (policy ACCEPT)
>>>>    target     prot opt source               destination
>>>>
>>>>   Chain FORWARD (policy ACCEPT)
>>>>   target     prot opt source               destination
>>>>
>>>>   Chain OUTPUT (policy ACCEPT)
>>>>   target     prot opt source               destination
>>>>   -bash-3.2#
>>>>   -bash-3.2# exit
>>>>   logout
>>>>   [tsakai@vixen ec2]$
>>>>
>>>> Would someone please point out what I am doing wrong?
>>>>
>>>> Thank you.
>>>>
>>>> Regards,
>>>>
>>>> Tena
>>>>
>>> Hi Tena
>>>
>>> Nothing wrong that I can see.
>>> Just another couple of suggestions,
>>> based on somewhat vague possibilities.
>>>
>>> A slight difference is that on vixen and dashen you ran the
>>> MPI hostname tests as a regular user, not as root, right?
>>> Not sure if this will make much of a difference,
>>> but it may be worth trying to run it as a regular user in EC2 also.
>>> I general most people avoid running user applications (MPI programs
>>> included) as root.
>>> Mostly for safety, but I wonder if there are any
>>> implications in the 'rootly powers'
>>> regarding the under-the-hood processes that OpenMPI
>>> launches along with the actual user programs.
>>>
>>> This may make no difference either,
>>> but you could do a 'service iptables status',
>>> to see if the service is running, even though there are
>>> no explicit iptable rules (as per your email).
>>> If the service is not running you get
>>> 'Firewall is stopped.' (in CentOS).
>>> I *think* 'iptables --list' loads the iptables module into the
>>> kernel, as a side effect, whereas the service command does not.
>>> So, it may be cleaner (safer?) to use the service version
>>> instead of 'iptables --list'.
>>> I don't know if it will make any difference,
>>> but just in case, if the service is running,
>>> why not do 'service iptables stop',
>>> and perhaps also 'chkconfig iptables off' to be completely
>>> free of iptables?
>>>
>>> Gus Correa
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to