Hi Gus,
I am starting to see the light at the other end of the tunnel.
As I wrote in reply to Jeff, it was not a ssh problem. It was
a setting of user configurable firewall that Amazon calls
security group. I need to expand my small tests to wider
set, but I think I can do that. I will keep you posted in
coming days/weeks.
Many thanks for your help and dialog. I really appreciate
your help and explanations.
Thank you!
Regards,
Tena
On 2/16/11 4:31 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
Hi Tena
Again, I think your EC2 session log with ssh debug3 level (below)
should be looked at by somebody more knowledgeable in OpenMPI
and in ssh that me.
There must be some clue to what is going on there.
Ssh experts, Jeff, Ralph, please help!
Anyway ...
AFAIK, 'orted' in the first line you selected/highlighted below,
is the 'Openmpi Run Time Environment Daemon' ( ... the OpenMPI pros
are authorized to send me to the galleys if it is not ...).
So, orted is trying to do its thing, to create the conditions for your
job to run across the two EC2 'instances'. (Gone are the naive
days when these things were computers, each one on its box ...)
This master or ceremonies' work of orted is done via tcp, and I guess
10.96.118.236 is the IP (of computer B?),
and 56064 is probably the port,
where orted may be trying to open a socket.
The bunch of -mca parameters are just what they are: MCA parameters
(MCA=Modular Component Architecture of OpenMPI, and here I am risking to
be shanghaied or ridiculed again ...).
(You can learn more about the mca parameters with 'ompi_info -help'.)
That is how in my ignorance I parse that line.
So, from the computer/instance-A side orted gives the first kick,
but somehow the ball never comes back from computer/instance-B.
It's ping- without -pong.
The same frustrating feeling I had when I was a kid and kicked the
soccer ball on the neighbor's side and would never see it again.
Cheers,
Gus
Tena Sakai wrote:
Hi Gus,
Thank you for your reply and suggestions.
I will follow up on these in a bit and will give you an
update. Looking at what vixen and/or dasher generates
from DEBUG3 would be interesting.
For now, may I point out something I noticed out of the
DEBUG3 Output last night?
I found this line:
debug1: Sending command: orted --daemonize -mca ess env -mca
orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
--hnp-uri "125566976.0;tcp://10.96.118.236:56064"
Followed by:
debug2: channel 0: request exec confirm 1
debug2: fd 3 setting TCP_NODELAY
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug3: Wrote 272 bytes for a total of 1893
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
It appears, to my untrained eye/mind, a directive from instance A
to B was issued and then what happened? I don't see that was
honored by the instance B.
Can you please comment on this?
Thank you.
Regards,
Tena
On 2/16/11 1:34 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
Hi Tena
I hope somebody more knowledgeable in ssh
takes a look at the debug3 session log that you included.
I can't see if/where/why ssh is failing for you in EC2.
See other answers inline, please.
Tena Sakai wrote:
Hi Gus,
Thank you again for your reply.
A slight difference is that on vixen and dashen you ran the
MPI hostname tests as a regular user, not as root, right?
Not sure if this will make much of a difference,
but it may be worth trying to run it as a regular user in EC2 also.
I general most people avoid running user applications (MPI programs
included) as root.
Mostly for safety, but I wonder if there are any
implications in the 'rootly powers'
regarding the under-the-hood processes that OpenMPI
launches along with the actual user programs.
Yes, between vixen and dahser I was doing the test as user tsakai,
not as root. But the reason I wanted to do this test as root is
to show that it fails as regular user (generating pipe system
call failed error), whereas as root it would succeed, as it did
on Friday.
Sorry again.
I even wrote "root can and Tena cannot", then I forgot.
Too many tasks at the same time, too much context-switching ...
The ami has not changed. The last change on the ami
was last Tuesday. As such I don't understand this inconsistent
behavior. I have lots of notes from previous sessions and I
consulted different successful session logs to replicate what I
saw Friday, but with no success.
Having spent days and not getting anywhere, I decided to take a
different approach. I instantiated a linux ami that was built by
Amazon, which feels like centos/fedora-based. I downloaded gcc
and c++, plus openMPI 1.4.3. After I got openMPI running, I
created an account for user tsakai, uploaded my public key, re-logged
in as user tsakai, and ran the same test. Surprisingly (or not?) it
generated the same result. I.e., I cannot run the same mpirun
command when there is a remote instance involved, but on itself
mpirun runs fine. So, I am feeling that this has to be an ssh
authentication problem. I looked at man page for ssh and ssh_config
and cannot figure out what I am doing wrong. I put in "LogLevel
DEBUG3" line and it generated lots of lines, in which I found a
line:
debug1: Authentication succeeded (publickey).
Then I see a bunch of lines that look like:
debug3: Ignored env XXXXXXX
and mpirun hangs. Here is the session log:
Ssh on our clusters uses host-based authentication.
I think Reuti sent you his page about it:
http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
However, I believe OpenMPI shouldn't care which ssh authentication
mechanism is used, as long as it works passwordless.
As for ssh configuration, ours is pretty standard:
1) We don't have 'IdentitiesOnly yes' (default is 'no'),
but use standard identity file names id_rsa, etc.
I think you are just telling ssh to use the specific identity
file you named.
I don't know if this may cause the problem, but who knows?
2) We don't have 'BatchMode yes' set.
3) We have the GSS authentication set
GSSAPIAuthentication yes
4) The locale environment variables are also passed
(may not be crucial):
SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY
LC_MESSAGES
SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT
SendEnv LC_IDENTIFICATION LC_ALL
5) And X forwarding (you're not doing any X stuff, I suppose):
ForwardX11Trusted yes
6) However, you may want to check what is in your
/etc/ssh/ssh_config and /etc/ssh/sshd_config,
because some options may be already set there.
7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too.
***
Finally, if you are willing to, it may be worth to run the same
experiment (with debug3) on vixen @ dashen, just to compare what
comes out from the verbose ssh messages to what you see in EC2.
Perhaps it may help nail down the reason for failure.
Gus Correa
[tsakai@vixen ec2]$
[tsakai@vixen ec2]$ ssh -i $MYKEY
tsa...@ec2-50-17-24-195.compute-1.amazonaws.com
Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off
[tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status
-bash: service: command not found
[tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status
iptables: Firewall is not running.
[tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no
password authentication
[tsakai@domU-12-31-39-16-75-1E ~]$ ssh
domU-12-31-39-16-4E-4C.compute-1.internal
Last login: Wed Feb 16 06:53:14 2011 from
domu-12-31-39-16-75-1e.compute-1.internal
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai@domU-12-31-39-16-4E-4C ~]$
[tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A
[tsakai@domU-12-31-39-16-4E-4C ~]$
[tsakai@domU-12-31-39-16-4E-4C ~]$ ssh
domU-12-31-39-16-75-1E.compute-1.internal
Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1
__| __|_ ) Amazon Linux AMI
_| ( / Beta
___|\___|___|
See /usr/share/doc/amzn-ami/image-release-notes for latest release notes.
:-)
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # OK
[tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B
[tsakai@domU-12-31-39-16-75-1E ~]$ exit
logout
Connection to domU-12-31-39-16-75-1E.compute-1.internal closed.
[tsakai@domU-12-31-39-16-4E-4C ~]$
[tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB
LD_LIBRARY_PATH=:/usr/local/lib
[tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B
[tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status
iptables: Firewall is not running.
[tsakai@domU-12-31-39-16-4E-4C ~]$
[tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A
[tsakai@domU-12-31-39-16-4E-4C ~]$ exit
logout
Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed.
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB
LD_LIBRARY_PATH=:/usr/local/lib
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac
-H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine);
bottom 2 are remote inst (inst B)
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
^Cmpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
back when launched
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when
launched ***
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2
-H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A)
[tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2
domU-12-31-39-16-75-1E
domU-12-31-39-16-75-1E
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
-rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config
LogLevel DEBUG3
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config
Host *
IdentityFile /home/tsakai/.ssh/tsakai
IdentitiesOnly yes
BatchMode yes
LogLevel DEBUG3
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config
-rw------- 1 tsakai tsakai 98 Feb 16 07:07 config
[tsakai@domU-12-31-39-16-75-1E .ssh]$
[tsakai@domU-12-31-39-16-75-1E .ssh]$ cd ..
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac
debug2: ssh_connect: needpriv 0
debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal
[10.96.77.182] port 22.
debug1: Connection established.
debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai.
debug2: key_type_from_name: unknown key type '-----BEGIN'
debug3: key_read: missing keytype
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug3: key_read: missing whitespace
debug2: key_type_from_name: unknown key type '-----END'
debug3: key_read: missing keytype
debug1: identity file /home/tsakai/.ssh/tsakai type -1
debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3
debug1: match: OpenSSH_5.3 pat OpenSSH*
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_5.3
debug2: fd 3 setting O_NONBLOCK
debug1: SSH2_MSG_KEXINIT sent
debug3: Wrote 792 bytes for a total of 813
debug1: SSH2_MSG_KEXINIT received
debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
ie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
iu.se
debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
iu.se
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
debug2: kex_parse_kexinit: none,z...@openssh.com,zlib
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit: first_kex_follows 0
debug2: kex_parse_kexinit: reserved 0
debug2: kex_parse_kexinit:
diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>>
f
ie-hellman-group14-sha1,diffie-hellman-group1-sha1
debug2: kex_parse_kexinit: ssh-rsa,ssh-dss
debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
iu.se
debug2: kex_parse_kexinit:
aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>>
b
lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>>
l
iu.se
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit:
hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>>
h
.com,hmac-sha1-96,hmac-md5-96
debug2: kex_parse_kexinit: none,z...@openssh.com
debug2: kex_parse_kexinit: none,z...@openssh.com
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit:
debug2: kex_parse_kexinit: first_kex_follows 0
debug2: kex_parse_kexinit: reserved 0
debug2: mac_setup: found hmac-md5
debug1: kex: server->client aes128-ctr hmac-md5 none
debug2: mac_setup: found hmac-md5
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
debug3: Wrote 24 bytes for a total of 837
debug2: dh_gen_key: priv key bits set: 125/256
debug2: bits set: 489/1024
debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
debug3: Wrote 144 bytes for a total of 981
debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
debug3: check_host_in_hostfile: match line 1
debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts
debug3: check_host_in_hostfile: match line 1
debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and
matches the RSA host key.
debug1: Found key in /home/tsakai/.ssh/known_hosts:1
debug2: bits set: 491/1024
debug1: ssh_rsa_verify: signature correct
debug2: kex_derive_keys
debug2: set_newkeys: mode 1
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug3: Wrote 16 bytes for a total of 997
debug2: set_newkeys: mode 0
debug1: SSH2_MSG_NEWKEYS received
debug1: SSH2_MSG_SERVICE_REQUEST sent
debug3: Wrote 48 bytes for a total of 1045
debug2: service_accept: ssh-userauth
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug2: key: /home/tsakai/.ssh/tsakai ((nil))
debug3: Wrote 64 bytes for a total of 1109
debug1: Authentications that can continue: publickey
debug3: start over, passed a different list publickey
debug3: preferred gssapi-with-mic,publickey
debug3: authmethod_lookup publickey
debug3: remaining preferred: ,publickey
debug3: authmethod_is_enabled publickey
debug1: Next authentication method: publickey
debug1: Trying private key: /home/tsakai/.ssh/tsakai
debug1: read PEM private key done: type RSA
debug3: sign_and_send_pubkey
debug2: we sent a publickey packet, wait for reply
debug3: Wrote 384 bytes for a total of 1493
debug1: Authentication succeeded (publickey).
debug2: fd 4 setting O_NONBLOCK
debug1: channel 0: new [client-session]
debug3: ssh_session2_open: channel_new: 0
debug2: channel 0: send open
debug1: Requesting no-more-sessi...@openssh.com
debug1: Entering interactive session.
debug3: Wrote 128 bytes for a total of 1621
debug2: callback start
debug2: client_session2_setup: id 0
debug1: Sending environment.
debug3: Ignored env HOSTNAME
debug3: Ignored env TERM
debug3: Ignored env SHELL
debug3: Ignored env HISTSIZE
debug3: Ignored env EC2_AMITOOL_HOME
debug3: Ignored env SSH_CLIENT
debug3: Ignored env SSH_TTY
debug3: Ignored env USER
debug3: Ignored env LD_LIBRARY_PATH
debug3: Ignored env LS_COLORS
debug3: Ignored env EC2_HOME
debug3: Ignored env MAIL
debug3: Ignored env PATH
debug3: Ignored env INPUTRC
debug3: Ignored env PWD
debug3: Ignored env JAVA_HOME
debug1: Sending env LANG = en_US.UTF-8
debug2: channel 0: request env confirm 0
debug3: Ignored env AWS_CLOUDWATCH_HOME
debug3: Ignored env AWS_IAM_HOME
debug3: Ignored env SHLVL
debug3: Ignored env HOME
debug3: Ignored env AWS_PATH
debug3: Ignored env AWS_AUTO_SCALING_HOME
debug3: Ignored env LOGNAME
debug3: Ignored env AWS_ELB_HOME
debug3: Ignored env SSH_CONNECTION
debug3: Ignored env LESSOPEN
debug3: Ignored env AWS_RDS_HOME
debug3: Ignored env G_BROKEN_FILENAMES
debug3: Ignored env _
debug3: Ignored env OLDPWD
debug3: Ignored env OMPI_MCA_plm
debug1: Sending command: orted --daemonize -mca ess env -mca
orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2
--hnp-uri "125566976.0;tcp://10.96.118.236:56064"
debug2: channel 0: request exec confirm 1
debug2: fd 3 setting TCP_NODELAY
debug2: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug3: Wrote 272 bytes for a total of 1893
debug2: channel 0: rcvd adjust 2097152
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
debug2: channel 0: read<=0 rfd 4 len 0
debug2: channel 0: read failed
debug2: channel 0: close_read
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug2: channel 0: input drain -> closed
debug3: Wrote 32 bytes for a total of 1925
debug2: channel 0: rcvd eof
debug2: channel 0: output open -> drain
debug2: channel 0: obuf empty
debug2: channel 0: close_write
debug2: channel 0: output drain -> closed
debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
debug2: channel 0: rcvd close
debug3: channel 0: will not send data after close
debug2: channel 0: almost dead
debug2: channel 0: gc: notify user
debug2: channel 0: gc: user detached
debug2: channel 0: send close
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
#0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)
debug3: channel 0: close_fds r -1 w -1 e 6 c -1
debug3: Wrote 32 bytes for a total of 1957
debug3: Wrote 64 bytes for a total of 2021
debug1: fd 0 clearing O_NONBLOCK
Transferred: sent 1840, received 1896 bytes, in 0.1 seconds
Bytes per second: sent 18384.8, received 18944.3
debug1: Exit status 0
# it is hanging; I am about to issue control-C
^Cmpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report
back when launched
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e.,
[tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when
launched
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean?
[tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming...
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ # I give up
[tsakai@domU-12-31-39-16-75-1E ~]$
[tsakai@domU-12-31-39-16-75-1E ~]$ exit
logout
[tsakai@vixen ec2]$
[tsakai@vixen ec2]$
Do you see anything strange?
One final question: On ssh man page, it mentions a few environmental
varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do
any of these matter as far as openMPI is concerned?
Thank you, Gus.
Regards,
Tena
On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote:
Tena Sakai wrote:
Hi,
I am trying to reproduce what I was able to show last Friday on Amazon
EC2 instances, but I am having a problem. What I was able to show last
Friday as root was with this command:
mpirun app app.ac
with app.ac being:
-H dns-entry-A np 1 (linux command)
-H dns-entry-A np 1 (linux command)
-H dns-entry-B np 1 (linux command)
-H dns-entry-B np 1 (linux command)
Here¹s the config file in root¹s .ssh directory:
Host *
IdentityFile /root/.ssh/.derobee/.kagi
IdentitiesOnly yes
BatchMode yes
Yesterday and today I can¹t get this to work. I made the last part of
app.ac
file simpler (it now says /bin/hostname). Below is the session:
-bash-3.2#
-bash-3.2# # I am on instance A, host name for inst A is:
-bash-3.2# hostname
domU-12-31-39-09-CD-C2
-bash-3.2#
-bash-3.2# nslookup domU-12-31-39-09-CD-C2
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: domU-12-31-39-09-CD-C2.compute-1.internal
Address: 10.210.210.48
-bash-3.2# cd .ssh
-bash-3.2#
-bash-3.2# cat config
Host *
IdentityFile /root/.ssh/.derobee/.kagi
IdentitiesOnly yes
BatchMode yes
-bash-3.2#
-bash-3.2# ll config
-rw-r--r-- 1 root root 103 Feb 15 17:18 config
-bash-3.2#
-bash-3.2# chmod 600 config
-bash-3.2#
-bash-3.2# # show I can go to inst B without password/passphrase
-bash-3.2#
-bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal
Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48
-bash-3.2#
-bash-3.2# hostname
domU-12-31-39-09-E6-71
-bash-3.2#
-bash-3.2# nslookup `hostname`
Server: 172.16.0.23
Address: 172.16.0.23#53
Non-authoritative answer:
Name: domU-12-31-39-09-E6-71.compute-1.internal
Address: 10.210.233.123
-bash-3.2# # and back to inst A is also no problem
-bash-3.2#
-bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal
Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1
-bash-3.2#
-bash-3.2# hostname
domU-12-31-39-09-CD-C2
-bash-3.2#
-bash-3.2# # log out twice to go back to inst A
-bash-3.2# exit
logout
Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed.
-bash-3.2#
-bash-3.2# exit
logout
Connection to domU-12-31-39-09-E6-71.compute-1.internal closed.
-bash-3.2#
-bash-3.2# hostname
domU-12-31-39-09-CD-C2
-bash-3.2#
-bash-3.2# cd ..
-bash-3.2#
-bash-3.2# pwd
/root
-bash-3.2#
-bash-3.2# ll
total 8
-rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac
-rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2
-bash-3.2#
-bash-3.2# cat app.ac
-H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname
-bash-3.2#
-bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs
-bash-3.2# mpirun -app app.ac
mpirun: killing job...
------------------------------------------------------------------------->>>>>>
-
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
------------------------------------------------------------------------->>>>>>
-
------------------------------------------------------------------------->>>>>>
-
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
------------------------------------------------------------------------->>>>>>
-
domU-12-31-39-09-E6-71.compute-1.internal - daemon did not
report back when launched
-bash-3.2#
-bash-3.2# cat app.ac2
-H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
-H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname
-bash-3.2#
-bash-3.2# # when there is no remote machine, then mpirun works:
-bash-3.2# mpirun -app app.ac2
domU-12-31-39-09-CD-C2
domU-12-31-39-09-CD-C2
-bash-3.2#
-bash-3.2# hostname
domU-12-31-39-09-CD-C2
-bash-3.2#
-bash-3.2# # this gotta be ssh problem....
-bash-3.2#
-bash-3.2# # show no firewall is used
-bash-3.2# iptables --list
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
-bash-3.2#
-bash-3.2# exit
logout
[tsakai@vixen ec2]$
Would someone please point out what I am doing wrong?
Thank you.
Regards,
Tena
Hi Tena
Nothing wrong that I can see.
Just another couple of suggestions,
based on somewhat vague possibilities.
A slight difference is that on vixen and dashen you ran the
MPI hostname tests as a regular user, not as root, right?
Not sure if this will make much of a difference,
but it may be worth trying to run it as a regular user in EC2 also.
I general most people avoid running user applications (MPI programs
included) as root.
Mostly for safety, but I wonder if there are any
implications in the 'rootly powers'
regarding the under-the-hood processes that OpenMPI
launches along with the actual user programs.
This may make no difference either,
but you could do a 'service iptables status',
to see if the service is running, even though there are
no explicit iptable rules (as per your email).
If the service is not running you get
'Firewall is stopped.' (in CentOS).
I *think* 'iptables --list' loads the iptables module into the
kernel, as a side effect, whereas the service command does not.
So, it may be cleaner (safer?) to use the service version
instead of 'iptables --list'.
I don't know if it will make any difference,
but just in case, if the service is running,
why not do 'service iptables stop',
and perhaps also 'chkconfig iptables off' to be completely
free of iptables?
Gus Correa