Hi Gus, Thank you for the explanation. Your analogy (ping without pong) makes sense and somewhat it is congruent with what goes in my mind.
Regards, Tena On 2/16/11 4:31 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Hi Tena > > Again, I think your EC2 session log with ssh debug3 level (below) > should be looked at by somebody more knowledgeable in OpenMPI > and in ssh that me. > There must be some clue to what is going on there. > > Ssh experts, Jeff, Ralph, please help! > > Anyway ... > AFAIK, 'orted' in the first line you selected/highlighted below, > is the 'Openmpi Run Time Environment Daemon' ( ... the OpenMPI pros > are authorized to send me to the galleys if it is not ...). > So, orted is trying to do its thing, to create the conditions for your > job to run across the two EC2 'instances'. (Gone are the naive > days when these things were computers, each one on its box ...) > This master or ceremonies' work of orted is done via tcp, and I guess > 10.96.118.236 is the IP (of computer B?), > and 56064 is probably the port, > where orted may be trying to open a socket. > The bunch of -mca parameters are just what they are: MCA parameters > (MCA=Modular Component Architecture of OpenMPI, and here I am risking to > be shanghaied or ridiculed again ...). > (You can learn more about the mca parameters with 'ompi_info -help'.) > That is how in my ignorance I parse that line. > > So, from the computer/instance-A side orted gives the first kick, > but somehow the ball never comes back from computer/instance-B. > It's ping- without -pong. > The same frustrating feeling I had when I was a kid and kicked the > soccer ball on the neighbor's side and would never see it again. > > Cheers, > Gus > > Tena Sakai wrote: >> Hi Gus, >> >> Thank you for your reply and suggestions. >> >> I will follow up on these in a bit and will give you an >> update. Looking at what vixen and/or dasher generates >> from DEBUG3 would be interesting. >> >> For now, may I point out something I noticed out of the >> DEBUG3 Output last night? >> >> I found this line: >> >>> debug1: Sending command: orted --daemonize -mca ess env -mca >>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064" >> >> Followed by: >> >>> debug2: channel 0: request exec confirm 1 >>> debug2: fd 3 setting TCP_NODELAY >>> debug2: callback done >>> debug2: channel 0: open confirm rwindow 0 rmax 32768 >>> debug3: Wrote 272 bytes for a total of 1893 >>> debug2: channel 0: rcvd adjust 2097152 >>> debug2: channel_input_status_confirm: type 99 id 0 >> >> It appears, to my untrained eye/mind, a directive from instance A >> to B was issued and then what happened? I don't see that was >> honored by the instance B. >> >> Can you please comment on this? >> >> Thank you. >> >> Regards, >> >> Tena >> >> On 2/16/11 1:34 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >> >>> Hi Tena >>> >>> I hope somebody more knowledgeable in ssh >>> takes a look at the debug3 session log that you included. >>> >>> I can't see if/where/why ssh is failing for you in EC2. >>> >>> See other answers inline, please. >>> >>> Tena Sakai wrote: >>>> Hi Gus, >>>> >>>> Thank you again for your reply. >>>> >>>>> A slight difference is that on vixen and dashen you ran the >>>>> MPI hostname tests as a regular user, not as root, right? >>>>> Not sure if this will make much of a difference, >>>>> but it may be worth trying to run it as a regular user in EC2 also. >>>>> I general most people avoid running user applications (MPI programs >>>>> included) as root. >>>>> Mostly for safety, but I wonder if there are any >>>>> implications in the 'rootly powers' >>>>> regarding the under-the-hood processes that OpenMPI >>>>> launches along with the actual user programs. >>>> Yes, between vixen and dahser I was doing the test as user tsakai, >>>> not as root. But the reason I wanted to do this test as root is >>>> to show that it fails as regular user (generating pipe system >>>> call failed error), whereas as root it would succeed, as it did >>>> on Friday. >>> Sorry again. >>> I even wrote "root can and Tena cannot", then I forgot. >>> Too many tasks at the same time, too much context-switching ... >>> >>>> The ami has not changed. The last change on the ami >>>> was last Tuesday. As such I don't understand this inconsistent >>>> behavior. I have lots of notes from previous sessions and I >>>> consulted different successful session logs to replicate what I >>>> saw Friday, but with no success. >>>> >>>> Having spent days and not getting anywhere, I decided to take a >>>> different approach. I instantiated a linux ami that was built by >>>> Amazon, which feels like centos/fedora-based. I downloaded gcc >>>> and c++, plus openMPI 1.4.3. After I got openMPI running, I >>>> created an account for user tsakai, uploaded my public key, re-logged >>>> in as user tsakai, and ran the same test. Surprisingly (or not?) it >>>> generated the same result. I.e., I cannot run the same mpirun >>>> command when there is a remote instance involved, but on itself >>>> mpirun runs fine. So, I am feeling that this has to be an ssh >>>> authentication problem. I looked at man page for ssh and ssh_config >>>> and cannot figure out what I am doing wrong. I put in "LogLevel >>>> DEBUG3" line and it generated lots of lines, in which I found a >>>> line: >>>> debug1: Authentication succeeded (publickey). >>>> Then I see a bunch of lines that look like: >>>> debug3: Ignored env XXXXXXX >>>> and mpirun hangs. Here is the session log: >>>> >>> Ssh on our clusters uses host-based authentication. >>> I think Reuti sent you his page about it: >>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>> >>> However, I believe OpenMPI shouldn't care which ssh authentication >>> mechanism is used, as long as it works passwordless. >>> >>> As for ssh configuration, ours is pretty standard: >>> >>> 1) We don't have 'IdentitiesOnly yes' (default is 'no'), >>> but use standard identity file names id_rsa, etc. >>> I think you are just telling ssh to use the specific identity >>> file you named. >>> I don't know if this may cause the problem, but who knows? >>> >>> 2) We don't have 'BatchMode yes' set. >>> >>> 3) We have the GSS authentication set >>> >>> GSSAPIAuthentication yes >>> >>> 4) The locale environment variables are also passed >>> (may not be crucial): >>> >>> SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY >>> LC_MESSAGES >>> SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT >>> SendEnv LC_IDENTIFICATION LC_ALL >>> >>> 5) And X forwarding (you're not doing any X stuff, I suppose): >>> >>> ForwardX11Trusted yes >>> >>> 6) However, you may want to check what is in your >>> /etc/ssh/ssh_config and /etc/ssh/sshd_config, >>> because some options may be already set there. >>> >>> 7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too. >>> >>> *** >>> >>> Finally, if you are willing to, it may be worth to run the same >>> experiment (with debug3) on vixen @ dashen, just to compare what >>> comes out from the verbose ssh messages to what you see in EC2. >>> Perhaps it may help nail down the reason for failure. >>> >>> Gus Correa >>> >>> >>> >>>> [tsakai@vixen ec2]$ >>>> [tsakai@vixen ec2]$ ssh -i $MYKEY >>>> tsa...@ec2-50-17-24-195.compute-1.amazonaws.com >>>> Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1 >>>> >>>> __| __|_ ) Amazon Linux AMI >>>> _| ( / Beta >>>> ___|\___|___| >>>> >>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >>>> :-) >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status >>>> -bash: service: command not found >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status >>>> iptables: Firewall is not running. >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no >>>> password authentication >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ ssh >>>> domU-12-31-39-16-4E-4C.compute-1.internal >>>> Last login: Wed Feb 16 06:53:14 2011 from >>>> domu-12-31-39-16-75-1e.compute-1.internal >>>> >>>> __| __|_ ) Amazon Linux AMI >>>> _| ( / Beta >>>> ___|\___|___| >>>> >>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >>>> :-) >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh >>>> domU-12-31-39-16-75-1E.compute-1.internal >>>> Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1 >>>> >>>> __| __|_ ) Amazon Linux AMI >>>> _| ( / Beta >>>> ___|\___|___| >>>> >>>> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >>>> :-) >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # OK >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ exit >>>> logout >>>> Connection to domU-12-31-39-16-75-1E.compute-1.internal closed. >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB >>>> LD_LIBRARY_PATH=:/usr/local/lib >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status >>>> iptables: Firewall is not running. >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A >>>> [tsakai@domU-12-31-39-16-4E-4C ~]$ exit >>>> logout >>>> Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed. >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB >>>> LD_LIBRARY_PATH=:/usr/local/lib >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac >>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine); >>>> bottom 2 are remote inst (inst B) >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac >>>> ^Cmpirun: killing job... >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> >>>> -------------------------------------------------------------------------- >>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report >>>> back when launched >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when >>>> launched *** >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2 >>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A) >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2 >>>> domU-12-31-39-16-75-1E >>>> domU-12-31-39-16-75-1E >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >>>> Host * >>>> IdentityFile /home/tsakai/.ssh/tsakai >>>> IdentitiesOnly yes >>>> BatchMode yes >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config >>>> -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >>>> Host * >>>> IdentityFile /home/tsakai/.ssh/tsakai >>>> IdentitiesOnly yes >>>> BatchMode yes >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config >>>> LogLevel DEBUG3 >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >>>> Host * >>>> IdentityFile /home/tsakai/.ssh/tsakai >>>> IdentitiesOnly yes >>>> BatchMode yes >>>> LogLevel DEBUG3 >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config >>>> -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >>>> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd .. >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac >>>> debug2: ssh_connect: needpriv 0 >>>> debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal >>>> [10.96.77.182] port 22. >>>> debug1: Connection established. >>>> debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai. >>>> debug2: key_type_from_name: unknown key type '-----BEGIN' >>>> debug3: key_read: missing keytype >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug3: key_read: missing whitespace >>>> debug2: key_type_from_name: unknown key type '-----END' >>>> debug3: key_read: missing keytype >>>> debug1: identity file /home/tsakai/.ssh/tsakai type -1 >>>> debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3 >>>> debug1: match: OpenSSH_5.3 pat OpenSSH* >>>> debug1: Enabling compatibility mode for protocol 2.0 >>>> debug1: Local version string SSH-2.0-OpenSSH_5.3 >>>> debug2: fd 3 setting O_NONBLOCK >>>> debug1: SSH2_MSG_KEXINIT sent >>>> debug3: Wrote 792 bytes for a total of 813 >>>> debug1: SSH2_MSG_KEXINIT received >>>> debug2: kex_parse_kexinit: >>>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>> f >>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1 >>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss >>>> debug2: kex_parse_kexinit: >>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>> b >>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>> l >>>> iu.se >>>> debug2: kex_parse_kexinit: >>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>> b >>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>> l >>>> iu.se >>>> debug2: kex_parse_kexinit: >>>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>> h >>>> .com,hmac-sha1-96,hmac-md5-96 >>>> debug2: kex_parse_kexinit: >>>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>> h >>>> .com,hmac-sha1-96,hmac-md5-96 >>>> debug2: kex_parse_kexinit: none,z...@openssh.com,zlib >>>> debug2: kex_parse_kexinit: none,z...@openssh.com,zlib >>>> debug2: kex_parse_kexinit: >>>> debug2: kex_parse_kexinit: >>>> debug2: kex_parse_kexinit: first_kex_follows 0 >>>> debug2: kex_parse_kexinit: reserved 0 >>>> debug2: kex_parse_kexinit: >>>> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,dif>>>> f >>>> ie-hellman-group14-sha1,diffie-hellman-group1-sha1 >>>> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss >>>> debug2: kex_parse_kexinit: >>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>> b >>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>> l >>>> iu.se >>>> debug2: kex_parse_kexinit: >>>> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,>>>> b >>>> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.>>>> l >>>> iu.se >>>> debug2: kex_parse_kexinit: >>>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>> h >>>> .com,hmac-sha1-96,hmac-md5-96 >>>> debug2: kex_parse_kexinit: >>>> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openss>>>> h >>>> .com,hmac-sha1-96,hmac-md5-96 >>>> debug2: kex_parse_kexinit: none,z...@openssh.com >>>> debug2: kex_parse_kexinit: none,z...@openssh.com >>>> debug2: kex_parse_kexinit: >>>> debug2: kex_parse_kexinit: >>>> debug2: kex_parse_kexinit: first_kex_follows 0 >>>> debug2: kex_parse_kexinit: reserved 0 >>>> debug2: mac_setup: found hmac-md5 >>>> debug1: kex: server->client aes128-ctr hmac-md5 none >>>> debug2: mac_setup: found hmac-md5 >>>> debug1: kex: client->server aes128-ctr hmac-md5 none >>>> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent >>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP >>>> debug3: Wrote 24 bytes for a total of 837 >>>> debug2: dh_gen_key: priv key bits set: 125/256 >>>> debug2: bits set: 489/1024 >>>> debug1: SSH2_MSG_KEX_DH_GEX_INIT sent >>>> debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY >>>> debug3: Wrote 144 bytes for a total of 981 >>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts >>>> debug3: check_host_in_hostfile: match line 1 >>>> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts >>>> debug3: check_host_in_hostfile: match line 1 >>>> debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and >>>> matches the RSA host key. >>>> debug1: Found key in /home/tsakai/.ssh/known_hosts:1 >>>> debug2: bits set: 491/1024 >>>> debug1: ssh_rsa_verify: signature correct >>>> debug2: kex_derive_keys >>>> debug2: set_newkeys: mode 1 >>>> debug1: SSH2_MSG_NEWKEYS sent >>>> debug1: expecting SSH2_MSG_NEWKEYS >>>> debug3: Wrote 16 bytes for a total of 997 >>>> debug2: set_newkeys: mode 0 >>>> debug1: SSH2_MSG_NEWKEYS received >>>> debug1: SSH2_MSG_SERVICE_REQUEST sent >>>> debug3: Wrote 48 bytes for a total of 1045 >>>> debug2: service_accept: ssh-userauth >>>> debug1: SSH2_MSG_SERVICE_ACCEPT received >>>> debug2: key: /home/tsakai/.ssh/tsakai ((nil)) >>>> debug3: Wrote 64 bytes for a total of 1109 >>>> debug1: Authentications that can continue: publickey >>>> debug3: start over, passed a different list publickey >>>> debug3: preferred gssapi-with-mic,publickey >>>> debug3: authmethod_lookup publickey >>>> debug3: remaining preferred: ,publickey >>>> debug3: authmethod_is_enabled publickey >>>> debug1: Next authentication method: publickey >>>> debug1: Trying private key: /home/tsakai/.ssh/tsakai >>>> debug1: read PEM private key done: type RSA >>>> debug3: sign_and_send_pubkey >>>> debug2: we sent a publickey packet, wait for reply >>>> debug3: Wrote 384 bytes for a total of 1493 >>>> debug1: Authentication succeeded (publickey). >>>> debug2: fd 4 setting O_NONBLOCK >>>> debug1: channel 0: new [client-session] >>>> debug3: ssh_session2_open: channel_new: 0 >>>> debug2: channel 0: send open >>>> debug1: Requesting no-more-sessi...@openssh.com >>>> debug1: Entering interactive session. >>>> debug3: Wrote 128 bytes for a total of 1621 >>>> debug2: callback start >>>> debug2: client_session2_setup: id 0 >>>> debug1: Sending environment. >>>> debug3: Ignored env HOSTNAME >>>> debug3: Ignored env TERM >>>> debug3: Ignored env SHELL >>>> debug3: Ignored env HISTSIZE >>>> debug3: Ignored env EC2_AMITOOL_HOME >>>> debug3: Ignored env SSH_CLIENT >>>> debug3: Ignored env SSH_TTY >>>> debug3: Ignored env USER >>>> debug3: Ignored env LD_LIBRARY_PATH >>>> debug3: Ignored env LS_COLORS >>>> debug3: Ignored env EC2_HOME >>>> debug3: Ignored env MAIL >>>> debug3: Ignored env PATH >>>> debug3: Ignored env INPUTRC >>>> debug3: Ignored env PWD >>>> debug3: Ignored env JAVA_HOME >>>> debug1: Sending env LANG = en_US.UTF-8 >>>> debug2: channel 0: request env confirm 0 >>>> debug3: Ignored env AWS_CLOUDWATCH_HOME >>>> debug3: Ignored env AWS_IAM_HOME >>>> debug3: Ignored env SHLVL >>>> debug3: Ignored env HOME >>>> debug3: Ignored env AWS_PATH >>>> debug3: Ignored env AWS_AUTO_SCALING_HOME >>>> debug3: Ignored env LOGNAME >>>> debug3: Ignored env AWS_ELB_HOME >>>> debug3: Ignored env SSH_CONNECTION >>>> debug3: Ignored env LESSOPEN >>>> debug3: Ignored env AWS_RDS_HOME >>>> debug3: Ignored env G_BROKEN_FILENAMES >>>> debug3: Ignored env _ >>>> debug3: Ignored env OLDPWD >>>> debug3: Ignored env OMPI_MCA_plm >>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >>>> --hnp-uri "125566976.0;tcp://10.96.118.236:56064" >>>> debug2: channel 0: request exec confirm 1 >>>> debug2: fd 3 setting TCP_NODELAY >>>> debug2: callback done >>>> debug2: channel 0: open confirm rwindow 0 rmax 32768 >>>> debug3: Wrote 272 bytes for a total of 1893 >>>> debug2: channel 0: rcvd adjust 2097152 >>>> debug2: channel_input_status_confirm: type 99 id 0 >>>> debug2: exec request accepted on channel 0 >>>> debug2: channel 0: read<=0 rfd 4 len 0 >>>> debug2: channel 0: read failed >>>> debug2: channel 0: close_read >>>> debug2: channel 0: input open -> drain >>>> debug2: channel 0: ibuf empty >>>> debug2: channel 0: send eof >>>> debug2: channel 0: input drain -> closed >>>> debug3: Wrote 32 bytes for a total of 1925 >>>> debug2: channel 0: rcvd eof >>>> debug2: channel 0: output open -> drain >>>> debug2: channel 0: obuf empty >>>> debug2: channel 0: close_write >>>> debug2: channel 0: output drain -> closed >>>> debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 >>>> debug2: channel 0: rcvd close >>>> debug3: channel 0: will not send data after close >>>> debug2: channel 0: almost dead >>>> debug2: channel 0: gc: notify user >>>> debug2: channel 0: gc: user detached >>>> debug2: channel 0: send close >>>> debug2: channel 0: is dead >>>> debug2: channel 0: garbage collecting >>>> debug1: channel 0: free: client-session, nchannels 1 >>>> debug3: channel 0: status: The following connections are open: >>>> #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1) >>>> >>>> debug3: channel 0: close_fds r -1 w -1 e 6 c -1 >>>> debug3: Wrote 32 bytes for a total of 1957 >>>> debug3: Wrote 64 bytes for a total of 2021 >>>> debug1: fd 0 clearing O_NONBLOCK >>>> Transferred: sent 1840, received 1896 bytes, in 0.1 seconds >>>> Bytes per second: sent 18384.8, received 18944.3 >>>> debug1: Exit status 0 >>>> # it is hanging; I am about to issue control-C >>>> ^Cmpirun: killing job... >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> >>>> -------------------------------------------------------------------------- >>>> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report >>>> back when launched >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e., >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when >>>> launched >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean? >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming... >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ >>>> [tsakai@domU-12-31-39-16-75-1E ~]$ exit >>>> logout >>>> [tsakai@vixen ec2]$ >>>> [tsakai@vixen ec2]$ >>>> >>>> Do you see anything strange? >>>> >>>> One final question: On ssh man page, it mentions a few environmental >>>> varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do >>>> any of these matter as far as openMPI is concerned? >>>> >>>> Thank you, Gus. >>>> >>>> Regards, >>>> >>>> Tena >>>> >>>> On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >>>> >>>>> Tena Sakai wrote: >>>>>> Hi, >>>>>> >>>>>> I am trying to reproduce what I was able to show last Friday on Amazon >>>>>> EC2 instances, but I am having a problem. What I was able to show last >>>>>> Friday as root was with this command: >>>>>> mpirun app app.ac >>>>>> with app.ac being: >>>>>> -H dns-entry-A np 1 (linux command) >>>>>> -H dns-entry-A np 1 (linux command) >>>>>> -H dns-entry-B np 1 (linux command) >>>>>> -H dns-entry-B np 1 (linux command) >>>>>> >>>>>> Here¹s the config file in root¹s .ssh directory: >>>>>> Host * >>>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>>> IdentitiesOnly yes >>>>>> BatchMode yes >>>>>> >>>>>> Yesterday and today I can¹t get this to work. I made the last part of >>>>>> app.ac >>>>>> file simpler (it now says /bin/hostname). Below is the session: >>>>>> >>>>>> -bash-3.2# >>>>>> -bash-3.2# # I am on instance A, host name for inst A is: >>>>>> -bash-3.2# hostname >>>>>> domU-12-31-39-09-CD-C2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# nslookup domU-12-31-39-09-CD-C2 >>>>>> Server: 172.16.0.23 >>>>>> Address: 172.16.0.23#53 >>>>>> >>>>>> Non-authoritative answer: >>>>>> Name: domU-12-31-39-09-CD-C2.compute-1.internal >>>>>> Address: 10.210.210.48 >>>>>> >>>>>> -bash-3.2# cd .ssh >>>>>> -bash-3.2# >>>>>> -bash-3.2# cat config >>>>>> Host * >>>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>>> IdentitiesOnly yes >>>>>> BatchMode yes >>>>>> -bash-3.2# >>>>>> -bash-3.2# ll config >>>>>> -rw-r--r-- 1 root root 103 Feb 15 17:18 config >>>>>> -bash-3.2# >>>>>> -bash-3.2# chmod 600 config >>>>>> -bash-3.2# >>>>>> -bash-3.2# # show I can go to inst B without password/passphrase >>>>>> -bash-3.2# >>>>>> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal >>>>>> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48 >>>>>> -bash-3.2# >>>>>> -bash-3.2# hostname >>>>>> domU-12-31-39-09-E6-71 >>>>>> -bash-3.2# >>>>>> -bash-3.2# nslookup `hostname` >>>>>> Server: 172.16.0.23 >>>>>> Address: 172.16.0.23#53 >>>>>> >>>>>> Non-authoritative answer: >>>>>> Name: domU-12-31-39-09-E6-71.compute-1.internal >>>>>> Address: 10.210.233.123 >>>>>> >>>>>> -bash-3.2# # and back to inst A is also no problem >>>>>> -bash-3.2# >>>>>> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal >>>>>> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1 >>>>>> -bash-3.2# >>>>>> -bash-3.2# hostname >>>>>> domU-12-31-39-09-CD-C2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# # log out twice to go back to inst A >>>>>> -bash-3.2# exit >>>>>> logout >>>>>> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed. >>>>>> -bash-3.2# >>>>>> -bash-3.2# exit >>>>>> logout >>>>>> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed. >>>>>> -bash-3.2# >>>>>> -bash-3.2# hostname >>>>>> domU-12-31-39-09-CD-C2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# cd .. >>>>>> -bash-3.2# >>>>>> -bash-3.2# pwd >>>>>> /root >>>>>> -bash-3.2# >>>>>> -bash-3.2# ll >>>>>> total 8 >>>>>> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac >>>>>> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# cat app.ac >>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >>>>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >>>>>> -bash-3.2# >>>>>> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs >>>>>> -bash-3.2# mpirun -app app.ac >>>>>> mpirun: killing job... >>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>>>> below. Additional manual cleanup may be required - please refer to >>>>>> the "orte-clean" tool for assistance. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not >>>>>> report back when launched >>>>>> -bash-3.2# >>>>>> -bash-3.2# cat app.ac2 >>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>>>> -bash-3.2# >>>>>> -bash-3.2# # when there is no remote machine, then mpirun works: >>>>>> -bash-3.2# mpirun -app app.ac2 >>>>>> domU-12-31-39-09-CD-C2 >>>>>> domU-12-31-39-09-CD-C2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# hostname >>>>>> domU-12-31-39-09-CD-C2 >>>>>> -bash-3.2# >>>>>> -bash-3.2# # this gotta be ssh problem.... >>>>>> -bash-3.2# >>>>>> -bash-3.2# # show no firewall is used >>>>>> -bash-3.2# iptables --list >>>>>> Chain INPUT (policy ACCEPT) >>>>>> target prot opt source destination >>>>>> >>>>>> Chain FORWARD (policy ACCEPT) >>>>>> target prot opt source destination >>>>>> >>>>>> Chain OUTPUT (policy ACCEPT) >>>>>> target prot opt source destination >>>>>> -bash-3.2# >>>>>> -bash-3.2# exit >>>>>> logout >>>>>> [tsakai@vixen ec2]$ >>>>>> >>>>>> Would someone please point out what I am doing wrong? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Tena >>>>>> >>>>> Hi Tena >>>>> >>>>> Nothing wrong that I can see. >>>>> Just another couple of suggestions, >>>>> based on somewhat vague possibilities. >>>>> >>>>> A slight difference is that on vixen and dashen you ran the >>>>> MPI hostname tests as a regular user, not as root, right? >>>>> Not sure if this will make much of a difference, >>>>> but it may be worth trying to run it as a regular user in EC2 also. >>>>> I general most people avoid running user applications (MPI programs >>>>> included) as root. >>>>> Mostly for safety, but I wonder if there are any >>>>> implications in the 'rootly powers' >>>>> regarding the under-the-hood processes that OpenMPI >>>>> launches along with the actual user programs. >>>>> >>>>> This may make no difference either, >>>>> but you could do a 'service iptables status', >>>>> to see if the service is running, even though there are >>>>> no explicit iptable rules (as per your email). >>>>> If the service is not running you get >>>>> 'Firewall is stopped.' (in CentOS). >>>>> I *think* 'iptables --list' loads the iptables module into the >>>>> kernel, as a side effect, whereas the service command does not. >>>>> So, it may be cleaner (safer?) to use the service version >>>>> instead of 'iptables --list'. >>>>> I don't know if it will make any difference, >>>>> but just in case, if the service is running, >>>>> why not do 'service iptables stop', >>>>> and perhaps also 'chkconfig iptables off' to be completely >>>>> free of iptables? >>>>> >>>>> Gus Correa >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users