Hi Gus, I am sorry for delay. I had a busy day and kept me from doing what I wanted to do. Namely, to figure out the problem with mpirun on ec2 instances.
What I did just now was to 1) run the mpirun with the same app.ac file as before (locally, between dasher and vixen) with DEBUG3 setting in .ssh/config file 2) use the config setting as you suggested without DEBUG3 between dasher and vixen 3) the same as 2) but with DEBUG3 4) launch 2 ec2 instances 5) ship the config file from 3) to an instance (instance A) 6) get onto the instance A 7) assure that I can ssh between instances A and B withour passphrase 8) run mpirun with the config file got shipped (the one you suggested) The result is in 8) I see a bit different debug info, but I don't think they mean much and the mpirun behavior-wise, I think it is essentially the same in all cases since Monday (but different from Friday last week). The glory detail is in the attached file. When you have a chance, would you please give a glance? Thank you very much. Regards, Tena On 2/16/11 1:34 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Hi Tena > > I hope somebody more knowledgeable in ssh > takes a look at the debug3 session log that you included. > > I can't see if/where/why ssh is failing for you in EC2. > > See other answers inline, please. > > Tena Sakai wrote: >> Hi Gus, >> >> Thank you again for your reply. >> >>> A slight difference is that on vixen and dashen you ran the >>> MPI hostname tests as a regular user, not as root, right? >>> Not sure if this will make much of a difference, >>> but it may be worth trying to run it as a regular user in EC2 also. >>> I general most people avoid running user applications (MPI programs >>> included) as root. >>> Mostly for safety, but I wonder if there are any >>> implications in the 'rootly powers' >>> regarding the under-the-hood processes that OpenMPI >>> launches along with the actual user programs. >> >> Yes, between vixen and dahser I was doing the test as user tsakai, >> not as root. But the reason I wanted to do this test as root is >> to show that it fails as regular user (generating pipe system >> call failed error), whereas as root it would succeed, as it did >> on Friday. > > Sorry again. > I even wrote "root can and Tena cannot", then I forgot. > Too many tasks at the same time, too much context-switching ... > >> The ami has not changed. The last change on the ami >> was last Tuesday. As such I don't understand this inconsistent >> behavior. I have lots of notes from previous sessions and I >> consulted different successful session logs to replicate what I >> saw Friday, but with no success. >> >> Having spent days and not getting anywhere, I decided to take a >> different approach. I instantiated a linux ami that was built by >> Amazon, which feels like centos/fedora-based. I downloaded gcc >> and c++, plus openMPI 1.4.3. After I got openMPI running, I >> created an account for user tsakai, uploaded my public key, re-logged >> in as user tsakai, and ran the same test. Surprisingly (or not?) it >> generated the same result. I.e., I cannot run the same mpirun >> command when there is a remote instance involved, but on itself >> mpirun runs fine. So, I am feeling that this has to be an ssh >> authentication problem. I looked at man page for ssh and ssh_config >> and cannot figure out what I am doing wrong. I put in "LogLevel >> DEBUG3" line and it generated lots of lines, in which I found a >> line: >> debug1: Authentication succeeded (publickey). >> Then I see a bunch of lines that look like: >> debug3: Ignored env XXXXXXX >> and mpirun hangs. Here is the session log: >> > > Ssh on our clusters uses host-based authentication. > I think Reuti sent you his page about it: > http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html > > However, I believe OpenMPI shouldn't care which ssh authentication > mechanism is used, as long as it works passwordless. > > As for ssh configuration, ours is pretty standard: > > 1) We don't have 'IdentitiesOnly yes' (default is 'no'), > but use standard identity file names id_rsa, etc. > I think you are just telling ssh to use the specific identity > file you named. > I don't know if this may cause the problem, but who knows? > > 2) We don't have 'BatchMode yes' set. > > 3) We have the GSS authentication set > > GSSAPIAuthentication yes > > 4) The locale environment variables are also passed > (may not be crucial): > > SendEnv LANG LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY > LC_MESSAGES > SendEnv LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT > SendEnv LC_IDENTIFICATION LC_ALL > > 5) And X forwarding (you're not doing any X stuff, I suppose): > > ForwardX11Trusted yes > > 6) However, you may want to check what is in your > /etc/ssh/ssh_config and /etc/ssh/sshd_config, > because some options may be already set there. > > 7) Take a look at 'man ssh[d]' and 'man ssh[d]_config' too. > > *** > > Finally, if you are willing to, it may be worth to run the same > experiment (with debug3) on vixen @ dashen, just to compare what > comes out from the verbose ssh messages to what you see in EC2. > Perhaps it may help nail down the reason for failure. > > Gus Correa > > > >> [tsakai@vixen ec2]$ >> [tsakai@vixen ec2]$ ssh -i $MYKEY >> tsa...@ec2-50-17-24-195.compute-1.amazonaws.com >> Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1 >> >> __| __|_ ) Amazon Linux AMI >> _| ( / Beta >> ___|\___|___| >> >> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >> :-) >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off >> [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status >> -bash: service: command not found >> [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status >> iptables: Firewall is not running. >> [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no >> password authentication >> [tsakai@domU-12-31-39-16-75-1E ~]$ ssh >> domU-12-31-39-16-4E-4C.compute-1.internal >> Last login: Wed Feb 16 06:53:14 2011 from >> domu-12-31-39-16-75-1e.compute-1.internal >> >> __| __|_ ) Amazon Linux AMI >> _| ( / Beta >> ___|\___|___| >> >> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >> :-) >> [tsakai@domU-12-31-39-16-4E-4C ~]$ >> [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A >> [tsakai@domU-12-31-39-16-4E-4C ~]$ >> [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh >> domU-12-31-39-16-75-1E.compute-1.internal >> Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1 >> >> __| __|_ ) Amazon Linux AMI >> _| ( / Beta >> ___|\___|___| >> >> See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. >> :-) >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # OK >> [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B >> [tsakai@domU-12-31-39-16-75-1E ~]$ exit >> logout >> Connection to domU-12-31-39-16-75-1E.compute-1.internal closed. >> [tsakai@domU-12-31-39-16-4E-4C ~]$ >> [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB >> LD_LIBRARY_PATH=:/usr/local/lib >> [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B >> [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status >> iptables: Firewall is not running. >> [tsakai@domU-12-31-39-16-4E-4C ~]$ >> [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A >> [tsakai@domU-12-31-39-16-4E-4C ~]$ exit >> logout >> Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed. >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB >> LD_LIBRARY_PATH=:/usr/local/lib >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac >> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine); >> bottom 2 are remote inst (inst B) >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac >> ^Cmpirun: killing job... >> >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report >> back when launched >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when >> launched *** >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2 >> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A) >> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2 >> domU-12-31-39-16-75-1E >> domU-12-31-39-16-75-1E >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >> Host * >> IdentityFile /home/tsakai/.ssh/tsakai >> IdentitiesOnly yes >> BatchMode yes >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config >> -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >> Host * >> IdentityFile /home/tsakai/.ssh/tsakai >> IdentitiesOnly yes >> BatchMode yes >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config >> LogLevel DEBUG3 >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config >> Host * >> IdentityFile /home/tsakai/.ssh/tsakai >> IdentitiesOnly yes >> BatchMode yes >> LogLevel DEBUG3 >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config >> -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ >> [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd .. >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac >> debug2: ssh_connect: needpriv 0 >> debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal >> [10.96.77.182] port 22. >> debug1: Connection established. >> debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai. >> debug2: key_type_from_name: unknown key type '-----BEGIN' >> debug3: key_read: missing keytype >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug3: key_read: missing whitespace >> debug2: key_type_from_name: unknown key type '-----END' >> debug3: key_read: missing keytype >> debug1: identity file /home/tsakai/.ssh/tsakai type -1 >> debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3 >> debug1: match: OpenSSH_5.3 pat OpenSSH* >> debug1: Enabling compatibility mode for protocol 2.0 >> debug1: Local version string SSH-2.0-OpenSSH_5.3 >> debug2: fd 3 setting O_NONBLOCK >> debug1: SSH2_MSG_KEXINIT sent >> debug3: Wrote 792 bytes for a total of 813 >> debug1: SSH2_MSG_KEXINIT received >> debug2: kex_parse_kexinit: >> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff >> ie-hellman-group14-sha1,diffie-hellman-group1-sha1 >> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss >> debug2: kex_parse_kexinit: >> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b >> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l >> iu.se >> debug2: kex_parse_kexinit: >> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b >> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l >> iu.se >> debug2: kex_parse_kexinit: >> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh >> .com,hmac-sha1-96,hmac-md5-96 >> debug2: kex_parse_kexinit: >> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh >> .com,hmac-sha1-96,hmac-md5-96 >> debug2: kex_parse_kexinit: none,z...@openssh.com,zlib >> debug2: kex_parse_kexinit: none,z...@openssh.com,zlib >> debug2: kex_parse_kexinit: >> debug2: kex_parse_kexinit: >> debug2: kex_parse_kexinit: first_kex_follows 0 >> debug2: kex_parse_kexinit: reserved 0 >> debug2: kex_parse_kexinit: >> diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff >> ie-hellman-group14-sha1,diffie-hellman-group1-sha1 >> debug2: kex_parse_kexinit: ssh-rsa,ssh-dss >> debug2: kex_parse_kexinit: >> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b >> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l >> iu.se >> debug2: kex_parse_kexinit: >> aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b >> lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l >> iu.se >> debug2: kex_parse_kexinit: >> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh >> .com,hmac-sha1-96,hmac-md5-96 >> debug2: kex_parse_kexinit: >> hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh >> .com,hmac-sha1-96,hmac-md5-96 >> debug2: kex_parse_kexinit: none,z...@openssh.com >> debug2: kex_parse_kexinit: none,z...@openssh.com >> debug2: kex_parse_kexinit: >> debug2: kex_parse_kexinit: >> debug2: kex_parse_kexinit: first_kex_follows 0 >> debug2: kex_parse_kexinit: reserved 0 >> debug2: mac_setup: found hmac-md5 >> debug1: kex: server->client aes128-ctr hmac-md5 none >> debug2: mac_setup: found hmac-md5 >> debug1: kex: client->server aes128-ctr hmac-md5 none >> debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent >> debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP >> debug3: Wrote 24 bytes for a total of 837 >> debug2: dh_gen_key: priv key bits set: 125/256 >> debug2: bits set: 489/1024 >> debug1: SSH2_MSG_KEX_DH_GEX_INIT sent >> debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY >> debug3: Wrote 144 bytes for a total of 981 >> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts >> debug3: check_host_in_hostfile: match line 1 >> debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts >> debug3: check_host_in_hostfile: match line 1 >> debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and >> matches the RSA host key. >> debug1: Found key in /home/tsakai/.ssh/known_hosts:1 >> debug2: bits set: 491/1024 >> debug1: ssh_rsa_verify: signature correct >> debug2: kex_derive_keys >> debug2: set_newkeys: mode 1 >> debug1: SSH2_MSG_NEWKEYS sent >> debug1: expecting SSH2_MSG_NEWKEYS >> debug3: Wrote 16 bytes for a total of 997 >> debug2: set_newkeys: mode 0 >> debug1: SSH2_MSG_NEWKEYS received >> debug1: SSH2_MSG_SERVICE_REQUEST sent >> debug3: Wrote 48 bytes for a total of 1045 >> debug2: service_accept: ssh-userauth >> debug1: SSH2_MSG_SERVICE_ACCEPT received >> debug2: key: /home/tsakai/.ssh/tsakai ((nil)) >> debug3: Wrote 64 bytes for a total of 1109 >> debug1: Authentications that can continue: publickey >> debug3: start over, passed a different list publickey >> debug3: preferred gssapi-with-mic,publickey >> debug3: authmethod_lookup publickey >> debug3: remaining preferred: ,publickey >> debug3: authmethod_is_enabled publickey >> debug1: Next authentication method: publickey >> debug1: Trying private key: /home/tsakai/.ssh/tsakai >> debug1: read PEM private key done: type RSA >> debug3: sign_and_send_pubkey >> debug2: we sent a publickey packet, wait for reply >> debug3: Wrote 384 bytes for a total of 1493 >> debug1: Authentication succeeded (publickey). >> debug2: fd 4 setting O_NONBLOCK >> debug1: channel 0: new [client-session] >> debug3: ssh_session2_open: channel_new: 0 >> debug2: channel 0: send open >> debug1: Requesting no-more-sessi...@openssh.com >> debug1: Entering interactive session. >> debug3: Wrote 128 bytes for a total of 1621 >> debug2: callback start >> debug2: client_session2_setup: id 0 >> debug1: Sending environment. >> debug3: Ignored env HOSTNAME >> debug3: Ignored env TERM >> debug3: Ignored env SHELL >> debug3: Ignored env HISTSIZE >> debug3: Ignored env EC2_AMITOOL_HOME >> debug3: Ignored env SSH_CLIENT >> debug3: Ignored env SSH_TTY >> debug3: Ignored env USER >> debug3: Ignored env LD_LIBRARY_PATH >> debug3: Ignored env LS_COLORS >> debug3: Ignored env EC2_HOME >> debug3: Ignored env MAIL >> debug3: Ignored env PATH >> debug3: Ignored env INPUTRC >> debug3: Ignored env PWD >> debug3: Ignored env JAVA_HOME >> debug1: Sending env LANG = en_US.UTF-8 >> debug2: channel 0: request env confirm 0 >> debug3: Ignored env AWS_CLOUDWATCH_HOME >> debug3: Ignored env AWS_IAM_HOME >> debug3: Ignored env SHLVL >> debug3: Ignored env HOME >> debug3: Ignored env AWS_PATH >> debug3: Ignored env AWS_AUTO_SCALING_HOME >> debug3: Ignored env LOGNAME >> debug3: Ignored env AWS_ELB_HOME >> debug3: Ignored env SSH_CONNECTION >> debug3: Ignored env LESSOPEN >> debug3: Ignored env AWS_RDS_HOME >> debug3: Ignored env G_BROKEN_FILENAMES >> debug3: Ignored env _ >> debug3: Ignored env OLDPWD >> debug3: Ignored env OMPI_MCA_plm >> debug1: Sending command: orted --daemonize -mca ess env -mca >> orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri "125566976.0;tcp://10.96.118.236:56064" >> debug2: channel 0: request exec confirm 1 >> debug2: fd 3 setting TCP_NODELAY >> debug2: callback done >> debug2: channel 0: open confirm rwindow 0 rmax 32768 >> debug3: Wrote 272 bytes for a total of 1893 >> debug2: channel 0: rcvd adjust 2097152 >> debug2: channel_input_status_confirm: type 99 id 0 >> debug2: exec request accepted on channel 0 >> debug2: channel 0: read<=0 rfd 4 len 0 >> debug2: channel 0: read failed >> debug2: channel 0: close_read >> debug2: channel 0: input open -> drain >> debug2: channel 0: ibuf empty >> debug2: channel 0: send eof >> debug2: channel 0: input drain -> closed >> debug3: Wrote 32 bytes for a total of 1925 >> debug2: channel 0: rcvd eof >> debug2: channel 0: output open -> drain >> debug2: channel 0: obuf empty >> debug2: channel 0: close_write >> debug2: channel 0: output drain -> closed >> debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 >> debug2: channel 0: rcvd close >> debug3: channel 0: will not send data after close >> debug2: channel 0: almost dead >> debug2: channel 0: gc: notify user >> debug2: channel 0: gc: user detached >> debug2: channel 0: send close >> debug2: channel 0: is dead >> debug2: channel 0: garbage collecting >> debug1: channel 0: free: client-session, nchannels 1 >> debug3: channel 0: status: The following connections are open: >> #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1) >> >> debug3: channel 0: close_fds r -1 w -1 e 6 c -1 >> debug3: Wrote 32 bytes for a total of 1957 >> debug3: Wrote 64 bytes for a total of 2021 >> debug1: fd 0 clearing O_NONBLOCK >> Transferred: sent 1840, received 1896 bytes, in 0.1 seconds >> Bytes per second: sent 18384.8, received 18944.3 >> debug1: Exit status 0 >> # it is hanging; I am about to issue control-C >> ^Cmpirun: killing job... >> >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report >> back when launched >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e., >> [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when >> launched >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean? >> [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming... >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up >> [tsakai@domU-12-31-39-16-75-1E ~]$ >> [tsakai@domU-12-31-39-16-75-1E ~]$ exit >> logout >> [tsakai@vixen ec2]$ >> [tsakai@vixen ec2]$ >> >> Do you see anything strange? >> >> One final question: On ssh man page, it mentions a few environmental >> varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do >> any of these matter as far as openMPI is concerned? >> >> Thank you, Gus. >> >> Regards, >> >> Tena >> >> On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >> >>> Tena Sakai wrote: >>>> Hi, >>>> >>>> I am trying to reproduce what I was able to show last Friday on Amazon >>>> EC2 instances, but I am having a problem. What I was able to show last >>>> Friday as root was with this command: >>>> mpirun app app.ac >>>> with app.ac being: >>>> -H dns-entry-A np 1 (linux command) >>>> -H dns-entry-A np 1 (linux command) >>>> -H dns-entry-B np 1 (linux command) >>>> -H dns-entry-B np 1 (linux command) >>>> >>>> Here¹s the config file in root¹s .ssh directory: >>>> Host * >>>> IdentityFile /root/.ssh/.derobee/.kagi >>>> IdentitiesOnly yes >>>> BatchMode yes >>>> >>>> Yesterday and today I can¹t get this to work. I made the last part of >>>> app.ac >>>> file simpler (it now says /bin/hostname). Below is the session: >>>> >>>> -bash-3.2# >>>> -bash-3.2# # I am on instance A, host name for inst A is: >>>> -bash-3.2# hostname >>>> domU-12-31-39-09-CD-C2 >>>> -bash-3.2# >>>> -bash-3.2# nslookup domU-12-31-39-09-CD-C2 >>>> Server: 172.16.0.23 >>>> Address: 172.16.0.23#53 >>>> >>>> Non-authoritative answer: >>>> Name: domU-12-31-39-09-CD-C2.compute-1.internal >>>> Address: 10.210.210.48 >>>> >>>> -bash-3.2# cd .ssh >>>> -bash-3.2# >>>> -bash-3.2# cat config >>>> Host * >>>> IdentityFile /root/.ssh/.derobee/.kagi >>>> IdentitiesOnly yes >>>> BatchMode yes >>>> -bash-3.2# >>>> -bash-3.2# ll config >>>> -rw-r--r-- 1 root root 103 Feb 15 17:18 config >>>> -bash-3.2# >>>> -bash-3.2# chmod 600 config >>>> -bash-3.2# >>>> -bash-3.2# # show I can go to inst B without password/passphrase >>>> -bash-3.2# >>>> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal >>>> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48 >>>> -bash-3.2# >>>> -bash-3.2# hostname >>>> domU-12-31-39-09-E6-71 >>>> -bash-3.2# >>>> -bash-3.2# nslookup `hostname` >>>> Server: 172.16.0.23 >>>> Address: 172.16.0.23#53 >>>> >>>> Non-authoritative answer: >>>> Name: domU-12-31-39-09-E6-71.compute-1.internal >>>> Address: 10.210.233.123 >>>> >>>> -bash-3.2# # and back to inst A is also no problem >>>> -bash-3.2# >>>> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal >>>> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1 >>>> -bash-3.2# >>>> -bash-3.2# hostname >>>> domU-12-31-39-09-CD-C2 >>>> -bash-3.2# >>>> -bash-3.2# # log out twice to go back to inst A >>>> -bash-3.2# exit >>>> logout >>>> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed. >>>> -bash-3.2# >>>> -bash-3.2# exit >>>> logout >>>> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed. >>>> -bash-3.2# >>>> -bash-3.2# hostname >>>> domU-12-31-39-09-CD-C2 >>>> -bash-3.2# >>>> -bash-3.2# cd .. >>>> -bash-3.2# >>>> -bash-3.2# pwd >>>> /root >>>> -bash-3.2# >>>> -bash-3.2# ll >>>> total 8 >>>> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac >>>> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2 >>>> -bash-3.2# >>>> -bash-3.2# cat app.ac >>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >>>> -bash-3.2# >>>> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs >>>> -bash-3.2# mpirun -app app.ac >>>> mpirun: killing job... >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> >>>> -------------------------------------------------------------------------- >>>> >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> >>>> -------------------------------------------------------------------------- >>>> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not >>>> report back when launched >>>> -bash-3.2# >>>> -bash-3.2# cat app.ac2 >>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >>>> -bash-3.2# >>>> -bash-3.2# # when there is no remote machine, then mpirun works: >>>> -bash-3.2# mpirun -app app.ac2 >>>> domU-12-31-39-09-CD-C2 >>>> domU-12-31-39-09-CD-C2 >>>> -bash-3.2# >>>> -bash-3.2# hostname >>>> domU-12-31-39-09-CD-C2 >>>> -bash-3.2# >>>> -bash-3.2# # this gotta be ssh problem.... >>>> -bash-3.2# >>>> -bash-3.2# # show no firewall is used >>>> -bash-3.2# iptables --list >>>> Chain INPUT (policy ACCEPT) >>>> target prot opt source destination >>>> >>>> Chain FORWARD (policy ACCEPT) >>>> target prot opt source destination >>>> >>>> Chain OUTPUT (policy ACCEPT) >>>> target prot opt source destination >>>> -bash-3.2# >>>> -bash-3.2# exit >>>> logout >>>> [tsakai@vixen ec2]$ >>>> >>>> Would someone please point out what I am doing wrong? >>>> >>>> Thank you. >>>> >>>> Regards, >>>> >>>> Tena >>>> >>> Hi Tena >>> >>> Nothing wrong that I can see. >>> Just another couple of suggestions, >>> based on somewhat vague possibilities. >>> >>> A slight difference is that on vixen and dashen you ran the >>> MPI hostname tests as a regular user, not as root, right? >>> Not sure if this will make much of a difference, >>> but it may be worth trying to run it as a regular user in EC2 also. >>> I general most people avoid running user applications (MPI programs >>> included) as root. >>> Mostly for safety, but I wonder if there are any >>> implications in the 'rootly powers' >>> regarding the under-the-hood processes that OpenMPI >>> launches along with the actual user programs. >>> >>> This may make no difference either, >>> but you could do a 'service iptables status', >>> to see if the service is running, even though there are >>> no explicit iptable rules (as per your email). >>> If the service is not running you get >>> 'Firewall is stopped.' (in CentOS). >>> I *think* 'iptables --list' loads the iptables module into the >>> kernel, as a side effect, whereas the service command does not. >>> So, it may be cleaner (safer?) to use the service version >>> instead of 'iptables --list'. >>> I don't know if it will make any difference, >>> but just in case, if the service is running, >>> why not do 'service iptables stop', >>> and perhaps also 'chkconfig iptables off' to be completely >>> free of iptables? >>> >>> Gus Correa >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
moreMpirunTestswithDEBUG3.text
Description: moreMpirunTestswithDEBUG3.text