Hi Gus, Thank you again for your reply.
> A slight difference is that on vixen and dashen you ran the > MPI hostname tests as a regular user, not as root, right? > Not sure if this will make much of a difference, > but it may be worth trying to run it as a regular user in EC2 also. > I general most people avoid running user applications (MPI programs > included) as root. > Mostly for safety, but I wonder if there are any > implications in the 'rootly powers' > regarding the under-the-hood processes that OpenMPI > launches along with the actual user programs. Yes, between vixen and dahser I was doing the test as user tsakai, not as root. But the reason I wanted to do this test as root is to show that it fails as regular user (generating pipe system call failed error), whereas as root it would succeed, as it did on Friday. The ami has not changed. The last change on the ami was last Tuesday. As such I don't understand this inconsistent behavior. I have lots of notes from previous sessions and I consulted different successful session logs to replicate what I saw Friday, but with no success. Having spent days and not getting anywhere, I decided to take a different approach. I instantiated a linux ami that was built by Amazon, which feels like centos/fedora-based. I downloaded gcc and c++, plus openMPI 1.4.3. After I got openMPI running, I created an account for user tsakai, uploaded my public key, re-logged in as user tsakai, and ran the same test. Surprisingly (or not?) it generated the same result. I.e., I cannot run the same mpirun command when there is a remote instance involved, but on itself mpirun runs fine. So, I am feeling that this has to be an ssh authentication problem. I looked at man page for ssh and ssh_config and cannot figure out what I am doing wrong. I put in "LogLevel DEBUG3" line and it generated lots of lines, in which I found a line: debug1: Authentication succeeded (publickey). Then I see a bunch of lines that look like: debug3: Ignored env XXXXXXX and mpirun hangs. Here is the session log: [tsakai@vixen ec2]$ [tsakai@vixen ec2]$ ssh -i $MYKEY tsa...@ec2-50-17-24-195.compute-1.amazonaws.com Last login: Wed Feb 16 06:50:08 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # show firewall is off [tsakai@domU-12-31-39-16-75-1E ~]$ service iptables status -bash: service: command not found [tsakai@domU-12-31-39-16-75-1E ~]$ sudo service iptables status iptables: Firewall is not running. [tsakai@domU-12-31-39-16-75-1E ~]$ # show I can go to inst B with no password authentication [tsakai@domU-12-31-39-16-75-1E ~]$ ssh domU-12-31-39-16-4E-4C.compute-1.internal Last login: Wed Feb 16 06:53:14 2011 from domu-12-31-39-16-75-1e.compute-1.internal __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@domU-12-31-39-16-4E-4C ~]$ [tsakai@domU-12-31-39-16-4E-4C ~]$ # also back to inst A [tsakai@domU-12-31-39-16-4E-4C ~]$ [tsakai@domU-12-31-39-16-4E-4C ~]$ ssh domU-12-31-39-16-75-1E.compute-1.internal Last login: Wed Feb 16 06:58:33 2011 from 63.193.205.1 __| __|_ ) Amazon Linux AMI _| ( / Beta ___|\___|___| See /usr/share/doc/amzn-ami/image-release-notes for latest release notes. :-) [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # OK [tsakai@domU-12-31-39-16-75-1E ~]$ # back to inst B [tsakai@domU-12-31-39-16-75-1E ~]$ exit logout Connection to domU-12-31-39-16-75-1E.compute-1.internal closed. [tsakai@domU-12-31-39-16-4E-4C ~]$ [tsakai@domU-12-31-39-16-4E-4C ~]$ env | grep LD_LIB LD_LIBRARY_PATH=:/usr/local/lib [tsakai@domU-12-31-39-16-4E-4C ~]$ # show no firewall on inst B [tsakai@domU-12-31-39-16-4E-4C ~]$ sudo service iptables status iptables: Firewall is not running. [tsakai@domU-12-31-39-16-4E-4C ~]$ [tsakai@domU-12-31-39-16-4E-4C ~]$ # go back to inst A [tsakai@domU-12-31-39-16-4E-4C ~]$ exit logout Connection to domU-12-31-39-16-4E-4C.compute-1.internal closed. [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ env | grep LD_LIB LD_LIBRARY_PATH=:/usr/local/lib [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname -H domU-12-31-39-16-4E-4C.compute-1.internal -np 1 /bin/hostname [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # top 2 are inst A (this machine); bottom 2 are remote inst (inst B) [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac ^Cmpirun: killing job... -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report back when launched [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # *** daemon did not report back when launched *** [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ cat app.ac2 -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname -H domU-12-31-39-16-75-1E.compute-1.internal -np 1 /bin/hostname [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # they refer to this instance (inst A) [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac2 domU-12-31-39-16-75-1E domU-12-31-39-16-75-1E [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # that's no problem [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ cd .ssh [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config Host * IdentityFile /home/tsakai/.ssh/tsakai IdentitiesOnly yes BatchMode yes [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ mv config config.svd [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config.svd > config [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config -rw-rw-r-- 1 tsakai tsakai 81 Feb 16 07:06 config [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ chmod 600 config [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config Host * IdentityFile /home/tsakai/.ssh/tsakai IdentitiesOnly yes BatchMode yes [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat - >> config LogLevel DEBUG3 [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cat config Host * IdentityFile /home/tsakai/.ssh/tsakai IdentitiesOnly yes BatchMode yes LogLevel DEBUG3 [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ ll config -rw------- 1 tsakai tsakai 98 Feb 16 07:07 config [tsakai@domU-12-31-39-16-75-1E .ssh]$ [tsakai@domU-12-31-39-16-75-1E .ssh]$ cd .. [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ mpirun -app app.ac debug2: ssh_connect: needpriv 0 debug1: Connecting to domU-12-31-39-16-4E-4C.compute-1.internal [10.96.77.182] port 22. debug1: Connection established. debug3: Not a RSA1 key file /home/tsakai/.ssh/tsakai. debug2: key_type_from_name: unknown key type '-----BEGIN' debug3: key_read: missing keytype debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug3: key_read: missing whitespace debug2: key_type_from_name: unknown key type '-----END' debug3: key_read: missing keytype debug1: identity file /home/tsakai/.ssh/tsakai type -1 debug1: Remote protocol version 2.0, remote software version OpenSSH_5.3 debug1: match: OpenSSH_5.3 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_5.3 debug2: fd 3 setting O_NONBLOCK debug1: SSH2_MSG_KEXINIT sent debug3: Wrote 792 bytes for a total of 813 debug1: SSH2_MSG_KEXINIT received debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff ie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l iu.se debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l iu.se debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh .com,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh .com,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,z...@openssh.com,zlib debug2: kex_parse_kexinit: none,z...@openssh.com,zlib debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: kex_parse_kexinit: diffie-hellman-group-exchange-sha256,diffie-hellman-group-exchange-sha1,diff ie-hellman-group14-sha1,diffie-hellman-group1-sha1 debug2: kex_parse_kexinit: ssh-rsa,ssh-dss debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l iu.se debug2: kex_parse_kexinit: aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc,b lowfish-cbc,cast128-cbc,aes192-cbc,aes256-cbc,arcfour,rijndael-cbc@lysator.l iu.se debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh .com,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: hmac-md5,hmac-sha1,umac...@openssh.com,hmac-ripemd160,hmac-ripemd160@openssh .com,hmac-sha1-96,hmac-md5-96 debug2: kex_parse_kexinit: none,z...@openssh.com debug2: kex_parse_kexinit: none,z...@openssh.com debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: debug2: kex_parse_kexinit: first_kex_follows 0 debug2: kex_parse_kexinit: reserved 0 debug2: mac_setup: found hmac-md5 debug1: kex: server->client aes128-ctr hmac-md5 none debug2: mac_setup: found hmac-md5 debug1: kex: client->server aes128-ctr hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug3: Wrote 24 bytes for a total of 837 debug2: dh_gen_key: priv key bits set: 125/256 debug2: bits set: 489/1024 debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug3: Wrote 144 bytes for a total of 981 debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts debug3: check_host_in_hostfile: match line 1 debug3: check_host_in_hostfile: filename /home/tsakai/.ssh/known_hosts debug3: check_host_in_hostfile: match line 1 debug1: Host 'domu-12-31-39-16-4e-4c.compute-1.internal' is known and matches the RSA host key. debug1: Found key in /home/tsakai/.ssh/known_hosts:1 debug2: bits set: 491/1024 debug1: ssh_rsa_verify: signature correct debug2: kex_derive_keys debug2: set_newkeys: mode 1 debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug3: Wrote 16 bytes for a total of 997 debug2: set_newkeys: mode 0 debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug3: Wrote 48 bytes for a total of 1045 debug2: service_accept: ssh-userauth debug1: SSH2_MSG_SERVICE_ACCEPT received debug2: key: /home/tsakai/.ssh/tsakai ((nil)) debug3: Wrote 64 bytes for a total of 1109 debug1: Authentications that can continue: publickey debug3: start over, passed a different list publickey debug3: preferred gssapi-with-mic,publickey debug3: authmethod_lookup publickey debug3: remaining preferred: ,publickey debug3: authmethod_is_enabled publickey debug1: Next authentication method: publickey debug1: Trying private key: /home/tsakai/.ssh/tsakai debug1: read PEM private key done: type RSA debug3: sign_and_send_pubkey debug2: we sent a publickey packet, wait for reply debug3: Wrote 384 bytes for a total of 1493 debug1: Authentication succeeded (publickey). debug2: fd 4 setting O_NONBLOCK debug1: channel 0: new [client-session] debug3: ssh_session2_open: channel_new: 0 debug2: channel 0: send open debug1: Requesting no-more-sessi...@openssh.com debug1: Entering interactive session. debug3: Wrote 128 bytes for a total of 1621 debug2: callback start debug2: client_session2_setup: id 0 debug1: Sending environment. debug3: Ignored env HOSTNAME debug3: Ignored env TERM debug3: Ignored env SHELL debug3: Ignored env HISTSIZE debug3: Ignored env EC2_AMITOOL_HOME debug3: Ignored env SSH_CLIENT debug3: Ignored env SSH_TTY debug3: Ignored env USER debug3: Ignored env LD_LIBRARY_PATH debug3: Ignored env LS_COLORS debug3: Ignored env EC2_HOME debug3: Ignored env MAIL debug3: Ignored env PATH debug3: Ignored env INPUTRC debug3: Ignored env PWD debug3: Ignored env JAVA_HOME debug1: Sending env LANG = en_US.UTF-8 debug2: channel 0: request env confirm 0 debug3: Ignored env AWS_CLOUDWATCH_HOME debug3: Ignored env AWS_IAM_HOME debug3: Ignored env SHLVL debug3: Ignored env HOME debug3: Ignored env AWS_PATH debug3: Ignored env AWS_AUTO_SCALING_HOME debug3: Ignored env LOGNAME debug3: Ignored env AWS_ELB_HOME debug3: Ignored env SSH_CONNECTION debug3: Ignored env LESSOPEN debug3: Ignored env AWS_RDS_HOME debug3: Ignored env G_BROKEN_FILENAMES debug3: Ignored env _ debug3: Ignored env OLDPWD debug3: Ignored env OMPI_MCA_plm debug1: Sending command: orted --daemonize -mca ess env -mca orte_ess_jobid 125566976 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri "125566976.0;tcp://10.96.118.236:56064" debug2: channel 0: request exec confirm 1 debug2: fd 3 setting TCP_NODELAY debug2: callback done debug2: channel 0: open confirm rwindow 0 rmax 32768 debug3: Wrote 272 bytes for a total of 1893 debug2: channel 0: rcvd adjust 2097152 debug2: channel_input_status_confirm: type 99 id 0 debug2: exec request accepted on channel 0 debug2: channel 0: read<=0 rfd 4 len 0 debug2: channel 0: read failed debug2: channel 0: close_read debug2: channel 0: input open -> drain debug2: channel 0: ibuf empty debug2: channel 0: send eof debug2: channel 0: input drain -> closed debug3: Wrote 32 bytes for a total of 1925 debug2: channel 0: rcvd eof debug2: channel 0: output open -> drain debug2: channel 0: obuf empty debug2: channel 0: close_write debug2: channel 0: output drain -> closed debug1: client_input_channel_req: channel 0 rtype exit-status reply 0 debug2: channel 0: rcvd close debug3: channel 0: will not send data after close debug2: channel 0: almost dead debug2: channel 0: gc: notify user debug2: channel 0: gc: user detached debug2: channel 0: send close debug2: channel 0: is dead debug2: channel 0: garbage collecting debug1: channel 0: free: client-session, nchannels 1 debug3: channel 0: status: The following connections are open: #0 client-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1) debug3: channel 0: close_fds r -1 w -1 e 6 c -1 debug3: Wrote 32 bytes for a total of 1957 debug3: Wrote 64 bytes for a total of 2021 debug1: fd 0 clearing O_NONBLOCK Transferred: sent 1840, received 1896 bytes, in 0.1 seconds Bytes per second: sent 18384.8, received 18944.3 debug1: Exit status 0 # it is hanging; I am about to issue control-C ^Cmpirun: killing job... -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- domU-12-31-39-16-4E-4C.compute-1.internal - daemon did not report back when launched [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # it says the same thing, i.e., [tsakai@domU-12-31-39-16-75-1E ~]$ # daemon did not report back when launched [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # what does that mean? [tsakai@domU-12-31-39-16-75-1E ~]$ # ssh doesn't say anything alarming... [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ # I give up [tsakai@domU-12-31-39-16-75-1E ~]$ [tsakai@domU-12-31-39-16-75-1E ~]$ exit logout [tsakai@vixen ec2]$ [tsakai@vixen ec2]$ Do you see anything strange? One final question: On ssh man page, it mentions a few environmental varialbles. SSH_ASKPASS, SSH_AUTH_SOCK, SSH_CONNECTION, etc. Do any of these matter as far as openMPI is concerned? Thank you, Gus. Regards, Tena On 2/15/11 5:09 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Tena Sakai wrote: >> Hi, >> >> I am trying to reproduce what I was able to show last Friday on Amazon >> EC2 instances, but I am having a problem. What I was able to show last >> Friday as root was with this command: >> mpirun app app.ac >> with app.ac being: >> -H dns-entry-A np 1 (linux command) >> -H dns-entry-A np 1 (linux command) >> -H dns-entry-B np 1 (linux command) >> -H dns-entry-B np 1 (linux command) >> >> Here¹s the config file in root¹s .ssh directory: >> Host * >> IdentityFile /root/.ssh/.derobee/.kagi >> IdentitiesOnly yes >> BatchMode yes >> >> Yesterday and today I can¹t get this to work. I made the last part of >> app.ac >> file simpler (it now says /bin/hostname). Below is the session: >> >> -bash-3.2# >> -bash-3.2# # I am on instance A, host name for inst A is: >> -bash-3.2# hostname >> domU-12-31-39-09-CD-C2 >> -bash-3.2# >> -bash-3.2# nslookup domU-12-31-39-09-CD-C2 >> Server: 172.16.0.23 >> Address: 172.16.0.23#53 >> >> Non-authoritative answer: >> Name: domU-12-31-39-09-CD-C2.compute-1.internal >> Address: 10.210.210.48 >> >> -bash-3.2# cd .ssh >> -bash-3.2# >> -bash-3.2# cat config >> Host * >> IdentityFile /root/.ssh/.derobee/.kagi >> IdentitiesOnly yes >> BatchMode yes >> -bash-3.2# >> -bash-3.2# ll config >> -rw-r--r-- 1 root root 103 Feb 15 17:18 config >> -bash-3.2# >> -bash-3.2# chmod 600 config >> -bash-3.2# >> -bash-3.2# # show I can go to inst B without password/passphrase >> -bash-3.2# >> -bash-3.2# ssh domU-12-31-39-09-E6-71.compute-1.internal >> Last login: Tue Feb 15 17:18:46 2011 from 10.210.210.48 >> -bash-3.2# >> -bash-3.2# hostname >> domU-12-31-39-09-E6-71 >> -bash-3.2# >> -bash-3.2# nslookup `hostname` >> Server: 172.16.0.23 >> Address: 172.16.0.23#53 >> >> Non-authoritative answer: >> Name: domU-12-31-39-09-E6-71.compute-1.internal >> Address: 10.210.233.123 >> >> -bash-3.2# # and back to inst A is also no problem >> -bash-3.2# >> -bash-3.2# ssh domU-12-31-39-09-CD-C2.compute-1.internal >> Last login: Tue Feb 15 17:36:19 2011 from 63.193.205.1 >> -bash-3.2# >> -bash-3.2# hostname >> domU-12-31-39-09-CD-C2 >> -bash-3.2# >> -bash-3.2# # log out twice to go back to inst A >> -bash-3.2# exit >> logout >> Connection to domU-12-31-39-09-CD-C2.compute-1.internal closed. >> -bash-3.2# >> -bash-3.2# exit >> logout >> Connection to domU-12-31-39-09-E6-71.compute-1.internal closed. >> -bash-3.2# >> -bash-3.2# hostname >> domU-12-31-39-09-CD-C2 >> -bash-3.2# >> -bash-3.2# cd .. >> -bash-3.2# >> -bash-3.2# pwd >> /root >> -bash-3.2# >> -bash-3.2# ll >> total 8 >> -rw-r--r-- 1 root root 260 Feb 15 17:24 app.ac >> -rw-r--r-- 1 root root 130 Feb 15 17:34 app.ac2 >> -bash-3.2# >> -bash-3.2# cat app.ac >> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-09-E6-71.compute-1.internal -np 1 /bin/hostname >> -bash-3.2# >> -bash-3.2# # when there is a remote machine (bottome 2 lines) it hangs >> -bash-3.2# mpirun -app app.ac >> mpirun: killing job... >> >> -------------------------------------------------------------------------- >> mpirun noticed that the job aborted, but has no info as to the process >> that caused that situation. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun was unable to cleanly terminate the daemons on the nodes shown >> below. Additional manual cleanup may be required - please refer to >> the "orte-clean" tool for assistance. >> -------------------------------------------------------------------------- >> domU-12-31-39-09-E6-71.compute-1.internal - daemon did not >> report back when launched >> -bash-3.2# >> -bash-3.2# cat app.ac2 >> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >> -H domU-12-31-39-09-CD-C2.compute-1.internal -np 1 /bin/hostname >> -bash-3.2# >> -bash-3.2# # when there is no remote machine, then mpirun works: >> -bash-3.2# mpirun -app app.ac2 >> domU-12-31-39-09-CD-C2 >> domU-12-31-39-09-CD-C2 >> -bash-3.2# >> -bash-3.2# hostname >> domU-12-31-39-09-CD-C2 >> -bash-3.2# >> -bash-3.2# # this gotta be ssh problem.... >> -bash-3.2# >> -bash-3.2# # show no firewall is used >> -bash-3.2# iptables --list >> Chain INPUT (policy ACCEPT) >> target prot opt source destination >> >> Chain FORWARD (policy ACCEPT) >> target prot opt source destination >> >> Chain OUTPUT (policy ACCEPT) >> target prot opt source destination >> -bash-3.2# >> -bash-3.2# exit >> logout >> [tsakai@vixen ec2]$ >> >> Would someone please point out what I am doing wrong? >> >> Thank you. >> >> Regards, >> >> Tena >> > Hi Tena > > Nothing wrong that I can see. > Just another couple of suggestions, > based on somewhat vague possibilities. > > A slight difference is that on vixen and dashen you ran the > MPI hostname tests as a regular user, not as root, right? > Not sure if this will make much of a difference, > but it may be worth trying to run it as a regular user in EC2 also. > I general most people avoid running user applications (MPI programs > included) as root. > Mostly for safety, but I wonder if there are any > implications in the 'rootly powers' > regarding the under-the-hood processes that OpenMPI > launches along with the actual user programs. > > This may make no difference either, > but you could do a 'service iptables status', > to see if the service is running, even though there are > no explicit iptable rules (as per your email). > If the service is not running you get > 'Firewall is stopped.' (in CentOS). > I *think* 'iptables --list' loads the iptables module into the > kernel, as a side effect, whereas the service command does not. > So, it may be cleaner (safer?) to use the service version > instead of 'iptables --list'. > I don't know if it will make any difference, > but just in case, if the service is running, > why not do 'service iptables stop', > and perhaps also 'chkconfig iptables off' to be completely > free of iptables? > > Gus Correa > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users