Hi Gus, Thank you for your response.
I have verified that 1) /etc/hosts files on both machines vixen and dasher are identical 2) both machines have nothing but comments in hosts.allow and hosts.deny Regarding firewall, they are different: On vixen this how it looks: [root@vixen ec2]# cat /etc/sysconfig/iptables cat: /etc/sysconfig/iptables: No such file or directory [root@vixen ec2]# [root@vixen ec2]# /sbin/iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination [root@vixen ec2]# On dasher: [tsakai@dasher Rmpi]$ sudo cat /etc/sysconfig/iptables # Firewall configuration written by system-config-securitylevel # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] :RH-Firewall-1-INPUT - [0:0] -A INPUT -j RH-Firewall-1-INPUT -A FORWARD -j RH-Firewall-1-INPUT -A RH-Firewall-1-INPUT -i lo -j ACCEPT -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT -A RH-Firewall-1-INPUT -p 50 -j ACCEPT -A RH-Firewall-1-INPUT -p 51 -j ACCEPT -A RH-Firewall-1-INPUT -p udp --dport 5353 -d 224.0.0.251 -j ACCEPT -A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited COMMIT [tsakai@dasher Rmpi]$ [tsakai@dasher Rmpi]$ sudo /sbin/iptables --list [sudo] password for tsakai: Chain INPUT (policy ACCEPT) target prot opt source destination RH-Firewall-1-INPUT all -- anywhere anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination RH-Firewall-1-INPUT all -- anywhere anywhere Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain RH-Firewall-1-INPUT (2 references) target prot opt source destination ACCEPT all -- anywhere anywhere ACCEPT icmp -- anywhere anywhere icmp any ACCEPT esp -- anywhere anywhere ACCEPT ah -- anywhere anywhere ACCEPT udp -- anywhere 224.0.0.251 udp dpt:mdns ACCEPT udp -- anywhere anywhere udp dpt:ipp ACCEPT tcp -- anywhere anywhere tcp dpt:ipp ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:http REJECT all -- anywhere anywhere reject-with icmp-host-prohibited [tsakai@dasher Rmpi]$ I don't understand what they mean. Can you see any clue as to why vixen can and dasher cannot run mpirun with the app file: -H dasher.egcrc.org -np 1 hostname -H dasher.egcrc.org -np 1 hostname -H vixen.egcrc.org -np 1 hostname -H vixen.egcrc.org -np 1 hostname Many thanks. Tena On 2/14/11 11:15 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Tena Sakai wrote: >> Hi Reuti, >> >>> a) can you ssh from dasher to vixen? >> Yes, no problem. >> [tsakai@dasher Rmpi]$ >> [tsakai@dasher Rmpi]$ hostname >> dasher.egcrc.org >> [tsakai@dasher Rmpi]$ >> [tsakai@dasher Rmpi]$ ssh vixen >> Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org >> [tsakai@vixen ~]$ >> [tsakai@vixen ~]$ hostname >> vixen.egcrc.org >> [tsakai@vixen ~]$ >> >>> b) firewall on vixen? >> There is no firewall on vixen that I know of, but I don't >> know how I can definitively show it one way or the other. >> Can you please suggest how I can do this? >> >> Regards, >> >> Tena >> >> > > Hi Tena > > Besides Reuti suggestions: > > Check the consistency of /etc/hosts on both machines. > Check if there are restrictions on /etc/hosts.allow and > /etc/hosts.deny on both machines. > Check if both the MPI directories and your home/work directory > is mounted/available on both machines. > (We may have been through this checklist before, sorry if I forgot.) > > Firewall info (not very friendly syntax ...): > > iptables --list > > or maybe better: > > cat /etc/sysconfig/iptables > > I hope it helps, > Gus Correa > >> On 2/14/11 4:38 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >> >>> Hi, >>> >>> Am 14.02.2011 um 04:54 schrieb Tena Sakai: >>> >>>> I have digressed and started downward descent... >>>> >>>> I was trying to make a simple and clear case. Everything >>>> I write in this very mail is about local machines. There >>>> is no virtual machines involved. I am talking about two >>>> machines, vixen and dasher, which share the same file >>>> structure. Vixen is a nfs server and dasher is an nfs >>>> client. I have just installed openmpi 1.4.3 on dasher, >>>> which is the same version I have on vixen. >>>> >>>> I have a file app.ac3, which looks like: >>>> [tsakai@vixen Rmpi]$ cat app.ac3 >>>> -H dasher.egcrc.org -np 1 hostname >>>> -H dasher.egcrc.org -np 1 hostname >>>> -H vixen.egcrc.org -np 1 hostname >>>> -H vixen.egcrc.org -np 1 hostname >>>> [tsakai@vixen Rmpi]$ >>>> >>>> Vixen can run this without any problem: >>>> [tsakai@vixen Rmpi]$ mpirun -app app.ac3 >>>> vixen.egcrc.org >>>> vixen.egcrc.org >>>> dasher.egcrc.org >>>> dasher.egcrc.org >>>> [tsakai@vixen Rmpi]$ >>>> >>>> But I can't run this very command from dasher: >>>> [tsakai@vixen Rmpi]$ >>>> [tsakai@vixen Rmpi]$ ssh dasher >>>> Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org >>>> [tsakai@dasher ~]$ >>>> [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/ >>>> [tsakai@dasher Rmpi]$ >>>> [tsakai@dasher Rmpi]$ mpirun -app app.ac3 >>>> mpirun: killing job... >>> a) can you ssh from dasher to vixen? >>> >>> b) firewall on vixen? >>> >>> -- Reuti >>> >>> >>>> -------------------------------------------------------------------------- >>>> mpirun noticed that the job aborted, but has no info as to the process >>>> that caused that situation. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun was unable to cleanly terminate the daemons on the nodes shown >>>> below. Additional manual cleanup may be required - please refer to >>>> the "orte-clean" tool for assistance. >>>> -------------------------------------------------------------------------- >>>> vixen.egcrc.org - daemon did not report back when launched >>>> [tsakai@dasher Rmpi]$ >>>> >>>> After I issue the mpirun command, it hangs and I had to Cntrol-C out >>>> of it at which point it generated all lines " mpirun: killing job..." >>>> and below. >>>> >>>> A strange thing is that dahser has no problem executing the same >>>> thing via ssh: >>>> [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname >>>> vixen.egcrc.org >>>> [tsakai@dasher Rmpi]$ >>>> >>>> In fact, dasher can run it via mpirun so long as no foreign machine >>>> is present in the app file. Ie., >>>> [tsakai@dasher Rmpi]$ cat app.ac4 >>>> -H dasher.egcrc.org -np 1 hostname >>>> -H dasher.egcrc.org -np 1 hostname >>>> # -H vixen.egcrc.org -np 1 hostname >>>> # -H vixen.egcrc.org -np 1 hostname >>>> [tsakai@dasher Rmpi]$ >>>> [tsakai@dasher Rmpi]$ mpirun -app app.ac4 >>>> dasher.egcrc.org >>>> dasher.egcrc.org >>>> [tsakai@dasher Rmpi]$ >>>> >>>> Can you please tell me why I can go one way (from vixen to dasher) >>>> and not the other way (dasher to vixen)? >>>> >>>> Thank you. >>>> >>>> Tena >>>> >>>> >>>> On 2/12/11 9:42 PM, "Gustavo Correa" <g...@ldeo.columbia.edu> wrote: >>>> >>>>> Hi Tena >>>>> >>>>> Thank you for taking the time to explain the details of >>>>> the EC2 procedure. >>>>> >>>>> I am afraid everything in my bag of tricks was used. >>>>> As Ralph and Jeff suggested, this seems to be a very specific >>>>> problem with EC2. >>>>> >>>>> The difference in behavior when you run as root vs. when you >>>>> run as Tena, tells that there is some use restriction to regular users >>>>> in EC2 that isn't present in common machines (Linux or other), I guess. >>>>> This may be yet another 'stone to turn', as you like to say. >>>>> It also suggests that there is nothing wrong in principle with your >>>>> openMPI setup or with your program, otherwise root would not be able to >>>>> run >>>>> it. >>>>> >>>>> Besides Ralph suggestion of trying the EC2 mailing list archive, >>>>> I wonder if EC2 has any type of user support where you could ask >>>>> for help. >>>>> After all, it is a paid sevice, isn't it? >>>>> (OpenMPI is not paid and has a great customer service, doesn't it? :) ) >>>>> You have a well documented case to present, >>>>> and the very peculiar fact that the program fails for normal users but >>>>> runs >>>>> for root. >>>>> This should help the EC2 support to start looking for a solution. >>>>> >>>>> I am running out of suggestions of what you could try on your own. >>>>> But let me try: >>>>> >>>>> 1) You may try to reduce the problem to its less common denominator, >>>>> perhaps by trying to run non-R based MPI programs on EC2, maybe the >>>>> hello_c.c, >>>>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory. >>>>> This would be to avoid the extra layer of complexity introduced by R. >>>>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 >>>>> hostname). >>>>> I.e. go in a progression of increasing complexity, see where you hit the >>>>> wall. >>>>> This may shed some light on what is going on. >>>>> >>>>> I don't know if this suggestion may really help, though. >>>>> It is not clear to me where the thing fails, whether it is during program >>>>> execution, >>>>> or while mpiexec is setting up the environment for the program to run. >>>>> If it is very early in the process, before the program starts, my >>>>> suggestion >>>>> won't work. >>>>> Jeff and Ralph, who know OpenMPI inside out, may have better advice in >>>>> this >>>>> regard. >>>>> >>>>> 2) Another thing would be to try to run R on E2C in serial mode, without >>>>> mpiexec, >>>>> interactively or via script, to see who EC2 doesn't like: R or OpenMPI >>>>> (but >>>>> maybe it's both). >>>>> >>>>> Gus Correa >>>>> >>>>> On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote: >>>>> >>>>>> Hi Gus, >>>>>> >>>>>> Thank you for your tips. >>>>>> >>>>>> I didn't find any smoking gun or anything comes close. >>>>>> Here's the upshot: >>>>>> >>>>>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a >>>>>> core file size (blocks, -c) 0 >>>>>> data seg size (kbytes, -d) unlimited >>>>>> scheduling priority (-e) 0 >>>>>> file size (blocks, -f) unlimited >>>>>> pending signals (-i) 61504 >>>>>> max locked memory (kbytes, -l) 32 >>>>>> max memory size (kbytes, -m) unlimited >>>>>> open files (-n) 1024 >>>>>> pipe size (512 bytes, -p) 8 >>>>>> POSIX message queues (bytes, -q) 819200 >>>>>> real-time priority (-r) 0 >>>>>> stack size (kbytes, -s) 8192 >>>>>> cpu time (seconds, -t) unlimited >>>>>> max user processes (-u) 61504 >>>>>> virtual memory (kbytes, -v) unlimited >>>>>> file locks (-x) unlimited >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>>>>> bash-3.2# >>>>>> bash-3.2# ulimit -a >>>>>> core file size (blocks, -c) 0 >>>>>> data seg size (kbytes, -d) unlimited >>>>>> scheduling priority (-e) 0 >>>>>> file size (blocks, -f) unlimited >>>>>> pending signals (-i) 61504 >>>>>> max locked memory (kbytes, -l) 32 >>>>>> max memory size (kbytes, -m) unlimited >>>>>> open files (-n) 1024 >>>>>> pipe size (512 bytes, -p) 8 >>>>>> POSIX message queues (bytes, -q) 819200 >>>>>> real-time priority (-r) 0 >>>>>> stack size (kbytes, -s) 8192 >>>>>> cpu time (seconds, -t) unlimited >>>>>> max user processes (-u) unlimited >>>>>> virtual memory (kbytes, -v) unlimited >>>>>> file locks (-x) unlimited >>>>>> bash-3.2# >>>>>> bash-3.2# >>>>>> bash-3.2# ulimit -a > root_ulimit-a >>>>>> bash-3.2# exit >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a >>>>>> 14c14 >>>>>> < max user processes (-u) unlimited >>>>>> --- >>>>>>> max user processes (-u) 61504 >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr >>>>>> /proc/sys/fs/file-max >>>>>> 480 0 762674 >>>>>> 762674 >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ sudo su >>>>>> bash-3.2# >>>>>> bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>>>>> 512 0 762674 >>>>>> 762674 >>>>>> bash-3.2# exit >>>>>> exit >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max >>>>>> -bash: sysctl: command not found >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ /sbin/!! >>>>>> /sbin/sysctl -a |grep fs.file-max >>>>>> error: permission denied on key 'kernel.cad_pid' >>>>>> error: permission denied on key 'kernel.cap-bound' >>>>>> fs.file-max = 762674 >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max >>>>>> fs.file-max = 762674 >>>>>> [tsakai@ip-10-114-239-188 ~]$ >>>>>> >>>>>> I see a bit of difference between root and tsakai, but I cannot >>>>>> believe such small difference results in somewhat a catastrophic >>>>>> failure as I have reported. Would you agree with me? >>>>>> >>>>>> Regards, >>>>>> >>>>>> Tena >>>>>> >>>>>> On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >>>>>> >>>>>>> Hi Tena >>>>>>> >>>>>>> Please read one answer inline. >>>>>>> >>>>>>> Tena Sakai wrote: >>>>>>>> Hi Jeff, >>>>>>>> Hi Gus, >>>>>>>> >>>>>>>> Thanks for your replies. >>>>>>>> >>>>>>>> I have pretty much ruled out PATH issues by setting tsakai's PATH >>>>>>>> as identical to that of root. In that setting I reproduced the >>>>>>>> same result as before: root can run mpirun correctly and tsakai >>>>>>>> cannot. >>>>>>>> >>>>>>>> I have also checked out permission on /tmp directory. tsakai has >>>>>>>> no problem creating files under /tmp. >>>>>>>> >>>>>>>> I am trying to come up with a strategy to show that each and every >>>>>>>> programs in the PATH has "world" executable permission. It is a >>>>>>>> stone to turn over, but I am not holding my breath. >>>>>>>> >>>>>>>>> ... you are running out of file descriptors. Are file descriptors >>>>>>>>> limited on a per-process basis, perchance? >>>>>>>> I have never heard there is such restriction on Amazon EC2. There >>>>>>>> are folks who keep running instances for a long, long time. Whereas >>>>>>>> in my case, I launch 2 instances, check things out, and then turn >>>>>>>> the instances off. (Given that the state of California has a huge >>>>>>>> debts, our funding is very tight.) So, I really doubt that's the >>>>>>>> case. I have run mpirun unsuccessfully as user tsakai and immediately >>>>>>>> after successfully as root. Still, I would be happy if you can tell >>>>>>>> me a way to tell number of file descriptors used or remmain. >>>>>>>> >>>>>>>> Your mentioned file descriptors made me think of something under >>>>>>>> /dev. But I don't know exactly what I am fishing. Do you have >>>>>>>> some suggestions? >>>>>>>> >>>>>>> 1) If the environment has anything to do with Linux, >>>>>>> check: >>>>>>> >>>>>>> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max >>>>>>> >>>>>>> >>>>>>> or >>>>>>> >>>>>>> sysctl -a |grep fs.file-max >>>>>>> >>>>>>> This max can be set (fs.file-max=whatever_is_reasonable) >>>>>>> in /etc/sysctl.conf >>>>>>> >>>>>>> See 'man sysctl' and 'man sysctl.conf' >>>>>>> >>>>>>> 2) Another possible source of limits. >>>>>>> >>>>>>> Check "ulimit -a" (bash) or "limit" (tcsh). >>>>>>> >>>>>>> If you need to change look at: >>>>>>> >>>>>>> /etc/security/limits.conf >>>>>>> >>>>>>> (See also 'man limits.conf') >>>>>>> >>>>>>> ** >>>>>>> >>>>>>> Since "root can but Tena cannot", >>>>>>> I would check 2) first, >>>>>>> as they are the 'per user/per group' limits, >>>>>>> whereas 1) is kernel/system-wise. >>>>>>> >>>>>>> I hope this helps, >>>>>>> Gus Correa >>>>>>> >>>>>>> PS - I know you are a wise and careful programmer, >>>>>>> but here we had cases of programs that would >>>>>>> fail because of too many files that were open and never closed, >>>>>>> eventually exceeding the max available/permissible. >>>>>>> So, it does happen. >>>>>>> >>>>>>>> I wish I could reproduce this (weired) behavior on a different >>>>>>>> set of machines. I certainly cannot in my local environment. Sigh! >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Tena >>>>>>>> >>>>>>>> >>>>>>>> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> It is concerning if the pipe system call fails - I can't think of why >>>>>>>>> that >>>>>>>>> would happen. Thats not usually a permissions issue but rather a >>>>>>>>> deeper >>>>>>>>> indication that something is either seriously wrong on your system or >>>>>>>>> you >>>>>>>>> are >>>>>>>>> running out of file descriptors. Are file descriptors limited on a >>>>>>>>> per-process >>>>>>>>> basis, perchance? >>>>>>>>> >>>>>>>>> Sent from my PDA. No type good. >>>>>>>>> >>>>>>>>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Tena >>>>>>>>>> >>>>>>>>>> Since root can but you can't, >>>>>>>>>> is is a directory permission problem perhaps? >>>>>>>>>> Check the execution directory permission (on both machines, >>>>>>>>>> if this is not NFS mounted dir). >>>>>>>>>> I am not sure, but IIRR OpenMPI also uses /tmp for >>>>>>>>>> under-the-hood stuff, worth checking permissions there also. >>>>>>>>>> Just a naive guess. >>>>>>>>>> >>>>>>>>>> Congrats for all the progress with the cloudy MPI! >>>>>>>>>> >>>>>>>>>> Gus Correa >>>>>>>>>> >>>>>>>>>> Tena Sakai wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> I have made a bit more progress. I think I can say ssh authenti- >>>>>>>>>>> cation problem is behind me now. I am still having a problem >>>>>>>>>>> running >>>>>>>>>>> mpirun, but the latest discovery, which I can reproduce, is that >>>>>>>>>>> I can run mpirun as root. Here's the session log: >>>>>>>>>>> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >>>>>>>>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll >>>>>>>>>>> total 8 >>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >>>>>>>>>>> total 16 >>>>>>>>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >>>>>>>>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >>>>>>>>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >>>>>>>>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >>>>>>>>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ hostname >>>>>>>>>>> ip-10-100-243-195 >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ ll >>>>>>>>>>> total 8 >>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>>>>>>>> [tsakai@ip-10-100-243-195 ~]$ exit >>>>>>>>>>> logout >>>>>>>>>>> Connection to ip-10-100-243-195.ec2.internal closed. >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ hostname >>>>>>>>>>> ip-10-195-198-31 >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> mpirun was unable to launch the specified application as it >>>>>>>>>>> encountered >>>>>>>>>>> an >>>>>>>>>>> error: >>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>>>>> subsystem >>>>>>>>>>> Node: ip-10-195-198-31 >>>>>>>>>>> while attempting to start process rank 0. >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it as root >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ sudo su >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# pwd >>>>>>>>>>> /home/tsakai >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# ls -l /root/.ssh/config >>>>>>>>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# cat /root/.ssh/config >>>>>>>>>>> Host * >>>>>>>>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>> BatchMode yes >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# pwd >>>>>>>>>>> /home/tsakai >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# ls -l >>>>>>>>>>> total 8 >>>>>>>>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>>>>>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# # now is the time for mpirun >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# mpirun --app ./app.ac >>>>>>>>>>> 13 ip-10-100-243-195 >>>>>>>>>>> 21 ip-10-100-243-195 >>>>>>>>>>> 5 ip-10-195-198-31 >>>>>>>>>>> 8 ip-10-195-198-31 >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# # It works (being root)! >>>>>>>>>>> bash-3.2# >>>>>>>>>>> bash-3.2# exit >>>>>>>>>>> exit >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> mpirun was unable to launch the specified application as it >>>>>>>>>>> encountered >>>>>>>>>>> an >>>>>>>>>>> error: >>>>>>>>>>> Error: pipe function call failed when setting up I/O forwarding >>>>>>>>>>> subsystem >>>>>>>>>>> Node: ip-10-195-198-31 >>>>>>>>>>> while attempting to start process rank 0. >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>>>>>>>> [tsakai@ip-10-195-198-31 ~]$ exit >>>>>>>>>>> logout >>>>>>>>>>> [tsakai@vixen ec2]$ >>>>>>>>>>> So, why does it say "pipe function call failed when setting up >>>>>>>>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ? >>>>>>>>>>> The node it is referring to is not the remote machine. It is >>>>>>>>>>> What I call machine A. I first thought maybe this is a problem >>>>>>>>>>> With PATH variable. But I don't think so. I compared root's >>>>>>>>>>> Path to that of tsaki's and made them identical and retried. >>>>>>>>>>> I got the same behavior. >>>>>>>>>>> If you could enlighten me why this is happening, I would really >>>>>>>>>>> Appreciate it. >>>>>>>>>>> Thank you. >>>>>>>>>>> Tena >>>>>>>>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>>>> Hi jeff, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for the firewall tip. I tried it while allowing all tip >>>>>>>>>>>> traffic >>>>>>>>>>>> and got interesting and preplexing result. Here's what's >>>>>>>>>>>> interesting >>>>>>>>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this >>>>>>>>>>>> run): >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>>>>> Host key verification failed. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> A daemon (pid 2743) died unexpectedly with status 255 while >>>>>>>>>>>> attempting >>>>>>>>>>>> to launch so we are aborting. >>>>>>>>>>>> >>>>>>>>>>>> There may be more information reported by the environment (see >>>>>>>>>>>> above). >>>>>>>>>>>> >>>>>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>>>>> shared >>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>>>>> have >>>>>>>>>>>> the >>>>>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>>>>>>>>>>> /usr/local/lib >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ export >>>>>>>>>>>> LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as >>>>>>>>>>>> well >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>>>>>>>>>>> Warning: Identity file tsakai not accessible: No such file or >>>>>>>>>>>> directory. >>>>>>>>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ export >>>>>>>>>>>> LD_LIBRARY_PATH='/usr/local/lib' >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>>>>>>>>>>> LD_LIBRARY_PATH=/usr/local/lib >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>>>>>>>>>>> [tsakai@ip-10-195-171-159 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to ip-10-195-171-159 closed. >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ hostname >>>>>>>>>>>> ip-10-203-21-132 >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>>>>>>>> Host key verification failed. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> A daemon (pid 2789) died unexpectedly with status 255 while >>>>>>>>>>>> attempting >>>>>>>>>>>> to launch so we are aborting. >>>>>>>>>>>> >>>>>>>>>>>> There may be more information reported by the environment (see >>>>>>>>>>>> above). >>>>>>>>>>>> >>>>>>>>>>>> This may be because the daemon was unable to find all the needed >>>>>>>>>>>> shared >>>>>>>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>>>>>>> have >>>>>>>>>>>> the >>>>>>>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>>>>>>> automatically be forwarded to the remote nodes. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>> ------------------------------------------------------------------------- >>>>>> >> >>>>>>>> >>>>>> - >>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>>>>>>>>>>> /usr/local/lib... >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>>>>>>>>>>> total 16604 >>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>>>>>>>>>>> libfuse.so.2.8.5 >>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>>>>>>>>>>> libfuse.so.2.8.5 >>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>>>>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 >>>>>>>>>>>> -> >>>>>>>>>>>> libmca_common_sm.so.1.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> >>>>>>>>>>>> libmpi.so.0.0.2 >>>>>>>>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>>>>>>>>>>> libmpi.so.0.0.2 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>>>>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>>>>>>>>>>> libmpi_cxx.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>>>>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>>>>>>>>>>> libmpi_f77.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>>>>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>>>>>>>>>>> libmpi_f90.so.0.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>>>>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>>>>>>>>>>> libopen-pal.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>>>>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>>>>>>>>>>> libopen-rte.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>>>>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 >>>>>>>>>>>> -> >>>>>>>>>>>> libopenmpi_malloc.so.0.0.0 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>>>>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>>>>>>>>>>> libulockmgr.so.1.0.1 >>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>>>>>>>>>>> libxml2.so.2.7.2 >>>>>>>>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>>>>>>>>>>> libxml2.so.2.7.2 >>>>>>>>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>>>>>>>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>>>>>>>> >>>>>>>>>>>> Do you know why it's complaining about shared libraries? >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Your prior mails were about ssh issues, but this one sounds like >>>>>>>>>>>> you >>>>>>>>>>>> might >>>>>>>>>>>> have firewall issues. >>>>>>>>>>>> >>>>>>>>>>>> That is, the "orted" command attempts to open a TCP socket back to >>>>>>>>>>>> mpirun >>>>>>>>>>>> for >>>>>>>>>>>> various command and control reasons. If it is blocked from doing >>>>>>>>>>>> so >>>>>>>>>>>> by >>>>>>>>>>>> a >>>>>>>>>>>> firewall, Open MPI won't run. In general, you can either disable >>>>>>>>>>>> your >>>>>>>>>>>> firewall or you can setup a trust relationship for TCP connections >>>>>>>>>>>> within >>>>>>>>>>>> your >>>>>>>>>>>> cluster. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Reuti, >>>>>>>>>>>> >>>>>>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>>>>>>>>> session is captured in the attached file. >>>>>>>>>>>> >>>>>>>>>>>> What I did is much similar to what I have done before: verify >>>>>>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>>>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>>>>>>>>> from an scp invocation and then from mpirun invocation. They both >>>>>>>>>>>> say >>>>>>>>>>>> debug1: Authentication succeeded (publickey). >>>>>>>>>>>> >>>>>>>>>>>> From mpirun invocation, I see a line: >>>>>>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>>>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca >>>>>>>>>>>> orte_ess_num_procs >>>>>>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>>>>>>>>> The IP address at the end of the line is indeed that of machine B. >>>>>>>>>>>> After that there was hanging and I controlled-C out of it, which >>>>>>>>>>>> gave me more lines. But the lines after >>>>>>>>>>>> debug1: Sending command: orted bla bla bla >>>>>>>>>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>>>>>>>>> mean. >>>>>>>>>>>> >>>>>>>>>>>> If you could shed some light, I would appreciate it very much. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Tena >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>>>>>>>> >>>>>>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>>>>>>> No, my environment is entirely linux. The path to my home >>>>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>>>>>>>> okay. Sometimes the protection of the home directory must be >>>>>>>>>>>> adjusted >>>>>>>>>>>> too, >>>>>>>>>>>> but >>>>>>>>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>>>>>> ssh-keys for each user. >>>>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>>>>>>> but to address from one amazon cluster node to the other I >>>>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>>>>>>> change from a launch to another. I am using passphrasesless >>>>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>>>>>>> can't be established. >>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>>>>>>> to which I say "yes." >>>>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>>>>>>>> No. It's convenient inside a private cluster as it won't fill each >>>>>>>>>>>> users' >>>>>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But >>>>>>>>>>>> when >>>>>>>>>>>> the >>>>>>>>>>>> hostname changes every time it might also create new hostkeys. It >>>>>>>>>>>> uses >>>>>>>>>>>> hostkeys (private and public), this way it works for all users. >>>>>>>>>>>> Just >>>>>>>>>>>> for >>>>>>>>>>>> reference: >>>>>>>>>>>> >>>>>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>>>>>>>> >>>>>>>>>>>> You could look into it later. >>>>>>>>>>>> >>>>>>>>>>>> == >>>>>>>>>>>> >>>>>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. >>>>>>>>>>>> ssh >>>>>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>>>>>>>> >>>>>>>>>>>> - What about putting: >>>>>>>>>>>> >>>>>>>>>>>> LogLevel DEBUG3 >>>>>>>>>>>> >>>>>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to >>>>>>>>>>>> negotiate >>>>>>>>>>>> before >>>>>>>>>>>> it fails in verbose mode. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> >>>>>>>>>>>> Tena >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? >>>>>>>>>>>> I >>>>>>>>>>>> saw >>>>>>>>>>>> the >>>>>>>>>>>> /Users/tsakai/... in your output. >>>>>>>>>>>> >>>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>>>>>> >>>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I have made a bit of progress(?)... >>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>>>>> like: >>>>>>>>>>>> # machine A >>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>>>>>> This is just an abbreviation or nickname above. To use the >>>>>>>>>>>> specified >>>>>>>>>>>> settings, >>>>>>>>>>>> it's necessary to specify exactly this name. When the settings are >>>>>>>>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>> anyway for all machines, you can use: >>>>>>>>>>>> >>>>>>>>>>>> Host * >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> >>>>>>>>>>>> instead. >>>>>>>>>>>> >>>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It >>>>>>>>>>>> would >>>>>>>>>>>> also >>>>>>>>>>>> be >>>>>>>>>>>> an option to use hostbased authentication, which will avoid setting >>>>>>>>>>>> any >>>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>>>>>> >>>>>>>>>>>> -- Reuti >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> # machine B >>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>>>>> >>>>>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>>>>> >>>>>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>>>>> mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> - >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> - >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> - >>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes >>>>>>>>>>>> shown >>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> - >>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not >>>>>>>>>>>> report >>>>>>>>>>>> back when launched >>>>>>>>>>>> >>>>>>>>>>>> Am I making progress? >>>>>>>>>>>> >>>>>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>>>>> problem? >>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There >>>>>>>>>>>> are >>>>>>>>>>>> so >>>>>>>>>>>> many keyword-argument paris for this config file and I would like >>>>>>>>>>>> to >>>>>>>>>>>> look >>>>>>>>>>>> at >>>>>>>>>>>> some very basic one that works. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>>>> >>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> The program I run is >>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the program fib.R: >>>>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>>>> >>>>>>>>>>>> fib <- function( n ) { >>>>>>>>>>>> a <- 0 >>>>>>>>>>>> b <- 1 >>>>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>>>> t <- b >>>>>>>>>>>> b <- a >>>>>>>>>>>> a <- a + t >>>>>>>>>>>> } >>>>>>>>>>>> a >>>>>>>>>>>> >>>>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>>>> >>>>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>>>> number >>>>>>>>>>>> that >>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>>>> simple >>>>>>>>>>>> stuff. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the run output: >>>>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>>>> >>>>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>>>> >>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of >>>>>>>>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>>>> [tsakai@vixen local]$ ssh A I ~/.ssh/tsakai >>>>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>>>> >>>>>>>>>>>> Now I am on machine A: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>>>> without >>>>>>>>>>>> password authentication, >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine >>>>>>>>>>>> A >>>>>>>>>>>> without using password >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>>>> can't >>>>>>>>>>>> be established. >>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>>>> list >>>>>>>>>>>> of >>>>>>>>>>>> known hosts. >>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> >>>>>>>>>>>> As you can see, neither machine uses password for authentication; >>>>>>>>>>>> it >>>>>>>>>>>> uses >>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for >>>>>>>>>>>> ssh >>>>>>>>>>>> invocation >>>>>>>>>>>> from one machine to the other. This is so because I have a copy of >>>>>>>>>>>> public >>>>>>>>>>>> key >>>>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>>>> >>>>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> Here¹s what happens with mpirun: >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>>>> Permission denied, please try again. >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------- >>>>>>>>>>> -- >>>>>>>>>>> -> >>>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> >>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have. >>>>>>>>>>>> I end up typing control-C. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s my question: >>>>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>>>> password? >>>>>>>>>>>> >>>>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users