Hi Gus, Thank you for your tips.
I didn't find any smoking gun or anything comes close. Here's the upshot: [tsakai@ip-10-114-239-188 ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 61504 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 61504 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ sudo su bash-3.2# bash-3.2# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 61504 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited bash-3.2# bash-3.2# bash-3.2# ulimit -a > root_ulimit-a bash-3.2# exit [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a 14c14 < max user processes (-u) unlimited --- > max user processes (-u) 61504 [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr /proc/sys/fs/file-max 480 0 762674 762674 [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ sudo su bash-3.2# bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max 512 0 762674 762674 bash-3.2# exit exit [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max -bash: sysctl: command not found [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ /sbin/!! /sbin/sysctl -a |grep fs.file-max error: permission denied on key 'kernel.cad_pid' error: permission denied on key 'kernel.cap-bound' fs.file-max = 762674 [tsakai@ip-10-114-239-188 ~]$ [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max fs.file-max = 762674 [tsakai@ip-10-114-239-188 ~]$ I see a bit of difference between root and tsakai, but I cannot believe such small difference results in somewhat a catastrophic failure as I have reported. Would you agree with me? Regards, Tena On 2/11/11 6:06 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Hi Tena > > Please read one answer inline. > > Tena Sakai wrote: >> Hi Jeff, >> Hi Gus, >> >> Thanks for your replies. >> >> I have pretty much ruled out PATH issues by setting tsakai's PATH >> as identical to that of root. In that setting I reproduced the >> same result as before: root can run mpirun correctly and tsakai >> cannot. >> >> I have also checked out permission on /tmp directory. tsakai has >> no problem creating files under /tmp. >> >> I am trying to come up with a strategy to show that each and every >> programs in the PATH has "world" executable permission. It is a >> stone to turn over, but I am not holding my breath. >> >>> ... you are running out of file descriptors. Are file descriptors >>> limited on a per-process basis, perchance? >> >> I have never heard there is such restriction on Amazon EC2. There >> are folks who keep running instances for a long, long time. Whereas >> in my case, I launch 2 instances, check things out, and then turn >> the instances off. (Given that the state of California has a huge >> debts, our funding is very tight.) So, I really doubt that's the >> case. I have run mpirun unsuccessfully as user tsakai and immediately >> after successfully as root. Still, I would be happy if you can tell >> me a way to tell number of file descriptors used or remmain. >> >> Your mentioned file descriptors made me think of something under >> /dev. But I don't know exactly what I am fishing. Do you have >> some suggestions? >> > > 1) If the environment has anything to do with Linux, > check: > > cat /proc/sys/fs/file-nr /proc/sys/fs/file-max > > > or > > sysctl -a |grep fs.file-max > > This max can be set (fs.file-max=whatever_is_reasonable) > in /etc/sysctl.conf > > See 'man sysctl' and 'man sysctl.conf' > > 2) Another possible source of limits. > > Check "ulimit -a" (bash) or "limit" (tcsh). > > If you need to change look at: > > /etc/security/limits.conf > > (See also 'man limits.conf') > > ** > > Since "root can but Tena cannot", > I would check 2) first, > as they are the 'per user/per group' limits, > whereas 1) is kernel/system-wise. > > I hope this helps, > Gus Correa > > PS - I know you are a wise and careful programmer, > but here we had cases of programs that would > fail because of too many files that were open and never closed, > eventually exceeding the max available/permissible. > So, it does happen. > >> I wish I could reproduce this (weired) behavior on a different >> set of machines. I certainly cannot in my local environment. Sigh! >> >> Regards, >> >> Tena >> >> >> On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >> >>> It is concerning if the pipe system call fails - I can't think of why that >>> would happen. Thats not usually a permissions issue but rather a deeper >>> indication that something is either seriously wrong on your system or you >>> are >>> running out of file descriptors. Are file descriptors limited on a >>> per-process >>> basis, perchance? >>> >>> Sent from my PDA. No type good. >>> >>> On Feb 11, 2011, at 10:08 AM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >>> >>>> Hi Tena >>>> >>>> Since root can but you can't, >>>> is is a directory permission problem perhaps? >>>> Check the execution directory permission (on both machines, >>>> if this is not NFS mounted dir). >>>> I am not sure, but IIRR OpenMPI also uses /tmp for >>>> under-the-hood stuff, worth checking permissions there also. >>>> Just a naive guess. >>>> >>>> Congrats for all the progress with the cloudy MPI! >>>> >>>> Gus Correa >>>> >>>> Tena Sakai wrote: >>>>> Hi, >>>>> I have made a bit more progress. I think I can say ssh authenti- >>>>> cation problem is behind me now. I am still having a problem running >>>>> mpirun, but the latest discovery, which I can reproduce, is that >>>>> I can run mpirun as root. Here's the session log: >>>>> [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com >>>>> Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195 >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ ll >>>>> total 8 >>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ ll .ssh >>>>> total 16 >>>>> -rw------- 1 tsakai tsakai 232 Feb 5 23:19 authorized_keys >>>>> -rw------- 1 tsakai tsakai 102 Feb 11 00:34 config >>>>> -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts >>>>> -rw------- 1 tsakai tsakai 887 Feb 8 22:03 tsakai >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal >>>>> Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31 >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ # I am on machine B >>>>> [tsakai@ip-10-100-243-195 ~]$ hostname >>>>> ip-10-100-243-195 >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ ll >>>>> total 8 >>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac >>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ cat app.ac >>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5 >>>>> -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6 >>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7 >>>>> -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8 >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ # go back to machine A >>>>> [tsakai@ip-10-100-243-195 ~]$ >>>>> [tsakai@ip-10-100-243-195 ~]$ exit >>>>> logout >>>>> Connection to ip-10-100-243-195.ec2.internal closed. >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ hostname >>>>> ip-10-195-198-31 >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac >>>>> >>>>> -------------------------------------------------------------------------- >>>>> mpirun was unable to launch the specified application as it encountered >>>>> an >>>>> error: >>>>> Error: pipe function call failed when setting up I/O forwarding subsystem >>>>> Node: ip-10-195-198-31 >>>>> while attempting to start process rank 0. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ # try it as root >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ sudo su >>>>> bash-3.2# >>>>> bash-3.2# pwd >>>>> /home/tsakai >>>>> bash-3.2# >>>>> bash-3.2# ls -l /root/.ssh/config >>>>> -rw------- 1 root root 103 Feb 11 00:56 /root/.ssh/config >>>>> bash-3.2# >>>>> bash-3.2# cat /root/.ssh/config >>>>> Host * >>>>> IdentityFile /root/.ssh/.derobee/.kagi >>>>> IdentitiesOnly yes >>>>> BatchMode yes >>>>> bash-3.2# >>>>> bash-3.2# pwd >>>>> /home/tsakai >>>>> bash-3.2# >>>>> bash-3.2# ls -l >>>>> total 8 >>>>> -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac >>>>> -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R >>>>> bash-3.2# >>>>> bash-3.2# # now is the time for mpirun >>>>> bash-3.2# >>>>> bash-3.2# mpirun --app ./app.ac >>>>> 13 ip-10-100-243-195 >>>>> 21 ip-10-100-243-195 >>>>> 5 ip-10-195-198-31 >>>>> 8 ip-10-195-198-31 >>>>> bash-3.2# >>>>> bash-3.2# # It works (being root)! >>>>> bash-3.2# >>>>> bash-3.2# exit >>>>> exit >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac >>>>> >>>>> -------------------------------------------------------------------------- >>>>> mpirun was unable to launch the specified application as it encountered >>>>> an >>>>> error: >>>>> Error: pipe function call failed when setting up I/O forwarding subsystem >>>>> Node: ip-10-195-198-31 >>>>> while attempting to start process rank 0. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ # I don't get it. >>>>> [tsakai@ip-10-195-198-31 ~]$ >>>>> [tsakai@ip-10-195-198-31 ~]$ exit >>>>> logout >>>>> [tsakai@vixen ec2]$ >>>>> So, why does it say "pipe function call failed when setting up >>>>> I/O forwarding subsystem Node: ip-10-195-198-31" ? >>>>> The node it is referring to is not the remote machine. It is >>>>> What I call machine A. I first thought maybe this is a problem >>>>> With PATH variable. But I don't think so. I compared root's >>>>> Path to that of tsaki's and made them identical and retried. >>>>> I got the same behavior. >>>>> If you could enlighten me why this is happening, I would really >>>>> Appreciate it. >>>>> Thank you. >>>>> Tena >>>>> On 2/10/11 4:12 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>> Hi jeff, >>>>>> >>>>>> Thanks for the firewall tip. I tried it while allowing all tip traffic >>>>>> and got interesting and preplexing result. Here's what's interesting >>>>>> (BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run): >>>>>> >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>> Host key verification failed. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> A daemon (pid 2743) died unexpectedly with status 255 while attempting >>>>>> to launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun: clean termination accomplished >>>>>> >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to >>>>>> /usr/local/lib >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159 >>>>>> Warning: Identity file tsakai not accessible: No such file or >>>>>> directory. >>>>>> Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132 >>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>> [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib' >>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>> [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB >>>>>> LD_LIBRARY_PATH=/usr/local/lib >>>>>> [tsakai@ip-10-195-171-159 ~]$ >>>>>> [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A >>>>>> [tsakai@ip-10-195-171-159 ~]$ exit >>>>>> logout >>>>>> Connection to ip-10-195-171-159 closed. >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ hostname >>>>>> ip-10-203-21-132 >>>>>> [tsakai@ip-10-203-21-132 ~]$ # try mpirun again >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2 >>>>>> Host key verification failed. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> A daemon (pid 2789) died unexpectedly with status 255 while attempting >>>>>> to launch so we are aborting. >>>>>> >>>>>> There may be more information reported by the environment (see above). >>>>>> >>>>>> This may be because the daemon was unable to find all the needed shared >>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>>>>> the >>>>>> location of the shared libraries on the remote nodes and this will >>>>>> automatically be forwarded to the remote nodes. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun noticed that the job aborted, but has no info as to the process >>>>>> that caused that situation. >>>>>> >>>>>> ------------------------------------------------------------------------->>>>>> - >>>>>> mpirun: clean termination accomplished >>>>>> >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in >>>>>> /usr/local/lib... >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less >>>>>> total 16604 >>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so -> >>>>>> libfuse.so.2.8.5 >>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libfuse.so.2 -> >>>>>> libfuse.so.2.8.5 >>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so -> >>>>>> libmca_common_sm.so.1.0.0 >>>>>> lrwxrwxrwx 1 root root 25 Feb 8 23:06 libmca_common_sm.so.1 -> >>>>>> libmca_common_sm.so.1.0.0 >>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so -> >>>>>> libmpi.so.0.0.2 >>>>>> lrwxrwxrwx 1 root root 15 Feb 8 23:06 libmpi.so.0 -> >>>>>> libmpi.so.0.0.2 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so -> >>>>>> libmpi_cxx.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_cxx.so.0 -> >>>>>> libmpi_cxx.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so -> >>>>>> libmpi_f77.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f77.so.0 -> >>>>>> libmpi_f77.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so -> >>>>>> libmpi_f90.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 19 Feb 8 23:06 libmpi_f90.so.0 -> >>>>>> libmpi_f90.so.0.0.1 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so -> >>>>>> libopen-pal.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-pal.so.0 -> >>>>>> libopen-pal.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so -> >>>>>> libopen-rte.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libopen-rte.so.0 -> >>>>>> libopen-rte.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so -> >>>>>> libopenmpi_malloc.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 26 Feb 8 23:06 libopenmpi_malloc.so.0 -> >>>>>> libopenmpi_malloc.so.0.0.0 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so -> >>>>>> libulockmgr.so.1.0.1 >>>>>> lrwxrwxrwx 1 root root 20 Feb 8 23:06 libulockmgr.so.1 -> >>>>>> libulockmgr.so.1.0.1 >>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so -> >>>>>> libxml2.so.2.7.2 >>>>>> lrwxrwxrwx 1 root root 16 Feb 8 23:06 libxml2.so.2 -> >>>>>> libxml2.so.2.7.2 >>>>>> -rw-r--r-- 1 root root 385912 Jan 26 01:00 libvt.a >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> [tsakai@ip-10-203-21-132 ~]$ # Now, I am really confused... >>>>>> [tsakai@ip-10-203-21-132 ~]$ >>>>>> >>>>>> Do you know why it's complaining about shared libraries? >>>>>> >>>>>> Thank you. >>>>>> >>>>>> Tena >>>>>> >>>>>> >>>>>> On 2/10/11 1:05 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote: >>>>>> >>>>>>> Your prior mails were about ssh issues, but this one sounds like you >>>>>>> might >>>>>>> have firewall issues. >>>>>>> >>>>>>> That is, the "orted" command attempts to open a TCP socket back to >>>>>>> mpirun >>>>>>> for >>>>>>> various command and control reasons. If it is blocked from doing so by >>>>>>> a >>>>>>> firewall, Open MPI won't run. In general, you can either disable your >>>>>>> firewall or you can setup a trust relationship for TCP connections >>>>>>> within >>>>>>> your >>>>>>> cluster. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote: >>>>>>> >>>>>>>> Hi Reuti, >>>>>>>> >>>>>>>> Thanks for suggesting "LogLevel DEBUG3." I did so and complete >>>>>>>> session is captured in the attached file. >>>>>>>> >>>>>>>> What I did is much similar to what I have done before: verify >>>>>>>> that ssh works and then run mpirun command. In my a bit lengthy >>>>>>>> session log, there are two responses from "LogLevel DEBUG3." First >>>>>>>> from an scp invocation and then from mpirun invocation. They both >>>>>>>> say >>>>>>>> debug1: Authentication succeeded (publickey). >>>>>>>> >>>>>>>>> From mpirun invocation, I see a line: >>>>>>>> debug1: Sending command: orted --daemonize -mca ess env -mca >>>>>>>> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs >>>>>>>> 2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256" >>>>>>>> The IP address at the end of the line is indeed that of machine B. >>>>>>>> After that there was hanging and I controlled-C out of it, which >>>>>>>> gave me more lines. But the lines after >>>>>>>> debug1: Sending command: orted bla bla bla >>>>>>>> doesn't look good to me. But, in truth, I have no idea what they >>>>>>>> mean. >>>>>>>> >>>>>>>> If you could shed some light, I would appreciate it very much. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Tena >>>>>>>> >>>>>>>> >>>>>>>> On 2/10/11 10:57 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Am 10.02.2011 um 19:11 schrieb Tena Sakai: >>>>>>>>> >>>>>>>>>>> your local machine is Linux like, but the execution hosts >>>>>>>>>>> are Macs? I saw the /Users/tsakai/... in your output. >>>>>>>>>> No, my environment is entirely linux. The path to my home >>>>>>>>>> directory on one host (blitzen) has been known as /Users/tsakai, >>>>>>>>>> despite it is an nfs mount from vixen (which is known to >>>>>>>>>> itself as /home/tsakai). For historical reasons, I have >>>>>>>>>> chosen to give a symbolic link named /Users to vixen's /Home, >>>>>>>>>> so that I can use consistent path for both vixen and blitzen. >>>>>>>>> okay. Sometimes the protection of the home directory must be adjusted >>>>>>>>> too, >>>>>>>>> but >>>>>>>>> as you can do it from the command line this shouldn't be an issue. >>>>>>>>> >>>>>>>>> >>>>>>>>>>> Is this a private cluster (or at least private interfaces)? >>>>>>>>>>> It would also be an option to use hostbased authentication, >>>>>>>>>>> which will avoid setting any known_hosts file or passphraseless >>>>>>>>>>> ssh-keys for each user. >>>>>>>>>> No, it is not a private cluster. It is Amazon EC2. When I >>>>>>>>>> Ssh from my local machine (vixen) I use its public interface, >>>>>>>>>> but to address from one amazon cluster node to the other I >>>>>>>>>> use nodes' private dns names: domU-12-31-39-07-35-21 and >>>>>>>>>> domU-12-31-39-06-74-E2. Both public and private dns names >>>>>>>>>> change from a launch to another. I am using passphrasesless >>>>>>>>>> ssh-keys for authentication in all cases, i.e., from vixen to >>>>>>>>>> Amazon node A, from amazon node A to amazon node B, and from >>>>>>>>>> Amazon node B back to A. (Please see my initail post. There >>>>>>>>>> is a session dialogue for this.) They all work without authen- >>>>>>>>>> tication dialogue, except a brief initial dialogue: >>>>>>>>>> The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)' >>>>>>>>>> can't be established. >>>>>>>>>> RSA key fingerprint is >>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>> Are you sure you want to continue connecting (yes/no)? >>>>>>>>>> to which I say "yes." >>>>>>>>>> But I am unclear with what you mean by "hostbased authentication"? >>>>>>>>>> Doesn't that mean with password? If so, it is not an option. >>>>>>>>> No. It's convenient inside a private cluster as it won't fill each >>>>>>>>> users' >>>>>>>>> known_hosts file and you don't need to create any ssh-keys. But when >>>>>>>>> the >>>>>>>>> hostname changes every time it might also create new hostkeys. It uses >>>>>>>>> hostkeys (private and public), this way it works for all users. Just >>>>>>>>> for >>>>>>>>> reference: >>>>>>>>> >>>>>>>>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html >>>>>>>>> >>>>>>>>> You could look into it later. >>>>>>>>> >>>>>>>>> == >>>>>>>>> >>>>>>>>> - Can you try to use a command when connecting from A to B? E.g. ssh >>>>>>>>> `domU-12-31-39-06-74-E2 ls`. Is this working too? >>>>>>>>> >>>>>>>>> - What about putting: >>>>>>>>> >>>>>>>>> LogLevel DEBUG3 >>>>>>>>> >>>>>>>>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate >>>>>>>>> before >>>>>>>>> it fails in verbose mode. >>>>>>>>> >>>>>>>>> >>>>>>>>> -- Reuti >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Tena >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 2/10/11 2:27 AM, "Reuti" <re...@staff.uni-marburg.de> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> your local machine is Linux like, but the execution hosts are Macs? >>>>>>>>>>> I >>>>>>>>>>> saw >>>>>>>>>>> the >>>>>>>>>>> /Users/tsakai/... in your output. >>>>>>>>>>> >>>>>>>>>>> a) executing a command on them is also working, e.g.: ssh >>>>>>>>>>> domU-12-31-39-07-35-21 ls >>>>>>>>>>> >>>>>>>>>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> I have made a bit of progress(?)... >>>>>>>>>>>> I made a config file in my .ssh directory on the cloud. It looks >>>>>>>>>>>> like: >>>>>>>>>>>> # machine A >>>>>>>>>>>> Host domU-12-31-39-07-35-21.compute-1.internal >>>>>>>>>>> This is just an abbreviation or nickname above. To use the specified >>>>>>>>>>> settings, >>>>>>>>>>> it's necessary to specify exactly this name. When the settings are >>>>>>>>>>> the >>>>>>>>>>> same >>>>>>>>>>> anyway for all machines, you can use: >>>>>>>>>>> >>>>>>>>>>> Host * >>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>> BatchMode yes >>>>>>>>>>> >>>>>>>>>>> instead. >>>>>>>>>>> >>>>>>>>>>> Is this a private cluster (or at least private interfaces)? It would >>>>>>>>>>> also >>>>>>>>>>> be >>>>>>>>>>> an option to use hostbased authentication, which will avoid setting >>>>>>>>>>> any >>>>>>>>>>> known_hosts file or passphraseless ssh-keys for each user. >>>>>>>>>>> >>>>>>>>>>> -- Reuti >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> HostName domU-12-31-39-07-35-21 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> # machine B >>>>>>>>>>>> Host domU-12-31-39-06-74-E2.compute-1.internal >>>>>>>>>>>> HostName domU-12-31-39-06-74-E2 >>>>>>>>>>>> BatchMode yes >>>>>>>>>>>> IdentityFile /home/tsakai/.ssh/tsakai >>>>>>>>>>>> ChallengeResponseAuthentication no >>>>>>>>>>>> IdentitiesOnly yes >>>>>>>>>>>> >>>>>>>>>>>> This file exists on both machine A and machine B. >>>>>>>>>>>> >>>>>>>>>>>> Now When I issue mpirun command as below: >>>>>>>>>>>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2 >>>>>>>>>>>> >>>>>>>>>>>> It hungs. I control-C out of it and I get: >>>>>>>>>>>> mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>> -------------------------------------------------------------------------> >>>>> >> >>>>>> - >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>> -------------------------------------------------------------------------> >>>>> >> >>>>>> - >>>>> -------------------------------------------------------------------------> >>>>> >> >>>>>> - >>>>>>>>>>>> mpirun was unable to cleanly terminate the daemons on the nodes >>>>>>>>>>>> shown >>>>>>>>>>>> below. Additional manual cleanup may be required - please refer to >>>>>>>>>>>> the "orte-clean" tool for assistance. >>>>>>>>>>>> >>>>>>>>>>>> >>>>> -------------------------------------------------------------------------> >>>>> >> >>>>>> - >>>>>>>>>>>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not >>>>>>>>>>>> report >>>>>>>>>>>> back when launched >>>>>>>>>>>> >>>>>>>>>>>> Am I making progress? >>>>>>>>>>>> >>>>>>>>>>>> Does this mean I am past authentication and something else is the >>>>>>>>>>>> problem? >>>>>>>>>>>> Does someone have an example .ssh/config file I can look at? There >>>>>>>>>>>> are >>>>>>>>>>>> so >>>>>>>>>>>> many keyword-argument paris for this config file and I would like >>>>>>>>>>>> to >>>>>>>>>>>> look >>>>>>>>>>>> at >>>>>>>>>>>> some very basic one that works. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu >>>>>>>>>>>> >>>>>>>>>>>> On 2/9/11 7:52 PM, "Tena Sakai" <tsa...@gallo.ucsf.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi >>>>>>>>>>>> >>>>>>>>>>>> I have an app.ac1 file like below: >>>>>>>>>>>> [tsakai@vixen local]$ cat app.ac1 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5 >>>>>>>>>>>> -H vixen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7 >>>>>>>>>>>> -H blitzen.egcrc.org -np 1 Rscript >>>>>>>>>>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> The program I run is >>>>>>>>>>>> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x >>>>>>>>>>>> Where x is [5..8]. The machines vixen and blitzen each run 2 runs. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the program fib.R: >>>>>>>>>>>> [ tsakai@vixen local]$ cat fib.R >>>>>>>>>>>> # fib() computes, given index n, fibonacci number iteratively >>>>>>>>>>>> # here's the first dozen sequence (indexed from 0..11) >>>>>>>>>>>> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 >>>>>>>>>>>> >>>>>>>>>>>> fib <- function( n ) { >>>>>>>>>>>> a <- 0 >>>>>>>>>>>> b <- 1 >>>>>>>>>>>> for ( i in 1:n ) { >>>>>>>>>>>> t <- b >>>>>>>>>>>> b <- a >>>>>>>>>>>> a <- a + t >>>>>>>>>>>> } >>>>>>>>>>>> a >>>>>>>>>>>> >>>>>>>>>>>> arg <- commandArgs( TRUE ) >>>>>>>>>>>> myHost <- system( 'hostname', intern=TRUE ) >>>>>>>>>>>> cat( fib(arg), myHost, '\n' ) >>>>>>>>>>>> >>>>>>>>>>>> It reads an argument from command line and produces a fibonacci >>>>>>>>>>>> number >>>>>>>>>>>> that >>>>>>>>>>>> corresponds to that index, followed by the machine name. Pretty >>>>>>>>>>>> simple >>>>>>>>>>>> stuff. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s the run output: >>>>>>>>>>>> [tsakai@vixen local]$ mpirun -app app.ac1 >>>>>>>>>>>> 5 vixen.egcrc.org >>>>>>>>>>>> 8 vixen.egcrc.org >>>>>>>>>>>> 13 blitzen.egcrc.org >>>>>>>>>>>> 21 blitzen.egcrc.org >>>>>>>>>>>> >>>>>>>>>>>> Which is exactly what I expect. So far so good. >>>>>>>>>>>> >>>>>>>>>>>> Now I want to run the same thing on cloud. I launch 2 instances of >>>>>>>>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>> virtual machine, to which I get to by: >>>>>>>>>>>> [tsakai@vixen local]$ ssh A I ~/.ssh/tsakai >>>>>>>>>>>> machine-instance-A-public-dns >>>>>>>>>>>> >>>>>>>>>>>> Now I am on machine A: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B >>>>>>>>>>>> without >>>>>>>>>>>> password authentication, >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> Last login: Wed Feb 9 20:51:48 2011 from 10.254.214.4 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-0C-C8-01 >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine >>>>>>>>>>>> A >>>>>>>>>>>> without using password >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' >>>>>>>>>>>> can't >>>>>>>>>>>> be established. >>>>>>>>>>>> RSA key fingerprint is >>>>>>>>>>>> e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81. >>>>>>>>>>>> Are you sure you want to continue connecting (yes/no)? yes >>>>>>>>>>>> Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the >>>>>>>>>>>> list >>>>>>>>>>>> of >>>>>>>>>>>> known hosts. >>>>>>>>>>>> Last login: Wed Feb 9 20:49:34 2011 from 10.215.203.239 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-00-D1-F2 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-0C-C8-01 ~]$ exit >>>>>>>>>>>> logout >>>>>>>>>>>> Connection to domU-12-31-39-0C-C8-01 closed. >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # back at machine A >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname >>>>>>>>>>>> domU-12-31-39-00-D1-F2 >>>>>>>>>>>> >>>>>>>>>>>> As you can see, neither machine uses password for authentication; >>>>>>>>>>>> it >>>>>>>>>>>> uses >>>>>>>>>>>> public/private key pairs. There is no problem (that I can see) for >>>>>>>>>>>> ssh >>>>>>>>>>>> invocation >>>>>>>>>>>> from one machine to the other. This is so because I have a copy of >>>>>>>>>>>> public >>>>>>>>>>>> key >>>>>>>>>>>> and a copy of private key on each instance. >>>>>>>>>>>> >>>>>>>>>>>> The app.ac file is identical, except the node names: >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ cat app.ac1 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 5 >>>>>>>>>>>> -H domU-12-31-39-00-D1-F2 -np 1 Rscript /home/tsakai/fib.R 6 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 7 >>>>>>>>>>>> -H domU-12-31-39-0C-C8-01 -np 1 Rscript /home/tsakai/fib.R 8 >>>>>>>>>>>> >>>>>>>>>>>> Here¹s what happens with mpirun: >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ mpirun -app app.ac1 >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: >>>>>>>>>>>> Permission denied, please try again. >>>>>>>>>>>> tsakai@domu-12-31-39-0c-c8-01's password: mpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>> ----------------------------------------------------------------------->>> >>>>> >> >>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>>>>>>> process >>>>>>>>>>>> that caused that situation. >>>>>>>>>>>> >>>>>>>>>>>> >>>>> ----------------------------------------------------------------------->>> >>>>> >> >>>>> - >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> mpirun: clean termination accomplished >>>>>>>>>>>> >>>>>>>>>>>> [tsakai@domU-12-31-39-00-D1-F2 ~]$ >>>>>>>>>>>> >>>>>>>>>>>> Mpirun (or somebody else?) asks me password, which I don¹t have. >>>>>>>>>>>> I end up typing control-C. >>>>>>>>>>>> >>>>>>>>>>>> Here¹s my question: >>>>>>>>>>>> How can I get past authentication by mpirun where there is no >>>>>>>>>>>> password? >>>>>>>>>>>> >>>>>>>>>>>> I would appreciate your help/insight greatly. >>>>>>>>>>>> >>>>>>>>>>>> Thank you. >>>>>>>>>>>> >>>>>>>>>>>> Tena Sakai >>>>>>>>>>>> tsa...@gallo.ucsf.edu > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users