Hooray!! Great to hear - I was running out of ideas :-) On Dec 19, 2012, at 2:01 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
> I figured this out. > > ssh was working, but scp was not due to an mtu mismatch between the systems. > Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the > problem. > > Dan > > On 12/17/2012 04:12 PM, Daniel Davidson wrote: >> Yes, it does. >> >> Dan >> >> [root@compute-2-1 ~]# ssh compute-2-0 >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local >> [root@compute-2-0 ~]# ssh compute-2-1 >> Warning: untrusted X11 forwarding setup failed: xauth key data not generated >> Warning: No xauth data; using fake authentication data for X11 forwarding. >> Last login: Mon Dec 17 16:12:32 2012 from biocluster.local >> [root@compute-2-1 ~]# >> >> >> >> On 12/17/2012 03:39 PM, Doug Reeder wrote: >>> Daniel, >>> >>> Does passwordless ssh work. You need to make sure that it is. >>> >>> Doug >>> On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote: >>> >>>> I would also add that scp seems to be creating the file in the /tmp >>>> directory of compute-2-0, and that /var/log secure is showing ssh >>>> connections being accepted. Is there anything in ssh that can limit >>>> connections that I need to look out for? My guess is that it is part of >>>> the client prefs and not the server prefs since I can initiate the mpi >>>> command from another machine and it works fine, even when it uses >>>> compute-2-0 and 1. >>>> >>>> Dan >>>> >>>> >>>> [root@compute-2-1 /]# date >>>> Mon Dec 17 15:11:50 CST 2012 >>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host >>>> compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca >>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname >>>> [compute-2-1.local:70237] mca:base:select:( plm) Querying component [rsh] >>>> [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh >>>> : rsh path NULL >>>> >>>> [root@compute-2-0 tmp]# ls -ltr >>>> total 24 >>>> -rw-------. 1 root root 0 Nov 28 08:42 yum.log >>>> -rw-------. 1 root root 5962 Nov 29 10:50 >>>> yum_save_tx-2012-11-29-10-50SRba9s.yumtx >>>> drwx------. 3 danield danield 4096 Dec 12 14:56 >>>> openmpi-sessions-danield@compute-2-0_0 >>>> drwx------. 3 root root 4096 Dec 13 15:38 >>>> openmpi-sessions-root@compute-2-0_0 >>>> drwx------ 18 danield danield 4096 Dec 14 09:48 >>>> openmpi-sessions-danield@compute-2-0.local_0 >>>> drwx------ 44 root root 4096 Dec 17 15:14 >>>> openmpi-sessions-root@compute-2-0.local_0 >>>> >>>> [root@compute-2-0 tmp]# tail -10 /var/log/secure >>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from >>>> 10.1.255.226 port 49483 ssh2 >>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session >>>> opened for user root by (uid=0) >>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from >>>> 10.1.255.226: 11: disconnected by user >>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session >>>> closed for user root >>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from >>>> 10.1.255.226 port 49484 ssh2 >>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session >>>> opened for user root by (uid=0) >>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from >>>> 10.1.255.226: 11: disconnected by user >>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session >>>> closed for user root >>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from >>>> 10.1.255.226 port 49485 ssh2 >>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session >>>> opened for user root by (uid=0) >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 12/17/2012 11:16 AM, Daniel Davidson wrote: >>>>> A very long time (15 mintues or so) I finally received the following in >>>>> addition to what I just sent earlier: >>>>> >>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >>>>> WILDCARD >>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >>>>> WILDCARD >>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on >>>>> WILDCARD >>>>> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1 >>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending >>>>> orted_exit commands >>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on >>>>> WILDCARD >>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on >>>>> WILDCARD >>>>> >>>>> Firewalls are down: >>>>> >>>>> [root@compute-2-1 /]# iptables -L >>>>> Chain INPUT (policy ACCEPT) >>>>> target prot opt source destination >>>>> >>>>> Chain FORWARD (policy ACCEPT) >>>>> target prot opt source destination >>>>> >>>>> Chain OUTPUT (policy ACCEPT) >>>>> target prot opt source destination >>>>> [root@compute-2-0 ~]# iptables -L >>>>> Chain INPUT (policy ACCEPT) >>>>> target prot opt source destination >>>>> >>>>> Chain FORWARD (policy ACCEPT) >>>>> target prot opt source destination >>>>> >>>>> Chain OUTPUT (policy ACCEPT) >>>>> target prot opt source destination >>>>> >>>>> On 12/17/2012 11:09 AM, Ralph Castain wrote: >>>>>> Hmmm...and that is ALL the output? If so, then it never succeeded in >>>>>> sending a message back, which leads one to suspect some kind of firewall >>>>>> in the way. >>>>>> >>>>>> Looking at the ssh line, we are going to attempt to send a message from >>>>>> tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to >>>>>> work? Anything preventing it? >>>>>> >>>>>> >>>>>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>> wrote: >>>>>> >>>>>>> These nodes have not been locked down yet so that jobs cannot be >>>>>>> launched from the backend, at least on purpose anyway. The added >>>>>>> logging returns the information below: >>>>>>> >>>>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host >>>>>>> compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca >>>>>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying component >>>>>>> [rsh] >>>>>>> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent >>>>>>> ssh : rsh path NULL >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Query of component >>>>>>> [rsh] set priority to 10 >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying component >>>>>>> [slurm] >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Skipping component >>>>>>> [slurm]. Query failed to return a module >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Querying component >>>>>>> [tm] >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Skipping component >>>>>>> [tm]. Query failed to return a module >>>>>>> [compute-2-1.local:69655] mca:base:select:( plm) Selected component >>>>>>> [rsh] >>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 >>>>>>> nodename hash 3634869988 >>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341 >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : >>>>>>> rsh path NULL >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm >>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Querying component >>>>>>> [default] >>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Query of component >>>>>>> [default] set priority to 1 >>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Selected component >>>>>>> [default] >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map >>>>>>> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged >>>>>>> allocation >>>>>>> [compute-2-1.local:69655] [[32341,0],0] using dash_host >>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0 >>>>>>> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list >>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new >>>>>>> daemon [[32341,0],1] >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new >>>>>>> daemon [[32341,0],1] to node compute-2-0 >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash) >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote >>>>>>> shell as local shell >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash) >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv: >>>>>>> /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH >>>>>>> ; export PATH ; >>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export >>>>>>> LD_LIBRARY_PATH ; >>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; >>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess >>>>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid <template> -mca >>>>>>> orte_ess_num_procs 2 -mca orte_hnp_uri >>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca >>>>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose >>>>>>> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached >>>>>>> 1 >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a >>>>>>> child of mine >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node >>>>>>> compute-2-0 to launch list >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of >>>>>>> daemon [[32341,0],1] >>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: >>>>>>> (//usr/bin/ssh) [/usr/bin/ssh compute-2-0 >>>>>>> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; >>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export >>>>>>> LD_LIBRARY_PATH ; >>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; >>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess >>>>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca >>>>>>> orte_ess_num_procs 2 -mca orte_hnp_uri >>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca >>>>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose >>>>>>> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached >>>>>>> 1] >>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not >>>>>>> generated >>>>>>> Warning: No xauth data; using fake authentication data for X11 >>>>>>> forwarding. >>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Querying component >>>>>>> [rsh] >>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : >>>>>>> rsh path NULL >>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Query of component >>>>>>> [rsh] set priority to 10 >>>>>>> [compute-2-0.local:24659] mca:base:select:( plm) Selected component >>>>>>> [rsh] >>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Querying component >>>>>>> [default] >>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Query of component >>>>>>> [default] set priority to 1 >>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Selected component >>>>>>> [default] >>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : >>>>>>> rsh path NULL >>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 12/17/2012 10:37 AM, Ralph Castain wrote: >>>>>>>> ?? That was all the output? If so, then something is indeed quite >>>>>>>> wrong as it didn't even attempt to launch the job. >>>>>>>> >>>>>>>> Try adding -mca plm_base_verbose 5 to the cmd line. >>>>>>>> >>>>>>>> I was assuming you were using ssh as the launcher, but I wonder if you >>>>>>>> are in some managed environment? If so, then it could be that launch >>>>>>>> from a backend node isn't allowed (e.g., on gridengine). >>>>>>>> >>>>>>>> On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This looks to be having issues as well, and I cannot get any number >>>>>>>>> of processors to give me a different result with the new version. >>>>>>>>> >>>>>>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host >>>>>>>>> compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca >>>>>>>>> odls_base_verbose 5 hostname >>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Querying component >>>>>>>>> [default] >>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Query of component >>>>>>>>> [default] set priority to 1 >>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Selected component >>>>>>>>> [default] >>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Querying component >>>>>>>>> [default] >>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Query of component >>>>>>>>> [default] set priority to 1 >>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Selected component >>>>>>>>> [default] >>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working >>>>>>>>> on WILDCARD >>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working >>>>>>>>> on WILDCARD >>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working >>>>>>>>> on WILDCARD >>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working >>>>>>>>> on WILDCARD >>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working >>>>>>>>> on WILDCARD >>>>>>>>> >>>>>>>>> However from the head node: >>>>>>>>> >>>>>>>>> [root@biocluster openmpi-1.7rc5]# >>>>>>>>> /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v >>>>>>>>> -np 50 hostname >>>>>>>>> >>>>>>>>> Displays 25 hostnames from each system. >>>>>>>>> >>>>>>>>> Thank you again for the help so far, >>>>>>>>> >>>>>>>>> Dan >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 12/17/2012 08:31 AM, Daniel Davidson wrote: >>>>>>>>>> I will give this a try, but wouldn't that be an issue as well if the >>>>>>>>>> process was run on the head node or another node? So long as the >>>>>>>>>> mpi job is not started on either of these two nodes, it works fine. >>>>>>>>>> >>>>>>>>>> Dan >>>>>>>>>> >>>>>>>>>> On 12/14/2012 11:46 PM, Ralph Castain wrote: >>>>>>>>>>> It must be making contact or ORTE wouldn't be attempting to launch >>>>>>>>>>> your application's procs. Looks more like it never received the >>>>>>>>>>> launch command. Looking at the code, I suspect you're getting >>>>>>>>>>> caught in a race condition that causes the message to get "stuck". >>>>>>>>>>> >>>>>>>>>>> Just to see if that's the case, you might try running this with the >>>>>>>>>>> 1.7 release candidate, or even the developer's nightly build. Both >>>>>>>>>>> use a different timing mechanism intended to resolve such >>>>>>>>>>> situations. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thank you for the help so far. Here is the information that the >>>>>>>>>>>> debugging gives me. Looks like the daemon on on the non-local >>>>>>>>>>>> node never makes contact. If I step NP back two though, it does. >>>>>>>>>>>> >>>>>>>>>>>> Dan >>>>>>>>>>>> >>>>>>>>>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>>>> compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca >>>>>>>>>>>> odls_base_verbose 5 hostname >>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying >>>>>>>>>>>> component [default] >>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of >>>>>>>>>>>> component [default] set priority to 1 >>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected >>>>>>>>>>>> component [default] >>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying >>>>>>>>>>>> component [default] >>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of >>>>>>>>>>>> component [default] set priority to 1 >>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected >>>>>>>>>>>> component [default] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info >>>>>>>>>>>> updating nidmap >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>>>>>>>>>> unpacking data to launch job [49524,1] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>>>>>>>>>> adding new jobdat for job [49524,1] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>>>>>>>>>> unpacking 1 app_contexts >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],0] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],1] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],1] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],2] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],3] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],3] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],4] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],5] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],5] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],6] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],7] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],7] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],8] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],9] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],9] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],10] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],11] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],11] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],12] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],13] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],13] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],14] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],15] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],15] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],16] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],17] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],17] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],18] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],19] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],19] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],20] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],21] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],21] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],22] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],23] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],23] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],24] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],25] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],25] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],26] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],27] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],27] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],28] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],29] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],29] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],30] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],31] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],31] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],32] on daemon 1 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - checking proc [[49524,1],33] on daemon 0 >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child >>>>>>>>>>>> list - found proc [[49524,1],33] for me! >>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my >>>>>>>>>>>> local list >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 >>>>>>>>>>>> processors for 17 children and locally set oversubscribed to false >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],1] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],3] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],5] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],7] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],9] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],11] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],13] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],15] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],17] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],19] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],21] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],23] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],25] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],27] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],29] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],31] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>>>>>>>> [[49524,1],33] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job >>>>>>>>>>>> [49524,1] launch status >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging >>>>>>>>>>>> launch report to myself >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting >>>>>>>>>>>> waitpids >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44857 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44858 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44859 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44860 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44861 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44862 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44863 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44865 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44866 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44867 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44869 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44870 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44871 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44872 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44873 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44874 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>>>>>>>> process 44875 terminated >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],33] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],31] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],29] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],27] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],25] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],23] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],21] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],19] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],17] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],15] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],13] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],11] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],9] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],7] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],5] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],3] terminated normally >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired >>>>>>>>>>>> checking abort file >>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child >>>>>>>>>>>> process [[49524,1],1] terminated normally >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],25] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],15] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],11] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],13] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],19] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],9] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],17] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],31] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],7] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],21] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],5] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],33] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],23] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],3] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],29] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],27] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete >>>>>>>>>>>> for child [[49524,1],1] >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete >>>>>>>>>>>> reporting all procs in [49524,1] terminated >>>>>>>>>>>> ^Cmpirun: killing job... >>>>>>>>>>>> >>>>>>>>>>>> Killed by signal 2. >>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc >>>>>>>>>>>> working on WILDCARD >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote: >>>>>>>>>>>>> Sorry - I forgot that you built from a tarball, and so debug >>>>>>>>>>>>> isn't enabled by default. You need to configure --enable-debug. >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson >>>>>>>>>>>>> <dani...@igb.uiuc.edu> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Oddly enough, adding this debugging info, lowered the number of >>>>>>>>>>>>>> processes that can be used down to 42 from 46. When I run the >>>>>>>>>>>>>> MPI, it fails giving only the information that follows: >>>>>>>>>>>>>> >>>>>>>>>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun >>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -v -np 44 >>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 hostname >>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying >>>>>>>>>>>>>> component [default] >>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of >>>>>>>>>>>>>> component [default] set priority to 1 >>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected >>>>>>>>>>>>>> component [default] >>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying >>>>>>>>>>>>>> component [default] >>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of >>>>>>>>>>>>>> component [default] set priority to 1 >>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected >>>>>>>>>>>>>> component [default] >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> compute-2-1.local >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote: >>>>>>>>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done >>>>>>>>>>>>>>> to each node (to start the local daemon). The only difference >>>>>>>>>>>>>>> is the number of fork/exec's being done on each node, and the >>>>>>>>>>>>>>> number of file descriptors being opened to support those >>>>>>>>>>>>>>> fork/exec's. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> It certainly looks like your limits are high enough. When you >>>>>>>>>>>>>>> say it "fails", what do you mean - what error does it report? >>>>>>>>>>>>>>> Try adding: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> to your cmd line - this will report all the local proc launch >>>>>>>>>>>>>>> debug and hopefully show you a more detailed error report. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson >>>>>>>>>>>>>>> <dani...@igb.uiuc.edu> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have had to cobble together two machines in our rocks >>>>>>>>>>>>>>>> cluster without using the standard installation, they have efi >>>>>>>>>>>>>>>> only bios on them and rocks doesnt like that, so it is the >>>>>>>>>>>>>>>> only workaround. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Everything works great now, except for one thing. MPI jobs >>>>>>>>>>>>>>>> (openmpi or mpich) fail when started from one of these nodes >>>>>>>>>>>>>>>> (via qsub or by logging in and running the command) if 24 or >>>>>>>>>>>>>>>> more processors are needed on another system. However if the >>>>>>>>>>>>>>>> originator of the MPI job is the headnode or any of the >>>>>>>>>>>>>>>> preexisting compute nodes, it works fine. Right now I am >>>>>>>>>>>>>>>> guessing ssh client or ulimit problems, but I cannot find any >>>>>>>>>>>>>>>> difference. Any help would be greatly appreciated. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Examples: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This works, prints 23 hostnames from each machine: >>>>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun >>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 46 hostname >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1 >>>>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun >>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 48 hostname >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> These both work, print 64 hostnames from each node >>>>>>>>>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname >>>>>>>>>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun >>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 128 hostname >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [root@compute-2-1 ~]# ulimit -a >>>>>>>>>>>>>>>> core file size (blocks, -c) 0 >>>>>>>>>>>>>>>> data seg size (kbytes, -d) unlimited >>>>>>>>>>>>>>>> scheduling priority (-e) 0 >>>>>>>>>>>>>>>> file size (blocks, -f) unlimited >>>>>>>>>>>>>>>> pending signals (-i) 16410016 >>>>>>>>>>>>>>>> max locked memory (kbytes, -l) unlimited >>>>>>>>>>>>>>>> max memory size (kbytes, -m) unlimited >>>>>>>>>>>>>>>> open files (-n) 4096 >>>>>>>>>>>>>>>> pipe size (512 bytes, -p) 8 >>>>>>>>>>>>>>>> POSIX message queues (bytes, -q) 819200 >>>>>>>>>>>>>>>> real-time priority (-r) 0 >>>>>>>>>>>>>>>> stack size (kbytes, -s) unlimited >>>>>>>>>>>>>>>> cpu time (seconds, -t) unlimited >>>>>>>>>>>>>>>> max user processes (-u) 1024 >>>>>>>>>>>>>>>> virtual memory (kbytes, -v) unlimited >>>>>>>>>>>>>>>> file locks (-x) unlimited >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config >>>>>>>>>>>>>>>> Host * >>>>>>>>>>>>>>>> CheckHostIP no >>>>>>>>>>>>>>>> ForwardX11 yes >>>>>>>>>>>>>>>> ForwardAgent yes >>>>>>>>>>>>>>>> StrictHostKeyChecking no >>>>>>>>>>>>>>>> UsePrivilegedPort no >>>>>>>>>>>>>>>> Protocol 2,1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users