Hooray!! Great to hear - I was running out of ideas :-)

On Dec 19, 2012, at 2:01 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

> I figured this out.
> 
> ssh was working, but scp was not due to an mtu mismatch between the systems.  
> Adding MTU=1500 to my /etc/sysconfig/network-scripts/ifcfg-eth2 fixed the 
> problem.
> 
> Dan
> 
> On 12/17/2012 04:12 PM, Daniel Davidson wrote:
>> Yes, it does.
>> 
>> Dan
>> 
>> [root@compute-2-1 ~]# ssh compute-2-0
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
>> [root@compute-2-0 ~]# ssh compute-2-1
>> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>> Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
>> [root@compute-2-1 ~]#
>> 
>> 
>> 
>> On 12/17/2012 03:39 PM, Doug Reeder wrote:
>>> Daniel,
>>> 
>>> Does passwordless ssh work. You need to make sure that it is.
>>> 
>>> Doug
>>> On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:
>>> 
>>>> I would also add that scp seems to be creating the file in the /tmp 
>>>> directory of compute-2-0, and that /var/log secure is showing ssh 
>>>> connections being accepted.  Is there anything in ssh that can limit 
>>>> connections that I need to look out for?  My guess is that it is part of 
>>>> the client prefs and not the server prefs since I can initiate the mpi 
>>>> command from another machine and it works fine, even when it uses 
>>>> compute-2-0 and 1.
>>>> 
>>>> Dan
>>>> 
>>>> 
>>>> [root@compute-2-1 /]# date
>>>> Mon Dec 17 15:11:50 CST 2012
>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>>> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
>>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
>>>> [compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
>>>> [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh 
>>>> : rsh path NULL
>>>> 
>>>> [root@compute-2-0 tmp]# ls -ltr
>>>> total 24
>>>> -rw-------.  1 root    root       0 Nov 28 08:42 yum.log
>>>> -rw-------.  1 root    root    5962 Nov 29 10:50 
>>>> yum_save_tx-2012-11-29-10-50SRba9s.yumtx
>>>> drwx------.  3 danield danield 4096 Dec 12 14:56 
>>>> openmpi-sessions-danield@compute-2-0_0
>>>> drwx------.  3 root    root    4096 Dec 13 15:38 
>>>> openmpi-sessions-root@compute-2-0_0
>>>> drwx------  18 danield danield 4096 Dec 14 09:48 
>>>> openmpi-sessions-danield@compute-2-0.local_0
>>>> drwx------  44 root    root    4096 Dec 17 15:14 
>>>> openmpi-sessions-root@compute-2-0.local_0
>>>> 
>>>> [root@compute-2-0 tmp]# tail -10 /var/log/secure
>>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
>>>> 10.1.255.226 port 49483 ssh2
>>>> Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
>>>> opened for user root by (uid=0)
>>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
>>>> 10.1.255.226: 11: disconnected by user
>>>> Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
>>>> closed for user root
>>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
>>>> 10.1.255.226 port 49484 ssh2
>>>> Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
>>>> opened for user root by (uid=0)
>>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
>>>> 10.1.255.226: 11: disconnected by user
>>>> Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
>>>> closed for user root
>>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
>>>> 10.1.255.226 port 49485 ssh2
>>>> Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session 
>>>> opened for user root by (uid=0)
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 12/17/2012 11:16 AM, Daniel Davidson wrote:
>>>>> A very long time (15 mintues or so) I finally received the following in 
>>>>> addition to what I just sent earlier:
>>>>> 
>>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>>>>> WILDCARD
>>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>>>>> WILDCARD
>>>>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>>>>> WILDCARD
>>>>> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
>>>>> orted_exit commands
>>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>>>>> WILDCARD
>>>>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>>>>> WILDCARD
>>>>> 
>>>>> Firewalls are down:
>>>>> 
>>>>> [root@compute-2-1 /]# iptables -L
>>>>> Chain INPUT (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> 
>>>>> Chain FORWARD (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> 
>>>>> Chain OUTPUT (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> [root@compute-2-0 ~]# iptables -L
>>>>> Chain INPUT (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> 
>>>>> Chain FORWARD (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> 
>>>>> Chain OUTPUT (policy ACCEPT)
>>>>> target     prot opt source               destination
>>>>> 
>>>>> On 12/17/2012 11:09 AM, Ralph Castain wrote:
>>>>>> Hmmm...and that is ALL the output? If so, then it never succeeded in 
>>>>>> sending a message back, which leads one to suspect some kind of firewall 
>>>>>> in the way.
>>>>>> 
>>>>>> Looking at the ssh line, we are going to attempt to send a message from 
>>>>>> tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to 
>>>>>> work? Anything preventing it?
>>>>>> 
>>>>>> 
>>>>>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>> wrote:
>>>>>> 
>>>>>>> These nodes have not been locked down yet so that jobs cannot be 
>>>>>>> launched from the backend, at least on purpose anyway.  The added 
>>>>>>> logging returns the information below:
>>>>>>> 
>>>>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>>>>>> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
>>>>>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
>>>>>>> [rsh]
>>>>>>> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent 
>>>>>>> ssh : rsh path NULL
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Query of component 
>>>>>>> [rsh] set priority to 10
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
>>>>>>> [slurm]
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component 
>>>>>>> [slurm]. Query failed to return a module
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
>>>>>>> [tm]
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component 
>>>>>>> [tm]. Query failed to return a module
>>>>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Selected component 
>>>>>>> [rsh]
>>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 
>>>>>>> nodename hash 3634869988
>>>>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : 
>>>>>>> rsh path NULL
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
>>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Querying component 
>>>>>>> [default]
>>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Query of component 
>>>>>>> [default] set priority to 1
>>>>>>> [compute-2-1.local:69655] mca:base:select:( odls) Selected component 
>>>>>>> [default]
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged 
>>>>>>> allocation
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] using dash_host
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new 
>>>>>>> daemon [[32341,0],1]
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new 
>>>>>>> daemon [[32341,0],1] to node compute-2-0
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote 
>>>>>>> shell as local shell
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
>>>>>>>        /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH 
>>>>>>> ; export PATH ; 
>>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
>>>>>>> LD_LIBRARY_PATH ; 
>>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
>>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess 
>>>>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid <template> -mca 
>>>>>>> orte_ess_num_procs 2 -mca orte_hnp_uri 
>>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
>>>>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
>>>>>>> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 
>>>>>>> 1
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a 
>>>>>>> child of mine
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node 
>>>>>>> compute-2-0 to launch list
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of 
>>>>>>> daemon [[32341,0],1]
>>>>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: 
>>>>>>> (//usr/bin/ssh) [/usr/bin/ssh compute-2-0 
>>>>>>> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
>>>>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
>>>>>>> LD_LIBRARY_PATH ; 
>>>>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
>>>>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess 
>>>>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca 
>>>>>>> orte_ess_num_procs 2 -mca orte_hnp_uri 
>>>>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
>>>>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
>>>>>>> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 
>>>>>>> 1]
>>>>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>>>>> generated
>>>>>>> Warning: No xauth data; using fake authentication data for X11 
>>>>>>> forwarding.
>>>>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Querying component 
>>>>>>> [rsh]
>>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : 
>>>>>>> rsh path NULL
>>>>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Query of component 
>>>>>>> [rsh] set priority to 10
>>>>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Selected component 
>>>>>>> [rsh]
>>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Querying component 
>>>>>>> [default]
>>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Query of component 
>>>>>>> [default] set priority to 1
>>>>>>> [compute-2-0.local:24659] mca:base:select:( odls) Selected component 
>>>>>>> [default]
>>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : 
>>>>>>> rsh path NULL
>>>>>>> [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 12/17/2012 10:37 AM, Ralph Castain wrote:
>>>>>>>> ?? That was all the output? If so, then something is indeed quite 
>>>>>>>> wrong as it didn't even attempt to launch the job.
>>>>>>>> 
>>>>>>>> Try adding -mca plm_base_verbose 5 to the cmd line.
>>>>>>>> 
>>>>>>>> I was assuming you were using ssh as the launcher, but I wonder if you 
>>>>>>>> are in some managed environment? If so, then it could be that launch 
>>>>>>>> from a backend node isn't allowed (e.g., on gridengine).
>>>>>>>> 
>>>>>>>> On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> This looks to be having issues as well, and I cannot get any number 
>>>>>>>>> of processors to give me a different result with the new version.
>>>>>>>>> 
>>>>>>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>>>>>>>> compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
>>>>>>>>> odls_base_verbose 5 hostname
>>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Querying component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Query of component 
>>>>>>>>> [default] set priority to 1
>>>>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Selected component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Querying component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Query of component 
>>>>>>>>> [default] set priority to 1
>>>>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Selected component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> 
>>>>>>>>> However from the head node:
>>>>>>>>> 
>>>>>>>>> [root@biocluster openmpi-1.7rc5]# 
>>>>>>>>> /home/apps/openmpi-1.7rc5/bin/mpirun -host compute-2-0,compute-2-1 -v 
>>>>>>>>>  -np 50  hostname
>>>>>>>>> 
>>>>>>>>> Displays 25 hostnames from each system.
>>>>>>>>> 
>>>>>>>>> Thank you again for the help so far,
>>>>>>>>> 
>>>>>>>>> Dan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>>>>>>>>>> I will give this a try, but wouldn't that be an issue as well if the 
>>>>>>>>>> process was run on the head node or another node?  So long as the 
>>>>>>>>>> mpi job is not started on either of these two nodes, it works fine.
>>>>>>>>>> 
>>>>>>>>>> Dan
>>>>>>>>>> 
>>>>>>>>>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>>>>>>>>>> It must be making contact or ORTE wouldn't be attempting to launch 
>>>>>>>>>>> your application's procs. Looks more like it never received the 
>>>>>>>>>>> launch command. Looking at the code, I suspect you're getting 
>>>>>>>>>>> caught in a race condition that causes the message to get "stuck".
>>>>>>>>>>> 
>>>>>>>>>>> Just to see if that's the case, you might try running this with the 
>>>>>>>>>>> 1.7 release candidate, or even the developer's nightly build. Both 
>>>>>>>>>>> use a different timing mechanism intended to resolve such 
>>>>>>>>>>> situations.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thank you for the help so far.  Here is the information that the 
>>>>>>>>>>>> debugging gives me.  Looks like the daemon on on the non-local 
>>>>>>>>>>>> node never makes contact.  If I step NP back two though, it does.
>>>>>>>>>>>> 
>>>>>>>>>>>> Dan
>>>>>>>>>>>> 
>>>>>>>>>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
>>>>>>>>>>>> odls_base_verbose 5 hostname
>>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying 
>>>>>>>>>>>> component [default]
>>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of 
>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected 
>>>>>>>>>>>> component [default]
>>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying 
>>>>>>>>>>>> component [default]
>>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of 
>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected 
>>>>>>>>>>>> component [default]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info 
>>>>>>>>>>>> updating nidmap
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>>>>> unpacking data to launch job [49524,1]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>>>>> adding new jobdat for job [49524,1]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>>>>> unpacking 1 app_contexts
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],0] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],1] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],1] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],2] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],3] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],3] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],4] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],5] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],5] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],6] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],7] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],7] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],8] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],9] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],9] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],10] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],11] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],11] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],12] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],13] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],13] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],14] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],15] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],15] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],16] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],17] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],17] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],18] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],19] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],19] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],20] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],21] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],21] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],22] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],23] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],23] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],24] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],25] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],25] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],26] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],27] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],27] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],28] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],29] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],29] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],30] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],31] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],31] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],32] on daemon 1
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - checking proc [[49524,1],33] on daemon 0
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child 
>>>>>>>>>>>> list - found proc [[49524,1],33] for me!
>>>>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my 
>>>>>>>>>>>> local list
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 
>>>>>>>>>>>> processors for 17 children and locally set oversubscribed to false
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],1]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],3]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],5]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],7]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],9]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],11]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],13]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],15]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],17]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],19]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],21]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],23]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],25]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],27]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],29]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],31]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>>>>> [[49524,1],33]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job 
>>>>>>>>>>>> [49524,1] launch status
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging 
>>>>>>>>>>>> launch report to myself
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting 
>>>>>>>>>>>> waitpids
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44857 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44858 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44859 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44860 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44861 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44862 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44863 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44865 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44866 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44867 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44869 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44870 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44871 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44872 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44873 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44874 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>>>>> process 44875 terminated
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],33] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],31] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],29] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],27] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],25] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],23] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],21] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],19] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],17] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],15] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],13] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],11] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],9] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],7] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],5] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],3] terminated normally
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired 
>>>>>>>>>>>> checking abort file 
>>>>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort 
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>>>>> process [[49524,1],1] terminated normally
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],25]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],15]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],11]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],13]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],19]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],9]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],17]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],31]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],7]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],21]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],5]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],33]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],23]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],3]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],29]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],27]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete 
>>>>>>>>>>>> for child [[49524,1],1]
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete 
>>>>>>>>>>>> reporting all procs in [49524,1] terminated
>>>>>>>>>>>> ^Cmpirun: killing job...
>>>>>>>>>>>> 
>>>>>>>>>>>> Killed by signal 2.
>>>>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc 
>>>>>>>>>>>> working on WILDCARD
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>>>>>>>>>>> Sorry - I forgot that you built from a tarball, and so debug 
>>>>>>>>>>>>> isn't enabled by default. You need to configure --enable-debug.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson 
>>>>>>>>>>>>> <dani...@igb.uiuc.edu> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Oddly enough, adding this debugging info, lowered the number of 
>>>>>>>>>>>>>> processes that can be used down to 42 from 46.  When I run the 
>>>>>>>>>>>>>> MPI, it fails giving only the information that follows:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun 
>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -v  -np 44 
>>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 hostname
>>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying 
>>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of 
>>>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected 
>>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying 
>>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of 
>>>>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected 
>>>>>>>>>>>>>> component [default]
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done 
>>>>>>>>>>>>>>> to each node (to start the local daemon). The only difference 
>>>>>>>>>>>>>>> is the number of fork/exec's being done on each node, and the 
>>>>>>>>>>>>>>> number of file descriptors being opened to support those 
>>>>>>>>>>>>>>> fork/exec's.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> It certainly looks like your limits are high enough. When you 
>>>>>>>>>>>>>>> say it "fails", what do you mean - what error does it report? 
>>>>>>>>>>>>>>> Try adding:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> to your cmd line - this will report all the local proc launch 
>>>>>>>>>>>>>>> debug and hopefully show you a more detailed error report.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson 
>>>>>>>>>>>>>>> <dani...@igb.uiuc.edu> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> I have had to cobble together two machines in our rocks 
>>>>>>>>>>>>>>>> cluster without using the standard installation, they have efi 
>>>>>>>>>>>>>>>> only bios on them and rocks doesnt like that, so it is the 
>>>>>>>>>>>>>>>> only workaround.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Everything works great now, except for one thing.  MPI jobs 
>>>>>>>>>>>>>>>> (openmpi or mpich) fail when started from one of these nodes 
>>>>>>>>>>>>>>>> (via qsub or by logging in and running the command) if 24 or 
>>>>>>>>>>>>>>>> more processors are needed on another system.  However if the 
>>>>>>>>>>>>>>>> originator of the MPI job is the headnode or any of the 
>>>>>>>>>>>>>>>> preexisting compute nodes, it works fine.  Right now I am 
>>>>>>>>>>>>>>>> guessing ssh client or ulimit problems, but I cannot find any 
>>>>>>>>>>>>>>>> difference.  Any help would be greatly appreciated.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Examples:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun 
>>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun 
>>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> These both work, print 64 hostnames from each node
>>>>>>>>>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun 
>>>>>>>>>>>>>>>> -host compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [root@compute-2-1 ~]# ulimit -a
>>>>>>>>>>>>>>>> core file size          (blocks, -c) 0
>>>>>>>>>>>>>>>> data seg size           (kbytes, -d) unlimited
>>>>>>>>>>>>>>>> scheduling priority             (-e) 0
>>>>>>>>>>>>>>>> file size               (blocks, -f) unlimited
>>>>>>>>>>>>>>>> pending signals                 (-i) 16410016
>>>>>>>>>>>>>>>> max locked memory       (kbytes, -l) unlimited
>>>>>>>>>>>>>>>> max memory size         (kbytes, -m) unlimited
>>>>>>>>>>>>>>>> open files                      (-n) 4096
>>>>>>>>>>>>>>>> pipe size            (512 bytes, -p) 8
>>>>>>>>>>>>>>>> POSIX message queues     (bytes, -q) 819200
>>>>>>>>>>>>>>>> real-time priority              (-r) 0
>>>>>>>>>>>>>>>> stack size              (kbytes, -s) unlimited
>>>>>>>>>>>>>>>> cpu time               (seconds, -t) unlimited
>>>>>>>>>>>>>>>> max user processes              (-u) 1024
>>>>>>>>>>>>>>>> virtual memory          (kbytes, -v) unlimited
>>>>>>>>>>>>>>>> file locks                      (-x) unlimited
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>>>>>>>>>>> Host *
>>>>>>>>>>>>>>>>        CheckHostIP             no
>>>>>>>>>>>>>>>>        ForwardX11              yes
>>>>>>>>>>>>>>>>        ForwardAgent            yes
>>>>>>>>>>>>>>>>        StrictHostKeyChecking   no
>>>>>>>>>>>>>>>>        UsePrivilegedPort       no
>>>>>>>>>>>>>>>>        Protocol                2,1
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to