Hmmm...and that is ALL the output? If so, then it never succeeded in sending a 
message back, which leads one to suspect some kind of firewall in the way.

Looking at the ssh line, we are going to attempt to send a message from tnode 
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything 
preventing it?


On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

> These nodes have not been locked down yet so that jobs cannot be launched 
> from the backend, at least on purpose anyway.  The added logging returns the 
> information below:
> 
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
> rsh path NULL
> [compute-2-1.local:69655] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [slurm]
> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename 
> hash 3634869988
> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path 
> NULL
> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
> [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:69655] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default]
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation
> [compute-2-1.local:69655] [[32341,0],0] using dash_host
> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
> [[32341,0],1]
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new 
> daemon [[32341,0],1] to node compute-2-0
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell 
> as local shell
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
>        /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; 
> export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH 
> ; export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export 
> DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca 
> orte_ess_jobid 2119499776 -mca orte_ess_vpid <template> -mca 
> orte_ess_num_procs 2 -mca orte_hnp_uri 
> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 
> -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child 
> of mine
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to 
> launch list
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon 
> [[32341,0],1]
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) 
> [/usr/bin/ssh compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; 
> export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export 
> DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca 
> orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca 
> orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" 
> -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1]
> Warning: untrusted X11 forwarding setup failed: xauth key data not generated
> Warning: No xauth data; using fake authentication data for X11 forwarding.
> [compute-2-0.local:24659] mca:base:select:(  plm) Querying component [rsh]
> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : rsh 
> path NULL
> [compute-2-0.local:24659] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [compute-2-0.local:24659] mca:base:select:(  plm) Selected component [rsh]
> [compute-2-0.local:24659] mca:base:select:( odls) Querying component [default]
> [compute-2-0.local:24659] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-0.local:24659] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : rsh path 
> NULL
> [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm
> 
> 
> 
> 
> On 12/17/2012 10:37 AM, Ralph Castain wrote:
>> ?? That was all the output? If so, then something is indeed quite wrong as 
>> it didn't even attempt to launch the job.
>> 
>> Try adding -mca plm_base_verbose 5 to the cmd line.
>> 
>> I was assuming you were using ssh as the launcher, but I wonder if you are 
>> in some managed environment? If so, then it could be that launch from a 
>> backend node isn't allowed (e.g., on gridengine).
>> 
>> On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>> 
>>> This looks to be having issues as well, and I cannot get any number of 
>>> processors to give me a different result with the new version.
>>> 
>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>> compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
>>> odls_base_verbose 5 hostname
>>> [compute-2-1.local:69417] mca:base:select:( odls) Querying component 
>>> [default]
>>> [compute-2-1.local:69417] mca:base:select:( odls) Query of component 
>>> [default] set priority to 1
>>> [compute-2-1.local:69417] mca:base:select:( odls) Selected component 
>>> [default]
>>> [compute-2-0.local:24486] mca:base:select:( odls) Querying component 
>>> [default]
>>> [compute-2-0.local:24486] mca:base:select:( odls) Query of component 
>>> [default] set priority to 1
>>> [compute-2-0.local:24486] mca:base:select:( odls) Selected component 
>>> [default]
>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>> WILDCARD
>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>> WILDCARD
>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>> WILDCARD
>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
>>> WILDCARD
>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
>>> WILDCARD
>>> 
>>> However from the head node:
>>> 
>>> [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun 
>>> -host compute-2-0,compute-2-1 -v  -np 50  hostname
>>> 
>>> Displays 25 hostnames from each system.
>>> 
>>> Thank you again for the help so far,
>>> 
>>> Dan
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>>>> I will give this a try, but wouldn't that be an issue as well if the 
>>>> process was run on the head node or another node?  So long as the mpi job 
>>>> is not started on either of these two nodes, it works fine.
>>>> 
>>>> Dan
>>>> 
>>>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>>>> It must be making contact or ORTE wouldn't be attempting to launch your 
>>>>> application's procs. Looks more like it never received the launch 
>>>>> command. Looking at the code, I suspect you're getting caught in a race 
>>>>> condition that causes the message to get "stuck".
>>>>> 
>>>>> Just to see if that's the case, you might try running this with the 1.7 
>>>>> release candidate, or even the developer's nightly build. Both use a 
>>>>> different timing mechanism intended to resolve such situations.
>>>>> 
>>>>> 
>>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>>>> 
>>>>>> Thank you for the help so far.  Here is the information that the 
>>>>>> debugging gives me.  Looks like the daemon on on the non-local node 
>>>>>> never makes contact.  If I step NP back two though, it does.
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
>>>>>> odls_base_verbose 5 hostname
>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating 
>>>>>> nidmap
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>> unpacking data to launch job [49524,1]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding 
>>>>>> new jobdat for job [49524,1]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>> unpacking 1 app_contexts
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],0] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],1] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],1] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],2] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],3] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],3] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],4] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],5] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],5] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],6] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],7] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],7] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],8] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],9] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],9] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],10] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],11] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],11] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],12] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],13] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],13] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],14] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],15] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],15] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],16] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],17] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],17] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],18] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],19] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],19] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],20] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],21] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],21] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],22] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],23] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],23] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],24] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],25] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],25] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],26] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],27] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],27] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],28] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],29] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],29] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],30] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],31] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],31] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],32] on daemon 1
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> checking proc [[49524,1],33] on daemon 0
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>>>> found proc [[49524,1],33] for me!
>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local 
>>>>>> list
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors 
>>>>>> for 17 children and locally set oversubscribed to false
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],1]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],3]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],5]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],7]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],9]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],11]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],13]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],15]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],17]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],19]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],21]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],23]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],25]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],27]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],29]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],31]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>> [[49524,1],33]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job 
>>>>>> [49524,1] launch status
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch 
>>>>>> report to myself
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44857 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44858 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44859 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44860 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44861 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44862 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44863 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44865 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44866 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44867 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44869 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44870 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44871 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44872 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44873 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44874 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>> process 44875 terminated
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],33] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],31] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],29] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],27] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],25] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],23] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],21] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],19] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],17] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],15] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],13] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],11] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],9] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],7] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],5] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],3] terminated normally
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>> abort file 
>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>>>> [[49524,1],1] terminated normally
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],25]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],15]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],11]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],13]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],19]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],9]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],17]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],31]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],7]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],21]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],5]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],33]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],23]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],3]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],29]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],27]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>> child [[49524,1],1]
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all 
>>>>>> procs in [49524,1] terminated
>>>>>> ^Cmpirun: killing job...
>>>>>> 
>>>>>> Killed by signal 2.
>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> 
>>>>>> 
>>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>>>>> Sorry - I forgot that you built from a tarball, and so debug isn't 
>>>>>>> enabled by default. You need to configure --enable-debug.
>>>>>>> 
>>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Oddly enough, adding this debugging info, lowered the number of 
>>>>>>>> processes that can be used down to 42 from 46.  When I run the MPI, it 
>>>>>>>> fails giving only the information that follows:
>>>>>>>> 
>>>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>> compute-2-0,compute-2-1 -v  -np 44 --leave-session-attached -mca 
>>>>>>>> odls_base_verbose 5 hostname
>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying component 
>>>>>>>> [default]
>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of component 
>>>>>>>> [default] set priority to 1
>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected component 
>>>>>>>> [default]
>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying component 
>>>>>>>> [default]
>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of component 
>>>>>>>> [default] set priority to 1
>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected component 
>>>>>>>> [default]
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> compute-2-1.local
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to 
>>>>>>>>> each node (to start the local daemon). The only difference is the 
>>>>>>>>> number of fork/exec's being done on each node, and the number of file 
>>>>>>>>> descriptors being opened to support those fork/exec's.
>>>>>>>>> 
>>>>>>>>> It certainly looks like your limits are high enough. When you say it 
>>>>>>>>> "fails", what do you mean - what error does it report? Try adding:
>>>>>>>>> 
>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>>>> 
>>>>>>>>> to your cmd line - this will report all the local proc launch debug 
>>>>>>>>> and hopefully show you a more detailed error report.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I have had to cobble together two machines in our rocks cluster 
>>>>>>>>>> without using the standard installation, they have efi only bios on 
>>>>>>>>>> them and rocks doesnt like that, so it is the only workaround.
>>>>>>>>>> 
>>>>>>>>>> Everything works great now, except for one thing.  MPI jobs (openmpi 
>>>>>>>>>> or mpich) fail when started from one of these nodes (via qsub or by 
>>>>>>>>>> logging in and running the command) if 24 or more processors are 
>>>>>>>>>> needed on another system.  However if the originator of the MPI job 
>>>>>>>>>> is the headnode or any of the preexisting compute nodes, it works 
>>>>>>>>>> fine.  Right now I am guessing ssh client or ulimit problems, but I 
>>>>>>>>>> cannot find any difference.  Any help would be greatly appreciated.
>>>>>>>>>> 
>>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>>>> 
>>>>>>>>>> Examples:
>>>>>>>>>> 
>>>>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>>>> 
>>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>>>> 
>>>>>>>>>> These both work, print 64 hostnames from each node
>>>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>> 
>>>>>>>>>> [root@compute-2-1 ~]# ulimit -a
>>>>>>>>>> core file size          (blocks, -c) 0
>>>>>>>>>> data seg size           (kbytes, -d) unlimited
>>>>>>>>>> scheduling priority             (-e) 0
>>>>>>>>>> file size               (blocks, -f) unlimited
>>>>>>>>>> pending signals                 (-i) 16410016
>>>>>>>>>> max locked memory       (kbytes, -l) unlimited
>>>>>>>>>> max memory size         (kbytes, -m) unlimited
>>>>>>>>>> open files                      (-n) 4096
>>>>>>>>>> pipe size            (512 bytes, -p) 8
>>>>>>>>>> POSIX message queues     (bytes, -q) 819200
>>>>>>>>>> real-time priority              (-r) 0
>>>>>>>>>> stack size              (kbytes, -s) unlimited
>>>>>>>>>> cpu time               (seconds, -t) unlimited
>>>>>>>>>> max user processes              (-u) 1024
>>>>>>>>>> virtual memory          (kbytes, -v) unlimited
>>>>>>>>>> file locks                      (-x) unlimited
>>>>>>>>>> 
>>>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>>>>> Host *
>>>>>>>>>>        CheckHostIP             no
>>>>>>>>>>        ForwardX11              yes
>>>>>>>>>>        ForwardAgent            yes
>>>>>>>>>>        StrictHostKeyChecking   no
>>>>>>>>>>        UsePrivilegedPort       no
>>>>>>>>>>        Protocol                2,1
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to