?? That was all the output? If so, then something is indeed quite wrong as it 
didn't even attempt to launch the job.

Try adding -mca plm_base_verbose 5 to the cmd line.

I was assuming you were using ssh as the launcher, but I wonder if you are in 
some managed environment? If so, then it could be that launch from a backend 
node isn't allowed (e.g., on gridengine).

On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

> This looks to be having issues as well, and I cannot get any number of 
> processors to give me a different result with the new version.
> 
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
> odls_base_verbose 5 hostname
> [compute-2-1.local:69417] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:69417] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:69417] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:24486] mca:base:select:( odls) Querying component [default]
> [compute-2-0.local:24486] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-0.local:24486] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
> WILDCARD
> 
> However from the head node:
> 
> [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 50  hostname
> 
> Displays 25 hostnames from each system.
> 
> Thank you again for the help so far,
> 
> Dan
> 
> 
> 
> 
> 
> 
> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>> I will give this a try, but wouldn't that be an issue as well if the process 
>> was run on the head node or another node?  So long as the mpi job is not 
>> started on either of these two nodes, it works fine.
>> 
>> Dan
>> 
>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>> It must be making contact or ORTE wouldn't be attempting to launch your 
>>> application's procs. Looks more like it never received the launch command. 
>>> Looking at the code, I suspect you're getting caught in a race condition 
>>> that causes the message to get "stuck".
>>> 
>>> Just to see if that's the case, you might try running this with the 1.7 
>>> release candidate, or even the developer's nightly build. Both use a 
>>> different timing mechanism intended to resolve such situations.
>>> 
>>> 
>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>> 
>>>> Thank you for the help so far.  Here is the information that the debugging 
>>>> gives me.  Looks like the daemon on on the non-local node never makes 
>>>> contact.  If I step NP back two though, it does.
>>>> 
>>>> Dan
>>>> 
>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
>>>> odls_base_verbose 5 hostname
>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component 
>>>> [default]
>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
>>>> [default] set priority to 1
>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component 
>>>> [default]
>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component 
>>>> [default]
>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
>>>> [default] set priority to 1
>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component 
>>>> [default]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating 
>>>> nidmap
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>> unpacking data to launch job [49524,1]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding 
>>>> new jobdat for job [49524,1]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>> unpacking 1 app_contexts
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],0] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],1] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],1] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],2] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],3] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],3] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],4] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],5] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],5] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],6] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],7] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],7] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],8] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],9] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],9] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],10] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],11] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],11] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],12] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],13] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],13] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],14] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],15] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],15] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],16] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],17] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],17] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],18] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],19] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],19] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],20] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],21] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],21] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],22] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],23] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],23] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],24] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],25] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],25] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],26] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],27] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],27] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],28] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],29] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],29] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],30] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],31] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],31] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],32] on daemon 1
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> checking proc [[49524,1],33] on daemon 0
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
>>>> found proc [[49524,1],33] for me!
>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local list
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors 
>>>> for 17 children and locally set oversubscribed to false
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],1]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],3]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],5]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],7]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],9]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],11]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],13]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],15]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],17]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],19]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],21]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],23]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],25]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],27]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],29]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],31]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>> [[49524,1],33]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job 
>>>> [49524,1] launch status
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch report 
>>>> to myself
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44857 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44858 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44859 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44860 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44861 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44862 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44863 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44865 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44866 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44867 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44869 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44870 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44871 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44872 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44873 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44874 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
>>>> 44875 terminated
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],33] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],31] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],29] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],27] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],25] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],23] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],21] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],19] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],17] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],15] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],13] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],11] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],9] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],7] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],5] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],3] terminated normally
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort 
>>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
>>>> [[49524,1],1] terminated normally
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> compute-2-1.local
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],25]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],15]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],11]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],13]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],19]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],9]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],17]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],31]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],7]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],21]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],5]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],33]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],23]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],3]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],29]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],27]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
>>>> [[49524,1],1]
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all 
>>>> procs in [49524,1] terminated
>>>> ^Cmpirun: killing job...
>>>> 
>>>> Killed by signal 2.
>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on 
>>>> WILDCARD
>>>> 
>>>> 
>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>>> Sorry - I forgot that you built from a tarball, and so debug isn't 
>>>>> enabled by default. You need to configure --enable-debug.
>>>>> 
>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>>>> 
>>>>>> Oddly enough, adding this debugging info, lowered the number of 
>>>>>> processes that can be used down to 42 from 46.  When I run the MPI, it 
>>>>>> fails giving only the information that follows:
>>>>>> 
>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>> compute-2-0,compute-2-1 -v  -np 44 --leave-session-attached -mca 
>>>>>> odls_base_verbose 5 hostname
>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> compute-2-1.local
>>>>>> 
>>>>>> 
>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to each 
>>>>>>> node (to start the local daemon). The only difference is the number of 
>>>>>>> fork/exec's being done on each node, and the number of file descriptors 
>>>>>>> being opened to support those fork/exec's.
>>>>>>> 
>>>>>>> It certainly looks like your limits are high enough. When you say it 
>>>>>>> "fails", what do you mean - what error does it report? Try adding:
>>>>>>> 
>>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>> 
>>>>>>> to your cmd line - this will report all the local proc launch debug and 
>>>>>>> hopefully show you a more detailed error report.
>>>>>>> 
>>>>>>> 
>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I have had to cobble together two machines in our rocks cluster 
>>>>>>>> without using the standard installation, they have efi only bios on 
>>>>>>>> them and rocks doesnt like that, so it is the only workaround.
>>>>>>>> 
>>>>>>>> Everything works great now, except for one thing.  MPI jobs (openmpi 
>>>>>>>> or mpich) fail when started from one of these nodes (via qsub or by 
>>>>>>>> logging in and running the command) if 24 or more processors are 
>>>>>>>> needed on another system.  However if the originator of the MPI job is 
>>>>>>>> the headnode or any of the preexisting compute nodes, it works fine.  
>>>>>>>> Right now I am guessing ssh client or ulimit problems, but I cannot 
>>>>>>>> find any difference.  Any help would be greatly appreciated.
>>>>>>>> 
>>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>> 
>>>>>>>> Examples:
>>>>>>>> 
>>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>> 
>>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>> 
>>>>>>>> These both work, print 64 hostnames from each node
>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>> 
>>>>>>>> [root@compute-2-1 ~]# ulimit -a
>>>>>>>> core file size          (blocks, -c) 0
>>>>>>>> data seg size           (kbytes, -d) unlimited
>>>>>>>> scheduling priority             (-e) 0
>>>>>>>> file size               (blocks, -f) unlimited
>>>>>>>> pending signals                 (-i) 16410016
>>>>>>>> max locked memory       (kbytes, -l) unlimited
>>>>>>>> max memory size         (kbytes, -m) unlimited
>>>>>>>> open files                      (-n) 4096
>>>>>>>> pipe size            (512 bytes, -p) 8
>>>>>>>> POSIX message queues     (bytes, -q) 819200
>>>>>>>> real-time priority              (-r) 0
>>>>>>>> stack size              (kbytes, -s) unlimited
>>>>>>>> cpu time               (seconds, -t) unlimited
>>>>>>>> max user processes              (-u) 1024
>>>>>>>> virtual memory          (kbytes, -v) unlimited
>>>>>>>> file locks                      (-x) unlimited
>>>>>>>> 
>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>>> Host *
>>>>>>>>        CheckHostIP             no
>>>>>>>>        ForwardX11              yes
>>>>>>>>        ForwardAgent            yes
>>>>>>>>        StrictHostKeyChecking   no
>>>>>>>>        UsePrivilegedPort       no
>>>>>>>>        Protocol                2,1
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to