?? That was all the output? If so, then something is indeed quite wrong as it didn't even attempt to launch the job.
Try adding -mca plm_base_verbose 5 to the cmd line. I was assuming you were using ssh as the launcher, but I wonder if you are in some managed environment? If so, then it could be that launch from a backend node isn't allowed (e.g., on gridengine). On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: > This looks to be having issues as well, and I cannot get any number of > processors to give me a different result with the new version. > > [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca > odls_base_verbose 5 hostname > [compute-2-1.local:69417] mca:base:select:( odls) Querying component [default] > [compute-2-1.local:69417] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-1.local:69417] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:24486] mca:base:select:( odls) Querying component [default] > [compute-2-0.local:24486] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-0.local:24486] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on > WILDCARD > [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on > WILDCARD > [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on > WILDCARD > > However from the head node: > > [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 50 hostname > > Displays 25 hostnames from each system. > > Thank you again for the help so far, > > Dan > > > > > > > On 12/17/2012 08:31 AM, Daniel Davidson wrote: >> I will give this a try, but wouldn't that be an issue as well if the process >> was run on the head node or another node? So long as the mpi job is not >> started on either of these two nodes, it works fine. >> >> Dan >> >> On 12/14/2012 11:46 PM, Ralph Castain wrote: >>> It must be making contact or ORTE wouldn't be attempting to launch your >>> application's procs. Looks more like it never received the launch command. >>> Looking at the code, I suspect you're getting caught in a race condition >>> that causes the message to get "stuck". >>> >>> Just to see if that's the case, you might try running this with the 1.7 >>> release candidate, or even the developer's nightly build. Both use a >>> different timing mechanism intended to resolve such situations. >>> >>> >>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: >>> >>>> Thank you for the help so far. Here is the information that the debugging >>>> gives me. Looks like the daemon on on the non-local node never makes >>>> contact. If I step NP back two though, it does. >>>> >>>> Dan >>>> >>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>> compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca >>>> odls_base_verbose 5 hostname >>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component >>>> [default] >>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component >>>> [default] set priority to 1 >>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component >>>> [default] >>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component >>>> [default] >>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component >>>> [default] set priority to 1 >>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component >>>> [default] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating >>>> nidmap >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>> unpacking data to launch job [49524,1] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding >>>> new jobdat for job [49524,1] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>> unpacking 1 app_contexts >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],0] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],1] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],1] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],2] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],3] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],3] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],4] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],5] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],5] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],6] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],7] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],7] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],8] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],9] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],9] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],10] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],11] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],11] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],12] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],13] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],13] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],14] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],15] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],15] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],16] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],17] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],17] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],18] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],19] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],19] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],20] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],21] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],21] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],22] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],23] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],23] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],24] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],25] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],25] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],26] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],27] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],27] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],28] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],29] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],29] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],30] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],31] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],31] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],32] on daemon 1 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> checking proc [[49524,1],33] on daemon 0 >>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>> found proc [[49524,1],33] for me! >>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local list >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors >>>> for 17 children and locally set oversubscribed to false >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],1] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],3] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],5] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],7] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],9] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],11] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],13] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],15] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],17] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],19] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],21] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],23] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],25] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],27] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],29] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],31] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>> [[49524,1],33] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job >>>> [49524,1] launch status >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch report >>>> to myself >>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44857 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44858 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44859 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44860 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44861 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44862 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44863 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44865 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44866 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44867 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44869 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44870 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44871 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44872 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44873 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44874 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process >>>> 44875 terminated >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],33] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],31] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],29] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],27] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],25] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],23] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],21] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],19] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],17] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],15] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],13] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],11] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],9] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],7] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],5] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],3] terminated normally >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort >>>> file /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort >>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>> [[49524,1],1] terminated normally >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> compute-2-1.local >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],25] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],15] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],11] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],13] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],19] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],9] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],17] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],31] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],7] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],21] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],5] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],33] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],23] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],3] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],29] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],27] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child >>>> [[49524,1],1] >>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all >>>> procs in [49524,1] terminated >>>> ^Cmpirun: killing job... >>>> >>>> Killed by signal 2. >>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on >>>> WILDCARD >>>> >>>> >>>> On 12/14/2012 04:11 PM, Ralph Castain wrote: >>>>> Sorry - I forgot that you built from a tarball, and so debug isn't >>>>> enabled by default. You need to configure --enable-debug. >>>>> >>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: >>>>> >>>>>> Oddly enough, adding this debugging info, lowered the number of >>>>>> processes that can be used down to 42 from 46. When I run the MPI, it >>>>>> fails giving only the information that follows: >>>>>> >>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>> compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca >>>>>> odls_base_verbose 5 hostname >>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying component >>>>>> [default] >>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of component >>>>>> [default] set priority to 1 >>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected component >>>>>> [default] >>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying component >>>>>> [default] >>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of component >>>>>> [default] set priority to 1 >>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected component >>>>>> [default] >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> >>>>>> >>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote: >>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to each >>>>>>> node (to start the local daemon). The only difference is the number of >>>>>>> fork/exec's being done on each node, and the number of file descriptors >>>>>>> being opened to support those fork/exec's. >>>>>>> >>>>>>> It certainly looks like your limits are high enough. When you say it >>>>>>> "fails", what do you mean - what error does it report? Try adding: >>>>>>> >>>>>>> --leave-session-attached -mca odls_base_verbose 5 >>>>>>> >>>>>>> to your cmd line - this will report all the local proc launch debug and >>>>>>> hopefully show you a more detailed error report. >>>>>>> >>>>>>> >>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> I have had to cobble together two machines in our rocks cluster >>>>>>>> without using the standard installation, they have efi only bios on >>>>>>>> them and rocks doesnt like that, so it is the only workaround. >>>>>>>> >>>>>>>> Everything works great now, except for one thing. MPI jobs (openmpi >>>>>>>> or mpich) fail when started from one of these nodes (via qsub or by >>>>>>>> logging in and running the command) if 24 or more processors are >>>>>>>> needed on another system. However if the originator of the MPI job is >>>>>>>> the headnode or any of the preexisting compute nodes, it works fine. >>>>>>>> Right now I am guessing ssh client or ulimit problems, but I cannot >>>>>>>> find any difference. Any help would be greatly appreciated. >>>>>>>> >>>>>>>> compute-2-1 and compute-2-0 are the new nodes >>>>>>>> >>>>>>>> Examples: >>>>>>>> >>>>>>>> This works, prints 23 hostnames from each machine: >>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>> compute-2-0,compute-2-1 -np 46 hostname >>>>>>>> >>>>>>>> This does not work, prints 24 hostnames for compute-2-1 >>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>> compute-2-0,compute-2-1 -np 48 hostname >>>>>>>> >>>>>>>> These both work, print 64 hostnames from each node >>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>> compute-2-0,compute-2-1 -np 128 hostname >>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>> compute-2-0,compute-2-1 -np 128 hostname >>>>>>>> >>>>>>>> [root@compute-2-1 ~]# ulimit -a >>>>>>>> core file size (blocks, -c) 0 >>>>>>>> data seg size (kbytes, -d) unlimited >>>>>>>> scheduling priority (-e) 0 >>>>>>>> file size (blocks, -f) unlimited >>>>>>>> pending signals (-i) 16410016 >>>>>>>> max locked memory (kbytes, -l) unlimited >>>>>>>> max memory size (kbytes, -m) unlimited >>>>>>>> open files (-n) 4096 >>>>>>>> pipe size (512 bytes, -p) 8 >>>>>>>> POSIX message queues (bytes, -q) 819200 >>>>>>>> real-time priority (-r) 0 >>>>>>>> stack size (kbytes, -s) unlimited >>>>>>>> cpu time (seconds, -t) unlimited >>>>>>>> max user processes (-u) 1024 >>>>>>>> virtual memory (kbytes, -v) unlimited >>>>>>>> file locks (-x) unlimited >>>>>>>> >>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config >>>>>>>> Host * >>>>>>>> CheckHostIP no >>>>>>>> ForwardX11 yes >>>>>>>> ForwardAgent yes >>>>>>>> StrictHostKeyChecking no >>>>>>>> UsePrivilegedPort no >>>>>>>> Protocol 2,1 >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users