This looks to be having issues as well, and I cannot get any number of
processors to give me a different result with the new version.
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:69417] mca:base:select:( odls) Querying component
[default]
[compute-2-1.local:69417] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-1.local:69417] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Querying component
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-0.local:24486] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on
WILDCARD
However from the head node:
[root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun
-host compute-2-0,compute-2-1 -v -np 50 hostname
Displays 25 hostnames from each system.
Thank you again for the help so far,
Dan
On 12/17/2012 08:31 AM, Daniel Davidson wrote:
I will give this a try, but wouldn't that be an issue as well if the
process was run on the head node or another node? So long as the mpi
job is not started on either of these two nodes, it works fine.
Dan
On 12/14/2012 11:46 PM, Ralph Castain wrote:
It must be making contact or ORTE wouldn't be attempting to launch
your application's procs. Looks more like it never received the
launch command. Looking at the code, I suspect you're getting caught
in a race condition that causes the message to get "stuck".
Just to see if that's the case, you might try running this with the
1.7 release candidate, or even the developer's nightly build. Both
use a different timing mechanism intended to resolve such situations.
On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu>
wrote:
Thank you for the help so far. Here is the information that the
debugging gives me. Looks like the daemon on on the non-local node
never makes contact. If I step NP back two though, it does.
Dan
[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:44855] mca:base:select:( odls) Querying component
[default]
[compute-2-1.local:44855] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-1.local:44855] mca:base:select:( odls) Selected component
[default]
[compute-2-0.local:29282] mca:base:select:( odls) Querying component
[default]
[compute-2-0.local:29282] mca:base:select:( odls) Query of component
[default] set priority to 1
[compute-2-0.local:29282] mca:base:select:( odls) Selected component
[default]
[compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info
updating nidmap
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
unpacking data to launch job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
adding new jobdat for job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list
unpacking 1 app_contexts
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],0] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],1] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],1] for me!
[compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],2] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],3] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],3] for me!
[compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],4] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],5] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],5] for me!
[compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],6] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],7] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],7] for me!
[compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],8] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],9] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],9] for me!
[compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],10] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],11] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],11] for me!
[compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],12] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],13] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],13] for me!
[compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],14] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],15] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],15] for me!
[compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],16] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],17] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],17] for me!
[compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],18] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],19] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],19] for me!
[compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],20] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],21] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],21] for me!
[compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],22] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],23] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],23] for me!
[compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],24] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],25] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],25] for me!
[compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],26] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],27] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],27] for me!
[compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],28] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],29] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],29] for me!
[compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],30] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],31] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],31] for me!
[compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],32] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- checking proc [[49524,1],33] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
- found proc [[49524,1],33] for me!
[compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my
local list
[compute-2-1.local:44855] [[49524,0],0] odls:launch found 384
processors for 17 children and locally set oversubscribed to false
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child
[[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job
[49524,1] launch status
[compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch
report to myself
[compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44857 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44858 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44859 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44860 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44861 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44862 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44863 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44865 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44866 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44867 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44869 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44870 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44871 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44872 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44873 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44874 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child
process 44875 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],33] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],31] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],29] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],27] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],25] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],23] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],21] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],19] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],17] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],15] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],13] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],11] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],9] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],7] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],5] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],3] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking
abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child
process [[49524,1],1] terminated normally
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for
child [[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting
all procs in [49524,1] terminated
^Cmpirun: killing job...
Killed by signal 2.
[compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working
on WILDCARD
On 12/14/2012 04:11 PM, Ralph Castain wrote:
Sorry - I forgot that you built from a tarball, and so debug isn't
enabled by default. You need to configure --enable-debug.
On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu>
wrote:
Oddly enough, adding this debugging info, lowered the number of
processes that can be used down to 42 from 46. When I run the
MPI, it fails giving only the information that follows:
[root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca
odls_base_verbose 5 hostname
[compute-2-1.local:44374] mca:base:select:( odls) Querying
component [default]
[compute-2-1.local:44374] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-1.local:44374] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Querying
component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-0.local:28950] mca:base:select:( odls) Selected
component [default]
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
On 12/14/2012 03:18 PM, Ralph Castain wrote:
It wouldn't be ssh - in both cases, only one ssh is being done to
each node (to start the local daemon). The only difference is the
number of fork/exec's being done on each node, and the number of
file descriptors being opened to support those fork/exec's.
It certainly looks like your limits are high enough. When you say
it "fails", what do you mean - what error does it report? Try
adding:
--leave-session-attached -mca odls_base_verbose 5
to your cmd line - this will report all the local proc launch
debug and hopefully show you a more detailed error report.
On Dec 14, 2012, at 12:29 PM, Daniel Davidson
<dani...@igb.uiuc.edu> wrote:
I have had to cobble together two machines in our rocks cluster
without using the standard installation, they have efi only bios
on them and rocks doesnt like that, so it is the only workaround.
Everything works great now, except for one thing. MPI jobs
(openmpi or mpich) fail when started from one of these nodes
(via qsub or by logging in and running the command) if 24 or
more processors are needed on another system. However if the
originator of the MPI job is the headnode or any of the
preexisting compute nodes, it works fine. Right now I am
guessing ssh client or ulimit problems, but I cannot find any
difference. Any help would be greatly appreciated.
compute-2-1 and compute-2-0 are the new nodes
Examples:
This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 46 hostname
This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 48 hostname
These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-2-1 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16410016
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
CheckHostIP no
ForwardX11 yes
ForwardAgent yes
StrictHostKeyChecking no
UsePrivilegedPort no
Protocol 2,1
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users