These nodes have not been locked down yet so that jobs cannot be
launched from the backend, at least on purpose anyway. The added
logging returns the information below:
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:69655] mca:base:select:( plm) Querying
component [rsh]
[compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on
agent ssh : rsh path NULL
[compute-2-1.local:69655] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[compute-2-1.local:69655] mca:base:select:( plm) Querying
component [slurm]
[compute-2-1.local:69655] mca:base:select:( plm) Skipping
component [slurm]. Query failed to return a module
[compute-2-1.local:69655] mca:base:select:( plm) Querying
component [tm]
[compute-2-1.local:69655] mca:base:select:( plm) Skipping
component [tm]. Query failed to return a module
[compute-2-1.local:69655] mca:base:select:( plm) Selected
component [rsh]
[compute-2-1.local:69655] plm:base:set_hnp_name: initial bias
69655 nodename hash 3634869988
[compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
[compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent
ssh : rsh path NULL
[compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
[compute-2-1.local:69655] mca:base:select:( odls) Querying
component [default]
[compute-2-1.local:69655] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-1.local:69655] mca:base:select:( odls) Selected
component [default]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
creating map
[compute-2-1.local:69655] [[32341,0],0] setup:vm: working
unmanaged allocation
[compute-2-1.local:69655] [[32341,0],0] using dash_host
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
[compute-2-1.local:69655] [[32341,0],0] checking node
compute-2-1.local
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new
daemon [[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
assigning new daemon [[32341,0],1] to node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0
(bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same
remote shell as local shell
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0
(bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template
argv:
/usr/bin/ssh <template>
PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ;
export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted
-mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid
<template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314"
-mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca
odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca
orte_leave_session_attached 1
[compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0
not a child of mine
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node
compute-2-0 to launch list
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating
launch event
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch
of daemon [[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing:
(//usr/bin/ssh) [/usr/bin/ssh compute-2-0
PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ;
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ;
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ;
export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted
-mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1
-mca orte_ess_num_procs 2 -mca orte_hnp_uri
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314"
-mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca
odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca
orte_leave_session_attached 1]
Warning: untrusted X11 forwarding setup failed: xauth key data
not generated
Warning: No xauth data; using fake authentication data for X11
forwarding.
[compute-2-0.local:24659] mca:base:select:( plm) Querying
component [rsh]
[compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent
ssh : rsh path NULL
[compute-2-0.local:24659] mca:base:select:( plm) Query of
component [rsh] set priority to 10
[compute-2-0.local:24659] mca:base:select:( plm) Selected
component [rsh]
[compute-2-0.local:24659] mca:base:select:( odls) Querying
component [default]
[compute-2-0.local:24659] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-0.local:24659] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent
ssh : rsh path NULL
[compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm
On 12/17/2012 10:37 AM, Ralph Castain wrote:
?? That was all the output? If so, then something is indeed
quite wrong as it didn't even attempt to launch the job.
Try adding -mca plm_base_verbose 5 to the cmd line.
I was assuming you were using ssh as the launcher, but I wonder
if you are in some managed environment? If so, then it could be
that launch from a backend node isn't allowed (e.g., on
gridengine).
On Dec 17, 2012, at 8:28 AM, Daniel Davidson
<dani...@igb.uiuc.edu> wrote:
This looks to be having issues as well, and I cannot get any
number of processors to give me a different result with the new
version.
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun
-host compute-2-0,compute-2-1 -v -np 50
--leave-session-attached -mca odls_base_verbose 5 hostname
[compute-2-1.local:69417] mca:base:select:( odls) Querying
component [default]
[compute-2-1.local:69417] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-1.local:69417] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:24486] mca:base:select:( odls) Querying
component [default]
[compute-2-0.local:24486] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-0.local:24486] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
working on WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
working on WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc
working on WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc
working on WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc
working on WILDCARD
However from the head node:
[root@biocluster openmpi-1.7rc5]#
/home/apps/openmpi-1.7rc5/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 50 hostname
Displays 25 hostnames from each system.
Thank you again for the help so far,
Dan
On 12/17/2012 08:31 AM, Daniel Davidson wrote:
I will give this a try, but wouldn't that be an issue as well
if the process was run on the head node or another node? So
long as the mpi job is not started on either of these two
nodes, it works fine.
Dan
On 12/14/2012 11:46 PM, Ralph Castain wrote:
It must be making contact or ORTE wouldn't be attempting to
launch your application's procs. Looks more like it never
received the launch command. Looking at the code, I suspect
you're getting caught in a race condition that causes the
message to get "stuck".
Just to see if that's the case, you might try running this
with the 1.7 release candidate, or even the developer's
nightly build. Both use a different timing mechanism intended
to resolve such situations.
On Dec 14, 2012, at 2:49 PM, Daniel Davidson
<dani...@igb.uiuc.edu> wrote:
Thank you for the help so far. Here is the information that
the debugging gives me. Looks like the daemon on on the
non-local node never makes contact. If I step NP back two
though, it does.
Dan
[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun
-host compute-2-0,compute-2-1 -v -np 34
--leave-session-attached -mca odls_base_verbose 5 hostname
[compute-2-1.local:44855] mca:base:select:( odls) Querying
component [default]
[compute-2-1.local:44855] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-1.local:44855] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Querying
component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-0.local:29282] mca:base:select:( odls) Selected
component [default]
[compute-2-1.local:44855] [[49524,0],0]
odls:update:daemon:info updating nidmap
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list
[compute-2-1.local:44855] [[49524,0],0]
odls:construct_child_list unpacking data to launch job
[49524,1]
[compute-2-1.local:44855] [[49524,0],0]
odls:construct_child_list adding new jobdat for job [49524,1]
[compute-2-1.local:44855] [[49524,0],0]
odls:construct_child_list unpacking 1 app_contexts
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],0] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],1] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],1] for me!
[compute-2-1.local:44855] adding proc [[49524,1],1] (1) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],2] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],3] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],3] for me!
[compute-2-1.local:44855] adding proc [[49524,1],3] (3) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],4] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],5] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],5] for me!
[compute-2-1.local:44855] adding proc [[49524,1],5] (5) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],6] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],7] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],7] for me!
[compute-2-1.local:44855] adding proc [[49524,1],7] (7) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],8] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],9] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],9] for me!
[compute-2-1.local:44855] adding proc [[49524,1],9] (9) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],10] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],11] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],11] for me!
[compute-2-1.local:44855] adding proc [[49524,1],11] (11) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],12] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],13] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],13] for me!
[compute-2-1.local:44855] adding proc [[49524,1],13] (13) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],14] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],15] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],15] for me!
[compute-2-1.local:44855] adding proc [[49524,1],15] (15) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],16] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],17] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],17] for me!
[compute-2-1.local:44855] adding proc [[49524,1],17] (17) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],18] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],19] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],19] for me!
[compute-2-1.local:44855] adding proc [[49524,1],19] (19) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],20] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],21] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],21] for me!
[compute-2-1.local:44855] adding proc [[49524,1],21] (21) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],22] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],23] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],23] for me!
[compute-2-1.local:44855] adding proc [[49524,1],23] (23) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],24] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],25] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],25] for me!
[compute-2-1.local:44855] adding proc [[49524,1],25] (25) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],26] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],27] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],27] for me!
[compute-2-1.local:44855] adding proc [[49524,1],27] (27) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],28] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],29] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],29] for me!
[compute-2-1.local:44855] adding proc [[49524,1],29] (29) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],30] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],31] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],31] for me!
[compute-2-1.local:44855] adding proc [[49524,1],31] (31) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],32] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - checking proc [[49524,1],33] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing
child list - found proc [[49524,1],33] for me!
[compute-2-1.local:44855] adding proc [[49524,1],33] (33) to
my local list
[compute-2-1.local:44855] [[49524,0],0] odls:launch found
384 processors for 17 children and locally set
oversubscribed to false
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working
child [[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0] odls:launch
reporting job [49524,1] launch status
[compute-2-1.local:44855] [[49524,0],0] odls:launch flagging
launch report to myself
[compute-2-1.local:44855] [[49524,0],0] odls:launch setting
waitpids
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44857 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44858 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44859 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44860 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44861 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44862 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44863 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44865 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44866 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44867 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44869 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44870 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44871 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44872 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44873 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44874 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc
child process 44875 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],33] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],31] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],29] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],27] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],25] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],23] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],21] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],19] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],17] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],15] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],13] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],11] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],9] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],7] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],5] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],3] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
checking abort file
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired
child process [[49524,1],1] terminated normally
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0]
odls:notify_iof_complete for child [[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:proc_complete
reporting all procs in [49524,1] terminated
^Cmpirun: killing job...
Killed by signal 2.
[compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc
working on WILDCARD
On 12/14/2012 04:11 PM, Ralph Castain wrote:
Sorry - I forgot that you built from a tarball, and so
debug isn't enabled by default. You need to configure
--enable-debug.
On Dec 14, 2012, at 1:52 PM, Daniel Davidson
<dani...@igb.uiuc.edu> wrote:
Oddly enough, adding this debugging info, lowered the
number of processes that can be used down to 42 from 46.
When I run the MPI, it fails giving only the information
that follows:
[root@compute-2-1 ssh]#
/home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -v -np 44
--leave-session-attached -mca odls_base_verbose 5 hostname
[compute-2-1.local:44374] mca:base:select:( odls) Querying
component [default]
[compute-2-1.local:44374] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-1.local:44374] mca:base:select:( odls) Selected
component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Querying
component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Query of
component [default] set priority to 1
[compute-2-0.local:28950] mca:base:select:( odls) Selected
component [default]
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
On 12/14/2012 03:18 PM, Ralph Castain wrote:
It wouldn't be ssh - in both cases, only one ssh is being
done to each node (to start the local daemon). The only
difference is the number of fork/exec's being done on
each node, and the number of file descriptors being
opened to support those fork/exec's.
It certainly looks like your limits are high enough. When
you say it "fails", what do you mean - what error does it
report? Try adding:
--leave-session-attached -mca odls_base_verbose 5
to your cmd line - this will report all the local proc
launch debug and hopefully show you a more detailed error
report.
On Dec 14, 2012, at 12:29 PM, Daniel Davidson
<dani...@igb.uiuc.edu> wrote:
I have had to cobble together two machines in our rocks
cluster without using the standard installation, they
have efi only bios on them and rocks doesnt like that,
so it is the only workaround.
Everything works great now, except for one thing. MPI
jobs (openmpi or mpich) fail when started from one of
these nodes (via qsub or by logging in and running the
command) if 24 or more processors are needed on another
system. However if the originator of the MPI job is the
headnode or any of the preexisting compute nodes, it
works fine. Right now I am guessing ssh client or
ulimit problems, but I cannot find any difference. Any
help would be greatly appreciated.
compute-2-1 and compute-2-0 are the new nodes
Examples:
This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]#
/home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 46 hostname
This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]#
/home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 48 hostname
These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun
-host compute-2-0,compute-2-1 -np 128 hostname
[root@compute-0-2 ~]#
/home/apps/openmpi-1.6.3/bin/mpirun -host
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-2-1 ~]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16410016
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
CheckHostIP no
ForwardX11 yes
ForwardAgent yes
StrictHostKeyChecking no
UsePrivilegedPort no
Protocol 2,1
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users