Hmmm...and that is ALL the output? If so, then it never succeeded in sending a message back, which leads one to suspect some kind of firewall in the way.
Looking at the ssh line, we are going to attempt to send a message from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything preventing it? On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: > These nodes have not been locked down yet so that jobs cannot be launched > from the backend, at least on purpose anyway. The added logging returns the > information below: > > [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host > compute-2-0,compute-2-1 -v -np 10 --leave-session-attached -mca > odls_base_verbose 5 -mca plm_base_verbose 5 hostname > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [rsh] > [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : > rsh path NULL > [compute-2-1.local:69655] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [slurm] > [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [compute-2-1.local:69655] mca:base:select:( plm) Querying component [tm] > [compute-2-1.local:69655] mca:base:select:( plm) Skipping component [tm]. > Query failed to return a module > [compute-2-1.local:69655] mca:base:select:( plm) Selected component [rsh] > [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename > hash 3634869988 > [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path > NULL > [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm > [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default] > [compute-2-1.local:69655] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default] > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map > [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation > [compute-2-1.local:69655] [[32341,0],0] using dash_host > [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0 > [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list > [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon > [[32341,0],1] > [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new > daemon [[32341,0],1] to node compute-2-0 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash) > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell > as local shell > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash) > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv: > /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; > export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH > ; export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export > DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca > orte_ess_jobid 2119499776 -mca orte_ess_vpid <template> -mca > orte_ess_num_procs 2 -mca orte_hnp_uri > "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca > orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 > -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1 > [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child > of mine > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to > launch list > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon > [[32341,0],1] > [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) > [/usr/bin/ssh compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export > PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; > export LD_LIBRARY_PATH ; > DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export > DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca > orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca > orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" > -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose > 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1] > Warning: untrusted X11 forwarding setup failed: xauth key data not generated > Warning: No xauth data; using fake authentication data for X11 forwarding. > [compute-2-0.local:24659] mca:base:select:( plm) Querying component [rsh] > [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : rsh > path NULL > [compute-2-0.local:24659] mca:base:select:( plm) Query of component [rsh] > set priority to 10 > [compute-2-0.local:24659] mca:base:select:( plm) Selected component [rsh] > [compute-2-0.local:24659] mca:base:select:( odls) Querying component [default] > [compute-2-0.local:24659] mca:base:select:( odls) Query of component > [default] set priority to 1 > [compute-2-0.local:24659] mca:base:select:( odls) Selected component [default] > [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : rsh path > NULL > [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm > > > > > On 12/17/2012 10:37 AM, Ralph Castain wrote: >> ?? That was all the output? If so, then something is indeed quite wrong as >> it didn't even attempt to launch the job. >> >> Try adding -mca plm_base_verbose 5 to the cmd line. >> >> I was assuming you were using ssh as the launcher, but I wonder if you are >> in some managed environment? If so, then it could be that launch from a >> backend node isn't allowed (e.g., on gridengine). >> >> On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: >> >>> This looks to be having issues as well, and I cannot get any number of >>> processors to give me a different result with the new version. >>> >>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host >>> compute-2-0,compute-2-1 -v -np 50 --leave-session-attached -mca >>> odls_base_verbose 5 hostname >>> [compute-2-1.local:69417] mca:base:select:( odls) Querying component >>> [default] >>> [compute-2-1.local:69417] mca:base:select:( odls) Query of component >>> [default] set priority to 1 >>> [compute-2-1.local:69417] mca:base:select:( odls) Selected component >>> [default] >>> [compute-2-0.local:24486] mca:base:select:( odls) Querying component >>> [default] >>> [compute-2-0.local:24486] mca:base:select:( odls) Query of component >>> [default] set priority to 1 >>> [compute-2-0.local:24486] mca:base:select:( odls) Selected component >>> [default] >>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on >>> WILDCARD >>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on >>> WILDCARD >>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on >>> WILDCARD >>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on >>> WILDCARD >>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on >>> WILDCARD >>> >>> However from the head node: >>> >>> [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun >>> -host compute-2-0,compute-2-1 -v -np 50 hostname >>> >>> Displays 25 hostnames from each system. >>> >>> Thank you again for the help so far, >>> >>> Dan >>> >>> >>> >>> >>> >>> >>> On 12/17/2012 08:31 AM, Daniel Davidson wrote: >>>> I will give this a try, but wouldn't that be an issue as well if the >>>> process was run on the head node or another node? So long as the mpi job >>>> is not started on either of these two nodes, it works fine. >>>> >>>> Dan >>>> >>>> On 12/14/2012 11:46 PM, Ralph Castain wrote: >>>>> It must be making contact or ORTE wouldn't be attempting to launch your >>>>> application's procs. Looks more like it never received the launch >>>>> command. Looking at the code, I suspect you're getting caught in a race >>>>> condition that causes the message to get "stuck". >>>>> >>>>> Just to see if that's the case, you might try running this with the 1.7 >>>>> release candidate, or even the developer's nightly build. Both use a >>>>> different timing mechanism intended to resolve such situations. >>>>> >>>>> >>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote: >>>>> >>>>>> Thank you for the help so far. Here is the information that the >>>>>> debugging gives me. Looks like the daemon on on the non-local node >>>>>> never makes contact. If I step NP back two though, it does. >>>>>> >>>>>> Dan >>>>>> >>>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>> compute-2-0,compute-2-1 -v -np 34 --leave-session-attached -mca >>>>>> odls_base_verbose 5 hostname >>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component >>>>>> [default] >>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component >>>>>> [default] set priority to 1 >>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component >>>>>> [default] >>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component >>>>>> [default] >>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component >>>>>> [default] set priority to 1 >>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component >>>>>> [default] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating >>>>>> nidmap >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>>>> unpacking data to launch job [49524,1] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding >>>>>> new jobdat for job [49524,1] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list >>>>>> unpacking 1 app_contexts >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],0] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],1] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],1] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],2] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],3] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],3] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],4] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],5] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],5] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],6] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],7] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],7] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],8] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],9] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],9] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],10] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],11] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],11] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],12] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],13] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],13] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],14] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],15] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],15] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],16] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],17] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],17] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],18] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],19] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],19] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],20] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],21] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],21] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],22] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],23] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],23] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],24] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],25] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],25] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],26] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],27] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],27] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],28] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],29] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],29] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],30] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],31] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],31] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],32] on daemon 1 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> checking proc [[49524,1],33] on daemon 0 >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - >>>>>> found proc [[49524,1],33] for me! >>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local >>>>>> list >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors >>>>>> for 17 children and locally set oversubscribed to false >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],1] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],3] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],5] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],7] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],9] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],11] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],13] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],15] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],17] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],19] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],21] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],23] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],25] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],27] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],29] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],31] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child >>>>>> [[49524,1],33] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job >>>>>> [49524,1] launch status >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch >>>>>> report to myself >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44857 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44858 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44859 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44860 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44861 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44862 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44863 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44865 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44866 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44867 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44869 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44870 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44871 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44872 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44873 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44874 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child >>>>>> process 44875 terminated >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],33] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],31] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],29] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],27] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],25] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],23] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],21] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],19] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],17] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],15] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],13] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],11] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],9] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],7] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],5] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],3] terminated normally >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking >>>>>> abort file >>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process >>>>>> [[49524,1],1] terminated normally >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> compute-2-1.local >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],25] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],15] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],11] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],13] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],19] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],9] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],17] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],31] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],7] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],21] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],5] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],33] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],23] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],3] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],29] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],27] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for >>>>>> child [[49524,1],1] >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all >>>>>> procs in [49524,1] terminated >>>>>> ^Cmpirun: killing job... >>>>>> >>>>>> Killed by signal 2. >>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on >>>>>> WILDCARD >>>>>> >>>>>> >>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote: >>>>>>> Sorry - I forgot that you built from a tarball, and so debug isn't >>>>>>> enabled by default. You need to configure --enable-debug. >>>>>>> >>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>>> wrote: >>>>>>> >>>>>>>> Oddly enough, adding this debugging info, lowered the number of >>>>>>>> processes that can be used down to 42 from 46. When I run the MPI, it >>>>>>>> fails giving only the information that follows: >>>>>>>> >>>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>> compute-2-0,compute-2-1 -v -np 44 --leave-session-attached -mca >>>>>>>> odls_base_verbose 5 hostname >>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying component >>>>>>>> [default] >>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of component >>>>>>>> [default] set priority to 1 >>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected component >>>>>>>> [default] >>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying component >>>>>>>> [default] >>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of component >>>>>>>> [default] set priority to 1 >>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected component >>>>>>>> [default] >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> compute-2-1.local >>>>>>>> >>>>>>>> >>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote: >>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to >>>>>>>>> each node (to start the local daemon). The only difference is the >>>>>>>>> number of fork/exec's being done on each node, and the number of file >>>>>>>>> descriptors being opened to support those fork/exec's. >>>>>>>>> >>>>>>>>> It certainly looks like your limits are high enough. When you say it >>>>>>>>> "fails", what do you mean - what error does it report? Try adding: >>>>>>>>> >>>>>>>>> --leave-session-attached -mca odls_base_verbose 5 >>>>>>>>> >>>>>>>>> to your cmd line - this will report all the local proc launch debug >>>>>>>>> and hopefully show you a more detailed error report. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have had to cobble together two machines in our rocks cluster >>>>>>>>>> without using the standard installation, they have efi only bios on >>>>>>>>>> them and rocks doesnt like that, so it is the only workaround. >>>>>>>>>> >>>>>>>>>> Everything works great now, except for one thing. MPI jobs (openmpi >>>>>>>>>> or mpich) fail when started from one of these nodes (via qsub or by >>>>>>>>>> logging in and running the command) if 24 or more processors are >>>>>>>>>> needed on another system. However if the originator of the MPI job >>>>>>>>>> is the headnode or any of the preexisting compute nodes, it works >>>>>>>>>> fine. Right now I am guessing ssh client or ulimit problems, but I >>>>>>>>>> cannot find any difference. Any help would be greatly appreciated. >>>>>>>>>> >>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes >>>>>>>>>> >>>>>>>>>> Examples: >>>>>>>>>> >>>>>>>>>> This works, prints 23 hostnames from each machine: >>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>> compute-2-0,compute-2-1 -np 46 hostname >>>>>>>>>> >>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1 >>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>> compute-2-0,compute-2-1 -np 48 hostname >>>>>>>>>> >>>>>>>>>> These both work, print 64 hostnames from each node >>>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname >>>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host >>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname >>>>>>>>>> >>>>>>>>>> [root@compute-2-1 ~]# ulimit -a >>>>>>>>>> core file size (blocks, -c) 0 >>>>>>>>>> data seg size (kbytes, -d) unlimited >>>>>>>>>> scheduling priority (-e) 0 >>>>>>>>>> file size (blocks, -f) unlimited >>>>>>>>>> pending signals (-i) 16410016 >>>>>>>>>> max locked memory (kbytes, -l) unlimited >>>>>>>>>> max memory size (kbytes, -m) unlimited >>>>>>>>>> open files (-n) 4096 >>>>>>>>>> pipe size (512 bytes, -p) 8 >>>>>>>>>> POSIX message queues (bytes, -q) 819200 >>>>>>>>>> real-time priority (-r) 0 >>>>>>>>>> stack size (kbytes, -s) unlimited >>>>>>>>>> cpu time (seconds, -t) unlimited >>>>>>>>>> max user processes (-u) 1024 >>>>>>>>>> virtual memory (kbytes, -v) unlimited >>>>>>>>>> file locks (-x) unlimited >>>>>>>>>> >>>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config >>>>>>>>>> Host * >>>>>>>>>> CheckHostIP no >>>>>>>>>> ForwardX11 yes >>>>>>>>>> ForwardAgent yes >>>>>>>>>> StrictHostKeyChecking no >>>>>>>>>> UsePrivilegedPort no >>>>>>>>>> Protocol 2,1 >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users