Yes, it does.

Dan

[root@compute-2-1 ~]# ssh compute-2-0
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
[root@compute-2-0 ~]# ssh compute-2-1
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
[root@compute-2-1 ~]#



On 12/17/2012 03:39 PM, Doug Reeder wrote:
Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:

I would also add that scp seems to be creating the file in the /tmp directory 
of compute-2-0, and that /var/log secure is showing ssh connections being 
accepted.  Is there anything in ssh that can limit connections that I need to 
look out for?  My guess is that it is part of the client prefs and not the 
server prefs since I can initiate the mpi command from another machine and it 
works fine, even when it uses compute-2-0 and 1.

Dan


[root@compute-2-1 /]# date
Mon Dec 17 15:11:50 CST 2012
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL

[root@compute-2-0 tmp]# ls -ltr
total 24
-rw-------.  1 root    root       0 Nov 28 08:42 yum.log
-rw-------.  1 root    root    5962 Nov 29 10:50 
yum_save_tx-2012-11-29-10-50SRba9s.yumtx
drwx------.  3 danield danield 4096 Dec 12 14:56 
openmpi-sessions-danield@compute-2-0_0
drwx------.  3 root    root    4096 Dec 13 15:38 
openmpi-sessions-root@compute-2-0_0
drwx------  18 danield danield 4096 Dec 14 09:48 
openmpi-sessions-danield@compute-2-0.local_0
drwx------  44 root    root    4096 Dec 17 15:14 
openmpi-sessions-root@compute-2-0.local_0

[root@compute-2-0 tmp]# tail -10 /var/log/secure
Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
10.1.255.226 port 49483 ssh2
Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session opened 
for user root by (uid=0)
Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 10.1.255.226: 
11: disconnected by user
Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session closed 
for user root
Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
10.1.255.226 port 49484 ssh2
Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session opened 
for user root by (uid=0)
Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 10.1.255.226: 
11: disconnected by user
Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session closed 
for user root
Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
10.1.255.226 port 49485 ssh2
Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session opened 
for user root by (uid=0)






On 12/17/2012 11:16 AM, Daniel Davidson wrote:
A very long time (15 mintues or so) I finally received the following in 
addition to what I just sent earlier:

[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending orted_exit 
commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD

Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a 
message back, which leads one to suspect some kind of firewall in the way.

Looking at the ssh line, we are going to attempt to send a message from tnode 
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything 
preventing it?


On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

These nodes have not been locked down yet so that jobs cannot be launched from 
the backend, at least on purpose anyway.  The added logging returns the 
information below:

[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[compute-2-1.local:69655] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [slurm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [slurm]. 
Query failed to return a module
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
[compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename 
hash 3634869988
[compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
[compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path 
NULL
[compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
[compute-2-1.local:69655] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:69655] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:69655] mca:base:select:( odls) Selected component [default]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
[compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation
[compute-2-1.local:69655] [[32341,0],0] using dash_host
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
[[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new daemon 
[[32341,0],1] to node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell as 
local shell
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
        /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; 
/home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 
<template> -mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca orte_use_common_port 
0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca plm_base_verbose 5 -mca plm rsh -mca 
orte_leave_session_attached 1
[compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child of 
mine
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to 
launch list
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon 
[[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) [/usr/bin/ssh 
compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export 
DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 
2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca 
plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1]
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
[compute-2-0.local:24659] mca:base:select:(  plm) Querying component [rsh]
[compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : rsh path 
NULL
[compute-2-0.local:24659] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[compute-2-0.local:24659] mca:base:select:(  plm) Selected component [rsh]
[compute-2-0.local:24659] mca:base:select:( odls) Querying component [default]
[compute-2-0.local:24659] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-0.local:24659] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : rsh path 
NULL
[compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm




On 12/17/2012 10:37 AM, Ralph Castain wrote:
?? That was all the output? If so, then something is indeed quite wrong as it 
didn't even attempt to launch the job.

Try adding -mca plm_base_verbose 5 to the cmd line.

I was assuming you were using ssh as the launcher, but I wonder if you are in 
some managed environment? If so, then it could be that launch from a backend 
node isn't allowed (e.g., on gridengine).

On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

This looks to be having issues as well, and I cannot get any number of 
processors to give me a different result with the new version.

[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:69417] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:69417] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:69417] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:24486] mca:base:select:( odls) Querying component [default]
[compute-2-0.local:24486] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-0.local:24486] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on WILDCARD

However from the head node:

[root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 50  hostname

Displays 25 hostnames from each system.

Thank you again for the help so far,

Dan






On 12/17/2012 08:31 AM, Daniel Davidson wrote:
I will give this a try, but wouldn't that be an issue as well if the process 
was run on the head node or another node?  So long as the mpi job is not 
started on either of these two nodes, it works fine.

Dan

On 12/14/2012 11:46 PM, Ralph Castain wrote:
It must be making contact or ORTE wouldn't be attempting to launch your application's 
procs. Looks more like it never received the launch command. Looking at the code, I 
suspect you're getting caught in a race condition that causes the message to get 
"stuck".

Just to see if that's the case, you might try running this with the 1.7 release 
candidate, or even the developer's nightly build. Both use a different timing 
mechanism intended to resolve such situations.


On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

Thank you for the help so far.  Here is the information that the debugging 
gives me.  Looks like the daemon on on the non-local node never makes contact.  
If I step NP back two though, it does.

Dan

[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:44855] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:44855] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:44855] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Querying component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-0.local:29282] mca:base:select:( odls) Selected component [default]
[compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating nidmap
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 
data to launch job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new 
jobdat for job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 
app_contexts
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],0] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],1] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],1] for me!
[compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],2] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],3] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],3] for me!
[compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],4] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],5] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],5] for me!
[compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],6] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],7] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],7] for me!
[compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],8] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],9] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],9] for me!
[compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],10] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],11] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],11] for me!
[compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],12] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],13] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],13] for me!
[compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],14] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],15] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],15] for me!
[compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],16] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],17] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],17] for me!
[compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],18] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],19] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],19] for me!
[compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],20] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],21] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],21] for me!
[compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],22] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],23] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],23] for me!
[compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],24] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],25] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],25] for me!
[compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],26] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],27] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],27] for me!
[compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],28] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],29] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],29] for me!
[compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],30] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],31] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],31] for me!
[compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],32] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],33] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],33] for me!
[compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 processors for 17 
children and locally set oversubscribed to false
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0] odls:launch working child [[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job [49524,1] 
launch status
[compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch report to 
myself
[compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44857 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44858 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44859 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44860 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44861 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44862 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44863 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44865 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44866 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44867 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44869 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44870 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44871 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44872 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44873 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44874 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child process 
44875 terminated
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],33] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],31] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],29] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],27] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],25] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],23] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],21] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],19] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],17] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],15] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],13] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],11] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],9] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],7] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],5] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],3] terminated normally
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking abort file 
/tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
[compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child process 
[[49524,1],1] terminated normally
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],25]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],15]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],11]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],13]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],19]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],9]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],17]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],31]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],7]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],21]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],5]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],33]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],23]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],3]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],29]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],27]
[compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for child 
[[49524,1],1]
[compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting all procs 
in [49524,1] terminated
^Cmpirun: killing job...

Killed by signal 2.
[compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working on WILDCARD


On 12/14/2012 04:11 PM, Ralph Castain wrote:
Sorry - I forgot that you built from a tarball, and so debug isn't enabled by 
default. You need to configure --enable-debug.

On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

Oddly enough, adding this debugging info, lowered the number of processes that 
can be used down to 42 from 46.  When I run the MPI, it fails giving only the 
information that follows:

[root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 44 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:44374] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:44374] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:44374] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Querying component [default]
[compute-2-0.local:28950] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-0.local:28950] mca:base:select:( odls) Selected component [default]
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local
compute-2-1.local


On 12/14/2012 03:18 PM, Ralph Castain wrote:
It wouldn't be ssh - in both cases, only one ssh is being done to each node (to 
start the local daemon). The only difference is the number of fork/exec's being 
done on each node, and the number of file descriptors being opened to support 
those fork/exec's.

It certainly looks like your limits are high enough. When you say it "fails", 
what do you mean - what error does it report? Try adding:

--leave-session-attached -mca odls_base_verbose 5

to your cmd line - this will report all the local proc launch debug and 
hopefully show you a more detailed error report.


On Dec 14, 2012, at 12:29 PM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:

I have had to cobble together two machines in our rocks cluster without using 
the standard installation, they have efi only bios on them and rocks doesnt 
like that, so it is the only workaround.

Everything works great now, except for one thing.  MPI jobs (openmpi or mpich) 
fail when started from one of these nodes (via qsub or by logging in and 
running the command) if 24 or more processors are needed on another system.  
However if the originator of the MPI job is the headnode or any of the 
preexisting compute nodes, it works fine.  Right now I am guessing ssh client 
or ulimit problems, but I cannot find any difference.  Any help would be 
greatly appreciated.

compute-2-1 and compute-2-0 are the new nodes

Examples:

This works, prints 23 hostnames from each machine:
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -np 46 hostname

This does not work, prints 24 hostnames for compute-2-1
[root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -np 48 hostname

These both work, print 64 hostnames from each node
[root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -np 128 hostname
[root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -np 128 hostname

[root@compute-2-1 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 16410016
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

[root@compute-2-1 ~]# more /etc/ssh/ssh_config
Host *
        CheckHostIP             no
        ForwardX11              yes
        ForwardAgent            yes
        StrictHostKeyChecking   no
        UsePrivilegedPort       no
        Protocol                2,1

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to