Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:

> I would also add that scp seems to be creating the file in the /tmp directory 
> of compute-2-0, and that /var/log secure is showing ssh connections being 
> accepted.  Is there anything in ssh that can limit connections that I need to 
> look out for?  My guess is that it is part of the client prefs and not the 
> server prefs since I can initiate the mpi command from another machine and it 
> works fine, even when it uses compute-2-0 and 1.
> 
> Dan
> 
> 
> [root@compute-2-1 /]# date
> Mon Dec 17 15:11:50 CST 2012
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
> [compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
> [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
> rsh path NULL
> 
> [root@compute-2-0 tmp]# ls -ltr
> total 24
> -rw-------.  1 root    root       0 Nov 28 08:42 yum.log
> -rw-------.  1 root    root    5962 Nov 29 10:50 
> yum_save_tx-2012-11-29-10-50SRba9s.yumtx
> drwx------.  3 danield danield 4096 Dec 12 14:56 
> openmpi-sessions-danield@compute-2-0_0
> drwx------.  3 root    root    4096 Dec 13 15:38 
> openmpi-sessions-root@compute-2-0_0
> drwx------  18 danield danield 4096 Dec 14 09:48 
> openmpi-sessions-danield@compute-2-0.local_0
> drwx------  44 root    root    4096 Dec 17 15:14 
> openmpi-sessions-root@compute-2-0.local_0
> 
> [root@compute-2-0 tmp]# tail -10 /var/log/secure
> Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
> 10.1.255.226 port 49483 ssh2
> Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
> 10.1.255.226: 11: disconnected by user
> Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
> closed for user root
> Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
> 10.1.255.226 port 49484 ssh2
> Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
> 10.1.255.226: 11: disconnected by user
> Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
> closed for user root
> Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
> 10.1.255.226 port 49485 ssh2
> Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> 
> 
> 
> 
> 
> 
> On 12/17/2012 11:16 AM, Daniel Davidson wrote:
>> A very long time (15 mintues or so) I finally received the following in 
>> addition to what I just sent earlier:
>> 
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
>> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
>> orted_exit commands
>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>> WILDCARD
>> 
>> Firewalls are down:
>> 
>> [root@compute-2-1 /]# iptables -L
>> Chain INPUT (policy ACCEPT)
>> target     prot opt source               destination
>> 
>> Chain FORWARD (policy ACCEPT)
>> target     prot opt source               destination
>> 
>> Chain OUTPUT (policy ACCEPT)
>> target     prot opt source               destination
>> [root@compute-2-0 ~]# iptables -L
>> Chain INPUT (policy ACCEPT)
>> target     prot opt source               destination
>> 
>> Chain FORWARD (policy ACCEPT)
>> target     prot opt source               destination
>> 
>> Chain OUTPUT (policy ACCEPT)
>> target     prot opt source               destination
>> 
>> On 12/17/2012 11:09 AM, Ralph Castain wrote:
>>> Hmmm...and that is ALL the output? If so, then it never succeeded in 
>>> sending a message back, which leads one to suspect some kind of firewall in 
>>> the way.
>>> 
>>> Looking at the ssh line, we are going to attempt to send a message from 
>>> tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? 
>>> Anything preventing it?
>>> 
>>> 
>>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>> 
>>>> These nodes have not been locked down yet so that jobs cannot be launched 
>>>> from the backend, at least on purpose anyway.  The added logging returns 
>>>> the information below:
>>>> 
>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>>> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
>>>> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
>>>> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh 
>>>> : rsh path NULL
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Query of component [rsh] 
>>>> set priority to 10
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
>>>> [slurm]
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component 
>>>> [slurm]. Query failed to return a module
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [tm]. 
>>>> Query failed to return a module
>>>> [compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 
>>>> nodename hash 3634869988
>>>> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh 
>>>> path NULL
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
>>>> [compute-2-1.local:69655] mca:base:select:( odls) Querying component 
>>>> [default]
>>>> [compute-2-1.local:69655] mca:base:select:( odls) Query of component 
>>>> [default] set priority to 1
>>>> [compute-2-1.local:69655] mca:base:select:( odls) Selected component 
>>>> [default]
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
>>>> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged 
>>>> allocation
>>>> [compute-2-1.local:69655] [[32341,0],0] using dash_host
>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
>>>> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
>>>> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
>>>> [[32341,0],1]
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new 
>>>> daemon [[32341,0],1] to node compute-2-0
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote 
>>>> shell as local shell
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
>>>>        /usr/bin/ssh <template> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; 
>>>> export PATH ; 
>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
>>>> LD_LIBRARY_PATH ; 
>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess 
>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid <template> -mca 
>>>> orte_ess_num_procs 2 -mca orte_hnp_uri 
>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 
>>>> -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a 
>>>> child of mine
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 
>>>> to launch list
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of 
>>>> daemon [[32341,0],1]
>>>> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: 
>>>> (//usr/bin/ssh) [/usr/bin/ssh compute-2-0 
>>>> PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
>>>> LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
>>>> LD_LIBRARY_PATH ; 
>>>> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
>>>> export DYLD_LIBRARY_PATH ; /home/apps/openmpi-1.7rc5/bin/orted -mca ess 
>>>> env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca 
>>>> orte_ess_num_procs 2 -mca orte_hnp_uri 
>>>> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
>>>> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 
>>>> -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1]
>>>> Warning: untrusted X11 forwarding setup failed: xauth key data not 
>>>> generated
>>>> Warning: No xauth data; using fake authentication data for X11 forwarding.
>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Querying component [rsh]
>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : rsh 
>>>> path NULL
>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Query of component [rsh] 
>>>> set priority to 10
>>>> [compute-2-0.local:24659] mca:base:select:(  plm) Selected component [rsh]
>>>> [compute-2-0.local:24659] mca:base:select:( odls) Querying component 
>>>> [default]
>>>> [compute-2-0.local:24659] mca:base:select:( odls) Query of component 
>>>> [default] set priority to 1
>>>> [compute-2-0.local:24659] mca:base:select:( odls) Selected component 
>>>> [default]
>>>> [compute-2-0.local:24659] [[32341,0],1] plm:rsh_setup on agent ssh : rsh 
>>>> path NULL
>>>> [compute-2-0.local:24659] [[32341,0],1] plm:base:receive start comm
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 12/17/2012 10:37 AM, Ralph Castain wrote:
>>>>> ?? That was all the output? If so, then something is indeed quite wrong 
>>>>> as it didn't even attempt to launch the job.
>>>>> 
>>>>> Try adding -mca plm_base_verbose 5 to the cmd line.
>>>>> 
>>>>> I was assuming you were using ssh as the launcher, but I wonder if you 
>>>>> are in some managed environment? If so, then it could be that launch from 
>>>>> a backend node isn't allowed (e.g., on gridengine).
>>>>> 
>>>>> On Dec 17, 2012, at 8:28 AM, Daniel Davidson <dani...@igb.uiuc.edu> wrote:
>>>>> 
>>>>>> This looks to be having issues as well, and I cannot get any number of 
>>>>>> processors to give me a different result with the new version.
>>>>>> 
>>>>>> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
>>>>>> compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
>>>>>> odls_base_verbose 5 hostname
>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-1.local:69417] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Querying component 
>>>>>> [default]
>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Query of component 
>>>>>> [default] set priority to 1
>>>>>> [compute-2-0.local:24486] mca:base:select:( odls) Selected component 
>>>>>> [default]
>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
>>>>>> WILDCARD
>>>>>> 
>>>>>> However from the head node:
>>>>>> 
>>>>>> [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun 
>>>>>> -host compute-2-0,compute-2-1 -v  -np 50  hostname
>>>>>> 
>>>>>> Displays 25 hostnames from each system.
>>>>>> 
>>>>>> Thank you again for the help so far,
>>>>>> 
>>>>>> Dan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>>>>>>> I will give this a try, but wouldn't that be an issue as well if the 
>>>>>>> process was run on the head node or another node?  So long as the mpi 
>>>>>>> job is not started on either of these two nodes, it works fine.
>>>>>>> 
>>>>>>> Dan
>>>>>>> 
>>>>>>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>>>>>>> It must be making contact or ORTE wouldn't be attempting to launch 
>>>>>>>> your application's procs. Looks more like it never received the launch 
>>>>>>>> command. Looking at the code, I suspect you're getting caught in a 
>>>>>>>> race condition that causes the message to get "stuck".
>>>>>>>> 
>>>>>>>> Just to see if that's the case, you might try running this with the 
>>>>>>>> 1.7 release candidate, or even the developer's nightly build. Both use 
>>>>>>>> a different timing mechanism intended to resolve such situations.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Thank you for the help so far.  Here is the information that the 
>>>>>>>>> debugging gives me.  Looks like the daemon on on the non-local node 
>>>>>>>>> never makes contact.  If I step NP back two though, it does.
>>>>>>>>> 
>>>>>>>>> Dan
>>>>>>>>> 
>>>>>>>>> [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>> compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
>>>>>>>>> odls_base_verbose 5 hostname
>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Querying component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
>>>>>>>>> [default] set priority to 1
>>>>>>>>> [compute-2-1.local:44855] mca:base:select:( odls) Selected component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Querying component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
>>>>>>>>> [default] set priority to 1
>>>>>>>>> [compute-2-0.local:29282] mca:base:select:( odls) Selected component 
>>>>>>>>> [default]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info 
>>>>>>>>> updating nidmap
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>> unpacking data to launch job [49524,1]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>> adding new jobdat for job [49524,1]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
>>>>>>>>> unpacking 1 app_contexts
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],0] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],1] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],1] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],2] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],3] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],3] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],4] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],5] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],5] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],6] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],7] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],7] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],8] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],9] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],9] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],10] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],11] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],11] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],12] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],13] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],13] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],14] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],15] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],15] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],15] (15) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],16] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],17] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],17] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],17] (17) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],18] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],19] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],19] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],19] (19) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],20] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],21] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],21] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],21] (21) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],22] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],23] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],23] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],23] (23) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],24] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],25] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],25] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],25] (25) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],26] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],27] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],27] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],27] (27) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],28] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],29] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],29] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],29] (29) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],30] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],31] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],31] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],31] (31) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],32] on daemon 1
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - checking proc [[49524,1],33] on daemon 0
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
>>>>>>>>> - found proc [[49524,1],33] for me!
>>>>>>>>> [compute-2-1.local:44855] adding proc [[49524,1],33] (33) to my local 
>>>>>>>>> list
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch found 384 
>>>>>>>>> processors for 17 children and locally set oversubscribed to false
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],1]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],3]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],5]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],7]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],9]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],11]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],13]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],15]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],17]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],19]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],21]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],23]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],25]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],27]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],29]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],31]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch working child 
>>>>>>>>> [[49524,1],33]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch reporting job 
>>>>>>>>> [49524,1] launch status
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch flagging launch 
>>>>>>>>> report to myself
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:launch setting waitpids
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44857 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44858 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44859 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44860 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44861 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44862 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44863 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44865 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44866 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44867 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44869 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44870 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44871 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44872 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44873 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44874 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:wait_local_proc child 
>>>>>>>>> process 44875 terminated
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/33/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],33] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/31/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],31] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/29/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],29] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/27/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],27] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/25/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],25] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/23/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],23] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/21/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],21] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/19/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],19] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/17/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],17] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/15/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],15] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/13/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],13] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/11/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],11] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/9/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],9] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/7/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],7] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/5/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],5] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/3/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],3] terminated normally
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired checking 
>>>>>>>>> abort file 
>>>>>>>>> /tmp/openmpi-sessions-root@compute-2-1.local_0/3245604865/1/abort
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:waitpid_fired child 
>>>>>>>>> process [[49524,1],1] terminated normally
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> compute-2-1.local
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],25]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],15]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],11]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],13]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],19]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],9]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],17]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],31]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],7]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],21]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],5]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],33]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],23]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],3]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],29]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],27]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:notify_iof_complete for 
>>>>>>>>> child [[49524,1],1]
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:proc_complete reporting 
>>>>>>>>> all procs in [49524,1] terminated
>>>>>>>>> ^Cmpirun: killing job...
>>>>>>>>> 
>>>>>>>>> Killed by signal 2.
>>>>>>>>> [compute-2-1.local:44855] [[49524,0],0] odls:kill_local_proc working 
>>>>>>>>> on WILDCARD
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 12/14/2012 04:11 PM, Ralph Castain wrote:
>>>>>>>>>> Sorry - I forgot that you built from a tarball, and so debug isn't 
>>>>>>>>>> enabled by default. You need to configure --enable-debug.
>>>>>>>>>> 
>>>>>>>>>> On Dec 14, 2012, at 1:52 PM, Daniel Davidson <dani...@igb.uiuc.edu> 
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Oddly enough, adding this debugging info, lowered the number of 
>>>>>>>>>>> processes that can be used down to 42 from 46.  When I run the MPI, 
>>>>>>>>>>> it fails giving only the information that follows:
>>>>>>>>>>> 
>>>>>>>>>>> [root@compute-2-1 ssh]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>> compute-2-0,compute-2-1 -v  -np 44 --leave-session-attached -mca 
>>>>>>>>>>> odls_base_verbose 5 hostname
>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Querying 
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Query of 
>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>> [compute-2-1.local:44374] mca:base:select:( odls) Selected 
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Querying 
>>>>>>>>>>> component [default]
>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Query of 
>>>>>>>>>>> component [default] set priority to 1
>>>>>>>>>>> [compute-2-0.local:28950] mca:base:select:( odls) Selected 
>>>>>>>>>>> component [default]
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> compute-2-1.local
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On 12/14/2012 03:18 PM, Ralph Castain wrote:
>>>>>>>>>>>> It wouldn't be ssh - in both cases, only one ssh is being done to 
>>>>>>>>>>>> each node (to start the local daemon). The only difference is the 
>>>>>>>>>>>> number of fork/exec's being done on each node, and the number of 
>>>>>>>>>>>> file descriptors being opened to support those fork/exec's.
>>>>>>>>>>>> 
>>>>>>>>>>>> It certainly looks like your limits are high enough. When you say 
>>>>>>>>>>>> it "fails", what do you mean - what error does it report? Try 
>>>>>>>>>>>> adding:
>>>>>>>>>>>> 
>>>>>>>>>>>> --leave-session-attached -mca odls_base_verbose 5
>>>>>>>>>>>> 
>>>>>>>>>>>> to your cmd line - this will report all the local proc launch 
>>>>>>>>>>>> debug and hopefully show you a more detailed error report.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Dec 14, 2012, at 12:29 PM, Daniel Davidson 
>>>>>>>>>>>> <dani...@igb.uiuc.edu> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> I have had to cobble together two machines in our rocks cluster 
>>>>>>>>>>>>> without using the standard installation, they have efi only bios 
>>>>>>>>>>>>> on them and rocks doesnt like that, so it is the only workaround.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Everything works great now, except for one thing.  MPI jobs 
>>>>>>>>>>>>> (openmpi or mpich) fail when started from one of these nodes (via 
>>>>>>>>>>>>> qsub or by logging in and running the command) if 24 or more 
>>>>>>>>>>>>> processors are needed on another system.  However if the 
>>>>>>>>>>>>> originator of the MPI job is the headnode or any of the 
>>>>>>>>>>>>> preexisting compute nodes, it works fine.  Right now I am 
>>>>>>>>>>>>> guessing ssh client or ulimit problems, but I cannot find any 
>>>>>>>>>>>>> difference.  Any help would be greatly appreciated.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> compute-2-1 and compute-2-0 are the new nodes
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Examples:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This works, prints 23 hostnames from each machine:
>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 46 hostname
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This does not work, prints 24 hostnames for compute-2-1
>>>>>>>>>>>>> [root@compute-2-1 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 48 hostname
>>>>>>>>>>>>> 
>>>>>>>>>>>>> These both work, print 64 hostnames from each node
>>>>>>>>>>>>> [root@biocluster ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>> [root@compute-0-2 ~]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
>>>>>>>>>>>>> compute-2-0,compute-2-1 -np 128 hostname
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [root@compute-2-1 ~]# ulimit -a
>>>>>>>>>>>>> core file size          (blocks, -c) 0
>>>>>>>>>>>>> data seg size           (kbytes, -d) unlimited
>>>>>>>>>>>>> scheduling priority             (-e) 0
>>>>>>>>>>>>> file size               (blocks, -f) unlimited
>>>>>>>>>>>>> pending signals                 (-i) 16410016
>>>>>>>>>>>>> max locked memory       (kbytes, -l) unlimited
>>>>>>>>>>>>> max memory size         (kbytes, -m) unlimited
>>>>>>>>>>>>> open files                      (-n) 4096
>>>>>>>>>>>>> pipe size            (512 bytes, -p) 8
>>>>>>>>>>>>> POSIX message queues     (bytes, -q) 819200
>>>>>>>>>>>>> real-time priority              (-r) 0
>>>>>>>>>>>>> stack size              (kbytes, -s) unlimited
>>>>>>>>>>>>> cpu time               (seconds, -t) unlimited
>>>>>>>>>>>>> max user processes              (-u) 1024
>>>>>>>>>>>>> virtual memory          (kbytes, -v) unlimited
>>>>>>>>>>>>> file locks                      (-x) unlimited
>>>>>>>>>>>>> 
>>>>>>>>>>>>> [root@compute-2-1 ~]# more /etc/ssh/ssh_config
>>>>>>>>>>>>> Host *
>>>>>>>>>>>>>        CheckHostIP             no
>>>>>>>>>>>>>        ForwardX11              yes
>>>>>>>>>>>>>        ForwardAgent            yes
>>>>>>>>>>>>>        StrictHostKeyChecking   no
>>>>>>>>>>>>>        UsePrivilegedPort       no
>>>>>>>>>>>>>        Protocol                2,1
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> _______________________________________________
>>>>>>> users mailing list
>>>>>>> us...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to