Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I will give this a try, but wouldn't that be an issue as well if the 
process was run on the head node or another node?  So long as the mpi 
job is not started on either of these two nodes, it works fine.


Dan

On 12/14/2012 11:46 PM, Ralph Castain wrote:

It must be making contact or ORTE wouldn't be attempting to launch your application's 
procs. Looks more like it never received the launch command. Looking at the code, I 
suspect you're getting caught in a race condition that causes the message to get 
"stuck".

Just to see if that's the case, you might try running this with the 1.7 release 
candidate, or even the developer's nightly build. Both use a different timing 
mechanism intended to resolve such situations.


On Dec 14, 2012, at 2:49 PM, Daniel Davidson  wrote:


Thank you for the help so far.  Here is the information that the debugging 
gives me.  Looks like the daemon on on the non-local node never makes contact.  
If I step NP back two though, it does.

Dan

[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:44855] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:44855] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:44855] mca:base:select:( odls) Selected component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Querying component [default]
[compute-2-0.local:29282] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-0.local:29282] mca:base:select:( odls) Selected component [default]
[compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating nidmap
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 
data to launch job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding new 
jobdat for job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list unpacking 1 
app_contexts
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],0] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],1] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],1] for me!
[compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],2] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],3] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],3] for me!
[compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],4] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],5] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],5] for me!
[compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],6] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],7] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],7] for me!
[compute-2-1.local:44855] adding proc [[49524,1],7] (7) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],8] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],9] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],9] for me!
[compute-2-1.local:44855] adding proc [[49524,1],9] (9) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],10] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],11] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],11] for me!
[compute-2-1.local:44855] adding proc [[49524,1],11] (11) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],12] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - checking 
proc [[49524,1],13] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - found 
proc [[49524,1],13] for me!
[compute-2-1.local:44855] adding proc [[49524,1],13] (13) to my local list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child

Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs

2012-12-17 Thread Blosch, Edwin L
Ralph,

Unfortunately I didn't see the ssh output.  The output I got was pretty much as 
before.

You know, the fact that the error message is not prefixed with a host name 
makes me think it could be happening on the host where the job is placed by 
PBS. If there is something wrong in the user environment prior to mpirun, that 
is not an OpenMPI problem. And yet, in one of the jobs that failed, I have also 
printed out the results of 'ldd' on the mpirun executable just prior to 
executing the command, and all the shared libraries were resolved:

ldd /release/cfd/openmpi-intel/bin/mpirun
linux-vdso.so.1 =>  (0x7fffbbb39000)
libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 
(0x2abdf75d2000)
libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 
(0x2abdf7887000)
libdl.so.2 => /lib64/libdl.so.2 (0x2abdf7b39000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x2abdf7d3d000)
libutil.so.1 => /lib64/libutil.so.1 (0x2abdf7f56000)
libm.so.6 => /lib64/libm.so.6 (0x2abdf8159000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2abdf83af000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x2abdf85c7000)
libc.so.6 => /lib64/libc.so.6 (0x2abdf87e4000)
libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so 
(0x2abdf8b42000)
libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
(0x2abdf8ed7000)
libintlc.so.5 => 
/appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 (0x2abdf90ed000)
/lib64/ld-linux-x86-64.so.2 (0x2abdf73b1000)

Hence my initial assumption that the shared-library problem was happening with 
one of the child processes on a remote node.

So at this point I have more questions than answers.  I still don't know if 
this message comes from the main mpirun process or one of the child processes, 
although it seems that it should not be the main process because of the output 
of ldd above.

Any more suggestions are welcomed of course.

Thanks


/release/cfd/openmpi-intel/bin/mpirun --machinefile 
/var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x 
MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached 
/tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 1 -ri restart.5000 -ro 
/tmp/fv420804.maruhpc4-mgt/restart.5000

[c6n38:16219] mca:base:select:(  plm) Querying component [rsh]
[c6n38:16219] mca:base:select:(  plm) Query of component [rsh] set priority to 
10
[c6n38:16219] mca:base:select:(  plm) Selected component [rsh]
Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
/release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
libimf.so: cannot open shared object file: No such file or directory
--
A daemon (pid 16227) died unexpectedly with status 127 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
c6n39 - daemon did not report back when launched
c6n40 - daemon did not report back when launched
c6n41 - daemon did not report back when launched
c6n42 - daemon did not report back when launched

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Friday, December 14, 2012 2:25 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] Problems with shared libraries whil

Re: [OMPI users] EXTERNAL: Re: Problems with shared libraries while launching jobs

2012-12-17 Thread Ralph Castain

On Dec 17, 2012, at 7:42 AM, "Blosch, Edwin L"  wrote:

> Ralph,
>  
> Unfortunately I didn’t see the ssh output.  The output I got was pretty much 
> as before.

Sorry - I forgot that you built from a tarball, and so the debug is "off" by 
default. You need to reconfigure with --enable-debug to get debug output.

>  
> You know, the fact that the error message is not prefixed with a host name 
> makes me think it could be happening on the host where the job is placed by 
> PBS. If there is something wrong in the user environment prior to mpirun, 
> that is not an OpenMPI problem. And yet, in one of the jobs that failed, I 
> have also printed out the results of ‘ldd’ on the mpirun executable just 
> prior to executing the command, and all the shared libraries were resolved:

No, it isn't mpirun as the error message specifically says it is "orted" that 
is failing to resolve the library. Try running with the debug enabled and the 
same command line and let's see what the actual ssh command looks like - my 
guess is that the path to that intel library is missing.

>  
> ldd /release/cfd/openmpi-intel/bin/mpirun
> linux-vdso.so.1 =>  (0x7fffbbb39000)
> libopen-rte.so.0 => /release/cfd/openmpi-intel/lib/libopen-rte.so.0 
> (0x2abdf75d2000)
> libopen-pal.so.0 => /release/cfd/openmpi-intel/lib/libopen-pal.so.0 
> (0x2abdf7887000)
> libdl.so.2 => /lib64/libdl.so.2 (0x2abdf7b39000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x2abdf7d3d000)
> libutil.so.1 => /lib64/libutil.so.1 (0x2abdf7f56000)
> libm.so.6 => /lib64/libm.so.6 (0x2abdf8159000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x2abdf83af000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x2abdf85c7000)
> libc.so.6 => /lib64/libc.so.6 (0x2abdf87e4000)
> libimf.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libimf.so 
> (0x2abdf8b42000)
> libsvml.so => /appserv/intel/Compiler/11.1/072/lib/intel64/libsvml.so 
> (0x2abdf8ed7000)
> libintlc.so.5 => 
> /appserv/intel/Compiler/11.1/072/lib/intel64/libintlc.so.5 
> (0x2abdf90ed000)
> /lib64/ld-linux-x86-64.so.2 (0x2abdf73b1000)
>  
> Hence my initial assumption that the shared-library problem was happening 
> with one of the child processes on a remote node.
>  
> So at this point I have more questions than answers.  I still don’t know if 
> this message comes from the main mpirun process or one of the child 
> processes, although it seems that it should not be the main process because 
> of the output of ldd above.
>  
> Any more suggestions are welcomed of course.
>  
> Thanks
>  
>  
> /release/cfd/openmpi-intel/bin/mpirun --machinefile 
> /var/spool/PBS/aux/20804.maruhpc4-mgt -np 160 -x LD_LIBRARY_PATH -x 
> MPI_ENVIRONMENT=1 --mca plm_base_verbose 5 --leave-session-attached 
> /tmp/fv420804.maruhpc4-mgt/test_jsgl -v -cycles 1 -ri restart.5000 -ro 
> /tmp/fv420804.maruhpc4-mgt/restart.5000
>  
> [c6n38:16219] mca:base:select:(  plm) Querying component [rsh]
> [c6n38:16219] mca:base:select:(  plm) Query of component [rsh] set priority 
> to 10
> [c6n38:16219] mca:base:select:(  plm) Selected component [rsh]
> Warning: Permanently added 'c6n39' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n40' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n41' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c6n42' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c5n26' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c3n20' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n10' (RSA) to the list of known hosts.^M
> Warning: Permanently added 'c4n40' (RSA) to the list of known hosts.^M
> /release/cfd/openmpi-intel/bin/orted: error while loading shared libraries: 
> libimf.so: cannot open shared object file: No such file or directory
> --
> A daemon (pid 16227) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
>  
> There may be more information reported by the environment (see above).
>  
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> Warning: Permanently added 'c3n27' (RSA) to the list of known hosts.^M
> --
> mpirun was unable 

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
This looks to be having issues as well, and I cannot get any number of 
processors to give me a different result with the new version.


[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:69417] mca:base:select:( odls) Querying component 
[default]
[compute-2-1.local:69417] mca:base:select:( odls) Query of component 
[default] set priority to 1
[compute-2-1.local:69417] mca:base:select:( odls) Selected component 
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Querying component 
[default]
[compute-2-0.local:24486] mca:base:select:( odls) Query of component 
[default] set priority to 1
[compute-2-0.local:24486] mca:base:select:( odls) Selected component 
[default]
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
WILDCARD
[compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
WILDCARD
[compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
WILDCARD


However from the head node:

[root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun 
-host compute-2-0,compute-2-1 -v  -np 50  hostname


Displays 25 hostnames from each system.

Thank you again for the help so far,

Dan






On 12/17/2012 08:31 AM, Daniel Davidson wrote:
I will give this a try, but wouldn't that be an issue as well if the 
process was run on the head node or another node?  So long as the mpi 
job is not started on either of these two nodes, it works fine.


Dan

On 12/14/2012 11:46 PM, Ralph Castain wrote:
It must be making contact or ORTE wouldn't be attempting to launch 
your application's procs. Looks more like it never received the 
launch command. Looking at the code, I suspect you're getting caught 
in a race condition that causes the message to get "stuck".


Just to see if that's the case, you might try running this with the 
1.7 release candidate, or even the developer's nightly build. Both 
use a different timing mechanism intended to resolve such situations.



On Dec 14, 2012, at 2:49 PM, Daniel Davidson  
wrote:


Thank you for the help so far.  Here is the information that the 
debugging gives me.  Looks like the daemon on on the non-local node 
never makes contact.  If I step NP back two though, it does.


Dan

[root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
odls_base_verbose 5 hostname
[compute-2-1.local:44855] mca:base:select:( odls) Querying component 
[default]
[compute-2-1.local:44855] mca:base:select:( odls) Query of component 
[default] set priority to 1
[compute-2-1.local:44855] mca:base:select:( odls) Selected component 
[default]
[compute-2-0.local:29282] mca:base:select:( odls) Querying component 
[default]
[compute-2-0.local:29282] mca:base:select:( odls) Query of component 
[default] set priority to 1
[compute-2-0.local:29282] mca:base:select:( odls) Selected component 
[default]
[compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info 
updating nidmap

[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
unpacking data to launch job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
adding new jobdat for job [49524,1]
[compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
unpacking 1 app_contexts
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],0] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],1] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- found proc [[49524,1],1] for me!
[compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local 
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],2] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],3] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- found proc [[49524,1],3] for me!
[compute-2-1.local:44855] adding proc [[49524,1],3] (3) to my local 
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],4] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],5] on daemon 0
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- found proc [[49524,1],5] for me!
[compute-2-1.local:44855] adding proc [[49524,1],5] (5) to my local 
list
[compute-2-1.local:44855] [[49524,0],0] odls:constructing child list 
- checking proc [[49524,1],6] on daemon 1
[compute-2-1.local:44855] [[49524,0],0] o

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
?? That was all the output? If so, then something is indeed quite wrong as it 
didn't even attempt to launch the job.

Try adding -mca plm_base_verbose 5 to the cmd line.

I was assuming you were using ssh as the launcher, but I wonder if you are in 
some managed environment? If so, then it could be that launch from a backend 
node isn't allowed (e.g., on gridengine).

On Dec 17, 2012, at 8:28 AM, Daniel Davidson  wrote:

> This looks to be having issues as well, and I cannot get any number of 
> processors to give me a different result with the new version.
> 
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 50 --leave-session-attached -mca 
> odls_base_verbose 5 hostname
> [compute-2-1.local:69417] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:69417] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:69417] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:24486] mca:base:select:( odls) Querying component [default]
> [compute-2-0.local:24486] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-0.local:24486] mca:base:select:( odls) Selected component [default]
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-0.local:24486] [[24939,0],1] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
> WILDCARD
> [compute-2-1.local:69417] [[24939,0],0] odls:kill_local_proc working on 
> WILDCARD
> 
> However from the head node:
> 
> [root@biocluster openmpi-1.7rc5]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 50  hostname
> 
> Displays 25 hostnames from each system.
> 
> Thank you again for the help so far,
> 
> Dan
> 
> 
> 
> 
> 
> 
> On 12/17/2012 08:31 AM, Daniel Davidson wrote:
>> I will give this a try, but wouldn't that be an issue as well if the process 
>> was run on the head node or another node?  So long as the mpi job is not 
>> started on either of these two nodes, it works fine.
>> 
>> Dan
>> 
>> On 12/14/2012 11:46 PM, Ralph Castain wrote:
>>> It must be making contact or ORTE wouldn't be attempting to launch your 
>>> application's procs. Looks more like it never received the launch command. 
>>> Looking at the code, I suspect you're getting caught in a race condition 
>>> that causes the message to get "stuck".
>>> 
>>> Just to see if that's the case, you might try running this with the 1.7 
>>> release candidate, or even the developer's nightly build. Both use a 
>>> different timing mechanism intended to resolve such situations.
>>> 
>>> 
>>> On Dec 14, 2012, at 2:49 PM, Daniel Davidson  wrote:
>>> 
 Thank you for the help so far.  Here is the information that the debugging 
 gives me.  Looks like the daemon on on the non-local node never makes 
 contact.  If I step NP back two though, it does.
 
 Dan
 
 [root@compute-2-1 etc]# /home/apps/openmpi-1.6.3/bin/mpirun -host 
 compute-2-0,compute-2-1 -v  -np 34 --leave-session-attached -mca 
 odls_base_verbose 5 hostname
 [compute-2-1.local:44855] mca:base:select:( odls) Querying component 
 [default]
 [compute-2-1.local:44855] mca:base:select:( odls) Query of component 
 [default] set priority to 1
 [compute-2-1.local:44855] mca:base:select:( odls) Selected component 
 [default]
 [compute-2-0.local:29282] mca:base:select:( odls) Querying component 
 [default]
 [compute-2-0.local:29282] mca:base:select:( odls) Query of component 
 [default] set priority to 1
 [compute-2-0.local:29282] mca:base:select:( odls) Selected component 
 [default]
 [compute-2-1.local:44855] [[49524,0],0] odls:update:daemon:info updating 
 nidmap
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list
 [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
 unpacking data to launch job [49524,1]
 [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list adding 
 new jobdat for job [49524,1]
 [compute-2-1.local:44855] [[49524,0],0] odls:construct_child_list 
 unpacking 1 app_contexts
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
 checking proc [[49524,1],0] on daemon 1
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
 checking proc [[49524,1],1] on daemon 0
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
 found proc [[49524,1],1] for me!
 [compute-2-1.local:44855] adding proc [[49524,1],1] (1) to my local list
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
 checking proc [[49524,1],2] on daemon 1
 [compute-2-1.local:44855] [[49524,0],0] odls:constructing child list - 
 checki

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
These nodes have not been locked down yet so that jobs cannot be 
launched from the backend, at least on purpose anyway.  The added 
logging returns the information below:


[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname

[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent 
ssh : rsh path NULL
[compute-2-1.local:69655] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10

[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [slurm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module

[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component 
[tm]. Query failed to return a module

[compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
[compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 
nodename hash 3634869988

[compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
[compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh 
path NULL

[compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
[compute-2-1.local:69655] mca:base:select:( odls) Querying component 
[default]
[compute-2-1.local:69655] mca:base:select:( odls) Query of component 
[default] set priority to 1
[compute-2-1.local:69655] mca:base:select:( odls) Selected component 
[default]

[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
[compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged 
allocation

[compute-2-1.local:69655] [[32341,0],0] using dash_host
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
[[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new 
daemon [[32341,0],1] to node compute-2-0

[compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote 
shell as local shell

[compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
/usr/bin/ssh  
PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
export DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca 
ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid  
-mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1
[compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a 
child of mine
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 
to launch list

[compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of 
daemon [[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: 
(//usr/bin/ssh) [/usr/bin/ssh compute-2-0 
PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export 
LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; 
export DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca 
ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca 
orte_ess_num_procs 2 -mca orte_hnp_uri 
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1]

Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
[compute-2-0.local:24659] mca:base:select:(  plm) Querying component [rsh]
[compute-2-0.local:24659] [[32341,0],1] plm:rsh_lookup on agent ssh : 
rsh path NULL
[compute-2-0.local:24659] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10

[compute-2-0.local:24659] mca:base:select:(  plm) Selected component [rsh]
[compute-2-0.local:24659] mca:base:select:( odls) Querying component 
[def

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Ralph Castain
Hmmm...and that is ALL the output? If so, then it never succeeded in sending a 
message back, which leads one to suspect some kind of firewall in the way.

Looking at the ssh line, we are going to attempt to send a message from tnode 
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything 
preventing it?


On Dec 17, 2012, at 8:56 AM, Daniel Davidson  wrote:

> These nodes have not been locked down yet so that jobs cannot be launched 
> from the backend, at least on purpose anyway.  The added logging returns the 
> information below:
> 
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
> [compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
> rsh path NULL
> [compute-2-1.local:69655] mca:base:select:(  plm) Query of component [rsh] 
> set priority to 10
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [slurm]
> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [slurm]. 
> Query failed to return a module
> [compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
> [compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [tm]. 
> Query failed to return a module
> [compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
> [compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename 
> hash 3634869988
> [compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path 
> NULL
> [compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
> [compute-2-1.local:69655] mca:base:select:( odls) Querying component [default]
> [compute-2-1.local:69655] mca:base:select:( odls) Query of component 
> [default] set priority to 1
> [compute-2-1.local:69655] mca:base:select:( odls) Selected component [default]
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
> [compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation
> [compute-2-1.local:69655] [[32341,0],0] using dash_host
> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
> [compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
> [compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
> [[32341,0],1]
> [compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new 
> daemon [[32341,0],1] to node compute-2-0
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell 
> as local shell
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
>/usr/bin/ssh  PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; 
> export PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH 
> ; export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export 
> DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca 
> orte_ess_jobid 2119499776 -mca orte_ess_vpid  -mca 
> orte_ess_num_procs 2 -mca orte_hnp_uri 
> "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca 
> orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 
> -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh:launch daemon 0 not a child 
> of mine
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: adding node compute-2-0 to 
> launch list
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: activating launch event
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: recording launch of daemon 
> [[32341,0],1]
> [compute-2-1.local:69655] [[32341,0],0] plm:rsh: executing: (//usr/bin/ssh) 
> [/usr/bin/ssh compute-2-0 PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export 
> PATH ; LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; 
> export LD_LIBRARY_PATH ; 
> DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export 
> DYLD_LIBRARY_PATH ;   /home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca 
> orte_ess_jobid 2119499776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 -mca 
> orte_hnp_uri "2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" 
> -mca orte_use_common_port 0 --tree-spawn -mca oob tcp -mca odls_base_verbose 
> 5 -mca plm_base_verbose 5 -mca plm rsh -mca orte_leave_session_attached 1]
> Warning: untrusted X11 forwarding setup fai

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
A very long time (15 mintues or so) I finally received the following in 
addition to what I just sent earlier:


[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
WILDCARD

[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
orted_exit commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
WILDCARD


Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:

Hmmm...and that is ALL the output? If so, then it never succeeded in sending a 
message back, which leads one to suspect some kind of firewall in the way.

Looking at the ssh line, we are going to attempt to send a message from tnode 
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything 
preventing it?


On Dec 17, 2012, at 8:56 AM, Daniel Davidson  wrote:


These nodes have not been locked down yet so that jobs cannot be launched from 
the backend, at least on purpose anyway.  The added logging returns the 
information below:

[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL
[compute-2-1.local:69655] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [slurm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [slurm]. 
Query failed to return a module
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component [tm]
[compute-2-1.local:69655] mca:base:select:(  plm) Skipping component [tm]. 
Query failed to return a module
[compute-2-1.local:69655] mca:base:select:(  plm) Selected component [rsh]
[compute-2-1.local:69655] plm:base:set_hnp_name: initial bias 69655 nodename 
hash 3634869988
[compute-2-1.local:69655] plm:base:set_hnp_name: final jobfam 32341
[compute-2-1.local:69655] [[32341,0],0] plm:rsh_setup on agent ssh : rsh path 
NULL
[compute-2-1.local:69655] [[32341,0],0] plm:base:receive start comm
[compute-2-1.local:69655] mca:base:select:( odls) Querying component [default]
[compute-2-1.local:69655] mca:base:select:( odls) Query of component [default] 
set priority to 1
[compute-2-1.local:69655] mca:base:select:( odls) Selected component [default]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_job
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm creating map
[compute-2-1.local:69655] [[32341,0],0] setup:vm: working unmanaged allocation
[compute-2-1.local:69655] [[32341,0],0] using dash_host
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] adding compute-2-0 to list
[compute-2-1.local:69655] [[32341,0],0] checking node compute-2-1.local
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm add new daemon 
[[32341,0],1]
[compute-2-1.local:69655] [[32341,0],0] plm:base:setup_vm assigning new daemon 
[[32341,0],1] to node compute-2-0
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: launching vm
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: local shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: assuming same remote shell as 
local shell
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: remote shell: 0 (bash)
[compute-2-1.local:69655] [[32341,0],0] plm:rsh: final template argv:
/usr/bin/ssh  PATH=/home/apps/openmpi-1.7rc5/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD_LIBRARY_PATH=/home/apps/openmpi-1.7rc5/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   
/home/apps/openmpi-1.7rc5/bin/orted -mca ess env -mca orte_ess_jobid 2119499776 -mca orte_ess_vpid 
 -mca orte_ess_num_procs 2 -mca orte_hnp_uri 
"2119499776.0;tcp://10.1.255.226:46314;tcp://172.16.28.94:46314" -mca orte_use_common_port 
0 --tree-spawn -mca oob tcp -mca odls_base_verbose 5 -mca p

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson
I would also add that scp seems to be creating the file in the /tmp 
directory of compute-2-0, and that /var/log secure is showing ssh 
connections being accepted.  Is there anything in ssh that can limit 
connections that I need to look out for?  My guess is that it is part of 
the client prefs and not the server prefs since I can initiate the mpi 
command from another machine and it works fine, even when it uses 
compute-2-0 and 1.


Dan


[root@compute-2-1 /]# date
Mon Dec 17 15:11:50 CST 2012
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname

[compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent 
ssh : rsh path NULL


[root@compute-2-0 tmp]# ls -ltr
total 24
-rw---.  1 rootroot   0 Nov 28 08:42 yum.log
-rw---.  1 rootroot5962 Nov 29 10:50 
yum_save_tx-2012-11-29-10-50SRba9s.yumtx
drwx--.  3 danield danield 4096 Dec 12 14:56 
openmpi-sessions-danield@compute-2-0_0
drwx--.  3 rootroot4096 Dec 13 15:38 
openmpi-sessions-root@compute-2-0_0
drwx--  18 danield danield 4096 Dec 14 09:48 
openmpi-sessions-danield@compute-2-0.local_0
drwx--  44 rootroot4096 Dec 17 15:14 
openmpi-sessions-root@compute-2-0.local_0


[root@compute-2-0 tmp]# tail -10 /var/log/secure
Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root 
from 10.1.255.226 port 49483 ssh2
Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
opened for user root by (uid=0)
Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
closed for user root
Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root 
from 10.1.255.226 port 49484 ssh2
Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
opened for user root by (uid=0)
Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
10.1.255.226: 11: disconnected by user
Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
closed for user root
Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root 
from 10.1.255.226 port 49485 ssh2
Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session 
opened for user root by (uid=0)







On 12/17/2012 11:16 AM, Daniel Davidson wrote:
A very long time (15 mintues or so) I finally received the following 
in addition to what I just sent earlier:


[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working 
on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working 
on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working 
on WILDCARD

[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
orted_exit commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working 
on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working 
on WILDCARD


Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:
Hmmm...and that is ALL the output? If so, then it never succeeded in 
sending a message back, which leads one to suspect some kind of 
firewall in the way.


Looking at the ssh line, we are going to attempt to send a message 
from tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going 
to work? Anything preventing it?



On Dec 17, 2012, at 8:56 AM, Daniel Davidson  
wrote:


These nodes have not been locked down yet so that jobs cannot be 
launched from the backend, at least on purpose anyway.  The added 
logging returns the information below:


[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
[rsh]
[compute-2-1.local:69655] [[INVALID],INVALID] plm:rsh_lookup on 
agent ssh : rsh path NULL
[compute-2-1.local:69655] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10
[compute-2-1.local:69655] mca:base:select:(  plm) Querying component 
[slurm]
[compute-2-1.local:69655] mca:base:select:(  plm) Sk

Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Doug Reeder
Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:

> I would also add that scp seems to be creating the file in the /tmp directory 
> of compute-2-0, and that /var/log secure is showing ssh connections being 
> accepted.  Is there anything in ssh that can limit connections that I need to 
> look out for?  My guess is that it is part of the client prefs and not the 
> server prefs since I can initiate the mpi command from another machine and it 
> works fine, even when it uses compute-2-0 and 1.
> 
> Dan
> 
> 
> [root@compute-2-1 /]# date
> Mon Dec 17 15:11:50 CST 2012
> [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
> compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
> odls_base_verbose 5 -mca plm_base_verbose 5 hostname
> [compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
> [compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : 
> rsh path NULL
> 
> [root@compute-2-0 tmp]# ls -ltr
> total 24
> -rw---.  1 rootroot   0 Nov 28 08:42 yum.log
> -rw---.  1 rootroot5962 Nov 29 10:50 
> yum_save_tx-2012-11-29-10-50SRba9s.yumtx
> drwx--.  3 danield danield 4096 Dec 12 14:56 
> openmpi-sessions-danield@compute-2-0_0
> drwx--.  3 rootroot4096 Dec 13 15:38 
> openmpi-sessions-root@compute-2-0_0
> drwx--  18 danield danield 4096 Dec 14 09:48 
> openmpi-sessions-danield@compute-2-0.local_0
> drwx--  44 rootroot4096 Dec 17 15:14 
> openmpi-sessions-root@compute-2-0.local_0
> 
> [root@compute-2-0 tmp]# tail -10 /var/log/secure
> Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
> 10.1.255.226 port 49483 ssh2
> Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 
> 10.1.255.226: 11: disconnected by user
> Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session 
> closed for user root
> Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
> 10.1.255.226 port 49484 ssh2
> Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 
> 10.1.255.226: 11: disconnected by user
> Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session 
> closed for user root
> Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
> 10.1.255.226 port 49485 ssh2
> Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session 
> opened for user root by (uid=0)
> 
> 
> 
> 
> 
> 
> On 12/17/2012 11:16 AM, Daniel Davidson wrote:
>> A very long time (15 mintues or so) I finally received the following in 
>> addition to what I just sent earlier:
>> 
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
>> [compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending 
>> orted_exit commands
>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>> WILDCARD
>> [compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on 
>> WILDCARD
>> 
>> Firewalls are down:
>> 
>> [root@compute-2-1 /]# iptables -L
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>> 
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>> 
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>> [root@compute-2-0 ~]# iptables -L
>> Chain INPUT (policy ACCEPT)
>> target prot opt source   destination
>> 
>> Chain FORWARD (policy ACCEPT)
>> target prot opt source   destination
>> 
>> Chain OUTPUT (policy ACCEPT)
>> target prot opt source   destination
>> 
>> On 12/17/2012 11:09 AM, Ralph Castain wrote:
>>> Hmmm...and that is ALL the output? If so, then it never succeeded in 
>>> sending a message back, which leads one to suspect some kind of firewall in 
>>> the way.
>>> 
>>> Looking at the ssh line, we are going to attempt to send a message from 
>>> tnode 2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? 
>>> Anything preventing it?
>>> 
>>> 
>>> On Dec 17, 2012, at 8:56 AM, Daniel Davidson  wrote:
>>> 
 These nodes have not been locked down yet so that jobs cannot be launched 
 from the backend, at least on purpose anyway.  The added logging returns 
 the information below:
 
 [root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
 compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
 odls_base_v

Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-17 Thread Jeff Squyres
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote:

>>  1. The datatypes passed to Scatter are not valid MPI datatypes
>> (MPI.OBJECT).  You need to construct a datatype that is specific to the
>> !MyData class, just like you would in C/C++.  I think that this is the
>> first error that you are seeing (i.e., that OMPI is trying to treat
>> MPI.OBJECT as an MPI Datatype object, and failing (and therefore throwing
>> an !ClassCastException exception).
> 
> Perhaps you are right and my small example program ist not a valid MPI
> program. The problem is that I couldn't find any good documentation or
> example programs how to write a program which uses a structured data
> type.

In Java, that's probably true.  Remember: there are no official MPI Java 
bindings. What is included in Open MPI is a research project from several years 
ago.  We picked what appeared to be the best one, freshened it up a little, 
updated its build system to incorporate into ours, verified its basic 
functionality, and went with that.

In C, there should be plenty of google-able examples about how to use Scatter 
(and friends).  You might want to have a look at a few of those to get an idea 
how to use MPI_Scatter in general, and then apply that knowledge to a Java 
program.

Make sense?

> Therefore I sticked to the mpiJava specification which states
> for derived datatypes in chapter 3.12 that the effect for MPI_Type_struct
> can be achieved by using MPI.OBJECT as the buffer type and relying on
> Java object serialization. "dataItem" is a serializable Java object and
> I used MPI.OBJECT as buffer type. How can I create a valid MPI datatype
> MPI.OBJECT so that I get a working example program?

/me reads some Java implementation code...

It looks like they allow passing MPI.OBJECT as the datatype argument; sorry, I 
guess I was wrong about that.

>MPI.COMM_WORLD.Scatter (dataItem, 0, 1, MPI.OBJECT,
>objBuffer, 0, 1, MPI.OBJECT, 0);

What I think you're running into here is that you're still using Scatter wrong, 
per my other point, below:

>>  1. It looks like you're trying to Scatter a single object to N peers.
>> That's invalid MPI -- you need to scatter (N*M) objects to N peers, where
>> M is a positive integer value (e.g., 1 or 2).  Are you trying to
>> broadcast?
> 
> It is the very first version of the program where I scatter one object
> to the process itself (at this point it is not the normal application
> area for scatter, but should nevertheless work). I didn't continue due
> to the error. I get the same error when I broadcast my data item.
> 
> tyr java 116 mpiexec -np 1 java -cp $DIRPREFIX_LOCAL/mpi_classfiles \
>  ObjectScatterMain
> Exception in thread "main" java.lang.ClassCastException: MyData cannot
>  be cast to [Ljava.lang.Object;
>at mpi.Intracomm.copyBuffer(Intracomm.java:119)
>at mpi.Intracomm.Scatter(Intracomm.java:389)
>at ObjectScatterMain.main(ObjectScatterMain.java:45)

I don't know Java, but it looks like it's complaining about the type of 
dataItem, not the type of MPI.OBJECT.  It says it can't cast dataItem to a 
Ljava.lang.Object -- which appears to be the type of the first argument to 
Scatter.

Do you need to have MyData inherit from the Java base Object type, or some such?

> "Broadcast" works if I have only a root process and it fails when I have
> one more process.

If I change MPI.COMM_WORLD.Scatter(...) to 

MPI.COMM_WORLD.Bcast(dataItem, 0, 1, MPI.OBJECT, 0);

I get the same casting error.

I'm sorry; I really don't know Java, and don't know how to fix this offhand.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] mpi problems/many cpus per node

2012-12-17 Thread Daniel Davidson

Yes, it does.

Dan

[root@compute-2-1 ~]# ssh compute-2-0
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Dec 17 16:13:00 2012 from compute-2-1.local
[root@compute-2-0 ~]# ssh compute-2-1
Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
Last login: Mon Dec 17 16:12:32 2012 from biocluster.local
[root@compute-2-1 ~]#



On 12/17/2012 03:39 PM, Doug Reeder wrote:

Daniel,

Does passwordless ssh work. You need to make sure that it is.

Doug
On Dec 17, 2012, at 2:24 PM, Daniel Davidson wrote:


I would also add that scp seems to be creating the file in the /tmp directory 
of compute-2-0, and that /var/log secure is showing ssh connections being 
accepted.  Is there anything in ssh that can limit connections that I need to 
look out for?  My guess is that it is part of the client prefs and not the 
server prefs since I can initiate the mpi command from another machine and it 
works fine, even when it uses compute-2-0 and 1.

Dan


[root@compute-2-1 /]# date
Mon Dec 17 15:11:50 CST 2012
[root@compute-2-1 /]# /home/apps/openmpi-1.7rc5/bin/mpirun -host 
compute-2-0,compute-2-1 -v  -np 10 --leave-session-attached -mca 
odls_base_verbose 5 -mca plm_base_verbose 5 hostname
[compute-2-1.local:70237] mca:base:select:(  plm) Querying component [rsh]
[compute-2-1.local:70237] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh 
path NULL

[root@compute-2-0 tmp]# ls -ltr
total 24
-rw---.  1 rootroot   0 Nov 28 08:42 yum.log
-rw---.  1 rootroot5962 Nov 29 10:50 
yum_save_tx-2012-11-29-10-50SRba9s.yumtx
drwx--.  3 danield danield 4096 Dec 12 14:56 
openmpi-sessions-danield@compute-2-0_0
drwx--.  3 rootroot4096 Dec 13 15:38 
openmpi-sessions-root@compute-2-0_0
drwx--  18 danield danield 4096 Dec 14 09:48 
openmpi-sessions-danield@compute-2-0.local_0
drwx--  44 rootroot4096 Dec 17 15:14 
openmpi-sessions-root@compute-2-0.local_0

[root@compute-2-0 tmp]# tail -10 /var/log/secure
Dec 17 15:13:40 compute-2-0 sshd[24834]: Accepted publickey for root from 
10.1.255.226 port 49483 ssh2
Dec 17 15:13:40 compute-2-0 sshd[24834]: pam_unix(sshd:session): session opened 
for user root by (uid=0)
Dec 17 15:13:42 compute-2-0 sshd[24834]: Received disconnect from 10.1.255.226: 
11: disconnected by user
Dec 17 15:13:42 compute-2-0 sshd[24834]: pam_unix(sshd:session): session closed 
for user root
Dec 17 15:13:50 compute-2-0 sshd[24851]: Accepted publickey for root from 
10.1.255.226 port 49484 ssh2
Dec 17 15:13:50 compute-2-0 sshd[24851]: pam_unix(sshd:session): session opened 
for user root by (uid=0)
Dec 17 15:13:55 compute-2-0 sshd[24851]: Received disconnect from 10.1.255.226: 
11: disconnected by user
Dec 17 15:13:55 compute-2-0 sshd[24851]: pam_unix(sshd:session): session closed 
for user root
Dec 17 15:14:01 compute-2-0 sshd[24868]: Accepted publickey for root from 
10.1.255.226 port 49485 ssh2
Dec 17 15:14:01 compute-2-0 sshd[24868]: pam_unix(sshd:session): session opened 
for user root by (uid=0)






On 12/17/2012 11:16 AM, Daniel Davidson wrote:

A very long time (15 mintues or so) I finally received the following in 
addition to what I just sent earlier:

[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-0.local:24659] [[32341,0],1] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] daemon 1 failed with status 1
[compute-2-1.local:69655] [[32341,0],0] plm:base:orted_cmd sending orted_exit 
commands
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD
[compute-2-1.local:69655] [[32341,0],0] odls:kill_local_proc working on WILDCARD

Firewalls are down:

[root@compute-2-1 /]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination
[root@compute-2-0 ~]# iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source   destination

Chain FORWARD (policy ACCEPT)
target prot opt source   destination

Chain OUTPUT (policy ACCEPT)
target prot opt source   destination

On 12/17/2012 11:09 AM, Ralph Castain wrote:

Hmmm...and that is ALL the output? If so, then it never succeeded in sending a 
message back, which leads one to suspect some kind of firewall in the way.

Looking at the ssh line, we are going to attempt to send a message from tnode 
2-0 to node 2-1 on the 10.1.255.226 address. Is that going to work? Anything 
preventing it?


On Dec 17, 2012, at 8:56 AM, Daniel Davidson  wrote:


These nodes have not been locked down yet so th

Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-17 Thread Jeff Squyres
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote:

> "Broadcast" works if I have only a root process and it fails when I have
> one more process.


I'm sorry; I didn't clarify this error.

In a broadcast of only 1 process, it's effectively a no-op.  So it doesn't need 
to do anything to the buffer.  Hence, it succeeds.

But when you include >1 process, then it needs to do something to the buffer, 
so the same casting issue arises.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] [Open MPI] #3351: JAVA scatter error

2012-12-17 Thread Jeff Squyres
On Dec 15, 2012, at 10:46 AM, Siegmar Gross wrote:

> If I misunderstood the mpiJava specification and I must create a special
> MPI object from my Java object: How do I create it? Thank you very much
> for any help in advance.


You sent me a source code listing off-list, but I want to reply on-list for a 
few reasons:

1. We actively do want feedback on these Java bindings
2. There is some documentation about this, but these bindings *are* different 
than the C bindings (and Java just behaves differently from Java), so it's 
worth documenting here in a Google-able location

I attached the source code listing you sent.

It is closer to correct, but I still don't think it's quite right.  There's two 
issues here:

1. I have no idea how Java stores 2D arrays of doubles.  I.e., you're using 
"double matrix[][]".  I don't know if all P*Q values are stored contiguously in 
memory (or, more specifically, if the Java language *guarantees* that that will 
always be so).

2. Your MPI vector is almost right, but there's a subtle issue about MPI 
vectors that you're missing.



Because of #1, I changed your program to use matrix[], and have it malloc a 
single P*Q array.  Then I always accessed the {i,j} element via matrix[i * Q + 
j].  In this way, Java seems to keep all the values contiguously in memory.

That leads to this conversion of your program:

-
import mpi.*;
public class ColumnScatterMain {
static final int P = 4;
static final int Q = 6;
static final int NUM_ELEM_PER_LINE = 6;

public static void main (String args[]) throws MPIException, 
InterruptedException
{
int  ntasks, mytid, i, j, tmp;
double   matrix[], column[];
Datatype column_t;

MPI.Init (args);
matrix = new double[P * Q];
column = new double[P];
mytid  = MPI.COMM_WORLD.Rank ();
ntasks = MPI.COMM_WORLD.Size ();
if (mytid == 0) {
if (ntasks != Q) {
System.err.println ("\n\nI need exactly " + Q +
" processes.\n\n" +
"Usage:\n" +
"  mpiexec -np " + Q + 
" java \n");
}
}
if (ntasks != Q) {
MPI.Finalize ();
System.exit (0);
}
column_t = Datatype.Vector (P, 1, Q, MPI.DOUBLE);
column_t.Commit ();
if (mytid == 0) {
tmp = 1;
System.out.println ("\nmatrix:\n");
for (i = 0; i < P; ++i) {
for (j = 0; j < Q; ++j) {
matrix[i * Q + j] = tmp++;
System.out.printf ("%10.2f", matrix[i * Q + j]);
}
System.out.println ();
}
System.out.println ();
}
MPI.COMM_WORLD.Scatter (matrix, 0, 1, column_t,
column, 0, P, MPI.DOUBLE, 0);
Thread.sleep(1000 * mytid); // Sleep to get ordered output
System.out.println ("\nColumn of process " + mytid + "\n");
for (i = 0; i < P; ++i) {
if (((i + 1) % NUM_ELEM_PER_LINE) == 0) {
System.out.printf ("%10.2f\n", column[i]);
} else {
System.out.printf ("%10.2f", column[i]);
}
}
System.out.println ();
column_t.finalize ();
MPI.Finalize();
}
}
-

Notice that the output for process 0 after the scatter is correct -- it shows 
that it received values 1, 7, 13, 19 for its column.  But all other processes 
are wrong.

Why?

Because of #2.  Notice that process 1 got values 20, 0, 0, 0 (or, more 
specifically, 20, junk, junk, junk).

That's because the vector datatype you created ended right at element 19.  So 
it started the next vector (i.e., to send to process 1) at the next element -- 
element 20.  And then went on in the same memory pattern from there, but that 
was already beyond the end of the array.  

Go google a tutorial on MPI_Type_vector and you'll see what I mean.

In C or Fortran, the solution would be to use an MPI_TYPE_UB at the end of the 
vector to artificially make the "next" vector be at element 1 (vs. element 20). 
 By the description in 3.12, it looks like they explicitly disallowed this (or, 
I guess, they didn't implement LB/UB properly -- but MPI_LB and MPI_UB are 
deprecated in MPI-3.0, anyway).  But I think it could be done with 
MPI_TYPE_CREATE_RESIZED, which, unfortunately, doesn't look like it is 
implemented in these java bindings yet.

Make sense?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


ColumnScatterMain.java
Description: Binary data