[O-MPI users] hpl and tcp

2005-11-15 Thread Allan Menezes

Hi Jeff,

Here are last night's reults of the following command on my 15 node 
cluster. One node is down from 16.
mpirun --mca pml teg --mca btl_tcp_if_include eth1,eth0 --hostfile aa 
-np 15 ./xhpl
No errors were spewed out to stdout as per my previous post when using  
btl tcp and btl_tcp_if_include eth1
However both tests with pml teg and btl tcp ran to completion with no 
errors, but the pml teg switch gives slightly(marginally) better 
performance.

Here is the HPL1.out for the above mpirun command listed:



HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 20, 
2004

Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK


An explanation of the input/output parameters follows:
T/V: Wall time / encoded variant.
N  : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P  : The number of process rows.
Q  : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N  :   25920
NB : 120
PMAP   : Row-major process mapping
P  :   3
Q  :   5
PFACT  :LeftCroutRight
NBMIN  :   24
NDIV   :   2
RFACT  :LeftCroutRight
BCAST  :   1ring
DEPTH  :   0
SWAP   : Mix (threshold = 64)
L1 : transposed form
U  : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words



- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
  1) ||Ax-b||_oo / ( eps * ||A||_1  * N)
  2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
  3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be  1.110223e-16
- Computational tests pass if scaled residuals are less than   16.0


T/VNNB P Q   Time Gflops

WR00L2L2   25920   120 3 5 529.79  2.192e+01

||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0128599 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0185612 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0037747 .. PASSED

T/VNNB P Q   Time Gflops

WR00L2L4   25920   120 3 5 532.56  2.180e+01

||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0117992 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0170302 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0034634 .. PASSED

T/VNNB P Q   Time Gflops

WR00L2C2   25920   120 3 5 534.56  2.172e+01

||Ax-b||_oo / ( eps * ||A||_1  * N) =0.9834234 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =1.4194125 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.2886629 .. PASSED

T/VNNB P Q   Time Gflops

WR00L2C4   25920   120 3 5 540.77  2.147e+01

||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0121362 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0175166 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0035623 .. PASSED

T/VNNB P Q   Time Gflops

WR00L2R2   25920   120 3 5 539.15  2.154e+01

||Ax-b||_oo / ( eps * ||A||_1  * N

Re: [O-MPI users] OMPI 1.0 rc6 --with-bproc errors

2005-11-15 Thread Jeff Squyres

Daryl --

I don't think that anyone directly replied to you, but I saw some 
commits fixing this yesterday (actually, they were already on the 
trunk; we forgot to bring them over to the v1.0 branch).  Could you 
give this morning's nightly snapshot tarball a whirl?



On Nov 14, 2005, at 10:30 AM, Daryl W. Grunau wrote:

Hi, I'm trying to compile rc6 on my BProc cluster and failed to build 
as

follows:

make[4]: Entering directory 
`/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls/bproc'

depbase=`echo pls_bproc.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`; \
if /bin/sh ../../../../libtool --tag=CC --mode=compile gcc 
-DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include  
-I/usr/src/redhat/BUILD/openmpi-1.0rc6/src/include  
-DORTE_BINDIR="\"/opt/OpenMPI/openmpi-1.0rc6/ib/bin\"" 
-I../../../../include -I../../../.. -I../../../.. 
-I../../../../include -I../../../../opal -I../../../../orte 
-I../../../../ompi-DNDEBUG -O2 -g -pipe -m64 -fno-strict-aliasing 
-pthread -MT pls_bproc.lo -MD -MP -MF "$depbase.Tpo" -c -o 
pls_bproc.lo pls_bproc.c; \
then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; 
exit 1; fi
 gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include 
-I../../../../include 
-I/usr/src/redhat/BUILD/openmpi-1.0rc6/src/include 
-DORTE_BINDIR=\"/opt/OpenMPI/openmpi-1.0rc6/ib/bin\" 
-I../../../../include -I../../../.. -I../../../.. 
-I../../../../include -I../../../../opal -I../../../../orte 
-I../../../../ompi -DNDEBUG -O2 -g -pipe -m64 -fno-strict-aliasing 
-pthread -MT pls_bproc.lo -MD -MP -MF .deps/pls_bproc.Tpo -c 
pls_bproc.c -o pls_bproc.o

pls_bproc.c: In function `orte_pls_bproc_node_array':
pls_bproc.c:122: error: structure has no member named `node_name'
pls_bproc.c:123: error: structure has no member named `node_name'
pls_bproc.c:140: error: structure has no member named `node_name'
make[4]: *** [pls_bproc.lo] Error 1
make[4]: Leaving directory 
`/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls/bproc'

make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory 
`/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls'

make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory 
`/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca'

make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte'
make: *** [all-recursive] Error 1


Here's the full configure line (from config.log):

  $ ./configure --build=x86_64-redhat-linux-gnu 
--host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu 
--program-prefix= --prefix=/opt/OpenMPI/openmpi-1.0rc6/ib 
--exec-prefix=/opt/OpenMPI/openmpi-1.0rc6/ib 
--bindir=/opt/OpenMPI/openmpi-1.0rc6/ib/bin 
--sbindir=/opt/OpenMPI/openmpi-1.0rc6/ib/sbin --sysconfdir=/etc 
--datadir=/opt/OpenMPI/openmpi-1.0rc6/ib/share 
--includedir=/opt/OpenMPI/openmpi-1.0rc6/ib/include 
--libdir=/opt/OpenMPI/openmpi-1.0rc6/ib/lib64 
--libexecdir=/opt/OpenMPI/openmpi-1.0rc6/ib/libexec 
--localstatedir=/var 
--sharedstatedir=/opt/OpenMPI/openmpi-1.0rc6/ib/com 
--mandir=/usr/share/man --infodir=/usr/share/info 
--prefix=/opt/OpenMPI/openmpi-1.0rc6/ib 
--sysconfdir=/opt/OpenMPI/openmpi-1.0rc6/ib/etc --disable-shared 
--enable-static --with-bproc 
--with-mvapi=/opt/IB/ibgd-1.8.0/driver/infinihost



Any advice appreciated!

Daryl
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI users] HPL and TCP

2005-11-15 Thread Jeff Squyres

On Nov 14, 2005, at 8:21 PM, Allan Menezes wrote:


  I think the confusion was my fault because --mca pml teg did not
produce errors and gave almost the same performance as Mpich2 v 1.02p1.
The reason why I cannot do what you suggest below is because the
.openmpi/mca-params.conf file if I am not mistaken would reside in my
home NFS share directory.


If I understand your setup, this should not be a problem.  See the Open 
MPI FAQ about setting MCA parameters -- this is just a different 
mechanism to do it instead of passing them in on the mpirun command 
line.  Having this file available on all nodes means that all nodes 
will get the same MCA parameters (which is typically what you want).



 I have installed a new 5.01 beta version of
Oscar and /home/allan is a shared directory of my head node where the
openmpi installation resides.[/home/allan/openmpi with paths in the
.bash_profile and .bashrc files] I would have to do an individual 16
installations of open mpi on each node for /opt/openmpi and the
mca-params file residing in there.


You can install Open MPI on NFS or on each local disk.  We haven't 
migrated this information to the Open MPI FAQ yet, but the issues are 
the same as are discussed on the LAM/MPI FAQ -- 
http://www.lam-mpi.org/faq/.  See the "Typical setup of LAM" section, 
and the question "Do I need a common filesystem on each node?".



 Tell me if I am wrong. I might have
to do this as this is a heterogenous cluster with different brands of
ethernet cards and CPU's.


You might actually be ok -- Open MPI won't care what TCP cards you have 
because we just use the OS TCP stack.  Different CPUs *might* be a 
problem, but it depends on how you compile Open MPI.  Having different 
Linux distos/versions can definitely be a problem because you may have 
different versions of glibc across your nodes, etc.  I'm guessing 
that's homogeneous, though, since you're using OSCAR.  So having an 
NFS-install Open MPI might be ok.


Check out the LAM/MPI FAQ questions on heterogeneity, too -- the issues 
are pretty much the same for Open MPI.


But it's a good test bed and I have no problems installing Oscar 4.2 
on it.

See my later post Hpl and TCP today where I tried 0b1 without mca  pml
teg and so on and get a good performance with 15 nodes and open mpi 
rc6.

Thank you very much,
Regards,
Allan


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/



Re: [O-MPI users] hpl and tcp

2005-11-15 Thread Jeff Squyres

On Nov 15, 2005, at 4:10 AM, Allan Menezes wrote:


 Here are last night's reults of the following command on my 15 node
cluster. One node is down from 16.
mpirun --mca pml teg --mca btl_tcp_if_include eth1,eth0 --hostfile aa
-np 15 ./xhpl


TEG does not use the BTLs; thats why you got no errors.

TEG is the prior generation point-to-point PML component; it uses the 
PTL components.  OB1 is the next generation point-to-point component; 
it uses BTL components.


--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/