[O-MPI users] hpl and tcp
Hi Jeff, Here are last night's reults of the following command on my 15 node cluster. One node is down from 16. mpirun --mca pml teg --mca btl_tcp_if_include eth1,eth0 --hostfile aa -np 15 ./xhpl No errors were spewed out to stdout as per my previous post when using btl tcp and btl_tcp_if_include eth1 However both tests with pml teg and btl tcp ran to completion with no errors, but the pml teg switch gives slightly(marginally) better performance. Here is the HPL1.out for the above mpirun command listed: HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK An explanation of the input/output parameters follows: T/V: Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 25920 NB : 120 PMAP : Row-major process mapping P : 3 Q : 5 PFACT :LeftCroutRight NBMIN : 24 NDIV : 2 RFACT :LeftCroutRight BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 T/VNNB P Q Time Gflops WR00L2L2 25920 120 3 5 529.79 2.192e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0128599 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0185612 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0037747 .. PASSED T/VNNB P Q Time Gflops WR00L2L4 25920 120 3 5 532.56 2.180e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0117992 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0170302 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0034634 .. PASSED T/VNNB P Q Time Gflops WR00L2C2 25920 120 3 5 534.56 2.172e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.9834234 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =1.4194125 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.2886629 .. PASSED T/VNNB P Q Time Gflops WR00L2C4 25920 120 3 5 540.77 2.147e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N) =0.0121362 .. PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) =0.0175166 .. PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0035623 .. PASSED T/VNNB P Q Time Gflops WR00L2R2 25920 120 3 5 539.15 2.154e+01 ||Ax-b||_oo / ( eps * ||A||_1 * N
Re: [O-MPI users] OMPI 1.0 rc6 --with-bproc errors
Daryl -- I don't think that anyone directly replied to you, but I saw some commits fixing this yesterday (actually, they were already on the trunk; we forgot to bring them over to the v1.0 branch). Could you give this morning's nightly snapshot tarball a whirl? On Nov 14, 2005, at 10:30 AM, Daryl W. Grunau wrote: Hi, I'm trying to compile rc6 on my BProc cluster and failed to build as follows: make[4]: Entering directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls/bproc' depbase=`echo pls_bproc.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`; \ if /bin/sh ../../../../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I/usr/src/redhat/BUILD/openmpi-1.0rc6/src/include -DORTE_BINDIR="\"/opt/OpenMPI/openmpi-1.0rc6/ib/bin\"" -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi-DNDEBUG -O2 -g -pipe -m64 -fno-strict-aliasing -pthread -MT pls_bproc.lo -MD -MP -MF "$depbase.Tpo" -c -o pls_bproc.lo pls_bproc.c; \ then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; exit 1; fi gcc -DHAVE_CONFIG_H -I. -I. -I../../../../include -I../../../../include -I/usr/src/redhat/BUILD/openmpi-1.0rc6/src/include -DORTE_BINDIR=\"/opt/OpenMPI/openmpi-1.0rc6/ib/bin\" -I../../../../include -I../../../.. -I../../../.. -I../../../../include -I../../../../opal -I../../../../orte -I../../../../ompi -DNDEBUG -O2 -g -pipe -m64 -fno-strict-aliasing -pthread -MT pls_bproc.lo -MD -MP -MF .deps/pls_bproc.Tpo -c pls_bproc.c -o pls_bproc.o pls_bproc.c: In function `orte_pls_bproc_node_array': pls_bproc.c:122: error: structure has no member named `node_name' pls_bproc.c:123: error: structure has no member named `node_name' pls_bproc.c:140: error: structure has no member named `node_name' make[4]: *** [pls_bproc.lo] Error 1 make[4]: Leaving directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls/bproc' make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca/pls' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte/mca' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/usr/src/redhat/BUILD/openmpi-1.0rc6/orte' make: *** [all-recursive] Error 1 Here's the full configure line (from config.log): $ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --target=x86_64-redhat-linux-gnu --program-prefix= --prefix=/opt/OpenMPI/openmpi-1.0rc6/ib --exec-prefix=/opt/OpenMPI/openmpi-1.0rc6/ib --bindir=/opt/OpenMPI/openmpi-1.0rc6/ib/bin --sbindir=/opt/OpenMPI/openmpi-1.0rc6/ib/sbin --sysconfdir=/etc --datadir=/opt/OpenMPI/openmpi-1.0rc6/ib/share --includedir=/opt/OpenMPI/openmpi-1.0rc6/ib/include --libdir=/opt/OpenMPI/openmpi-1.0rc6/ib/lib64 --libexecdir=/opt/OpenMPI/openmpi-1.0rc6/ib/libexec --localstatedir=/var --sharedstatedir=/opt/OpenMPI/openmpi-1.0rc6/ib/com --mandir=/usr/share/man --infodir=/usr/share/info --prefix=/opt/OpenMPI/openmpi-1.0rc6/ib --sysconfdir=/opt/OpenMPI/openmpi-1.0rc6/ib/etc --disable-shared --enable-static --with-bproc --with-mvapi=/opt/IB/ibgd-1.8.0/driver/infinihost Any advice appreciated! Daryl ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] HPL and TCP
On Nov 14, 2005, at 8:21 PM, Allan Menezes wrote: I think the confusion was my fault because --mca pml teg did not produce errors and gave almost the same performance as Mpich2 v 1.02p1. The reason why I cannot do what you suggest below is because the .openmpi/mca-params.conf file if I am not mistaken would reside in my home NFS share directory. If I understand your setup, this should not be a problem. See the Open MPI FAQ about setting MCA parameters -- this is just a different mechanism to do it instead of passing them in on the mpirun command line. Having this file available on all nodes means that all nodes will get the same MCA parameters (which is typically what you want). I have installed a new 5.01 beta version of Oscar and /home/allan is a shared directory of my head node where the openmpi installation resides.[/home/allan/openmpi with paths in the .bash_profile and .bashrc files] I would have to do an individual 16 installations of open mpi on each node for /opt/openmpi and the mca-params file residing in there. You can install Open MPI on NFS or on each local disk. We haven't migrated this information to the Open MPI FAQ yet, but the issues are the same as are discussed on the LAM/MPI FAQ -- http://www.lam-mpi.org/faq/. See the "Typical setup of LAM" section, and the question "Do I need a common filesystem on each node?". Tell me if I am wrong. I might have to do this as this is a heterogenous cluster with different brands of ethernet cards and CPU's. You might actually be ok -- Open MPI won't care what TCP cards you have because we just use the OS TCP stack. Different CPUs *might* be a problem, but it depends on how you compile Open MPI. Having different Linux distos/versions can definitely be a problem because you may have different versions of glibc across your nodes, etc. I'm guessing that's homogeneous, though, since you're using OSCAR. So having an NFS-install Open MPI might be ok. Check out the LAM/MPI FAQ questions on heterogeneity, too -- the issues are pretty much the same for Open MPI. But it's a good test bed and I have no problems installing Oscar 4.2 on it. See my later post Hpl and TCP today where I tried 0b1 without mca pml teg and so on and get a good performance with 15 nodes and open mpi rc6. Thank you very much, Regards, Allan -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI users] hpl and tcp
On Nov 15, 2005, at 4:10 AM, Allan Menezes wrote: Here are last night's reults of the following command on my 15 node cluster. One node is down from 16. mpirun --mca pml teg --mca btl_tcp_if_include eth1,eth0 --hostfile aa -np 15 ./xhpl TEG does not use the BTLs; thats why you got no errors. TEG is the prior generation point-to-point PML component; it uses the PTL components. OB1 is the next generation point-to-point component; it uses BTL components. -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/