Dear Jeff, I reorganized my cluster and ran the following test with 15 nodes: [allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl [0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [allan@a1 bench]$ It spewed out the above errors but I continued the test for 2and half hours monitoring HPL.out. It gives a maximum of 21.77GFlops for 15 nodes which is not bad. I think the reason it spewed out those errors is because on the four X88-64 machines a13-16 the NIC card connected to the LAN (gigabit) are eth0 and not eth1 like the rest. The head node is eth0. I removed one NIC from the head node to make things simpler to trouble shoot. Here is HPL.out ============================================================================ HPLinpack 1.0a -- High-Performance Linpack benchmark -- January 20, 2004 Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK ============================================================================ An explanation of the input/output parameters follows: T/V : Wall time / encoded variant. N : The order of the coefficient matrix A. NB : The partitioning blocking factor. P : The number of process rows. Q : The number of process columns. Time : Time in seconds to solve the linear system. Gflops : Rate of execution for solving the linear system. The following parameter values will be used: N : 25920 NB : 120 PMAP : Row-major process mapping P : 3 Q : 5 PFACT : Left Crout Right NBMIN : 2 4 NDIV : 2 RFACT : Left Crout Right BCAST : 1ring DEPTH : 0 SWAP : Mix (threshold = 64) L1 : transposed form U : transposed form EQUIL : yes ALIGN : 8 double precision words ---------------------------------------------------------------------------- - The matrix A is randomly generated for each test. - The following scaled residual checks will be computed: 1) ||Ax-b||_oo / ( eps * ||A||_1 * N ) 2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) 3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) - The relative machine precision (eps) is taken to be 1.110223e-16 - Computational tests pass if scaled residuals are less than 16.0 ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L2 25920 120 3 5 534.25 2.173e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2L4 25920 120 3 5 536.98 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2C2 25920 120 3 5 540.73 2.147e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2C4 25920 120 3 5 533.76 2.175e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2R2 25920 120 3 5 537.28 2.161e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00L2R4 25920 120 3 5 533.38 2.177e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2L2 25920 120 3 5 540.45 2.148e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2L4 25920 120 3 5 536.87 2.163e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2C2 25920 120 3 5 533.98 2.174e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2C4 25920 120 3 5 535.31 2.169e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2R2 25920 120 3 5 536.65 2.164e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00C2R4 25920 120 3 5 536.97 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2L2 25920 120 3 5 534.09 2.174e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2L4 25920 120 3 5 534.96 2.170e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117992 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0170302 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034634 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2C2 25920 120 3 5 536.73 2.163e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0128599 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0185612 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0037747 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2C4 25920 120 3 5 536.91 2.162e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0121362 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0175166 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0035623 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R2 25920 120 3 5 535.96 2.166e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0117731 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0169925 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0034557 ...... PASSED ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR00R2R4 25920 120 3 5 536.16 2.165e+01 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0109683 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0158310 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0032195 ...... PASSED ============================================================================ Finished 18 tests with the following results: 18 tests completed and passed residual checks, 0 tests completed and failed residual checks, 0 tests skipped because of illegal input values. ---------------------------------------------------------------------------- End of Tests. ============================================================================ Here is the result of the test carried out with --mca btl tcp --mca btl_tcp_if_include eth1,eth0 which hangs. [allan@a1 bench]$ mpirun -mca btl tcp --mca btl_tcp_if_include eth1,eth0 --prefix /home/allan/openmpi -hostfile aa -np 15 ./xhpl [0,1,1][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,6][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,7][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,12][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,11][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,2][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,3][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,8][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,4][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,5][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,10][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" [0,1,13][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,14][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth1" [0,1,9][btl_tcp_component.c:342:mca_btl_tcp_component_create_instances] invalid interface "eth0" Tell me if I connect all the 10/100Mbps cards on a 10/100Mbps switch along with the gigabit and specify as before --mca btl_tcp_if_include eth1,eth0 the problem will go away and I may get increased bandwidth. I will be trying the same with the switches pml teg to see if there is a difference! Thank you, Allan Message: 2 Date: Sun, 13 Nov 2005 15:51:30 -0500 From: Jeff Squyres <jsquy...@open-mpi.org> Subject: Re: [O-MPI users] HPL anf TCP To: Open MPI Users <us...@open-mpi.org> Message-ID: <f143e44670c59a2f345708e6e0fad...@open-mpi.org> Content-Type: text/plain; charset=US-ASCII; format=flowed On Nov 3, 2005, at 8:35 PM, Allan Menezes wrote:

1. No, I have 4 NICs on the head node and two on each of the 15 other compute nodes. I use the realtek 8169 gigabit ethernet cards on the compute nodes as eth1 or eth0(one only) connected to a gigabit ethernet switch with bisection bandwidth of 32Gbps and a sk98lin driver 3Com built in gigabit ethernet NIC card on the head node(eth3). The other ethernet cards 10/100M on the head node handle a network laser printer(eth0) and eth2 (10/100M) internet access. Eth1 is a spare 10/100M which I can remove. The compute nodes each have two ethernet cards one 10/100Mbps ethernet not connected to anything(built in to M/B) and a PCI realtek 8169 gigabit ethernet connected to the TCP network LAN(Gigabit). When I tried it without the switches -mca pml teg the maximum performace I would get with it was 9GFlops for P=4 Q=4 N=approx 12- 16 thousand and NB ridiculously low at 10 block size. If I tried bigger block sizes it would run for along time for large N ~ 16,000 unless I killed xhpl. I use atlas BLAS 3.7.11 libs compiled for each node and linked to HPL when creating xhpl. I also use open mpi mpicc in the hpl make file for compile and link both. Maybe I should according to the new faq use the TCP switch to use eth3 on the head node?

So if I'm reading that right, there's only one network that connects the head node and the compute nodes, right?


2. I have 512MB of memory per node which is 8 GB total, so I can safely go upto N=22,000 24,000. I used sizes of 22000 for TCP teg and did not run into problems. But if I do not specify the switches suggested by Tim I get bad performance for N = 12000.

I must admit that I'm still befuddled by this -- we are absolutely unable to duplicate this behavior. It *sounds* like there is some network mismatching going on in here -- that the tcp btl is somehow routing information differently than the tcp ptl (and therefore taking longer -- timing out and the like).

We did make some improvements to the tcp subnet mask matching code for rc5; I had to ask again, but could you try with the latest nightly snapshot tarball?

        http://www.open-mpi.org/nightly/v1.0/

4. My cluster is an experimental Basement Cluster [BSquared = Brampton Beowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2 64 bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a total of 16 machines running FC3 and Oscar beta cluster software. I have not tried it with the latest open mpi snapshot yet but I will to night. I think I should reinstall FC3 on the head node P4 2.8GHz and reinstall all the compute nodes with Oscar beta Nov 3, 2005 and open mpi of todays Nov 3, 2005 1.0 snapshot and try again. I could have made an errror somewhere before. It should not take me long. But I doubt it as MPICH2 and open mpi with the switches pml teg give good comparable performance. I was not using jumo MTU frames either just 1500bytes. It is not homogenous (BSquared) but a good test set up.
If you have any advice, Please tell me and I could try it out.
Thank you and good luck!
Allan





On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:


> On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:
>
>

>> We've done linpack runs recently w/ Infiniband, which result in
>> performance
>> comparable to mvapich, but not w/ the tcp port. Can you try
>> running w/
>> an
>> earlier version, specify on the command line:
>>
>> -mca pml teg
>> Hi Tim,
>> I tried the same cluster (16 node x86) with the switches -mca
pml
>> teg and I get good performance of 24.52Gflops at N=22500
>> and Block size NB=120.
>> My command line now looks like :
>> a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca
pml
>> teg -hostile aa -np 16 ./xhpl
>> hostfile = aa, containing the addresses of the 16 machines.
>> I am using a GS116 16 port netgear Gigabit ethernet switch with
Gnet
>> realtek gig ethernet cards
>> Why, PLEASE, do these switches pml teg make such a difference?
It's
>> 2.6 times more performance in GFlops than what I was getting
without
>> them.
>> I tried version rc3 and not an earlier version.
>> Thank you very much for your assistance!
>>

>
> Sorry for the delay in replying to this...
>
> The "pml teg" switch tells Open MPI to use the 2nd generation TCP
> implementation rather than the 3rd generation TCP. More
specifically,
> the "PML" is the point-to-point management layer.  There are 2
> different components for this -- teg (2nd generation) and ob1 (3rd
> generation).  "ob1" is the default; specifying "--mca pml teg" tells
> Open MPI to use the "teg" component instead of ob1.
>
> Note, however, that teg and ob1 know nothing about TCP -- it's the
2nd
> order implications that make the difference here.  teg and ob1 use
> different back-end components to talk across networks:
>
> - teg uses PTL components (point-to-point transport layer -- 2nd
gen)
> - ob1 uses BTL components (byte transfer layer -- 3rd gen)
>
> We obviously have TCP implementations for both the PTL and BTL.
> Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
> Unfortunately, as yet, little time has been spent optimizing the TCP
> BTL (i.e., 3rd gen) -- it was a simple port, nothing more.
>
> We have spent the majority of our time, so far, optimizing the
Myrinet
> and Infiniband BTLs (therefore showing that excellent performance is
> achievable in the BTLs).  However, I'm quite disappointed by the TCP
> BTL performance -- it sounds like we have a protocol mismatch that
is
> arbitrarily slowing everything down, and something that needs to be
> fixed before 1.0 (it's not a problem with the BTL design, since IB
and
> Myrinet performance is quite good -- just a problem/bug in the TCP
> implementation of the BTL).  That much performance degradation is
> clearly unacceptable.
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>




--
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/

Reply via email to