Message: 2
List-Post: users@lists.open-mpi.org
Date: Wed, 2 Nov 2005 17:28:33 -0500
From: Jeff Squyres <jsquy...@open-mpi.org>
Subject: Re: [O-MPI users] HPL and OpenMpi RC3
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <10bfc5d6-f68c-46f0-a984-5c8710675...@open-mpi.org>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed

Allan --

We have been unable to reproduce this bad TCP performance behavior. Indeed, in our runs, TEG TCP is performing slower than OB1 TCP.

Sidenote: is there any reason you're supplying the pls_rsh_orted MCA parameter on the command line? It shouldn't really be necessary if OMPI is in your PATH (although you may need to add it to your PATH in your shell startup files, or use the --prefix option -- see http:// www.open-mpi.org/faq/?category=running#adding-ompi-to-path and http:// www.open-mpi.org/faq/?category=running#mpirun-prefix).

Some followup questions:

1. Do you only have one TCP NIC on each node?
2. Are you running HPL in a size that is not going to thrash your memory? (I'm guessing not, since teg runs were ok, but just to be sure...) 3. Is anyone else running on these nodes at the same time? (Again, I'm assuming no, but just want to be sure) 4. Can you try this again with the latest v1.0 snapshot? (http:// www.open-mpi.org/nightly/v1.0/)

Thanks!

Hi Jeff,
Answers to the above Questions:
1. No, I have 4 NICs on the head node and two on each of the 15 other compute 
nodes. I use the realtek 8169 gigabit ethernet cards on the compute nodes as 
eth1 or eth0(one only) connected to a gigabit ethernet switch with bisection 
bandwidth of 32Gbps and a sk98lin driver 3Com built in gigabit ethernet NIC 
card on the head node(eth3). The other ethernet cards 10/100M on the head node 
handle a network laser printer(eth0) and eth2 (10/100M) internet access. Eth1 
is a spare 10/100M which I can remove. The compute nodes each have two ethernet 
cards one 10/100Mbps ethernet not connected to anything(built in to M/B) and a 
PCI realtek 8169 gigabit ethernet connected to the TCP network LAN(Gigabit). 
When I tried it without the switches -mca pml teg the maximum performace I 
would get with it was 9GFlops for P=4 Q=4 N=approx 12- 16 thousand and NB 
ridiculously low at 10 block size. If I tried bigger block sizes it would run 
for along time for large N ~ 16,000 unless I killed xhpl. I use atlas BLAS 
3.7.11 libs compiled for each node and linked to HPL when creating xhpl. I also 
use open mpi mpicc in the hpl make file for compile and link both.
  Maybe I should according to the new faq use the TCP switch to use eth3 on the 
head node?
2. I have 512MB of memory per node which is 8 GB total, so I can safely go upto 
N=22,000 24,000. I used sizes of 22000 for TCP teg and did not run into 
problems. But if I do not specify the switches suggested by Tim I get bad 
performance for N = 12000.
3. No , just me.
4. My cluster is an experimental Basement Cluster [BSquared = Brampton Beowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2 64 bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a total of 16 machines running FC3 and Oscar beta cluster software. I have not tried it with the latest open mpi snapshot yet but I will to night. I think I should reinstall FC3 on the head node P4 2.8GHz and reinstall all the compute nodes with Oscar beta Nov 3, 2005 and open mpi of todays Nov 3, 2005 1.0 snapshot and try again. I could have made an errror somewhere before. It should not take me long. But I doubt it as MPICH2 and open mpi with the switches pml teg give good comparable performance. I was not using jumo MTU frames either just 1500bytes. It is not homogenous (BSquared) but a good test set up. If you have any advice, Please tell me and I could try it out.
Thank you and good luck!
Allan




On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:


On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:


We've done linpack runs recently w/ Infiniband, which result in
performance
comparable to mvapich, but not w/ the tcp port. Can you try running w/
an
earlier version, specify on the command line:

-mca pml teg
Hi Tim,
  I tried the same cluster (16 node x86) with the switches -mca pml
teg and I get good performance of 24.52Gflops at N=22500
and Block size NB=120.
My command line now looks like :
a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca pml
teg -hostile aa -np 16 ./xhpl
hostfile = aa, containing the addresses of the 16 machines.
I am using a GS116 16 port netgear Gigabit ethernet switch with Gnet
realtek gig ethernet cards
Why, PLEASE, do these switches pml teg make such a difference? It's
2.6 times more performance in GFlops than what I was getting without
them.
I tried version rc3 and not an earlier version.
Thank you very much for your assistance!


Sorry for the delay in replying to this...

The "pml teg" switch tells Open MPI to use the 2nd generation TCP
implementation rather than the 3rd generation TCP.  More specifically,
the "PML" is the point-to-point management layer.  There are 2
different components for this -- teg (2nd generation) and ob1 (3rd
generation).  "ob1" is the default; specifying "--mca pml teg" tells
Open MPI to use the "teg" component instead of ob1.

Note, however, that teg and ob1 know nothing about TCP -- it's the 2nd
order implications that make the difference here.  teg and ob1 use
different back-end components to talk across networks:

- teg uses PTL components (point-to-point transport layer -- 2nd gen)
- ob1 uses BTL components (byte transfer layer -- 3rd gen)

We obviously have TCP implementations for both the PTL and BTL.
Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
Unfortunately, as yet, little time has been spent optimizing the TCP
BTL (i.e., 3rd gen) -- it was a simple port, nothing more.

We have spent the majority of our time, so far, optimizing the Myrinet
and Infiniband BTLs (therefore showing that excellent performance is
achievable in the BTLs).  However, I'm quite disappointed by the TCP
BTL performance -- it sounds like we have a protocol mismatch that is
arbitrarily slowing everything down, and something that needs to be
fixed before 1.0 (it's not a problem with the BTL design, since IB and
Myrinet performance is quite good -- just a problem/bug in the TCP
implementation of the BTL).  That much performance degradation is
clearly unacceptable.

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--

Reply via email to