Re: [O-MPI users] HPL and TCP

George Bosilca Mon, 14 Nov 2005 16:10:46 -0500

Allan,

If there are 2 Ethernet cards it's better if you can point to the one youwant to use. For that you can modify the .openmpi/mca-params.conf file inyour home directory. All of the options can go in this file so you willnot have to specify them on the mpirun command every time.

I give you here a small example that contain the host file (from whereopen mpi will pick the nodes) as well as the BTL configuration.


btl_base_include=tcp,sm,self
btl_tcp_if_include=eth0
rds_hostfile_path = /home/bosilca/.openmpi/machinefile

On the first line I specify that Open MPI is allowed to use the TCP,shared memory and self devices. Self should always be specified otherwiseany communication to the same process will fail (it's out loopbackdevice).

The second line specify that the TCP BTL is allowed to use only the eth0interface. This line has to reflect your own configuration.


Finally the 3th one give the full path to the hostfile file.

  Thanks,
    george.



On Mon, 14 Nov 2005, Allan Menezes wrote:

Dear Jeff, Sorry I could not test the cluster earlier but I am havingproblems with one compute node.(I will have to replace it!). So I will haveto repeat this test with 15 nodes. Yes I had 4 NIC cards on the head node andit was only eth3 that was the gigabit NIC which was communicating to othereth1 gigabit Nics on the compute nodes through a gigabit switch. So though Idid not specify the ethernet interface in the switch --mca pml teg I wasgetting good performance but in --mca btl tcp not specifying the interfaceseems to create problems. I wiped out the Linux FC3 installation and triedagain with Oscar 4.2 but am having problems with --mca btl tcp switch. mpirun--mca btl tcp --prefix /home/allan/openmpi --hostfile aa -np 16 ./xhpl Thehostfile aa contains the 16 hosts a1.lightning.net to a16.lightning.net. Soto recap the cluster is only connected to itself through the giga bit 16 portswitch through gigabit ethernet cards to form a LAN with an IP for each.There is an extra ethernet card built into the compute motherboards that is10/100Mbps that is not connected to anything yet. Please can you tell me theright mpirun command line for btl tcp for my setup? Is the hostfile right?for the mpirun command above? Should it include a1.lightning.net which is thehead node from where I am invoking mpirun? Or should it not have the headnode? Thank you, Allan Message: 2 Date: Sun, 13 Nov 2005 15:51:30 -0500 From:Jeff Squyres <jsquy...@open-mpi.org> Subject: Re: [O-MPI users] HPL anf TCPTo: Open MPI Users <us...@open-mpi.org> Message-ID:<f143e44670c59a2f345708e6e0fad...@open-mpi.org> Content-Type: text/plain;charset=US-ASCII; format=flowed On Nov 3, 2005, at 8:35 PM, Allan Menezeswrote:
1. No, I have 4 NICs on the head node and two on each of the 15 othercompute nodes. I use the realtek 8169 gigabit ethernet cards on thecompute nodes as eth1 or eth0(one only) connected to a gigabit ethernetswitch with bisection bandwidth of 32Gbps and a sk98lin driver 3Com builtin gigabit ethernet NIC card on the head node(eth3). The other ethernetcards 10/100M on the head node handle a network laser printer(eth0) andeth2 (10/100M) internet access. Eth1 is a spare 10/100M which I canremove. The compute nodes each have two ethernet cards one 10/100Mbpsethernet not connected to anything(built in to M/B) and a PCI realtek 8169gigabit ethernet connected to the TCP network LAN(Gigabit). When I triedit without the switches -mca pml teg the maximum performace I would getwith it was 9GFlops for P=4 Q=4 N=approx 12- 16 thousand and NBridiculously low at 10 block size. If I tried bigger block sizes it wouldrun for along time for large N ~ 16,000 unless I killed xhpl. I use atlasBLAS 3.7.11 libs compiled for each node and linked to HPL when creatingxhpl. I also use open mpi mpicc in the hpl make file for compile and linkboth. Maybe I should according to the new faq use the TCP switch to useeth3 on the head node?
So if I'm reading that right, there's only one network that connects the headnode and the compute nodes, right?
That's right!
Allan
2. I have 512MB of memory per node which is 8 GB total, so I can safely goupto N=22,000 24,000. I used sizes of 22000 for TCP teg and did not runinto problems. But if I do not specify the switches suggested by Tim I getbad performance for N = 12000.
I must admit that I'm still befuddled by this -- we are absolutely unable toduplicate this behavior. It *sounds* like there is some network mismatchinggoing on in here -- that the tcp btl is somehow routing informationdifferently than the tcp ptl (and therefore taking longer -- timing out andthe like).
We did make some improvements to the tcp subnet mask matching code for rc5; Ihad to ask again, but could you try with the latest nightly snapshot tarball?
        http://www.open-mpi.org/nightly/v1.0/
I will try it in the near future if time permits with the latest 1.0 snapshotand report back.
I had to "re-image" my cluster so I have some more work to do
Allan
4. My cluster is an experimental Basement Cluster [BSquared = BramptonBeowulf] built out of x86 Durons(6), 2 athlons, 2 semprons, two P4s, 2 64bit x86_64 AMD64 ATHLONS and two AMD x86_64 SEmprons(754 pin) for a totalof 16 machines running FC3 and Oscar beta cluster software. I have nottried it with the latest open mpi snapshot yet but I will to night. Ithink I should reinstall FC3 on the head node P4 2.8GHz and reinstall allthe compute nodes with Oscar beta Nov 3, 2005 and open mpi of todays Nov3, 2005 1.0 snapshot and try again. I could have made an errror somewherebefore. It should not take me long. But I doubt it as MPICH2 and open mpiwith the switches pml teg give good comparable performance. I was notusing jumo MTU frames either just 1500bytes. It is not homogenous(BSquared) but a good test set up.
If you have any advice, Please tell me and I could try it out.
Thank you and good luck!
Allan





On Oct 27, 2005, at 10:19 AM, Jeff Squyres wrote:
> On Oct 19, 2005, at 12:04 AM, Allan Menezes wrote:
>
>
>> We've done linpack runs recently w/ Infiniband, which result in
>> performance
>> comparable to mvapich, but not w/ the tcp port. Can you try
>> running w/
>> an
>> earlier version, specify on the command line:
>>
>> -mca pml teg
>> Hi Tim,
>>   I tried the same cluster (16 node x86) with the switches -mca
pml
>> teg and I get good performance of 24.52Gflops at N=22500
>> and Block size NB=120.
>> My command line now looks like :
>> a1> mpirun -mca pls_rsh_orted /home/allan/openmpi/bin/orted -mca
pml
>> teg -hostile aa -np 16 ./xhpl
>> hostfile = aa, containing the addresses of the 16 machines.
>> I am using a GS116 16 port netgear Gigabit ethernet switch with
Gnet
>> realtek gig ethernet cards
>> Why, PLEASE, do these switches pml teg make such a difference?
It's
>> 2.6 times more performance in GFlops than what I was getting
without
>> them.
>> I tried version rc3 and not an earlier version.
>> Thank you very much for your assistance!
>>
>
> Sorry for the delay in replying to this...
>
> The "pml teg" switch tells Open MPI to use the 2nd generation TCP
> implementation rather than the 3rd generation TCP.  More
specifically,
> the "PML" is the point-to-point management layer.  There are 2
> different components for this -- teg (2nd generation) and ob1 (3rd
> generation).  "ob1" is the default; specifying "--mca pml teg" tells
> Open MPI to use the "teg" component instead of ob1.
>
> Note, however, that teg and ob1 know nothing about TCP -- it's the
2nd
> order implications that make the difference here.  teg and ob1 use
> different back-end components to talk across networks:
>
> - teg uses PTL components (point-to-point transport layer -- 2nd
gen)
> - ob1 uses BTL components (byte transfer layer -- 3rd gen)
>
> We obviously have TCP implementations for both the PTL and BTL.
> Considerable time was spent optimizing the TCP PTL (i.e., 2nd gen).
> Unfortunately, as yet, little time has been spent optimizing the TCP
> BTL (i.e., 3rd gen) -- it was a simple port, nothing more.
>
> We have spent the majority of our time, so far, optimizing the
Myrinet
> and Infiniband BTLs (therefore showing that excellent performance is
> achievable in the BTLs).  However, I'm quite disappointed by the TCP
> BTL performance -- it sounds like we have a protocol mismatch that
is
> arbitrarily slowing everything down, and something that needs to be
> fixed before 1.0 (it's not a problem with the BTL design, since IB
and
> Myrinet performance is quite good -- just a problem/bug in the TCP
> implementation of the BTL).  That much performance degradation is
> clearly unacceptable.
>
> --
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
--
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
-- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/------------------------------


"We must accept finite disappointment, but we must never lose infinite
hope."
                                  Martin Luther King

Re: [O-MPI users] HPL and TCP

Reply via email to