[OMPI users] No increase in performance found using two Gigabit Nics

Allan Menezes Sun, 26 Mar 2006 04:36:02 -0500

On Wed, 15 Mar 2006, Allan Menezes wrote:

 Dear Brian, I have the same setup as Mr. Chakrbarty with 16 nodes,
 Oscar 4.2.1 beta 4 and two Gigabit ethernet cards with two 16 and 24
 port switches one smart and the other managed. I use dhcp to get the IP
 addresses for one eth card(The Ip addresses of these range from
 192.168.1.1 ... 16) and use static IP addresses for the other NIC of
 192.168.5.1 ... 16. The MTU of the first is 9000 for both the nICs and
 switch. For the second the MTU is 1500 for both the switch and the NIC
 cards as the switch cannot go beyond an MTU of beyond 1500. Using the
 -mca btl tcp switch with the 192.168.1.1 .. 16 NICs included and the
 192.168.5.1 ... 16 excluded by switches -mca btl_tcp_if_include
 eth1(MTU=9000) and -mca btl_tcp_if_exclude eth0 (MTU=1500) I get with
 HPL a performance of approx 28.3GigaFlops with both Open Mpi and Mpich2.
 But since as you say above if you include both gigabit cards with the
 switch -mca btl_tcp_if_include eth0,eth1 using Open Mpi 1.1 (beta) or
 1.01 teh performance should increase for the same N and NB in HPL I get
 aslight performance decrease instead of increase of about 0.5 to 1
 gigaflop less. The hostfile is simply a1, a2 ... a16 using Oscar's DNS
 to resolve the domain name. Why is there a performance decrease?


As both of the network devices come from the same BTL (internal driver
denomination) they will both have a similar priority. Let me explain how
exactly the fragmenting work. First for small messages only one of the
devices will be used. For messages above a certain size (usualy first
fragment + max_frag_size) the rest of the data will be split between the 2
devices depending on the device capabilities. Hint: what are the device
capabilities ? Well our algorithm is based on the latency and
bandwidth. As it is difficult to compute them directly from Open MPI,
the user should provide them with the correct values if the 2 nics
don't have similar performance.

It is clear that for best latency the fastest of the 2 nics should be
used. Therefore, you should give a hint to open mpi which one is the
fastest one. There is a parameter for that called btl_tcp_latency_%device,
where %device is the name of your device. On a similar way you should
indicate what is the bandwidth for each nic in order to allow Open MPI to
correctly split the messages across all the nics (the parameter name is
btl_tcp_bandwidth_%device).

Now le't take an example: You have 2 devices eth0 and eth1. Fir of all,
you have to compute the latency and bandwidth for each of them (using
Netpipe). Once you have these 4 values you will add them in your
$(HOME)/.openmpi/mca-params.conf file.

btl_tcp_latency_eth0=30
btl_tcp_latency_eth1=40

and

btl_tcp_bandwidth_eth0=30
btl_tcp_bandwidth_eth1=70

Now there is one trick. While the latency is an absolute value, the
bandwidth is relative (to the total bandwidth). Therefore, you have to
compute the percentage of each of the networks based on their total
bandwidth. If let's say eth0 has a bandwidth of 280Mbs and eth1 has a
bandwidth of 580Mbs the correct values for the bandwidth will be:

btl_tcp_bandwidth_eth0=(280*100)/(280+580) [*32*]
btl_tcp_bandwidth_eth1=(580*100)/(280+580) [*well 100-32 ~ 68]

Now, once you have your 2 devices correctly configured run again Netpipe
and you will notice that the bandwidth will increase. Of course you have
to specify that you want to use both of them via "--mca btl_tcp_if_include
eth0,eth1"

 george.

"We must accept finite disappointment, but we must never lose infinite
hope."
                                 Martin Luther King
Dear George,

I did the above for a set of 4 nodes with the setup as per myprevious post and I got no increase in performance.I have 2 gigabit NICs per node, one realtek giving a latency of 27 usand bandwidth approximately 460Mbps measured using Netpipe.The other is a dlink with latency 32us and bandwidth 760Mbps. Whilebenchmarking with HPL with One gig Bytes DDRRAM pernode I get 8.209GFlops for 4 nodes with eth1 the dlink cards and jumbo MTU =9000 for alland the NETGEAR managaed switch set with Jumbo frames enabled. Thecommand I used was:#> mpirun --prefix /opt/openmpi102 -hostfile a1234 -mca btl tcp -mcabtl_tcp_if_exclude eth0,eth2,eth3,lo -np 4 ./xhplthe hostfile a1234 conatins a1, a2, a3, a4 which resolve to 192.168.1.1... 4The Network Topology: Conists of 2 gigabit NICs per node a realtek and adlink. Realtek MTU =1500 and Dlink MTU =9000on two subnets the realtek subnet of 192.168.1.1 ... 4 and the dlinksubnet of 192.168.5.1 ... 4 with NETMASK=255.255.255.0The dlink subnet is connected to the Managed Jumbo Frames Enabledgigabit NEtgear switch(model -GS724T)while the realtek subnet is connected to the smart gigabit NETGEARswitch with MTU max 1500 (model -GS116)The switches are isolated and not interconnected. I can connect from192.168.5.1 to 192.168.1.3 and to 192 .168.5.4

so I have total connectivitybtl_tcp_latency_eth0 = 27

. I guess this happens because of the loopback on each machine. I amrunning FC4 and Oscar 4.2.1 beta 4

My openmpi-params file in the $HOME/.openmpi/ diectory looks something like
btl_tcp_latency_eth0 = 27
btl_tcp_latency_eth1 = 32
btl_tcp_bandwidth_eth1 = 34
btl_tcp_bandwidth_eth0 = 66

Case 2) Measuring performance with hpl with open mpi 1.02 with two ethcards with the cfollowing command:#> mpirun --prefix /opt/openmpi102 -hostfile a1234 -mca btl tcp -mcabtl_tcp_if_exclude eth2,eth3,lo -np 4 ./xhplThe head node on which mpirun was executed has 4 NICs two gigabitrealtek and Sk98lin and two Fast ethernet 100Mps cards for internet andnetwork printer

The open_params file looked like:
btl_tcp_latency_eth0 = 27
btl_tcp_latency_eth1 = 32
btl_tcp_bandwidth_eth1 = 0
btl_tcp_bandwidth_eth0 = 100
The peformance I measured was 8.132 GFlops.

So no performance increase was found using two gigabit NICs or one usingopenmpi1.02 betaThe Gigabit NICS are cheap cards that use the processing power of theCPU to do most of the overhead work and are 33MHz 32 PCI cards.They reside on cheap Asrock micro atx all in one Motherboards. Each PCIslot can supply only one gigabit card a burst of data to satuarate thePCI bus.These mobos have only two PCI slots (32bit 33MHz) and are Athlon andDuron and P4 mixed. Head node P4 and rest athlon, 2 duron.So tell me if my analysis is right? Have I made any mistakes in thenetwork topology or elsewhere?Why is it that there is no performance increase with two gigabit cards?Should the cards be matched like both dlink in a node?

Thank you very much,
Best regards,
Allan Menezes

[OMPI users] No increase in performance found using two Gigabit Nics

Reply via email to