On Wed, 15 Mar 2006, Allan Menezes wrote:
Dear Brian, I have the same setup as Mr. Chakrbarty with 16 nodes,
Oscar 4.2.1 beta 4 and two Gigabit ethernet cards with two 16 and 24
port switches one smart and the other managed. I use dhcp to get the IP
addresses for one eth card(The Ip addresses of these range from
192.168.1.1 ... 16) and use static IP addresses for the other NIC of
192.168.5.1 ... 16. The MTU of the first is 9000 for both the nICs and
switch. For the second the MTU is 1500 for both the switch and the NIC
cards as the switch cannot go beyond an MTU of beyond 1500. Using the
-mca btl tcp switch with the 192.168.1.1 .. 16 NICs included and the
192.168.5.1 ... 16 excluded by switches -mca btl_tcp_if_include
eth1(MTU=9000) and -mca btl_tcp_if_exclude eth0 (MTU=1500) I get with
HPL a performance of approx 28.3GigaFlops with both Open Mpi and Mpich2.
But since as you say above if you include both gigabit cards with the
switch -mca btl_tcp_if_include eth0,eth1 using Open Mpi 1.1 (beta) or
1.01 teh performance should increase for the same N and NB in HPL I get
aslight performance decrease instead of increase of about 0.5 to 1
gigaflop less. The hostfile is simply a1, a2 ... a16 using Oscar's DNS
to resolve the domain name. Why is there a performance decrease?
As both of the network devices come from the same BTL (internal driver
denomination) they will both have a similar priority. Let me explain how
exactly the fragmenting work. First for small messages only one of the
devices will be used. For messages above a certain size (usualy first
fragment + max_frag_size) the rest of the data will be split between the 2
devices depending on the device capabilities. Hint: what are the device
capabilities ? Well our algorithm is based on the latency and
bandwidth. As it is difficult to compute them directly from Open MPI,
the user should provide them with the correct values if the 2 nics
don't have similar performance.
It is clear that for best latency the fastest of the 2 nics should be
used. Therefore, you should give a hint to open mpi which one is the
fastest one. There is a parameter for that called btl_tcp_latency_%device,
where %device is the name of your device. On a similar way you should
indicate what is the bandwidth for each nic in order to allow Open MPI to
correctly split the messages across all the nics (the parameter name is
btl_tcp_bandwidth_%device).
Now le't take an example: You have 2 devices eth0 and eth1. Fir of all,
you have to compute the latency and bandwidth for each of them (using
Netpipe). Once you have these 4 values you will add them in your
$(HOME)/.openmpi/mca-params.conf file.
btl_tcp_latency_eth0=30
btl_tcp_latency_eth1=40
and
btl_tcp_bandwidth_eth0=30
btl_tcp_bandwidth_eth1=70
Now there is one trick. While the latency is an absolute value, the
bandwidth is relative (to the total bandwidth). Therefore, you have to
compute the percentage of each of the networks based on their total
bandwidth. If let's say eth0 has a bandwidth of 280Mbs and eth1 has a
bandwidth of 580Mbs the correct values for the bandwidth will be:
btl_tcp_bandwidth_eth0=(280*100)/(280+580) [*32*]
btl_tcp_bandwidth_eth1=(580*100)/(280+580) [*well 100-32 ~ 68]
Now, once you have your 2 devices correctly configured run again Netpipe
and you will notice that the bandwidth will increase. Of course you have
to specify that you want to use both of them via "--mca btl_tcp_if_include
eth0,eth1"
george.
"We must accept finite disappointment, but we must never lose infinite
hope."
Martin Luther King
Dear George,
I did the above for a set of 4 nodes with the setup as per my
previous post and I got no increase in performance.
I have 2 gigabit NICs per node, one realtek giving a latency of 27 us
and bandwidth approximately 460Mbps measured using Netpipe.
The other is a dlink with latency 32us and bandwidth 760Mbps. While
benchmarking with HPL with One gig Bytes DDRRAM pernode I get 8.209
GFlops for 4 nodes with eth1 the dlink cards and jumbo MTU =9000 for all
and the NETGEAR managaed switch set with Jumbo frames enabled. The
command I used was:
#> mpirun --prefix /opt/openmpi102 -hostfile a1234 -mca btl tcp -mca
btl_tcp_if_exclude eth0,eth2,eth3,lo -np 4 ./xhpl
the hostfile a1234 conatins a1, a2, a3, a4 which resolve to 192.168.1.1
... 4
The Network Topology: Conists of 2 gigabit NICs per node a realtek and a
dlink. Realtek MTU =1500 and Dlink MTU =9000
on two subnets the realtek subnet of 192.168.1.1 ... 4 and the dlink
subnet of 192.168.5.1 ... 4 with NETMASK=255.255.255.0
The dlink subnet is connected to the Managed Jumbo Frames Enabled
gigabit NEtgear switch(model -GS724T)
while the realtek subnet is connected to the smart gigabit NETGEAR
switch with MTU max 1500 (model -GS116)
The switches are isolated and not interconnected. I can connect from
192.168.5.1 to 192.168.1.3 and to 192 .168.5.4
so I have total connectivitybtl_tcp_latency_eth0 = 27
. I guess this happens because of the loopback on each machine. I am
running FC4 and Oscar 4.2.1 beta 4
My openmpi-params file in the $HOME/.openmpi/ diectory looks something like
btl_tcp_latency_eth0 = 27
btl_tcp_latency_eth1 = 32
btl_tcp_bandwidth_eth1 = 34
btl_tcp_bandwidth_eth0 = 66
Case 2) Measuring performance with hpl with open mpi 1.02 with two eth
cards with the cfollowing command:
#> mpirun --prefix /opt/openmpi102 -hostfile a1234 -mca btl tcp -mca
btl_tcp_if_exclude eth2,eth3,lo -np 4 ./xhpl
The head node on which mpirun was executed has 4 NICs two gigabit
realtek and Sk98lin and two Fast ethernet 100Mps cards for internet and
network printer
The open_params file looked like:
btl_tcp_latency_eth0 = 27
btl_tcp_latency_eth1 = 32
btl_tcp_bandwidth_eth1 = 0
btl_tcp_bandwidth_eth0 = 100
The peformance I measured was 8.132 GFlops.
So no performance increase was found using two gigabit NICs or one using
openmpi1.02 beta
The Gigabit NICS are cheap cards that use the processing power of the
CPU to do most of the overhead work and are 33MHz 32 PCI cards.
They reside on cheap Asrock micro atx all in one Motherboards. Each PCI
slot can supply only one gigabit card a burst of data to satuarate the
PCI bus.
These mobos have only two PCI slots (32bit 33MHz) and are Athlon and
Duron and P4 mixed. Head node P4 and rest athlon, 2 duron.
So tell me if my analysis is right? Have I made any mistakes in the
network topology or elsewhere?
Why is it that there is no performance increase with two gigabit cards?
Should the cards be matched like both dlink in a node?
Thank you very much,
Best regards,
Allan Menezes