Difficulty with bonded ethernet devices after upgrade to squeeze

Patrick Zaloum Fri, 18 Nov 2011 09:13:42 -0800

Hello
I have been dealing with a problem for quite a while now since upgrading my
lenny servers that use bonding to squeeze.
It is a bit difficult to explain so I will do my best


Let me start by giving my bond config:

iface bond0 inet static
        address x.x.x.x
        netmask 255.255.255.0
        network x.x.x.0
        broadcast x.x.x.255
        gateway x,x,x.1
        bond-slaves eth0 eth1
        bond-mode balance-alb
        bond-miimon 100
        bond-downdelay 200
        bond-updelay 200

Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: adaptive load balancing
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:XX:XX:66:98:34

Slave Interface: eth1
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:XX:XX:66:98:36

bond0     Link encap:Ethernet  HWaddr 00:XX:XX:66:98:34
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
eth0      Link encap:Ethernet  HWaddr 00:XX:XX:66:98:34
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
eth1      Link encap:Ethernet  HWaddr 00:XX:XX:66:98:36
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1



The problem started immediately after rebooting from the upgrade (however
in my troubleshooting I have been able to reproduce the problem with a
fresh install of Squeeze): I was unable to reach my network.
Testing with pings demonstrated that I was achieving about 40-50%
throughput and the rest were lost packets.
This was true for routed traffic and broadcast domain traffic.
In troubleshooting I concluded that the issue could be MAC related. I
started watching tcpdump logs of ARP traffic to and from my server.
Started testing with a server in the same subnet.
When pings succeed the ARP cache of the remote server shows the MAC for
eth0 (which is also the MAC owned by the bond0 device)
When pings fail the ARP cache shows the MAC for eth1 (not the MAC owned by
the bond device)
I don't see the changing MAC as a problem since I am using balance-alb, i
figure my server's MAC will be flapping between one slave and the other.
But why are the pings failing?
I do notice that eth1 seems to ARP reply much more often than eth0 (almost
exclusively)
I start tcpdumping the ICMP traffic on both ends. Packets coming to the
bonded server leave the remote host with a valid MAC for eth1 in the
destination. (but fail)
I notice that when I use tcpdump on eth1, the pings go through! As soon as
I stop they don't.
I run the tcpdump without putting eth1 into promiscuous mode and pings
continue to fail. If I enable promiscuous mode the pings go through!

So my current conclusion is that for some reason, despite eth1 being in an
alb config, it is dropping packets as though they were not destined for
that interface. However, it does send ARP replies with its MAC. Forcing the
interface to run in promiscuous mode with ifconfig seems to temporarily
resolve the issue. So does dismembering the bond device and reassembling
it. (ifdown and ifup) for each slave.

As soon as the server reboots, this starts over (unless I somewhere specify
eth1 to boot in promiscuous mode)

I hope this little narrative is enough for one of you to provide
assistance, if not please request any info you are missing.

Thanks in advance,
Pat

Difficulty with bonded ethernet devices after upgrade to squeeze

Reply via email to