Hello I have been dealing with a problem for quite a while now since upgrading my lenny servers that use bonding to squeeze. It is a bit difficult to explain so I will do my best
Let me start by giving my bond config: iface bond0 inet static address x.x.x.x netmask 255.255.255.0 network x.x.x.0 broadcast x.x.x.255 gateway x,x,x.1 bond-slaves eth0 eth1 bond-mode balance-alb bond-miimon 100 bond-downdelay 200 bond-updelay 200 Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) Bonding Mode: adaptive load balancing Primary Slave: None Currently Active Slave: eth0 MII Status: up MII Polling Interval (ms): 100 Up Delay (ms): 200 Down Delay (ms): 200 Slave Interface: eth0 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:XX:XX:66:98:34 Slave Interface: eth1 MII Status: up Link Failure Count: 0 Permanent HW addr: 00:XX:XX:66:98:36 bond0 Link encap:Ethernet HWaddr 00:XX:XX:66:98:34 UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 eth0 Link encap:Ethernet HWaddr 00:XX:XX:66:98:34 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 eth1 Link encap:Ethernet HWaddr 00:XX:XX:66:98:36 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 The problem started immediately after rebooting from the upgrade (however in my troubleshooting I have been able to reproduce the problem with a fresh install of Squeeze): I was unable to reach my network. Testing with pings demonstrated that I was achieving about 40-50% throughput and the rest were lost packets. This was true for routed traffic and broadcast domain traffic. In troubleshooting I concluded that the issue could be MAC related. I started watching tcpdump logs of ARP traffic to and from my server. Started testing with a server in the same subnet. When pings succeed the ARP cache of the remote server shows the MAC for eth0 (which is also the MAC owned by the bond0 device) When pings fail the ARP cache shows the MAC for eth1 (not the MAC owned by the bond device) I don't see the changing MAC as a problem since I am using balance-alb, i figure my server's MAC will be flapping between one slave and the other. But why are the pings failing? I do notice that eth1 seems to ARP reply much more often than eth0 (almost exclusively) I start tcpdumping the ICMP traffic on both ends. Packets coming to the bonded server leave the remote host with a valid MAC for eth1 in the destination. (but fail) I notice that when I use tcpdump on eth1, the pings go through! As soon as I stop they don't. I run the tcpdump without putting eth1 into promiscuous mode and pings continue to fail. If I enable promiscuous mode the pings go through! So my current conclusion is that for some reason, despite eth1 being in an alb config, it is dropping packets as though they were not destined for that interface. However, it does send ARP replies with its MAC. Forcing the interface to run in promiscuous mode with ifconfig seems to temporarily resolve the issue. So does dismembering the bond device and reassembling it. (ifdown and ifup) for each slave. As soon as the server reboots, this starts over (unless I somewhere specify eth1 to boot in promiscuous mode) I hope this little narrative is enough for one of you to provide assistance, if not please request any info you are missing. Thanks in advance, Pat