Ben Hutchings <b...@decadent.org.uk> writes: > On Mon, 2011-11-21 at 20:13 +0100, Bjørn Mork wrote: >> Looks like my wife did some external scans of our home network :-) >> >> Have to investigate further how she managed to kill the interface, but >> this is definitely not related to the driver upgrade. Sorry for my >> misleading initial report. > > So far as I'm aware, if the TX watchdog fires it indicates one of: > > 1. A bug in the driver, firmware or hardware caused the hardware > transmit queue to stop. > 2. A bug in the driver, firmware or hardware meant that the kernel was > not notified of link-down or another interruption that is expected to > stop the hardware transmit queue. > 3. Transmission is being continually blocked by (full-duplex link) pause > frames or (half-duplex link) collisions. This may occur due to a switch > misconfiguration or inconsistent configuration between switch and host. > > High levels of traffic or specific traffic patterns that overload the > CPU should never cause this to happen. As the primary maintainer of > another Linux network driver, I have to treat every 'TX watchdog' report > as a bug unless it falls into case 3.
This may very well be an example of case 3. The failing interface is connected to a gig port on a Cisco Catalyst C2950G. Both the switch port and the host port is configured for both input and output flow-control. canardo:/tmp# ethtool -a eth1 Pause parameters for eth1: Autonegotiate: on RX: on TX: on canardo:/tmp# ethtool eth1 Settings for eth1: Supported ports: [ TP ] Supported link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Supports auto-negotiation: Yes Advertised link modes: 10baseT/Half 10baseT/Full 100baseT/Half 100baseT/Full 1000baseT/Full Advertised pause frame use: No Advertised auto-negotiation: Yes Speed: 1000Mb/s Duplex: Full Port: Twisted Pair PHYAD: 1 Transceiver: internal Auto-negotiation: on MDI-X: on Supports Wake-on: d Wake-on: d Current message level: 0x00000001 (1) Link detected: yes c2950a#show interfaces gigabitEthernet 0/1 GigabitEthernet0/1 is up, line protocol is up (connected) Hardware is Gigabit Ethernet, address is 000d.bc45.b3d9 (bia 000d.bc45.b3d9) Description: canardo MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec, reliability 255/255, txload 1/255, rxload 1/255 Encapsulation ARPA, loopback not set Keepalive set (10 sec) Full-duplex, 1000Mb/s, media type is T input flow-control is on, output flow-control is on ARP type: ARPA, ARP Timeout 04:00:00 1000BaseT module in GBIC slot. Last input 00:00:03, output 00:00:01, output hang never Last clearing of "show interface" counters never Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0 Queueing strategy: fifo Output queue: 0/40 (size/max) 5 minute input rate 544000 bits/sec, 159 packets/sec 5 minute output rate 117000 bits/sec, 103 packets/sec 85269919 packets input, 1110719891 bytes, 756 no buffer Received 1673801 broadcasts (1543541 multicast) 0 runts, 0 giants, 0 throttles 0 input errors, 0 CRC, 0 frame, 0 overrun, 756 ignored 0 watchdog, 1543541 multicast, 11987 pause input 0 input packets with dribble condition detected 61473019 packets output, 2505206278 bytes, 0 underruns 0 output errors, 0 collisions, 2 interface resets 0 babbles, 0 late collision, 0 deferred 0 lost carrier, 0 no carrier, 0 PAUSE output 0 output buffer failures, 0 output buffers swapped out NOTE: switch counters have unfortunately been reset since the event. The host network configuration is rather unusual, and may seem unnecessarily complex (but I have my reasons for most of this - I've just forgotten them :-) The eth1 interface is bridged with a tap interface connected to a VDE switch running on the host. Both the physical and virtual switch ports are configured as trunks and a number of VLAN interfaces are put on top of the bridge interface: bjorn@canardo:~$ brctl show bridge name bridge id STP enabled interfaces br0 8000.0015171e5e35 no eth1 tap0 canardo:/tmp# cat /proc/net/vlan/config VLAN Dev name | VLAN ID Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD br0.1 | 1 | br0 br0.7 | 7 | br0 br0.90 | 90 | br0 br0.93 | 93 | br0 br0.666 | 666 | br0 This way, I can easily connect any combination I want of physical switch port, virtual switch port and host interface, using only a single cable. To make this even better, one of the swich ports is connected to a ADSL modem and I'm running two PPPoE sessions from the same host over br0.90 (the modem is of course an untagged port in VLAN 90). The packets causing the problem were dummy (probably completely empty) IPv6 packets addressed to a large number of non-existent hosts within a single /64. The packets came in over ppp1 running over the ADSL line, and would be routed out on br0.1. This probably made the host send a lot of neighbour discovery icmp packets out br0.1, which would be split out to a number of untagged ports in VLAN 1 on both the virtual and the physical switch. At least one of these ports are configured for half-duplex 10Mb/s. I assume that will limit the possible multicast traffic in this VLAN to 10Mb/s as well. The ADSL line is 12Mb/s, and the incoming packets could actually be smaller than the neighbour discovery packets, so I believe it's feasible that the triggered neigbour discovery traffic exceeded what the switch was capable of forwarding in this case. The real unanswered question is: Will the switch send pause frames in this case? It is of course capable of handling a whole lot more of traffic, just not more multicast traffic for this sinlge VLAN. > So I don't want to just forget this either. But if you can't reproduce > it, it may be difficult to track down. I will try and see if it is reproducible. Cannot promise when... It would of course be very interesting to watch the switch counters while doing this, and also seeing if this is reproducible on a VLAN with only gig ports, or even with only 100Mb/s ports. Thanks for your feedback. I realized that I should research the problem better, but I just didn't the time to do that properly when I discovered that it was more than the simple driver regression I initially thought it was. Your list of possible reasons made it much easier to guess what could be going on. Bjørn -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org