Ben Hutchings <b...@decadent.org.uk> writes:

> On Mon, 2011-11-21 at 20:13 +0100, Bjørn Mork wrote:
>> Looks like my wife did some external scans of our home network :-)
>> 
>> Have to investigate further how she managed to kill the interface, but
>> this is definitely not related to the driver upgrade.  Sorry for my
>> misleading initial report.
>
> So far as I'm aware, if the TX watchdog fires it indicates one of:
>
> 1. A bug in the driver, firmware or hardware caused the hardware
> transmit queue to stop.
> 2. A bug in the driver, firmware or hardware meant that the kernel was
> not notified of link-down or another interruption that is expected to
> stop the hardware transmit queue.
> 3. Transmission is being continually blocked by (full-duplex link) pause
> frames or (half-duplex link) collisions.  This may occur due to a switch
> misconfiguration or inconsistent configuration between switch and host.
>
> High levels of traffic or specific traffic patterns that overload the
> CPU should never cause this to happen.  As the primary maintainer of
> another Linux network driver, I have to treat every 'TX watchdog' report
> as a bug unless it falls into case 3.

This may very well be an example of case 3. The failing interface is
connected to a gig port on a Cisco Catalyst C2950G.  Both the switch
port and the host port is configured for both input and output
flow-control.

canardo:/tmp# ethtool -a eth1
Pause parameters for eth1:
Autonegotiate:  on
RX:             on
TX:             on

canardo:/tmp# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: on
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000001 (1)
        Link detected: yes

c2950a#show interfaces gigabitEthernet 0/1
GigabitEthernet0/1 is up, line protocol is up (connected)
  Hardware is Gigabit Ethernet, address is 000d.bc45.b3d9 (bia 000d.bc45.b3d9)
  Description: canardo
  MTU 1500 bytes, BW 1000000 Kbit, DLY 10 usec, 
     reliability 255/255, txload 1/255, rxload 1/255
  Encapsulation ARPA, loopback not set
  Keepalive set (10 sec)
  Full-duplex, 1000Mb/s, media type is T
  input flow-control is on, output flow-control is on 
  ARP type: ARPA, ARP Timeout 04:00:00
  1000BaseT module in GBIC slot.
  Last input 00:00:03, output 00:00:01, output hang never
  Last clearing of "show interface" counters never
  Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
  Queueing strategy: fifo
  Output queue: 0/40 (size/max)
  5 minute input rate 544000 bits/sec, 159 packets/sec
  5 minute output rate 117000 bits/sec, 103 packets/sec
     85269919 packets input, 1110719891 bytes, 756 no buffer
     Received 1673801 broadcasts (1543541 multicast)
     0 runts, 0 giants, 0 throttles
     0 input errors, 0 CRC, 0 frame, 0 overrun, 756 ignored
     0 watchdog, 1543541 multicast, 11987 pause input
     0 input packets with dribble condition detected
     61473019 packets output, 2505206278 bytes, 0 underruns
     0 output errors, 0 collisions, 2 interface resets
     0 babbles, 0 late collision, 0 deferred
     0 lost carrier, 0 no carrier, 0 PAUSE output
     0 output buffer failures, 0 output buffers swapped out


NOTE: switch counters have unfortunately been reset since the event.



The host network configuration is rather unusual, and may seem
unnecessarily complex (but I have my reasons for most of this - I've
just forgotten them :-)


The eth1 interface is bridged with a tap interface connected to a VDE
switch running on the host.  Both the physical and virtual switch ports
are configured as trunks and a number of VLAN interfaces are put on top
of the bridge interface:

bjorn@canardo:~$ brctl show
bridge name     bridge id               STP enabled     interfaces
br0             8000.0015171e5e35       no              eth1
                                                        tap0
canardo:/tmp# cat /proc/net/vlan/config 
VLAN Dev name    | VLAN ID
Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
br0.1          | 1  | br0
br0.7          | 7  | br0
br0.90         | 90  | br0
br0.93         | 93  | br0
br0.666        | 666  | br0


This way, I can easily connect any combination I want of physical switch
port, virtual switch port and host interface, using only a single cable.

To make this even better, one of the swich ports is connected to a ADSL
modem and I'm running two PPPoE sessions from the same host over br0.90
(the modem is of course an untagged port in VLAN 90).

The packets causing the problem were dummy (probably completely empty)
IPv6 packets addressed to a large number of non-existent hosts within a
single /64.  The packets came in over ppp1 running over the ADSL line,
and would be routed out on br0.1.

This probably made the host send a lot of neighbour discovery icmp
packets out br0.1, which would be split out to a number of untagged
ports in VLAN 1 on both the virtual and the physical switch. At least
one of these ports are configured for half-duplex 10Mb/s. I assume that
will limit the possible multicast traffic in this VLAN to 10Mb/s as
well.

The ADSL line is 12Mb/s, and the incoming packets could actually be
smaller than the neighbour discovery packets, so I believe it's feasible
that the triggered neigbour discovery traffic exceeded what the switch
was capable of forwarding in this case.

The real unanswered question is:  Will the switch send pause frames in
this case?  It is of course capable of handling a whole lot more of
traffic, just not more multicast traffic for this sinlge VLAN.

> So I don't want to just forget this either.  But if you can't reproduce
> it, it may be difficult to track down.

I will try and see if it is reproducible.  Cannot promise when... 

It would of course be very interesting to watch the switch counters
while doing this, and also seeing if this is reproducible on a VLAN with
only gig ports, or even with only 100Mb/s ports.


Thanks for your feedback.  I realized that I should research the problem
better, but I just didn't the time to do that properly when I discovered
that it was more than the simple driver regression I initially thought
it was.  Your list of possible reasons made it much easier to guess what
could be going on.



Bjørn



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to