Carl-Daniel Hailfinger schrieb: > Stephen Hemminger schrieb: > >>On Mon, 23 Jan 2006 20:57:10 +0100 >>Carl-Daniel Hailfinger <[EMAIL PROTECTED]> wrote: >> >> >>>Stephen Hemminger schrieb: >>> >>>>You might try adjusting the interrupt coalescing parameters with >>>> ethtool -C eth0 ... >>>>But I can't give you hard guidelines as to what would make it better. >>>> >>>>I have a debug patch, but it needs work still. >>> >>>I don't care whether that debug patch will freeze the box or perform >>>other random funnies. All the debugging printks I added to the driver >>>did not trigger and I'd try anything. So yes, I'm desparate. >>> >>>Does the sk98lin driver have any code for such problems? >> >> >>There are several differences that the sk98lin driver has. >>* It programs some parts of the chip differently. But most >> of those are wrong. I started copying it, but where it was wrong >> I didn't copy the mistakes. >>* Sk98lin does NAPI wrong. It has interrupts disabled and runs >> packets through soft irq twice. >>* Sk98lin does it's own buggy rx checksum validation. >>* Sk98lin does not do VLAN >>* Sk98lin programs PCI-Ex for 2K transfers, but that causes data >> corruption >> >>The one that probably is saving you with sk98lin, is it has a watchdog >>routine that tries to work around all the possible driver hangs. >>I prefer to find an fix these hangs, because a watchdog routine like that >>just masks the problem and introduces a bunch of SMP race conditions which >>the sk98lin author either didn't see or ignored. > > > Oh. Now that is news to me. Glad I didn't have a SMP machine with the old > driver. > > There is a bug in ethtool support in sky2. Namely, rx-frames{,-irq}=64 is > wrapped to zero. And rx-usecs-irq is 20 no matter what I set it to.
The following whitespace-damaged patch should help with the latter problem. --- a/drivers/net/sky2.c 2006-01-23 23:41:35.000000000 +0100 +++ b/drivers/net/sky2.c 2006-01-24 03:41:21.000000000 +0100 @@ -2843,7 +2843,7 @@ if (ecmd->rx_coalesce_usecs_irq == 0) sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP); else { - sky2_write32(hw, STAT_TX_TIMER_INI, + sky2_write32(hw, STAT_ISR_TIMER_INI, sky2_us2clk(hw, ecmd->rx_coalesce_usecs_irq)); sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START); } Despite all the problems I'm having with sky2, I want to thank you for writing it. The driver is easily readable and I can at least try to get it running. With sk98lin I'm just stuck due to coding style and general obfuscation. Yeeeeeaaaaaaaaaaaaahhhhhhhhhhhhh! I got the nic to reproducibly auto-recover. With the following ethtool settings it would hang after a few minutes and not recover until a rmmod/modprobe cycle. Now it comes back reliably. # ethtool -C bridgeext0 rx-frames 63 rx-frames-irq 63 tx-frames 63 \ rx-usecs 250 rx-usecs-irq 250 tx-usecs 250 Patch follows: --- a/drivers/net/sky2.c 2006-01-23 23:41:35.000000000 +0100 +++ b/drivers/net/sky2.c 2006-01-24 04:59:38.000000000 +0100 @@ -1623,6 +1623,12 @@ unsigned txq = txqaddr[sky2->port]; u16 ridx; + //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_STOP); + sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_STOP); + //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_STOP); + //sky2_write8(hw, STAT_TX_TIMER_CTRL, TIM_START); + sky2_write8(hw, STAT_LEV_TIMER_CTRL, TIM_START); + //sky2_write8(hw, STAT_ISR_TIMER_CTRL, TIM_START); /* Maybe we just missed an status interrupt */ spin_lock(&sky2->tx_lock); ridx = sky2_read16(hw, @@ -1639,6 +1645,7 @@ if (netif_msg_timer(sky2)) printk(KERN_ERR PFX "%s: tx timeout\n", dev->name); +#if 0 sky2_write32(hw, Q_ADDR(txq, Q_CSR), BMU_STOP); sky2_write32(hw, Y2_QADDR(txq, PREF_UNIT_CTRL), PREF_UNIT_RST_SET); @@ -1646,6 +1653,7 @@ sky2_qset(hw, txq); sky2_prefetch_init(hw, txq, sky2->tx_le_map, TX_RING_SIZE - 1); +#endif } Properties of the patch above: The device will fail after some time, enter the tx_timeout handler, recover and continue. Now if I could avoid entering the tx_timeout handler, I would be happy because it triggers only after hanging for approx. 10 seconds. Error log with my patch so far: Jan 24 05:09:27 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:09:27 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:09:41 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:09:41 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:09:41 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 1312 Jan 24 05:11:12 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:11:12 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:11:12 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 592 Jan 24 05:11:42 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:11:42 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:11:42 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 80 Jan 24 05:13:31 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:13:31 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:13:31 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 720 Jan 24 05:14:12 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:14:12 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:14:12 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 512 Jan 24 05:15:21 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:15:21 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:15:21 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 128 Jan 24 05:17:52 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:17:52 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:17:52 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 840 Jan 24 05:18:51 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:18:51 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:18:51 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 272 Jan 24 05:23:07 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:23:07 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:23:07 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 208 Jan 24 05:23:37 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:23:37 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:23:37 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 992 Jan 24 05:26:22 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:26:22 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:26:22 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 744 Jan 24 05:28:47 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:28:47 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:29:11 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:29:11 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:29:11 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 352 Jan 24 05:30:02 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:30:02 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:30:02 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 96 Jan 24 05:30:27 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:30:27 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:30:27 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 800 Jan 24 05:30:51 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:30:51 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:30:51 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 352 Jan 24 05:31:32 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:31:32 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:31:32 switch kernel: sky2 bridgeint0: rx error, status 0x7ffc0001 length 1344 Jan 24 05:34:17 switch kernel: NETDEV WATCHDOG: bridgeint0: transmit timed out Jan 24 05:34:17 switch kernel: sky2 bridgeint0: tx timeout Jan 24 05:35:36 switch kernel: NETDEV WATCHDOG: bridgeext0: transmit timed out Jan 24 05:35:36 switch kernel: sky2 bridgeext0: tx timeout Jan 24 05:35:36 switch kernel: sky2 bridgeext0: rx error, status 0x7ffc0001 length 128 Strange. Not every tx timeout corresponds with a rx error. However, that could be due to net_ratelimit firing. I'm now trying to find out which timer is the problematic one. Kicking STAT_TX_TIMER_CTRL alone has no effect. Kicking STAT_LEV_TIMER_CTRL alone does help so far. STAT_ISR_TIMER_CTRL was not tested yet. ...test... Survived 22 hangs with the hand-edited patch above. Stephen, do you know of any errata which could help explain this? Regards, Carl-Daniel -- http://www.hailfinger.org/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html