On Sun, 23 Jul 2006, Krzysztof Oledzki wrote:



On Fri, 26 May 2006, Stephen Hemminger wrote:

Please give this a try, it rearranges the transmit buffer management,
and may avoid issues with partial completions causing SKB reuse.

<CUT>

Plase excuse me, I overlooked this patch. Anyway, it seems that this fix went into the 2.6.16 kernel, which is already on the server that caused problems (http://bugzilla.kernel.org/show_bug.cgi?id=6142). I'll disable my workaround (/usr/sbin/ethtool -K eth1 tx off) and let you known about the results.

Strange, I had reenabled tx csum and there were no problems for about one week. Yesterday I had upgraded my kernel to the 2.6.17.7 and after one day, about 3 hours ago, my system crashed with following log:

 <782b6fe4> skge_xmit_frame+0x121/0x2ea  <781249b6> 
raise_softirq_irqoff+0xe/0x59
 <7833b9b7> qdisc_restart+0xc4/0x16b  <78332352> net_tx_action+0x97/0xbd
 <7812484d> __do_softirq+0x59/0xc0  <781248e4> do_softirq+0x30/0x35
 <78124947> local_bh_enable+0x5e/0x7e  <78332194> dev_queue_xmit+0x1b6/0x1bd
 <7834ab2c> ip_output+0x1b5/0x1eb  <7834af00> ip_queue_xmit+0x39e/0x3e6
 <78191f3e> __ext3_get_inode_loc+0x53/0x201  <7819df94> 
journal_dirty_metadata+0x1d1/0x1eb
 <7811bafb> __wake_up+0x27/0x3b  <7819e3dc> journal_stop+0x1bd/0x1c9
 <781963d0> __ext3_journal_stop+0x19/0x37  <78192b58> ext3_dirty_inode+0x5d/0x63
 <78359652> tcp_transmit_skb+0x38e/0x3af  <7816d122> touch_atime+0x97/0x9d
 <7835a89c> tcp_write_xmit+0x1ad/0x212  <7835a924> 
__tcp_push_pending_frames+0x23/0x80
 <78352732> do_tcp_setsockopt+0x12e/0x2f3  <7832cd3c> 
sock_common_setsockopt+0x1e/0x22
 <7832ac7b> sys_setsockopt+0x61/0x81  <7832b242> sys_socketcall+0x164/0x1a4
 <7815765d> sys_sendfile+0x5d/0x84  <78102c93> sysenter_past_esp+0x54/0x75
Bad page state in process 'swapper'
page:7985eb20 flags:0x80010008 mapping:e25867a0 mapcount:0 count:0
Trying to fix it up, but a reboot is needed
Backtrace:
 <78140e43> bad_page+0x43/0x6c  <781415e5> free_hot_cold_page+0x5b/0x123
 <7832d700> skb_release_data+0x50/0x86  <7832d741> kfree_skbmem+0xb/0x70
 <78355b41> tcp_clean_rtx_queue+0x225/0x3e6  <783560b1> tcp_ack+0x151/0x27b
 <78358116> tcp_rcv_established+0x544/0x5ed  <7835e972> tcp_v4_do_rcv+0x1f/0xb4
 <7835ee8e> tcp_v4_rcv+0x487/0x6de  <7833f4ef> nf_hook_slow+0xb3/0xce
 <78347aac> ip_local_deliver+0x11b/0x1ab  <78348086> ip_rcv+0x40c/0x446
 <783324e7> netif_receive_skb+0x16f/0x1a7  <782b79a0> skge_poll+0x307/0x3e8
 <78332661> net_rx_action+0x5c/0xd3  <7812484d> __do_softirq+0x59/0xc0
 <781248e4> do_softirq+0x30/0x35  <7812499d> irq_exit+0x36/0x41
 <78104edc> do_IRQ+0x20/0x28  <7810101c> default_idle+0x0/0x55
 <7810373e> common_interrupt+0x1a/0x20  <7810101c> default_idle+0x0/0x55
 <78101048> default_idle+0x2c/0x55  <78101132> cpu_idle+0xad/0xda

I know it is incomplete (this is all what I am able to find in my logs) but it looks _very_ similar to the one from:
http://bugzilla.kernel.org/show_bug.cgi?id=6142

BTW: During normal work skge driver still logs (about 10 times per 1 hour) informations about hardware error. However, message changed slightly - in 2.6.16 it was:
 skge hardware error detected (status 0x400)
but in 2.6.17 it is:
 skge 0000:00:0b.0: PCI error cmd=0x7 status=0x82b0
 skge 0000:00:0b.0: PCI error cmd=0x147 status=0xc2b0
 skge 0000:00:0b.0: PCI error cmd=0x147 status=0xc2b0
 skge 0000:00:0b.0: PCI error cmd=0x147 status=0xc2b0
 skge 0000:00:0b.0: PCI error cmd=0x147 status=0xc2b0
 skge 0000:00:0b.0: PCI error cmd=0x147 status=0xc2b0
(...)

Anyway, everything works fine. I don't know if it is somehow related to mentioned crashes.

Best regards,

                                Krzysztof Oledzki
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to