Hi, We're experiencing what appears to be the same problem as well on a Pacemaker cluster of ours; this is causing us serious issues as the nodes are rebooted when the problem appears.
Has any progress been made in identifying a cause for this and/or curing the problem? >From dmesg: > Dec 28 23:16:32 tyne kernel: [418756.268195] WARNING: at > /build/linux-rrsxby/linux-3.2.51/net/sched/sch_generic.c:256 > dev_watchdog+0xf2/0x151() > Dec 28 23:16:32 tyne kernel: [418756.382761] Hardware name: X9DRD-iF > Dec 28 23:16:32 tyne kernel: [418756.496392] NETDEV WATCHDOG: eth1 (igb): > transmit queue 1 timed out > Dec 28 23:16:33 tyne kernel: [418756.607364] Modules linked in: hmac dlm sctp > libcrc32c configfs ip6table_filter ebtable_nat ebtables act_police cls_basic > cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq > xt_statistic xt_CT xt_time xt_connlimit xt_realm xt_addrtype iptable_raw > xt_comment > xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP > ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp > nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre > nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda > nf_conntrack_sane nf_con > ntrack_tftp nf_conntrack_sip nf_conntrack_proto_udplite > nf_conntrack_proto_sctp nf_conntrack_pptp nf_conntrack_proto_gre > nf_conntrack_netlink nf_conntrack_netbios_ns nf_conntrack_broadcast > nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp xt_TPROXY nf_tproxy_core > ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype xt_p > hysdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport xt_mark xt_ > Dec 28 23:16:34 tyne kernel: mac xt_limit xt_length xt_iprange xt_helper > xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY > xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink ib_iser rdma_cm ib_cm > iw_cm ib_sa ib_mad ib_ > core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi > iptable_filter ip_tables x_tables nfsd nfs lockd fscache auth_rpcgss nfs_acl > sunrpc bonding sha1_ssse3 sha1_generic ipmi_poweroff ipmi_devintf ipmi_si > ipmi_msghandler vhost_net macvtap macvlan tun drbd lru_cache bridge stp loop > kvm_intel kvm snd_pcm s > nd_timer coretemp snd soundcore acpi_cpufreq crc32c_intel ghash_clmulni_intel > mperf aesni_intel psmouse snd_page_alloc cryptd iTCO_wdt sb_edac processor > i2c_i801 serio_raw aes_x86_64 ioatdma pcspkr iTCO_vendor_support aes_generic > thermal_sys i2c_core joydev edac_core evdev container button acpi_pad ext4 > crc16 jbd2 m > bcache dm_mod raid1 md_mod microcode usbhid hid sg sd_mod crc_t10dif ahci lib > Dec 28 23:16:34 tyne kernel: ahci isci libsas libata ehci_hcd > scsi_transport_sas usbcore igb scsi_mod usb_common dca [last unloaded: > scsi_wait_scan] > Dec 28 23:16:34 tyne kernel: [418758.541550] Pid: 0, comm: swapper/0 Not > tainted 3.2.0-4-amd64 #1 Debian 3.2.51-1 > Dec 28 23:16:34 tyne kernel: [418758.652098] Call Trace: > Dec 28 23:16:35 tyne kernel: [418758.761884] <IRQ> [<ffffffff81046cbd>] ? > warn_slowpath_common+0x78/0x8c > Dec 28 23:16:35 tyne kernel: [418758.869948] [<ffffffff81046d69>] ? > warn_slowpath_fmt+0x45/0x4a > Dec 28 23:16:35 tyne kernel: [418758.977593] [<ffffffff812a6f11>] ? > netif_tx_lock+0x40/0x75 > Dec 28 23:16:35 tyne kernel: [418759.082681] [<ffffffff812a7081>] ? > dev_watchdog+0xf2/0x151 > Dec 28 23:16:35 tyne kernel: [418759.186240] [<ffffffff81052480>] ? > run_timer_softirq+0x19a/0x261 > Dec 28 23:16:35 tyne kernel: [418759.287841] [<ffffffff812a6f8f>] ? > netif_tx_unlock+0x49/0x49 > Dec 28 23:16:35 tyne kernel: [418759.387569] [<ffffffff8104c2f8>] ? > __do_softirq+0xb9/0x177 > Dec 28 23:16:35 tyne kernel: [418759.486351] [<ffffffff81096529>] ? > rcu_needs_cpu+0x50/0x1bb > Dec 28 23:16:35 tyne kernel: [418759.583008] [<ffffffff8135646c>] ? > call_softirq+0x1c/0x30 > Dec 28 23:16:35 tyne kernel: [418759.677333] [<ffffffff8100f8cd>] ? > do_softirq+0x3c/0x7b > Dec 28 23:16:36 tyne kernel: [418759.770142] [<ffffffff8104c560>] ? > irq_exit+0x3c/0x99 > Dec 28 23:16:36 tyne kernel: [418759.860906] [<ffffffff8100f5fd>] ? > do_IRQ+0x82/0x98 > Dec 28 23:16:36 tyne kernel: [418759.954639] [<ffffffff8134f4ee>] ? > common_interrupt+0x6e/0x6e > Dec 28 23:16:36 tyne kernel: [418760.048124] <EOI> [<ffffffff811ee07d>] ? > intel_idle+0xea/0x119 > Dec 28 23:16:36 tyne kernel: [418760.137012] [<ffffffff811ee05c>] ? > intel_idle+0xc9/0x119 > Dec 28 23:16:36 tyne kernel: [418760.222705] [<ffffffff8126febd>] ? > cpuidle_idle_call+0xec/0x179 > Dec 28 23:16:36 tyne kernel: [418760.306317] [<ffffffff8100d243>] ? > cpu_idle+0xa5/0xf2 > Dec 28 23:16:36 tyne kernel: [418760.388391] [<ffffffff816abb36>] ? > start_kernel+0x3b8/0x3c3 > Dec 28 23:16:36 tyne kernel: [418760.470137] [<ffffffff816ab140>] ? > early_idt_handlers+0x140/0x140 > Dec 28 23:16:36 tyne kernel: [418760.548953] [<ffffffff816ab3c4>] ? > x86_64_start_kernel+0x104/0x111 > Dec 28 23:16:36 tyne kernel: [418760.626209] ---[ end trace 25448d4e9ff0e259 > ]--- > Dec 28 23:16:37 tyne kernel: [418760.710249] igb 0000:06:00.1: eth1: Reset > adapter > Dec 28 23:16:37 tyne kernel: [418760.814181] igb 0000:06:00.0: eth0: Reset > adapter - and - > Dec 28 23:16:32 tees kernel: [419013.476706] WARNING: at > /build/linux-rrsxby/linux-3.2.51/net/sched/sch_generic.c:256 > dev_watchdog+0xf2/0x151() > Dec 28 23:16:33 tees kernel: [419013.591003] Hardware name: X9DRD-iF > Dec 28 23:16:33 tees kernel: [419013.705052] NETDEV WATCHDOG: eth1 (igb): > transmit queue 3 timed out > Dec 28 23:16:34 tees kernel: [419013.817376] Modules linked in: hmac dlm sctp > libcrc32c configfs ip6table_filter ebtable_nat ebtables act_police cls_basic > cls_flow cls_fw cls_u32 sch_tbf sch_prio sch_htb sch_hfsc sch_ingress sch_sfq > xt_statistic xt_CT xt_time xt_connlimit xt_realm xt_addrtype iptable_raw > xt_comment xt_recent xt_policy ipt_ULOG ipt_REJECT ipt_REDIRECT ipt_NETMAP > ipt_MASQUERADE ipt_ECN ipt_ecn ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp > nf_nat_snmp_basic nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre > nf_nat_irc nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda > nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip > nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp > nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns > nf_conntrack_broadcast nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp > xt_TPROXY nf_tproxy_core ip6_tables nf_defrag_ipv6 xt_tcpmss xt_pkttype > xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink _ log xt_multiport xt_mark xt_ > Dec 28 23:16:34 tees kernel: mac xt_limit xt_length xt_iprange xt_helper > xt_hashlimit xt_DSCP xt_dscp xt_dccp xt_conntrack xt_connmark xt_CLASSIFY > xt_AUDIT ipt_LOG xt_tcpudp xt_state iptable_nat nf_nat nf_conntrack_ipv4 > nf_defrag_ipv4 nf_conntrack iptable_mangle nfnetlink ib_iser rdma_cm ib_cm > iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi > scsi_transport_iscsi iptable_filter ip_tables x_tables nfsd nfs lockd fscache > auth_rpcgss nfs_acl sunrpc bonding sha1_ssse3 sha1_generic ipmi_poweroff > ipmi_devintf ipmi_si ipmi_msghandler vhost_net macvtap macvlan tun drbd > lru_cache bridge stp loop kvm_intel kvm snd_pcm snd_timer snd i2c_i801 > coretemp crc32c_intel iTCO_wdt soundcore ghash_clmulni_intel acpi_cpufreq (this is as far as that server got before being STONITHed) Both servers have Supermicro X9DRD-iF motherboards and are running linux-image-3.2.0-4-amd64 3.2.51-1. lspci -vvv for one of the ports in question (eth1 on tyne) is: > 06:00.1 Ethernet controller: Intel Corporation I350 Gigabit Network > Connection (rev 01) > Subsystem: Super Micro Computer Inc Device 1521 > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > Stepping- SERR- FastB2B- DisINTx+ > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- > <MAbort- >SERR- <PERR- INTx- > Latency: 0, Cache Line Size: 64 bytes > Interrupt: pin B routed to IRQ 17 > Region 0: Memory at fbd00000 (32-bit, non-prefetchable) [size=128K] > Region 2: I/O ports at d000 [size=32] > Region 3: Memory at fbdc0000 (32-bit, non-prefetchable) [size=16K] > Capabilities: [40] Power Management version 3 > Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA > PME(D0+,D1-,D2-,D3hot+,D3cold+) > Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME- > Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ > Address: 0000000000000000 Data: 0000 > Masking: 00000000 Pending: 00000000 > Capabilities: [70] MSI-X: Enable+ Count=10 Masked- > Vector table: BAR=3 offset=00000000 > PBA: BAR=3 offset=00002000 > Capabilities: [a0] Express (v2) Endpoint, MSI 00 > DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, > L1 <64us > ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ > DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ > Unsupported+ > RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- > MaxPayload 128 bytes, MaxReadReq 512 bytes > DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ > TransPend- > LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Latency L0 > <4us, L1 <32us > ClockPM- Surprise- LLActRep- BwNot- > LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- > BWMgmt- ABWMgmt- > DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ > DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- > LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- > SpeedDis-, Selectable De-emphasis: -6dB > Transmit Margin: Normal Operating Range, > EnterModifiedCompliance- ComplianceSOS- > Compliance De-emphasis: -6dB > LnkSta2: Current De-emphasis Level: -6dB, > EqualizationComplete-, EqualizationPhase1- > EqualizationPhase2-, EqualizationPhase3-, > LinkEqualizationRequest- > Capabilities: [100 v2] Advanced Error Reporting > UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- > UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- > RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol- > UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- > RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- > CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ > AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- > Capabilities: [140 v1] Device Serial Number 00-25-90-ff-ff-4e-ae-18 > Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI) > ARICap: MFVC- ACS-, Next Function: 0 > ARICtl: MFVC- ACS-, Function Group: 0 > Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV) > IOVCap: Migration-, Interrupt Message Number: 000 > IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy- > IOVSta: Migration- > Initial VFs: 8, Total VFs: 8, Number of VFs: 8, Function > Dependency Link: 01 > VF offset: 384, stride: 4, Device ID: 1520 > Supported Page Size: 00000553, System Page Size: 00000001 > Region 0: Memory at fbd60000 (32-bit, non-prefetchable) > Region 3: Memory at fbd40000 (32-bit, non-prefetchable) > VF Migration: offset: 00000000, BIR: 0 > Capabilities: [1a0 v1] Transaction Processing Hints > Device specific mode supported > Steering table in TPH capability structure > Capabilities: [1d0 v1] Access Control Services > ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- > EgressCtrl- DirectTrans- > ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- > EgressCtrl- DirectTrans- > Kernel driver in use: igb Please let me know if I can provide any further information. Best regards, Chris -- Chris Boot Tiger Computing Ltd "Linux for Business" Tel: 01600 483 484 Web: http://www.tiger-computing.co.uk Follow us on Facebook: http://www.facebook.com/TigerComputing Registered in England. Company number: 3389961 Registered address: Wyastone Business Park, Wyastone Leys, Monmouth, NP25 3SR -- To UNSUBSCRIBE, email to debian-kernel-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: http://lists.debian.org/52cd4f3d.8000...@tiger-computing.co.uk