After an uptime of 2-4 days our routers are hitting a general protection fault in dev_hard_start_xmit. We are going to try the latest 4.19.24 release to see if the bug has been resolved, but I didn't see any obvious commits in the logs. We are also going to test with a much older 4.9.159 kernel as starting point to finding when this problem was introduced. Please let me know if there is any additional information I can provide or any test patches you would like me to try. Thanks, Jesse Hathaway
Decoded stacktrace: decode_stacktrace.sh was unable to decode the RIP line, but gdb was able to, if someone knows why that failed I would love to know. (gdb) l *dev_hard_start_xmit+0x38 0xffffffff815e7488 is in dev_hard_start_xmit (net/core/dev.c:3256). 3251 { 3252 struct sk_buff *skb = first; 3253 int rc = NETDEV_TX_OK; 3254 3255 while (skb) { 3256 struct sk_buff *next = skb->next; 3257 3258 skb->next = NULL; 3259 rc = xmit_one(skb, dev, txq, next != NULL); 3260 if (unlikely(!dev_xmit_complete(rc))) { (gdb) [423866.182835] general protection fault: 0000 [#1] SMP PTI [423866.188774] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G E 4.19.18-bt7u1-amd64 #1 [423866.198308] Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 2.8.0 005/17/2018 [423866.206874] RIP: 0010:dev_hard_start_xmit (??:?) [423866.212522] Code: 53 48 83 ec 28 48 85 ff 48 89 54 24 08 48 89 4c 24 18 0f 84 b9 01 00 00 48 8d 86 90 00 00 00 48 89 f5 48 89 fb 48 89 44 24 10 <4c> 8b 33 48 c7 03 00 00 00 00 48 8b 05 77 46 b4 00 4d 85 f6 0f 95 All code ======== 0: 53 push %rbx 1: 48 83 ec 28 sub $0x28,%rsp 5: 48 85 ff test %rdi,%rdi 8: 48 89 54 24 08 mov %rdx,0x8(%rsp) d: 48 89 4c 24 18 mov %rcx,0x18(%rsp) 12: 0f 84 b9 01 00 00 je 0x1d1 18: 48 8d 86 90 00 00 00 lea 0x90(%rsi),%rax 1f: 48 89 f5 mov %rsi,%rbp 22: 48 89 fb mov %rdi,%rbx 25: 48 89 44 24 10 mov %rax,0x10(%rsp) 2a:* 4c 8b 33 mov (%rbx),%r14 <-- trapping instruction 2d: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) 34: 48 8b 05 77 46 b4 00 mov 0xb44677(%rip),%rax # 0xb446b2 3b: 4d 85 f6 test %r14,%r14 3e: 0f .byte 0xf 3f: 95 xchg %eax,%ebp Code starting with the faulting instruction =========================================== 0: 4c 8b 33 mov (%rbx),%r14 3: 48 c7 03 00 00 00 00 movq $0x0,(%rbx) a: 48 8b 05 77 46 b4 00 mov 0xb44677(%rip),%rax # 0xb44688 11: 4d 85 f6 test %r14,%r14 14: 0f .byte 0xf 15: 95 xchg %eax,%ebp [423866.233612] RSP: 0018:ffff96f4af483b18 EFLAGS: 00010202 [423866.239550] RAX: ffff96f3f72b6600 RBX: 2e5903fe657c2d03 RCX: 0000000000000003 [423866.247627] RDX: ffffcc02bf687600 RSI: 00000000fffffe01 RDI: ffffffffb69e864d [423866.255703] RBP: ffff96f4a802a000 R08: 0000000000000001 R09: 00000000000003e8 [423866.263779] R10: 00000000000002f5 R11: ffff96f4a86ff940 R12: ffff96f4a802a000 [423866.271854] R13: 0000000000000032 R14: 2e5903fe657c2d03 R15: 0000000000000000 [423866.279931] FS: 0000000000000000(0000) GS:ffff96f4af480000(0000) knlGS:0000000000000000 [423866.289075] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [423866.295596] CR2: 00007fb3c351f000 CR3: 000000074080a001 CR4: 00000000001606e0 [423866.303671] Call Trace: [423866.306501] <IRQ> [423866.308847] __dev_queue_xmit (/source/linux-4.19.18/net/core/dev.c:3830) [423866.313427] ip_finish_output2 (/source/linux-4.19.18/./include/net/neighbour.h:501 /source/linux-4.19.18/net/ipv4/ip_output.c:229) [423866.318105] ip_output (/source/linux-4.19.18/net/ipv4/ip_output.c:409) [423866.321810] ? ip_fragment.constprop.49 (/source/linux-4.19.18/net/ipv4/ip_output.c:293) [423866.327166] ip_forward (/source/linux-4.19.18/net/ipv4/ip_forward.c:150) [423866.331161] ? ip_check_defrag (/source/linux-4.19.18/net/ipv4/ip_forward.c:66) [423866.335837] ip_rcv (/source/linux-4.19.18/net/ipv4/ip_input.c:527) [423866.339250] ? ip_rcv_core.isra.15 (/source/linux-4.19.18/net/ipv4/ip_input.c:403) [423866.344314] __netif_receive_skb_one_core (/source/linux-4.19.18/net/core/dev.c:4920) [423866.349866] netif_receive_skb_internal (/source/linux-4.19.18/net/core/dev.c:5134) [423866.355222] napi_gro_receive (/source/linux-4.19.18/net/core/dev.c:5591 /source/linux-4.19.18/net/core/dev.c:5622) [423866.359615] ixgbe_poll (/source/linux-4.19.18/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:2404 /source/linux-4.19.18/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:3186) ixgbe [423866.364488] ? load_balance (/source/linux-4.19.18/kernel/sched/fair.c:8578) [423866.368875] net_rx_action (/source/linux-4.19.18/net/core/dev.c:6262 /source/linux-4.19.18/net/core/dev.c:6328) [423866.373164] __do_softirq (/source/linux-4.19.18/kernel/softirq.c:292 /source/linux-4.19.18/./include/linux/jump_label.h:142 /source/linux-4.19.18/./include/trace/events/irq.h:142 /source/linux-4.19.18/kernel/softirq.c:293) [423866.377260] irq_exit (/source/linux-4.19.18/kernel/softirq.c:372 /source/linux-4.19.18/kernel/softirq.c:412) [423866.380867] do_IRQ (/source/linux-4.19.18/./arch/x86/include/asm/irq_regs.h:19 /source/linux-4.19.18/./arch/x86/include/asm/irq_regs.h:26 /source/linux-4.19.18/arch/x86/kernel/irq.c:260) [423866.384378] common_interrupt (/source/linux-4.19.18/arch/x86/entry/entry_64.S:646) [423866.388567] </IRQ> [423866.391587] RIP: 0010:mwait_idle (??:?) [423866.396925] Code: 01 00 0f ae 38 0f ae f0 31 d2 65 48 8b 04 25 40 5c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 30 01 00 00 31 c0 fb 0f 01 c9 <65> 8b 2d a3 13 53 49 0f 1f 44 00 00 eb 07 fb 66 0f 1f 44 00 00 65 All code ======== 0: 01 00 add %eax,(%rax) 2: 0f ae 38 clflush (%rax) 5: 0f ae f0 mfence 8: 31 d2 xor %edx,%edx a: 65 48 8b 04 25 40 5c mov %gs:0x15c40,%rax 11: 01 00 13: 48 89 d1 mov %rdx,%rcx 16: 0f 01 c8 monitor %rax,%rcx,%rdx 19: 48 8b 00 mov (%rax),%rax 1c: a8 08 test $0x8,%al 1e: 0f 85 30 01 00 00 jne 0x154 24: 31 c0 xor %eax,%eax 26: fb sti 27: 0f 01 c9 mwait %rax,%rcx 2a: 65 8b 2d a3 13 53 49 mov %gs:*0x495313a3(%rip),%ebp # 0x495313d4 <-- trapping instruction 31: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 36: eb 07 jmp 0x3f 38: fb sti 39: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 3f: 65 gs Code starting with the faulting instruction =========================================== 0: 65 8b 2d a3 13 53 49 mov %gs:0x495313a3(%rip),%ebp # 0x495313aa 7: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) c: eb 07 jmp 0x15 e: fb sti f: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1) 15: 65 gs [423866.419176] RSP: 0018:ffffac06c01ebe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdb [423866.428304] RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000 [423866.436944] RDX: 0000000000000000 RSI: ffff96f4af49a760 RDI: 0000000000000004 [423866.445569] RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 [423866.454198] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [423866.462825] R13: ffff96f4ad1aac40 R14: ffff96f4ad1aac40 R15: ffff96f4ad1aac40 [423866.471446] do_idle (/source/linux-4.19.18/kernel/sched/idle.c:153 /source/linux-4.19.18/kernel/sched/idle.c:262) [423866.475681] cpu_startup_entry (/source/linux-4.19.18/kernel/sched/idle.c:368 (discriminator 1)) [423866.480690] start_secondary (/source/linux-4.19.18/arch/x86/kernel/smpboot.c:272) [423866.485693] secondary_startup_64 (/source/linux-4.19.18/arch/x86/kernel/head_64.S:243) [423866.490976] Modules linked in: drbg(E) ansi_cprng(E) echainiv(E) esp4(E) xfrm4_mode_transport(E) tcp_diag(E) inet_diag(E) nf_conntrack_netlink(E) xt_nat(E) xt_policy(E) nfnetlink_log(E) xt_NFLOG(E) xt_limit(E) ipt_REJECT(E) nf_) [423866.575127] serpent_avx2(E) serpent_avx_x86_64(E) serpent_sse2_x86_64(E) serpent_generic(E) glue_helper(E) blowfish_generic(E) blowfish_x86_64(E) blowfish_common(E) cast5_avx_x86_64(E) cast5_generic(E) cast_common(E) crypto_si) [423866.658693] crc32c_generic(E) crc32c_intel(E) ext4(E) crc16(E) mbcache(E) jbd2(E) fscrypto(E) sg(E) sd_mod(E) ehci_pci(E) ahci(E) ehci_hcd(E) libahci(E) ixgbe(E) libata(E) megaraid_sas(E) dca(E) usbcore(E) mdio(E) i40e(E) scsi) [423866.683437] ---[ end trace e0abd70b6f85b1fd ]--- # awk -f scripts/ver_linux Linux rtr1 4.19.18-bt7u1-amd64 #1 SMP Mon Feb 11 20:09:59 UTC 2019 x86_64 GNU/Linux GNU C 4.7 GNU Make 3.81 Binutils 2.22 Util-linux 2.20.1 Mount 2.20.1 E2fsprogs 1.42.5 Linux C Library 2.13 Dynamic linker (ldd) 2.13 Linux C++ Library 6.0.17 Procps 3.3.3 Net-tools 1.60 Sh-utils 8.13 Udev 175 Modules Loaded 8021q acpi_power_meter aesni_intel aes_x86_64 af_key ahci ansi_cprng authenc blowfish_common blowfish_generic blowfish_x86_64 bonding button camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_generic camellia_x86_64 cast5_avx_x86_64 cast5_generic cast_common cbc cmac crc16 crc32c_generic crc32c_intel cryptd crypto_simd ctr dca dcdbas des_generic drbg drm drm_kms_helper dummy echainiv ehci_hcd ehci_pci esp4 evdev ext4 fscrypto garp glue_helper gre i2c_algo_bit i2c_dev i2c_i801 i40e inet_diag ioatdma ip_gre ipmi_devintf ipmi_msghandler ipmi_si ip_set ip_set_hash_net ip_set_hash_netiface ip_set_hash_netport iptable_filter iptable_mangle iptable_nat iptable_raw ip_tables ipt_REJECT ip_tunnel iTCO_vendor_support iTCO_wdt ixgbe jbd2 libahci libata libcrc32c llc loop lpc_ich mbcache mdio megaraid_sas mei mei_me mgag200 mrp mxm_wmi nf_conntrack nf_conntrack_netlink nf_conntrack_proto_gre nf_defrag_ipv4 nf_nat nf_nat_ipv4 nfnetlink nfnetlink_log nf_reject_ipv4 pcbc pcrypt pcspkr rmd160 scsi_mod sd_mod serpent_avx2 serpent_avx_x86_64 serpent_generic serpent_sse2_x86_64 sg sha512_generic sha512_ssse3 snd snd_pcm snd_timer soundcore stp tcp_diag ttm twofish_avx_x86_64 twofish_common twofish_generic twofish_x86_64 twofish_x86_64_3way usbcore wmi xcbc xfrm4_mode_transport xfrm_algo x_tables xt_addrtype xt_connmark xt_conntrack xt_CT xt_limit xt_mark xt_nat xt_NFLOG xt_policy xt_set xt_tcpudp