On Mon, Nov 20, 2017 at 2:38 PM, Sarah Newman <sarah.new...@computer.org> wrote: > On 11/20/2017 08:36 AM, Alexander Duyck wrote: >> Hi Sarah, >> >> I am adding the netdev mailing list as I am not certain this is an >> i350 specific issue. The traces themselves aren't anything I recognize >> as an existing issue. From what I can tell it looks like you are >> running Xen, so would I be correct in assuming you are bridging >> between VMs? If so are you using any sort of tunnels on your network, >> if so what type? This information would be useful as we may be looking >> at a bug in a tunnel offload for GRO. > > Yes, there's bridging. The traffic on the physical device is tagged with > vlans and the bridges use untagged traffic. There are no tunnels. I do not > own the VMs traffic. > > Because I have only seen this on a single server with unique hardware, I > think it's most likely related to the hardware or to a particular VM on that > server.
So I would suspect traffic coming from the VM if anything. The i350 is a pretty common device. If we were seeing issues specific to it I would expect we would have more reports than just the one so far. >> >> On Fri, Nov 17, 2017 at 3:28 PM, Sarah Newman <sarah.new...@computer.org> >> wrote: >>> Hi, >>> >>> I have an X10 supermicro with two I350's that has crashed twice now under >>> v4.9.39 within the last 3 weeks, with no crashes before v4.9.39: >> >> What was the last kernel you tested before v4.9.39? Just wondering as >> it will help to rule out certain patches as possibly being the issue. > > 4.9.31. > > If the problem is related to a particular VM, then I don't think the last > known good kernel is necessarily pertinent, as the problematic traffic could > have started at any time. > >>> I see in the release notes >>> https://downloadmirror.intel.com/22919/eng/README.txt " Do Not Use LRO When >>> Routing Packets." >>> >>> We are bridging traffic, not routing, and the crashes are in the GRO code. >>> >>> Is it possible there are problems with GRO for bridging in the igb driver >>> now? If I disable GRO can I have some confidence it will fix the issue? >> >> As far as LRO not being used when routing, just so you know LRO and >> GRO are two very different things. One of the issues with LRO is that >> it wasn't reversible in some cases and so could lead to the packet >> being changed if they were rerouted. With GRO that shouldn't be the >> case as we should be able to get back out the original packets that >> were put into a frame. So there shouldn't be any issues using GRO with >> bridging or routing. > > In some very old release notes for the ixgbe > https://downloadmirror.intel.com/22919/eng/README.txt it said to disable GRO > for bridging/routing, and it > wasn't clear it was not specific to the driver. I didn't originally notice > how old the release notes were and that the notice was removed in newer > versions, I apologize. > >>> First crash: >>> >>> [4083386.299221] ------------[ cut here ]------------ >>> [4083386.299358] WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:1473 >>> inet_gro_complete+0xbb/0xd0 >>> [4083386.299520] Modules linked in: sb_edac edac_core 8021q mrp garp >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >>> ip6table_filter >>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gnt >>> alloc xenfs xen_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 >>> ebt_mark ebt_ip ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw >>> br_netfilter bridge stp llc iTCO_wdt iTCO_vendor_support pcspkr raid456 >>> async_raid6_recov async_pq >>> async_xor xor async_memcpy async_tx raid10 raid6_pq libcrc32c joydev >>> shpchp i2c_i801 i2c_smbus mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler >>> acpi_power_meter ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core >>> mlx4_core mpt3sas >>> scsi_transport_sas raid_class wmi ast ttm >>> [4083386.300888] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.39 #1 >>> [4083386.301002] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >>> 2.0a 09/16/2016 >>> [4083386.301109] ffff880306603d90 ffffffff813f5935 0000000000000000 >>> 0000000000000000 >>> [4083386.301221] ffff880306603dd0 ffffffff810a7e01 000005c18174578a >>> ffff8802f94a9a00 >>> [4083386.301333] ffff8802f0824450 0000000000000000 0000000000000040 >>> 0000000000000040 >>> [4083386.301445] Call Trace: >>> [4083386.301483] <IRQ> [4083386.301519] dump_stack+0x63/0x8e >>> [4083386.301596] __warn+0xd1/0xf0 >>> [4083386.301665] warn_slowpath_null+0x1d/0x20 >>> [4083386.301747] inet_gro_complete+0xbb/0xd0 >>> [4083386.301830] napi_gro_complete+0x73/0xa0 >>> [4083386.301911] napi_gro_flush+0x5f/0x80 >>> [4083386.301988] napi_complete_done+0x6a/0xb0 >>> [4083386.302075] igb_poll+0x38d/0x720 [igb] >>> [4083386.302156] ? igb_msix_ring+0x2e/0x40 [igb] >>> [4083386.302255] ? __handle_irq_event_percpu+0x4b/0x1a0 >>> [4083386.302349] net_rx_action+0x158/0x360 >>> [4083386.302430] __do_softirq+0xd1/0x283 >>> [4083386.302507] irq_exit+0xe9/0x100 >>> [4083386.302580] xen_evtchn_do_upcall+0x35/0x50 >>> [4083386.302665] xen_do_hypervisor_callback+0x1e/0x40 >>> [4083386.302754] <EOI> [4083386.302787] ? xen_hypercall_sched_op+0xa/0x20 >>> [4083386.302876] ? xen_hypercall_sched_op+0xa/0x20 >>> [4083386.302965] ? xen_safe_halt+0x10/0x20 >>> [4083386.303043] ? default_idle+0x1e/0xd0 >>> [4083386.303122] ? arch_cpu_idle+0xf/0x20 >>> [4083386.303200] ? default_idle_call+0x2c/0x40 >>> [4083386.303284] ? cpu_startup_entry+0x1ac/0x240 >>> [4083386.303370] ? rest_init+0x77/0x80 >>> [4083386.303462] ? start_kernel+0x4a7/0x4b4 >>> [4083386.303568] ? set_init_arg+0x55/0x55 >>> [4083386.303670] ? x86_64_start_reservations+0x24/0x26 >>> [4083386.303776] ? xen_start_kernel+0x555/0x561 >>> [4083386.303873] ---[ end trace 8294f59ced689507 ]--- I think this first trace is more important than the one below. Specifically it calls out GRO assembly issues with there being either a lack of GRO ops or no gro_complete function for whatever protocol was found in the packet. >>> [4083386.303958] general protection fault: 0000 [#1] SMP >>> [4083386.304041] Modules linked in: sb_edac edac_core 8021q mrp garp >>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >>> ip6table_filter >>> ip6_tables xen_pciback blktap xen_netback xen_gntdev xen_gntalloc xenfs >>> xen_privcmd xe >>> n_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip ebt_arp >>> ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge stp llc >>> iTCO_wdt >>> iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq async_xor xor >>> async_memcp >>> y async_tx raid10 raid6_pq libcrc32c joydev shpchp i2c_i801 i2c_smbus >>> mei_me mei lpc_ich fjes ipmi_si ipmi_msghandler acpi_power_meter ioatdma >>> igb dca >>> raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core mpt3sas >>> scsi_transport_sas raid_c >>> lass wmi ast ttm >>> [4083386.305179] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W >>> 4.9.39 #1 >>> [4083386.305307] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >>> 2.0a 09/16/2016 >>> [4083386.305414] task: ffffffff81e0e540 task.stack: ffffffff81e00000 >>> [4083386.305498] RIP: e030: skb_release_data+0x73/0xf0 >>> [4083386.305617] RSP: e02b:ffff880306603d90 EFLAGS: 00010206 >>> [4083386.305692] RAX: 0000000000000030 RBX: f5b36db76bd162c7 RCX: >>> ffffffff81e60048 >>> [4083386.305790] RDX: 0000000000000001 RSI: 0000000000000000 RDI: >>> ffff8802f94a9a00 >>> [4083386.305887] RBP: ffff880306603db0 R08: 0000000000004277 R09: >>> 0000000000000000 >>> [4083386.305985] R10: 0000000000000005 R11: 0000000000000002 R12: >>> 0000000000000000 >>> [4083386.306083] R13: ffff8802f94a9a00 R14: ffff88032f527740 R15: >>> 0000000000000040 >>> [4083386.306186] FS: 0000000000000000(0000) GS:ffff880306600000(0000) >>> knlGS:0000000000000000 >>> [4083386.306296] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [4083386.306407] CR2: 0000000001692ed8 CR3: 000000022b3c9000 CR4: >>> 0000000000042660 >>> [4083386.306505] Stack: >>> [4083386.306537] ffff8802f94a9a00 ffff8802f94a9a00 ffffffff8175ac3e >>> 0000000000000040 >>> [4083386.306649] ffff880306603dc8 ffffffff81745764 ffff8802f94a9a00 >>> ffff880306603df0 >>> [4083386.306762] ffffffff817457c2 ffff8802f94a9a00 ffff8802f0824450 >>> 0000000000000000 >>> [4083386.306874] Call Trace: >>> [4083386.306911] <IRQ> [4083386.306944] ? napi_gro_complete+0x5e/0xa0 >>> [4083386.307038] skb_release_all+0x24/0x30 >>> [4083386.307133] kfree_skb+0x32/0x90 >>> [4083386.307206] napi_gro_complete+0x5e/0xa0 >>> [4083386.307287] napi_gro_flush+0x5f/0x80 >>> [4083386.307365] napi_complete_done+0x6a/0xb0 >>> [4083386.307449] igb_poll+0x38d/0x720 [igb] >>> [4083386.307530] ? igb_msix_ring+0x2e/0x40 [igb] >>> [4083386.307617] ? __handle_irq_event_percpu+0x4b/0x1a0 >>> [4083386.307720] net_rx_action+0x158/0x360 >>> [4083386.307800] __do_softirq+0xd1/0x283 >>> [4083386.307877] irq_exit+0xe9/0x100 >>> [4083386.307949] xen_evtchn_do_upcall+0x35/0x50 >>> [4083386.308034] xen_do_hypervisor_callback+0x1e/0x40 >>> [4083386.308124] <EOI> [4083386.308156] ? xen_hypercall_sched_op+0xa/0x20 >>> [4083386.308246] ? xen_hypercall_sched_op+0xa/0x20 >>> [4083386.308334] ? xen_safe_halt+0x10/0x20 >>> [4083386.308413] ? default_idle+0x1e/0xd0 >>> [4083386.308491] ? arch_cpu_idle+0xf/0x20 >>> [4083386.308568] ? default_idle_call+0x2c/0x40 >>> [4083386.308651] ? cpu_startup_entry+0x1ac/0x240 >>> [4083386.308737] ? rest_init+0x77/0x80 >>> [4083386.308811] ? start_kernel+0x4a7/0x4b4 >>> [4083386.308890] ? set_init_arg+0x55/0x55 >>> [4083386.308968] ? x86_64_start_reservations+0x24/0x26 >>> [4083386.309060] ? xen_start_kernel+0x555/0x561 >>> [4083386.309144] Code: f0 41 0f c1 46 20 39 c2 74 09 5b 41 5c 41 5d 41 5e >>> 5d c3 45 31 e4 41 80 3e 00 74 39 49 63 c4 48 83 c0 03 48 c1 e0 04 49 8b 1c >>> 06 <48> 8b 43 20 a8 01 75 6f f0 ff 4b 1c 74 55 48 8b 03 48 c1 e8 33 >>> [4083386.309571] RIP skb_release_data+0x73/0xf0 >>> [4083386.309658] RSP <ffff880306603d90> >>> [4083386.313000] ---[ end trace 8294f59ced689508 ]--- >>> [4083386.389667] Kernel panic - not syncing: Fatal exception in interrupt >>> [4083386.389791] Kernel Offset: disabled >>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. > > Output of addr2line for address of skb_release_data+0x73 is > > __read_once_size > include/linux/compiler.h:243 (discriminator 2) > compound_head > include/linux/page-flags.h:143 (discriminator 2) > put_page > include/linux/mm.h:777 (discriminator 2) > __skb_frag_unref > include/linux/skbuff.h:2592 (discriminator 2) > skb_release_data > net/core/skbuff.c:594 (discriminator 2) > > skbuff.c:594 is: > > __skb_frag_unref(&shinfo->frags[i]); > > Actual assembly is: > <+91>: xor %r12d,%r12d > <+94>: cmpb $0x0,(%r14) > <+98>: je <skb_release_data+157> > <+100>: movslq %r12d,%rax > <+103>: add $0x3,%rax > <+107>: shl $0x4,%rax > <+111>: mov (%r14,%rax,1),%rbx > <+115>: mov 0x20(%rbx),%rax <------ this is skb_release_data+0x73 > <+119>: test $0x1,%al > <+121>: jne <skb_release_data+234> > > rbx is f5b36db76bd162c7, which seems like garbage. I don't know if this looks > like any particular garbage. > >>> Second crash: >>> >>> [1838269.012349] general protection fault: 0000 [#1] SMP >>> [1838269.012452] Modules linked in: ebtable_nat sb_edac edac_core 8021q mrp >>> garp nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev >>> ip6table_filter ip6_tables xen_pciback blktap xen_netback xen_gntdev >>> xen_gntalloc xenfs xe >>> n_privcmd xen_evtchn xen_blkback tun sch_htb fuse ext2 ebt_mark ebt_ip >>> ebt_arp ebtable_filter ebtables drbd lru_cache cls_fw br_netfilter bridge >>> stp >>> llc iTCO_wdt iTCO_vendor_support pcspkr raid456 async_raid6_recov async_pq >>> async_xor xor >>> async_memcpy async_tx raid10 raid6_pq libcrc32c joydev i2c_i801 i2c_smbus >>> lpc_ich shpchp mei_me mei fjes ipmi_si ipmi_msghandler acpi_power_meter >>> ioatdma igb dca raid1 mlx4_en mlx4_ib ib_core ptp pps_core mlx4_core >>> mpt3sas scsi_transpor >>> t_sas raid_class wmi ast ttm >>> [1838269.013521] CPU: 1 PID: 18 Comm: ksoftirqd/1 Not tainted 4.9.39 #1 >>> [1838269.013637] Hardware name: Supermicro Super Server/X10DRi-LN4+, BIOS >>> 2.0a 09/16/2016 >>> [1838269.013743] task: ffff88030008c4c0 task.stack: ffffc90041978000 >>> [1838269.013826] RIP: e030: memcpy_erms+0x6/0x10 >>> [1838269.013952] RSP: e02b:ffffc9004197bac0 EFLAGS: 00010202 >>> [1838269.014026] RAX: ffff88032fcafe16 RBX: 0000000000000004 RCX: >>> 0000000000000004 >>> [1838269.014124] RDX: 0000000000000004 RSI: 62a16ddedc6dbcb3 RDI: >>> ffff88032fcafe16 >>> [1838269.014222] RBP: ffffc9004197bb20 R08: 0000000000000004 R09: >>> 0000000000000004 >>> [1838269.014320] R10: ffff88026ae89500 R11: 0000000044639632 R12: >>> 0000000000000048 >>> [1838269.014417] R13: 0000000000000000 R14: 0000000044639632 R15: >>> 0000000000000048 >>> [1838269.014519] FS: 0000000000000000(0000) GS:ffff880306640000(0000) >>> knlGS:ffff880306640000 >>> [1838269.014629] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> [1838269.014709] CR2: ffffffffff600400 CR3: 0000000051939000 CR4: >>> 0000000000042660 >>> [1838269.014808] Stack: >>> [1838269.014840] ffffffff81744c17 ffff88026ae89500 0000000044639632 >>> ffff88030008c4c0 >>> [1838269.014952] ffffffff00000004 0000000000000004 ffff88032fcafe16 >>> ffff88026ae89500 >>> [1838269.015064] 0000000000000004 0000000000000004 000000000000004c >>> 0000000000000028 >>> [1838269.015176] Call Trace: >>> [1838269.015217] ? skb_copy_bits+0x137/0x2c0 >>> [1838269.015299] __pskb_pull_tail+0x7f/0x3b0 >>> [1838269.015382] tcp_gro_receive+0x2c5/0x300 >>> [1838269.015465] tcp6_gro_receive+0x13a/0x1a0 >>> [1838269.015547] ipv6_gro_receive+0x1c6/0x380 >>> [1838269.015630] dev_gro_receive+0x269/0x3b0 >>> [1838269.015712] napi_gro_receive+0x38/0xf0 >>> [1838269.015796] igb_clean_rx_irq+0x38e/0x690 [igb] >>> [1838269.015886] igb_poll+0x362/0x720 [igb] >>> [1838269.015968] ? dequeue_entity+0x26e/0xa90 >>> [1838269.016051] ? xen_mc_flush+0x17b/0x1b0 >>> [1838269.016131] net_rx_action+0x158/0x360 >>> [1838269.016212] __do_softirq+0xd1/0x283 >>> [1838269.016290] ? sort_range+0x30/0x30 >>> [1838269.016366] run_ksoftirqd+0x29/0x50 >>> [1838269.016443] smpboot_thread_fn+0x110/0x160 >>> [1838269.016525] kthread+0xd7/0xf0 >>> [1838269.016595] ? kthread_park+0x60/0x60 >>> [1838269.016673] ret_from_fork+0x25/0x30 >>> [1838269.016758] Code: ff 90 90 90 90 eb 1e 0f 1f 00 48 89 f8 48 89 d1 48 >>> c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 c3 66 0f 1f 44 00 00 48 89 f8 48 89 >>> d1 <f3> a4 c3 0f 1f 80 00 00 00 00 48 89 f8 48 83 fa 20 72 7e 40 38 >>> [1838269.017183] RIP memcpy_erms+0x6/0x10 >>> [1838269.017264] RSP <ffffc9004197bac0> >>> [1838269.020618] ---[ end trace 3506ce1d7200529a ]--- >>> [1838269.079891] Kernel panic - not syncing: Fatal exception in interrupt >>> [1838269.080014] Kernel Offset: disabled >>> (XEN) Hardware Dom0 crashed: rebooting machine in 5 seconds. > > --Sarah