Hello intel-wired-lan,

we experienced some kernel warnings / crashes on multiple machines when using LACP bonding on our E810-CQDA2 nics running on Ubuntu 22.04 LTS (Jammy) and on HWE kernel 6.2.0-36-generic.

The two 100G ports of the nics were bonded using LACP and (likely unrelated, via MLAG to two Arista DCS-7050CX3M-32S switches).
This is how the issue looks the kernel log:

--- cut ---
2023-11-07T10:29:21.920285+00:00 server123 kernel: [413987.459120] ------------[ cut here ]------------ 2023-11-07T10:29:21.920298+00:00 fra-az1-comp-24 kernel: [413987.459123] NETDEV WATCHDOG: eth3 (ice): transmit queue 17 timed out 2023-11-07T10:29:21.920298+00:00 server123 kernel: [413987.459134] WARNING: CPU: 76 PID: 472260 at net/sched/sch_generic.c:525 dev_watchdog+0x21f/0x230 2023-11-07T10:29:21.920299+00:00 server123 kernel: [413987.459142] Modules linked in: xt_multiport xt_REDIRECT xt_nat xt_connmark xt_mark veth ebt_arp nft_meta_bridge ip6_tables xt_CT xt_mac xt_set xt_state ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel xt_comment xt_physdev vhost_net vhost vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink xfrm_user xfrm_algo nvme_fabrics 8021q garp mrp br_netfilter bridge stp llc bonding binfmt_misc tls nls_ascii intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd ipmi_ssif kvm irqbypass rapl wmi_bmof irdma ib_uverbs ib_core input_leds joydev ccp k10temp ptdma switchtec acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel efi_pstore ip_tables x_tables autofs4 dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear hid_generic cdc_ether usbnet usbhid hid mii raid1 2023-11-07T10:29:21.920300+00:00 server123 kernel: [413987.459251]  ast i2c_algo_bit drm_shmem_helper crct10dif_pclmul crc32_pclmul drm_kms_helper polyval_clmulni polyval_generic syscopyarea ghash_clmulni_intel sysfillrect sha512_ssse3 sysimgblt aesni_intel crypto_simd cryptd ice ahci nvme drm i40e libahci xhci_pci i2c_piix4 xhci_pci_renesas nvme_core nvme_common wmi 2023-11-07T10:29:21.920300+00:00 server123 kernel: [413987.459284] CPU: 76 PID: 472260 Comm: tp_osd_tp Not tainted 6.2.0-36-generic #37~22.04.1-Ubuntu 2023-11-07T10:29:21.920301+00:00 server123 kernel: [413987.459287] Hardware name: ASUSTeK COMPUTER INC. RS720A-E11-RS24U/KMPP-D32 Series, BIOS 1501 08/23/2023 2023-11-07T10:29:21.920305+00:00 server123 kernel: [413987.459289] RIP: 0010:dev_watchdog+0x21f/0x230 2023-11-07T10:29:21.920306+00:00 server123 kernel: [413987.459292] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 d9 5f 78 01 01 e8 e6 ff f7 ff 44 89 f1 4c 89 e6 48 c7 c7 b8 8c a4 91 48 89 c2 e8 81 c3 2b ff <0f> 0b e9 22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 2023-11-07T10:29:21.920306+00:00 server123 kernel: [413987.459294] RSP: 0018:ffffb445da1f0e70 EFLAGS: 00010246 2023-11-07T10:29:21.920307+00:00 server123 kernel: [413987.459297] RAX: 0000000000000000 RBX: ffff8fce1844f4c8 RCX: 0000000000000000 2023-11-07T10:29:21.920307+00:00 server123 kernel: [413987.459298] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 2023-11-07T10:29:21.920308+00:00 server123 kernel: [413987.459299] RBP: ffffb445da1f0e98 R08: 0000000000000000 R09: 0000000000000000 2023-11-07T10:29:21.920308+00:00 server123 kernel: [413987.459301] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8fce1844f000 2023-11-07T10:29:21.920311+00:00 server123 kernel: [413987.459302] R13: ffff8fce1844f41c R14: 0000000000000011 R15: 0000000000000000 2023-11-07T10:29:21.920339+00:00 server123 kernel: [413987.459304] FS:  00007f80f73f4640(0000) GS:ffff8fcddeb00000(0000) knlGS:0000000000000000 2023-11-07T10:29:21.920340+00:00 server123 kernel: [413987.459306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 2023-11-07T10:29:21.920341+00:00 server123 kernel: [413987.459308] CR2: 000056527fef8af0 CR3: 000000013121e003 CR4: 0000000000770ee0 2023-11-07T10:29:21.920341+00:00 server123 kernel: [413987.459309] PKRU: 55555554 2023-11-07T10:29:21.920342+00:00 server123 kernel: [413987.459311] Call Trace:
2023-11-07T10:29:21.920343+00:00 server123 kernel: [413987.459312]  <IRQ>
2023-11-07T10:29:21.920344+00:00 server123 kernel: [413987.459317]  ? show_regs+0x72/0x90 2023-11-07T10:29:21.920344+00:00 server123 kernel: [413987.459321]  ? dev_watchdog+0x21f/0x230 2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459323]  ? __warn+0x8d/0x160 2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459328]  ? dev_watchdog+0x21f/0x230 2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459331]  ? report_bug+0x1bb/0x1d0 2023-11-07T10:29:21.920346+00:00 server123 kernel: [413987.459336]  ? handle_bug+0x46/0x90 2023-11-07T10:29:21.920346+00:00 server123 kernel: [413987.459339]  ? exc_invalid_op+0x19/0x80 2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459342]  ? asm_exc_invalid_op+0x1b/0x20 2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459350]  ? dev_watchdog+0x21f/0x230 2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459353]  ? __pfx_dev_watchdog+0x10/0x10 2023-11-07T10:29:21.920348+00:00 server123 kernel: [413987.459356]  call_timer_fn+0x2c/0x160 2023-11-07T10:29:21.920348+00:00 server123 kernel: [413987.459360]  ? __pfx_dev_watchdog+0x10/0x10 2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459363]  __run_timers.part.0+0x1fb/0x2b0 2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459368]  run_timer_softirq+0x2a/0x60 2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459370]  __do_softirq+0xdd/0x330 2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459374]  ? hrtimer_interrupt+0x12b/0x250 2023-11-07T10:29:21.920359+00:00 server123 kernel: [413987.459379]  __irq_exit_rcu+0xa2/0xd0 2023-11-07T10:29:21.920359+00:00 server123 kernel: [413987.459382]  irq_exit_rcu+0xe/0x20 2023-11-07T10:29:21.920361+00:00 server123 kernel: [413987.459384]  sysvec_apic_timer_interrupt+0x96/0xb0
2023-11-07T10:29:21.920361+00:00 server123 kernel: [413987.459387]  </IRQ>
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459388]  <TASK>
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459389]  asm_sysvec_apic_timer_interrupt+0x1b/0x20 2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459392] RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x60 2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459395] Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89 e5 c6 07 00 0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d d0 3b f8 6e 74 13 5d 31 c0 31 d2 31 c9 31 f6 31 ff 45 31 2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459397] RSP: 0018:ffffb4464b557608 EFLAGS: 00000206 2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459399] RAX: 0000000000000080 RBX: 0000000000100000 RCX: 0000000000000000 2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459400] RDX: 0000000000000000 RSI: 0000000000000293 RDI: ffff8fcdee501008 2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459402] RBP: ffffb4464b557608 R08: 0000000000000293 R09: 0000000000000000 2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459403] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffffffffff80 2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459404] R13: 0000000000000007 R14: 0000000000000001 R15: ffff8fcdee501008 2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459409]  ? _raw_spin_lock_irqsave+0xe/0x20 2023-11-07T10:29:21.920365+00:00 server123 kernel: [413987.459412]  __alloc_and_insert_iova_range+0x9f/0x260 2023-11-07T10:29:21.920365+00:00 server123 kernel: [413987.459416]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920367+00:00 server123 kernel: [413987.459419]  ? kmem_cache_alloc+0x180/0x340 2023-11-07T10:29:21.920367+00:00 server123 kernel: [413987.459423]  alloc_iova+0x4d/0xb0 2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459427]  alloc_iova_fast+0x10d/0x310 2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459431]  iommu_dma_alloc_iova+0x10f/0x160 2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459434]  iommu_dma_map_sg+0x428/0x4d0 2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459439]  __dma_map_sg_attrs+0xab/0xb0 2023-11-07T10:29:21.920371+00:00 server123 kernel: [413987.459442]  dma_map_sgtable+0x21/0x50 2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459446]  nvme_map_data+0xd8/0x3b0 [nvme] 2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459453]  nvme_prep_rq.part.0+0x37/0x140 [nvme] 2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459458]  nvme_queue_rqs+0xbf/0x290 [nvme] 2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459462]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920373+00:00 server123 kernel: [413987.459466]  blk_mq_flush_plug_list.part.0+0x2cb/0x2f0 2023-11-07T10:29:21.920373+00:00 server123 kernel: [413987.459471]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920375+00:00 server123 kernel: [413987.459474]  ? blk_mq_get_new_requests+0xf6/0x1a0 2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459477]  blk_add_rq_to_plug+0x12f/0x1b0 2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459481]  blk_mq_submit_bio+0x281/0x4b0 2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459484]  __submit_bio+0x109/0x1a0 2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459488]  __submit_bio_noacct+0x81/0x1f0 2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459492]  submit_bio_noacct_nocheck+0x102/0x1e0 2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459494]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459497]  ? __bio_iov_iter_get_pages+0x272/0x3b0 2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459501]  submit_bio_noacct+0x1d0/0x700 2023-11-07T10:29:21.920378+00:00 server123 kernel: [413987.459504]  submit_bio+0x28/0x90 2023-11-07T10:29:21.920381+00:00 server123 kernel: [413987.459507]  __blkdev_direct_IO_async+0x124/0x220 2023-11-07T10:29:21.920384+00:00 server123 kernel: [413987.459510]  blkdev_direct_IO+0x49/0xa0 2023-11-07T10:29:21.920384+00:00 server123 kernel: [413987.459513]  generic_file_direct_write+0xd7/0x1f0 2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459518]  __generic_file_write_iter+0xaf/0x1e0 2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459523]  blkdev_write_iter+0x117/0x1c0 2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459526]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459530]  aio_write+0x116/0x250 2023-11-07T10:29:21.920387+00:00 server123 kernel: [413987.459534]  ? should_numa_migrate_memory+0x233/0x530 2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459539]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459542]  ? task_numa_fault+0x218/0x3d0 2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459547]  __io_submit_one.constprop.0+0xac/0x200 2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459550]  ? __io_submit_one.constprop.0+0xac/0x200 2023-11-07T10:29:21.920389+00:00 server123 kernel: [413987.459554]  io_submit_one+0xe8/0x3d0 2023-11-07T10:29:21.920389+00:00 server123 kernel: [413987.459559]  __x64_sys_io_submit+0x84/0x180 2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459564]  do_syscall_64+0x5c/0x90 2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459567]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459569]  ? do_user_addr_fault+0x1d0/0x640 2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459572]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459575]  ? exit_to_user_mode_prepare+0x3b/0xd0 2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459578]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920394+00:00 server123 kernel: [413987.459580]  ? irqentry_exit_to_user_mode+0x17/0x20 2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459583]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459586]  ? irqentry_exit+0x43/0x50 2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459589]  ? srso_alias_return_thunk+0x5/0x7f 2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459592]  ? exc_page_fault+0x92/0x1b0 2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459595]  entry_SYSCALL_64_after_hwframe+0x73/0xdd 2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459598] RIP: 0033:0x7f8115f1ea7d 2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459601] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 83 a3 0f 00 f7 d8 64 89 01 48 2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459602] RSP: 002b:00007f80f73efd58 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1 2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459604] RAX: ffffffffffffffda RBX: 00007f80f73f14f0 RCX: 00007f8115f1ea7d 2023-11-07T10:29:21.920399+00:00 server123 kernel: [413987.459606] RDX: 00007f80f73efd90 RSI: 0000000000000040 RDI: 00007f8116192000 2023-11-07T10:29:21.920399+00:00 server123 kernel: [413987.459607] RBP: 00007f8116192000 R08: 00007f80f73eff90 R09: 0000000000000000 2023-11-07T10:29:21.920408+00:00 server123 kernel: [413987.459609] R10: 000056521ba62df8 R11: 0000000000000246 R12: 0000000000000040 2023-11-07T10:29:21.920411+00:00 server123 kernel: [413987.459610] R13: 0000000000000000 R14: 00007f80f73efd90 R15: 00005652020b65a0
2023-11-07T10:29:21.920411+00:00 server123 kernel: [413987.459614]  </TASK>
2023-11-07T10:29:21.920414+00:00 server123 kernel: [413987.459616] ---[ end trace 0000000000000000 ]--- 2023-11-07T10:29:21.920414+00:00 server123 kernel: [413987.459663] ice 0000:a1:00.1 eth3: tx_timeout: VSI_num: 8, Q 17, NTC: 0x85, HW_HEAD: 0x73, NTU: 0x74, INT: 0x4000000 2023-11-07T10:29:21.920415+00:00 server123 kernel: [413987.459667] ice 0000:a1:00.1 eth3: tx_timeout recovery level 1, txqueue 17 2023-11-07T10:29:23.107363+00:00 server123 kernel: [413988.648050] ice 0000:a1:00.1: PTP reset successful 2023-11-07T10:29:23.135353+00:00 server123 kernel: [413988.672562] bond0: (slave eth3): link status definitely down, disabling slave 2023-11-07T10:29:44.767361+00:00 server123 kernel: [414010.308340] ice 0000:a1:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF 2023-11-07T10:29:44.771344+00:00 server123 kernel: [414010.312829] ice 0000:a1:00.1: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL 2023-11-07T10:29:44.807364+00:00 server123 kernel: [414010.348503] bond0: (slave eth3): link status definitely up, 100000 Mbps full duplex 2023-11-07T10:29:44.807370+00:00 server123 kernel: [414010.348520] bond0: active interface up!
--- cut ---


After this happened the interface came back to the bond, but it did not properly transport traffic anymore, so it's not a full recovery. We observed some weird asymmetry in communication due to the LCAP hashing allowing some servers to talk to the other which others were unreachable.

Switching to active-backup bonding mode (so just two independent interfaces) there were no issues so far anymore.

Our problem matches the bug reported to Ubuntu Launchpad at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239. But there it was with 25G nics, but also on two different kernel versions. So it's likely a driver issue?



If you require any more input or debugging please let me know,
Regards


Christian

_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

Reply via email to