Hello intel-wired-lan,
we experienced some kernel warnings / crashes on multiple machines when
using LACP bonding on our E810-CQDA2 nics running on Ubuntu 22.04 LTS
(Jammy) and on HWE kernel 6.2.0-36-generic.
The two 100G ports of the nics were bonded using LACP and (likely
unrelated, via MLAG to two Arista DCS-7050CX3M-32S switches).
This is how the issue looks the kernel log:
--- cut ---
2023-11-07T10:29:21.920285+00:00 server123 kernel: [413987.459120]
------------[ cut here ]------------
2023-11-07T10:29:21.920298+00:00 fra-az1-comp-24 kernel: [413987.459123]
NETDEV WATCHDOG: eth3 (ice): transmit queue 17 timed out
2023-11-07T10:29:21.920298+00:00 server123 kernel: [413987.459134]
WARNING: CPU: 76 PID: 472260 at net/sched/sch_generic.c:525
dev_watchdog+0x21f/0x230
2023-11-07T10:29:21.920299+00:00 server123 kernel: [413987.459142]
Modules linked in: xt_multiport xt_REDIRECT xt_nat xt_connmark xt_mark
veth ebt_arp nft_meta_bridge ip6_tables xt_CT xt_mac xt_set xt_state
ip_set_hash_net ip_set vxlan ip6_udp_tunnel udp_tunnel xt_comment
xt_physdev vhost_net vhost vhost_iotlb tap xt_CHECKSUM xt_MASQUERADE
xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
nf_tables nfnetlink xfrm_user xfrm_algo nvme_fabrics 8021q garp mrp
br_netfilter bridge stp llc bonding binfmt_misc tls nls_ascii
intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd
ipmi_ssif kvm irqbypass rapl wmi_bmof irdma ib_uverbs ib_core input_leds
joydev ccp k10temp ptdma switchtec acpi_ipmi ipmi_si ipmi_devintf
ipmi_msghandler mac_hid sch_fq_codel efi_pstore ip_tables x_tables
autofs4 dm_crypt raid10 raid456 async_raid6_recov async_memcpy async_pq
async_xor async_tx xor raid6_pq libcrc32c raid0 multipath linear
hid_generic cdc_ether usbnet usbhid hid mii raid1
2023-11-07T10:29:21.920300+00:00 server123 kernel: [413987.459251] ast
i2c_algo_bit drm_shmem_helper crct10dif_pclmul crc32_pclmul
drm_kms_helper polyval_clmulni polyval_generic syscopyarea
ghash_clmulni_intel sysfillrect sha512_ssse3 sysimgblt aesni_intel
crypto_simd cryptd ice ahci nvme drm i40e libahci xhci_pci i2c_piix4
xhci_pci_renesas nvme_core nvme_common wmi
2023-11-07T10:29:21.920300+00:00 server123 kernel: [413987.459284] CPU:
76 PID: 472260 Comm: tp_osd_tp Not tainted 6.2.0-36-generic
#37~22.04.1-Ubuntu
2023-11-07T10:29:21.920301+00:00 server123 kernel: [413987.459287]
Hardware name: ASUSTeK COMPUTER INC. RS720A-E11-RS24U/KMPP-D32 Series,
BIOS 1501 08/23/2023
2023-11-07T10:29:21.920305+00:00 server123 kernel: [413987.459289] RIP:
0010:dev_watchdog+0x21f/0x230
2023-11-07T10:29:21.920306+00:00 server123 kernel: [413987.459292] Code:
00 e9 31 ff ff ff 4c 89 e7 c6 05 d9 5f 78 01 01 e8 e6 ff f7 ff 44 89 f1
4c 89 e6 48 c7 c7 b8 8c a4 91 48 89 c2 e8 81 c3 2b ff <0f> 0b e9 22 ff
ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
2023-11-07T10:29:21.920306+00:00 server123 kernel: [413987.459294] RSP:
0018:ffffb445da1f0e70 EFLAGS: 00010246
2023-11-07T10:29:21.920307+00:00 server123 kernel: [413987.459297] RAX:
0000000000000000 RBX: ffff8fce1844f4c8 RCX: 0000000000000000
2023-11-07T10:29:21.920307+00:00 server123 kernel: [413987.459298] RDX:
0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2023-11-07T10:29:21.920308+00:00 server123 kernel: [413987.459299] RBP:
ffffb445da1f0e98 R08: 0000000000000000 R09: 0000000000000000
2023-11-07T10:29:21.920308+00:00 server123 kernel: [413987.459301] R10:
0000000000000000 R11: 0000000000000000 R12: ffff8fce1844f000
2023-11-07T10:29:21.920311+00:00 server123 kernel: [413987.459302] R13:
ffff8fce1844f41c R14: 0000000000000011 R15: 0000000000000000
2023-11-07T10:29:21.920339+00:00 server123 kernel: [413987.459304] FS:
00007f80f73f4640(0000) GS:ffff8fcddeb00000(0000) knlGS:0000000000000000
2023-11-07T10:29:21.920340+00:00 server123 kernel: [413987.459306] CS:
0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-11-07T10:29:21.920341+00:00 server123 kernel: [413987.459308] CR2:
000056527fef8af0 CR3: 000000013121e003 CR4: 0000000000770ee0
2023-11-07T10:29:21.920341+00:00 server123 kernel: [413987.459309] PKRU:
55555554
2023-11-07T10:29:21.920342+00:00 server123 kernel: [413987.459311] Call
Trace:
2023-11-07T10:29:21.920343+00:00 server123 kernel: [413987.459312] <IRQ>
2023-11-07T10:29:21.920344+00:00 server123 kernel: [413987.459317] ?
show_regs+0x72/0x90
2023-11-07T10:29:21.920344+00:00 server123 kernel: [413987.459321] ?
dev_watchdog+0x21f/0x230
2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459323] ?
__warn+0x8d/0x160
2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459328] ?
dev_watchdog+0x21f/0x230
2023-11-07T10:29:21.920345+00:00 server123 kernel: [413987.459331] ?
report_bug+0x1bb/0x1d0
2023-11-07T10:29:21.920346+00:00 server123 kernel: [413987.459336] ?
handle_bug+0x46/0x90
2023-11-07T10:29:21.920346+00:00 server123 kernel: [413987.459339] ?
exc_invalid_op+0x19/0x80
2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459342] ?
asm_exc_invalid_op+0x1b/0x20
2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459350] ?
dev_watchdog+0x21f/0x230
2023-11-07T10:29:21.920347+00:00 server123 kernel: [413987.459353] ?
__pfx_dev_watchdog+0x10/0x10
2023-11-07T10:29:21.920348+00:00 server123 kernel: [413987.459356]
call_timer_fn+0x2c/0x160
2023-11-07T10:29:21.920348+00:00 server123 kernel: [413987.459360] ?
__pfx_dev_watchdog+0x10/0x10
2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459363]
__run_timers.part.0+0x1fb/0x2b0
2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459368]
run_timer_softirq+0x2a/0x60
2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459370]
__do_softirq+0xdd/0x330
2023-11-07T10:29:21.920358+00:00 server123 kernel: [413987.459374] ?
hrtimer_interrupt+0x12b/0x250
2023-11-07T10:29:21.920359+00:00 server123 kernel: [413987.459379]
__irq_exit_rcu+0xa2/0xd0
2023-11-07T10:29:21.920359+00:00 server123 kernel: [413987.459382]
irq_exit_rcu+0xe/0x20
2023-11-07T10:29:21.920361+00:00 server123 kernel: [413987.459384]
sysvec_apic_timer_interrupt+0x96/0xb0
2023-11-07T10:29:21.920361+00:00 server123 kernel: [413987.459387] </IRQ>
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459388] <TASK>
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459389]
asm_sysvec_apic_timer_interrupt+0x1b/0x20
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459392] RIP:
0010:_raw_spin_unlock_irqrestore+0x21/0x60
2023-11-07T10:29:21.920362+00:00 server123 kernel: [413987.459395] Code:
90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 49 89 f0 48 89 e5 c6 07 00
0f 1f 00 41 f7 c0 00 02 00 00 74 06 fb 0f 1f 44 00 00 <65> ff 0d d0 3b
f8 6e 74 13 5d 31 c0 31 d2 31 c9 31 f6 31 ff 45 31
2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459397] RSP:
0018:ffffb4464b557608 EFLAGS: 00000206
2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459399] RAX:
0000000000000080 RBX: 0000000000100000 RCX: 0000000000000000
2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459400] RDX:
0000000000000000 RSI: 0000000000000293 RDI: ffff8fcdee501008
2023-11-07T10:29:21.920363+00:00 server123 kernel: [413987.459402] RBP:
ffffb4464b557608 R08: 0000000000000293 R09: 0000000000000000
2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459403] R10:
0000000000000000 R11: 0000000000000001 R12: ffffffffffffff80
2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459404] R13:
0000000000000007 R14: 0000000000000001 R15: ffff8fcdee501008
2023-11-07T10:29:21.920364+00:00 server123 kernel: [413987.459409] ?
_raw_spin_lock_irqsave+0xe/0x20
2023-11-07T10:29:21.920365+00:00 server123 kernel: [413987.459412]
__alloc_and_insert_iova_range+0x9f/0x260
2023-11-07T10:29:21.920365+00:00 server123 kernel: [413987.459416] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920367+00:00 server123 kernel: [413987.459419] ?
kmem_cache_alloc+0x180/0x340
2023-11-07T10:29:21.920367+00:00 server123 kernel: [413987.459423]
alloc_iova+0x4d/0xb0
2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459427]
alloc_iova_fast+0x10d/0x310
2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459431]
iommu_dma_alloc_iova+0x10f/0x160
2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459434]
iommu_dma_map_sg+0x428/0x4d0
2023-11-07T10:29:21.920368+00:00 server123 kernel: [413987.459439]
__dma_map_sg_attrs+0xab/0xb0
2023-11-07T10:29:21.920371+00:00 server123 kernel: [413987.459442]
dma_map_sgtable+0x21/0x50
2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459446]
nvme_map_data+0xd8/0x3b0 [nvme]
2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459453]
nvme_prep_rq.part.0+0x37/0x140 [nvme]
2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459458]
nvme_queue_rqs+0xbf/0x290 [nvme]
2023-11-07T10:29:21.920372+00:00 server123 kernel: [413987.459462] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920373+00:00 server123 kernel: [413987.459466]
blk_mq_flush_plug_list.part.0+0x2cb/0x2f0
2023-11-07T10:29:21.920373+00:00 server123 kernel: [413987.459471] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920375+00:00 server123 kernel: [413987.459474] ?
blk_mq_get_new_requests+0xf6/0x1a0
2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459477]
blk_add_rq_to_plug+0x12f/0x1b0
2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459481]
blk_mq_submit_bio+0x281/0x4b0
2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459484]
__submit_bio+0x109/0x1a0
2023-11-07T10:29:21.920376+00:00 server123 kernel: [413987.459488]
__submit_bio_noacct+0x81/0x1f0
2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459492]
submit_bio_noacct_nocheck+0x102/0x1e0
2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459494] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459497] ?
__bio_iov_iter_get_pages+0x272/0x3b0
2023-11-07T10:29:21.920377+00:00 server123 kernel: [413987.459501]
submit_bio_noacct+0x1d0/0x700
2023-11-07T10:29:21.920378+00:00 server123 kernel: [413987.459504]
submit_bio+0x28/0x90
2023-11-07T10:29:21.920381+00:00 server123 kernel: [413987.459507]
__blkdev_direct_IO_async+0x124/0x220
2023-11-07T10:29:21.920384+00:00 server123 kernel: [413987.459510]
blkdev_direct_IO+0x49/0xa0
2023-11-07T10:29:21.920384+00:00 server123 kernel: [413987.459513]
generic_file_direct_write+0xd7/0x1f0
2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459518]
__generic_file_write_iter+0xaf/0x1e0
2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459523]
blkdev_write_iter+0x117/0x1c0
2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459526] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920385+00:00 server123 kernel: [413987.459530]
aio_write+0x116/0x250
2023-11-07T10:29:21.920387+00:00 server123 kernel: [413987.459534] ?
should_numa_migrate_memory+0x233/0x530
2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459539] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459542] ?
task_numa_fault+0x218/0x3d0
2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459547]
__io_submit_one.constprop.0+0xac/0x200
2023-11-07T10:29:21.920388+00:00 server123 kernel: [413987.459550] ?
__io_submit_one.constprop.0+0xac/0x200
2023-11-07T10:29:21.920389+00:00 server123 kernel: [413987.459554]
io_submit_one+0xe8/0x3d0
2023-11-07T10:29:21.920389+00:00 server123 kernel: [413987.459559]
__x64_sys_io_submit+0x84/0x180
2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459564]
do_syscall_64+0x5c/0x90
2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459567] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920391+00:00 server123 kernel: [413987.459569] ?
do_user_addr_fault+0x1d0/0x640
2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459572] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459575] ?
exit_to_user_mode_prepare+0x3b/0xd0
2023-11-07T10:29:21.920392+00:00 server123 kernel: [413987.459578] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920394+00:00 server123 kernel: [413987.459580] ?
irqentry_exit_to_user_mode+0x17/0x20
2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459583] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459586] ?
irqentry_exit+0x43/0x50
2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459589] ?
srso_alias_return_thunk+0x5/0x7f
2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459592] ?
exc_page_fault+0x92/0x1b0
2023-11-07T10:29:21.920395+00:00 server123 kernel: [413987.459595]
entry_SYSCALL_64_after_hwframe+0x73/0xdd
2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459598] RIP:
0033:0x7f8115f1ea7d
2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459601] Code:
5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d 83 a3 0f 00 f7 d8 64 89 01 48
2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459602] RSP:
002b:00007f80f73efd58 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
2023-11-07T10:29:21.920398+00:00 server123 kernel: [413987.459604] RAX:
ffffffffffffffda RBX: 00007f80f73f14f0 RCX: 00007f8115f1ea7d
2023-11-07T10:29:21.920399+00:00 server123 kernel: [413987.459606] RDX:
00007f80f73efd90 RSI: 0000000000000040 RDI: 00007f8116192000
2023-11-07T10:29:21.920399+00:00 server123 kernel: [413987.459607] RBP:
00007f8116192000 R08: 00007f80f73eff90 R09: 0000000000000000
2023-11-07T10:29:21.920408+00:00 server123 kernel: [413987.459609] R10:
000056521ba62df8 R11: 0000000000000246 R12: 0000000000000040
2023-11-07T10:29:21.920411+00:00 server123 kernel: [413987.459610] R13:
0000000000000000 R14: 00007f80f73efd90 R15: 00005652020b65a0
2023-11-07T10:29:21.920411+00:00 server123 kernel: [413987.459614] </TASK>
2023-11-07T10:29:21.920414+00:00 server123 kernel: [413987.459616] ---[
end trace 0000000000000000 ]---
2023-11-07T10:29:21.920414+00:00 server123 kernel: [413987.459663] ice
0000:a1:00.1 eth3: tx_timeout: VSI_num: 8, Q 17, NTC: 0x85, HW_HEAD:
0x73, NTU: 0x74, INT: 0x4000000
2023-11-07T10:29:21.920415+00:00 server123 kernel: [413987.459667] ice
0000:a1:00.1 eth3: tx_timeout recovery level 1, txqueue 17
2023-11-07T10:29:23.107363+00:00 server123 kernel: [413988.648050] ice
0000:a1:00.1: PTP reset successful
2023-11-07T10:29:23.135353+00:00 server123 kernel: [413988.672562]
bond0: (slave eth3): link status definitely down, disabling slave
2023-11-07T10:29:44.767361+00:00 server123 kernel: [414010.308340] ice
0000:a1:00.1: VSI rebuilt. VSI index 0, type ICE_VSI_PF
2023-11-07T10:29:44.771344+00:00 server123 kernel: [414010.312829] ice
0000:a1:00.1: VSI rebuilt. VSI index 383, type ICE_VSI_CTRL
2023-11-07T10:29:44.807364+00:00 server123 kernel: [414010.348503]
bond0: (slave eth3): link status definitely up, 100000 Mbps full duplex
2023-11-07T10:29:44.807370+00:00 server123 kernel: [414010.348520]
bond0: active interface up!
--- cut ---
After this happened the interface came back to the bond, but it did not
properly transport traffic anymore, so it's not a full recovery.
We observed some weird asymmetry in communication due to the LCAP
hashing allowing some servers to talk to the other which others were
unreachable.
Switching to active-backup bonding mode (so just two independent
interfaces) there were no issues so far anymore.
Our problem matches the bug reported to Ubuntu Launchpad at
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239. But there
it was with 25G nics, but also on two different kernel versions. So it's
likely a driver issue?
If you require any more input or debugging please let me know,
Regards
Christian
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan