Public bug reported: intro -----
Our internal test triggers a kernel crash dump below [ 888.690348] Sun Mar 24 23:51:59 2024: DriVerTest - Start Test [ 888.691834] ---------------------------------------------------------------------------------------------------- [ 888.983912] mlx5_core 0000:08:00.1 eth3: Link up [ 888.987644] IPv6: ADDRCONF(NETDEV_CHANGE): eth3: link becomes ready [ 889.336577] mlx5_core 0000:08:00.0 eth2: Link up [ 894.635836] Sun Mar 24 11:52:04 PM IST 2024 - DriVerTest Debug Heartbeat [ 940.431644] general protection fault, probably for non-canonical address 0x8002001400000000: 0000 [#1] SMP NOPTI [ 940.432866] CPU: 7 PID: 94305 Comm: ethtool Tainted: G OE 5.15.0-1039.17.g0d63875-bluefield #1 [ 940.433970] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 940.435220] RIP: 0010:netlink_policy_dump_add_policy+0x95/0x160 [ 940.435893] Code: 48 c1 e0 04 4c 8b 34 01 4d 85 f6 74 5b 31 db eb 10 4c 89 e8 83 c3 01 48 c1 e0 04 39 5c 01 08 72 3f 89 d8 48 c1 e0 04 4c 01 f0 <0f> b6 10 83 ea 08 83 fa 01 77 dc 0f b7 50 02 48 8b 70 08 48 8d 7c [ 940.437921] RSP: 0018:ffa0000002d37a08 EFLAGS: 00010286 [ 940.438551] RAX: 8002001400000000 RBX: 0000000000000000 RCX: ff1100027d000000 [ 940.439351] RDX: 00000000fffffff8 RSI: 0000000000000018 RDI: ffa0000002d37a10 [ 940.440131] RBP: 0000000000000003 R08: 0000000000400000 R09: ff1100027d2d0f10 [ 940.440900] R10: 0000000000000318 R11: 0000000000000000 R12: ff1100011fa59bc0 [ 940.441683] R13: 0000000000000004 R14: 8002001400000000 R15: ffffffff83fa6540 [ 940.442459] FS: 00007f4a17993740(0000) GS:ff1100085f9c0000(0000) knlGS:0000000000000000 [ 940.443394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 940.444044] CR2: 0000000000429f50 CR3: 000000012fc2e002 CR4: 0000000000771ee0 [ 940.444847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 940.445639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 940.446431] PKRU: 55555554 [ 940.446795] Call Trace: [ 940.447144] <TASK> [ 940.447444] ? __die_body+0x1b/0x60 [ 940.447880] ? die_addr+0x39/0x60 [ 940.448315] ? exc_general_protection+0x1bc/0x3c0 [ 940.448867] ? asm_exc_general_protection+0x22/0x30 [ 940.449445] ? netlink_policy_dump_add_policy+0x95/0x160 [ 940.450058] ? netlink_policy_dump_add_policy+0xb2/0x160 [ 940.450714] ? ethtool_get_phc_vclocks+0x70/0x70 [ 940.451272] ctrl_dumppolicy_start+0xc4/0x2a0 [ 940.451788] ? ethnl_reply_init+0xd0/0xd0 [ 940.452284] ? __nla_parse+0x22/0x30 [ 940.452734] ? __cond_resched+0x15/0x30 [ 940.453211] ? kmem_cache_alloc_trace+0x44/0x390 [ 940.453750] genl_start+0xc3/0x150 [ 940.454179] __netlink_dump_start+0x175/0x250 [ 940.454706] genl_family_rcv_msg_dumpit.isra.0+0x9a/0x100 [ 940.455334] ? genl_family_rcv_msg_attrs_parse.isra.0+0xe0/0xe0 [ 940.455998] ? genl_unlock+0x20/0x20 [ 940.456453] ? genl_parallel_done+0x40/0x40 [ 940.456957] genl_rcv_msg+0x11f/0x2b0 [ 940.457421] ? genl_get_cmd+0x170/0x170 [ 940.457890] ? ctrl_dumppolicy_put_op.isra.0+0x1e0/0x1e0 [ 940.458515] ? genl_lock_done+0x60/0x60 [ 940.458987] ? genl_family_rcv_msg_doit.isra.0+0x110/0x110 [ 940.459634] netlink_rcv_skb+0x54/0x100 [ 940.460107] genl_rcv+0x24/0x40 [ 940.460504] netlink_unicast+0x18d/0x230 [ 940.460983] netlink_sendmsg+0x240/0x4a0 [ 940.461472] __sock_sendmsg+0x2f/0x40 [ 940.461922] __sys_sendto+0xee/0x160 [ 940.462384] ? __sys_recvmsg+0x56/0xa0 [ 940.462854] ? exit_to_user_mode_prepare+0x35/0x170 [ 940.463439] __x64_sys_sendto+0x25/0x30 [ 940.463906] do_syscall_64+0x35/0x80 [ 940.464368] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 940.464955] RIP: 0033:0x7f4a17aa940a [ 940.465415] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 [ 940.467418] RSP: 002b:00007ffc3612cac8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 940.468284] RAX: ffffffffffffffda RBX: 0000000000c3b3b0 RCX: 00007f4a17aa940a [ 940.469057] RDX: 0000000000000024 RSI: 0000000000c3b3b0 RDI: 0000000000000003 [ 940.469852] RBP: 0000000000c3b2a0 R08: 00007f4a17ba4200 R09: 000000000000000c [ 940.470674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000c3b340 [ 940.471470] R13: 0000000000c3b350 R14: 00007ffc3612caec R15: 0000000000c3b3b0 [ 940.472257] </TASK> [ 940.472570] Modules linked in: udp_diag(E) tcp_diag(E) inet_diag(E) iptable_raw(E) openvswitch(E) nsh(E) nf_conncount(E) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) auxiliary(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) memtrack(OE) psample(E) ptp(E) pps_core(E) nfsv3(E) nfs_acl(E) rpcsec_gss_krb5(E) xt_conntrack(E) auth_rpcgss(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) iptable_filter(E) iptable_nat(E) nf_nat(E) br_netfilter(E) bridge(E) stp(E) llc(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) netfs(E) overlay(E) rfkill(E) sunrpc(E) kvm_intel(E) iTCO_wdt(E) iTCO_vendor_support(E) kvm(E) irqbypass(E) virtio_net(E) i2c_i801(E) pcspkr(E) i2c_smbus(E) lpc_ich(E) net_failover(E) mfd_core(E) failover(E) sch_fq_codel(E) drm(E) i2c_core(E) ip_tables(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) sha256_ssse3(E) sha1_ssse3(E) serio_raw(E) fuse(E) [ 940.472612] [last unloaded: ib_core] [ 940.481959] ---[ end trace 09663efb82dc1774 ]--- [ 940.482523] RIP: 0010:netlink_policy_dump_add_policy+0x95/0x160 fix --- Need to cherry-pick the following patch commit c1b05105573b2cd5845921eb0d2caa26e2144a34 Author: Jakub Kicinski <k...@kernel.org> Date: Wed Nov 9 10:32:54 2022 -0800 genetlink: fix single op policy dump when do is present Jonathan reports crashes when running net-next in Meta's fleet. Stats collection uses ethtool -I which does a per-op policy dump to check if stats are supported. We don't initialize the dumpit information if doit succeeds due to evaluation short-circuiting. The crash may look like this: BUG: kernel NULL pointer dereference, address: 0000000000000cc0 RIP: 0010:netlink_policy_dump_add_policy+0x174/0x2a0 ctrl_dumppolicy_start+0x19f/0x2f0 genl_start+0xe7/0x140 Or we may trigger a warning: WARNING: CPU: 1 PID: 785 at net/netlink/policy.c:87 netlink_policy_dump_get_policy_idx+0x79/0x80 RIP: 0010:netlink_policy_dump_get_policy_idx+0x79/0x80 ctrl_dumppolicy_put_op+0x214/0x360 depending on what garbage we pick up from the stack. Reported-by: Jonathan Lemon <b...@meta.com> Fixes: 26588edbef60 ("genetlink: support split policies in ctrl_dumppolicy_put_op()") Reviewed-by: Jacob Keller <jacob.e.kel...@intel.com> Tested-by: Leon Romanovsky <leo...@nvidia.com> Link: https://lore.kernel.org/r/20221109183254.554051-1-k...@kernel.org Signed-off-by: Jakub Kicinski <k...@kernel.org> ** Affects: linux-bluefield (Ubuntu) Importance: Undecided Status: New -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux-bluefield in Ubuntu. https://bugs.launchpad.net/bugs/2059961 Title: genetlink: fix single op policy dump when do is present Status in linux-bluefield package in Ubuntu: New Bug description: intro ----- Our internal test triggers a kernel crash dump below [ 888.690348] Sun Mar 24 23:51:59 2024: DriVerTest - Start Test [ 888.691834] ---------------------------------------------------------------------------------------------------- [ 888.983912] mlx5_core 0000:08:00.1 eth3: Link up [ 888.987644] IPv6: ADDRCONF(NETDEV_CHANGE): eth3: link becomes ready [ 889.336577] mlx5_core 0000:08:00.0 eth2: Link up [ 894.635836] Sun Mar 24 11:52:04 PM IST 2024 - DriVerTest Debug Heartbeat [ 940.431644] general protection fault, probably for non-canonical address 0x8002001400000000: 0000 [#1] SMP NOPTI [ 940.432866] CPU: 7 PID: 94305 Comm: ethtool Tainted: G OE 5.15.0-1039.17.g0d63875-bluefield #1 [ 940.433970] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 940.435220] RIP: 0010:netlink_policy_dump_add_policy+0x95/0x160 [ 940.435893] Code: 48 c1 e0 04 4c 8b 34 01 4d 85 f6 74 5b 31 db eb 10 4c 89 e8 83 c3 01 48 c1 e0 04 39 5c 01 08 72 3f 89 d8 48 c1 e0 04 4c 01 f0 <0f> b6 10 83 ea 08 83 fa 01 77 dc 0f b7 50 02 48 8b 70 08 48 8d 7c [ 940.437921] RSP: 0018:ffa0000002d37a08 EFLAGS: 00010286 [ 940.438551] RAX: 8002001400000000 RBX: 0000000000000000 RCX: ff1100027d000000 [ 940.439351] RDX: 00000000fffffff8 RSI: 0000000000000018 RDI: ffa0000002d37a10 [ 940.440131] RBP: 0000000000000003 R08: 0000000000400000 R09: ff1100027d2d0f10 [ 940.440900] R10: 0000000000000318 R11: 0000000000000000 R12: ff1100011fa59bc0 [ 940.441683] R13: 0000000000000004 R14: 8002001400000000 R15: ffffffff83fa6540 [ 940.442459] FS: 00007f4a17993740(0000) GS:ff1100085f9c0000(0000) knlGS:0000000000000000 [ 940.443394] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 940.444044] CR2: 0000000000429f50 CR3: 000000012fc2e002 CR4: 0000000000771ee0 [ 940.444847] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 940.445639] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 940.446431] PKRU: 55555554 [ 940.446795] Call Trace: [ 940.447144] <TASK> [ 940.447444] ? __die_body+0x1b/0x60 [ 940.447880] ? die_addr+0x39/0x60 [ 940.448315] ? exc_general_protection+0x1bc/0x3c0 [ 940.448867] ? asm_exc_general_protection+0x22/0x30 [ 940.449445] ? netlink_policy_dump_add_policy+0x95/0x160 [ 940.450058] ? netlink_policy_dump_add_policy+0xb2/0x160 [ 940.450714] ? ethtool_get_phc_vclocks+0x70/0x70 [ 940.451272] ctrl_dumppolicy_start+0xc4/0x2a0 [ 940.451788] ? ethnl_reply_init+0xd0/0xd0 [ 940.452284] ? __nla_parse+0x22/0x30 [ 940.452734] ? __cond_resched+0x15/0x30 [ 940.453211] ? kmem_cache_alloc_trace+0x44/0x390 [ 940.453750] genl_start+0xc3/0x150 [ 940.454179] __netlink_dump_start+0x175/0x250 [ 940.454706] genl_family_rcv_msg_dumpit.isra.0+0x9a/0x100 [ 940.455334] ? genl_family_rcv_msg_attrs_parse.isra.0+0xe0/0xe0 [ 940.455998] ? genl_unlock+0x20/0x20 [ 940.456453] ? genl_parallel_done+0x40/0x40 [ 940.456957] genl_rcv_msg+0x11f/0x2b0 [ 940.457421] ? genl_get_cmd+0x170/0x170 [ 940.457890] ? ctrl_dumppolicy_put_op.isra.0+0x1e0/0x1e0 [ 940.458515] ? genl_lock_done+0x60/0x60 [ 940.458987] ? genl_family_rcv_msg_doit.isra.0+0x110/0x110 [ 940.459634] netlink_rcv_skb+0x54/0x100 [ 940.460107] genl_rcv+0x24/0x40 [ 940.460504] netlink_unicast+0x18d/0x230 [ 940.460983] netlink_sendmsg+0x240/0x4a0 [ 940.461472] __sock_sendmsg+0x2f/0x40 [ 940.461922] __sys_sendto+0xee/0x160 [ 940.462384] ? __sys_recvmsg+0x56/0xa0 [ 940.462854] ? exit_to_user_mode_prepare+0x35/0x170 [ 940.463439] __x64_sys_sendto+0x25/0x30 [ 940.463906] do_syscall_64+0x35/0x80 [ 940.464368] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 940.464955] RIP: 0033:0x7f4a17aa940a [ 940.465415] Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 7e c3 0f 1f 44 00 00 41 54 48 83 ec 30 44 89 [ 940.467418] RSP: 002b:00007ffc3612cac8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c [ 940.468284] RAX: ffffffffffffffda RBX: 0000000000c3b3b0 RCX: 00007f4a17aa940a [ 940.469057] RDX: 0000000000000024 RSI: 0000000000c3b3b0 RDI: 0000000000000003 [ 940.469852] RBP: 0000000000c3b2a0 R08: 00007f4a17ba4200 R09: 000000000000000c [ 940.470674] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000c3b340 [ 940.471470] R13: 0000000000c3b350 R14: 00007ffc3612caec R15: 0000000000c3b3b0 [ 940.472257] </TASK> [ 940.472570] Modules linked in: udp_diag(E) tcp_diag(E) inet_diag(E) iptable_raw(E) openvswitch(E) nsh(E) nf_conncount(E) rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlxfw(OE) auxiliary(OE) mlxdevm(OE) ib_uverbs(OE) ib_core(OE) mlx_compat(OE) memtrack(OE) psample(E) ptp(E) pps_core(E) nfsv3(E) nfs_acl(E) rpcsec_gss_krb5(E) xt_conntrack(E) auth_rpcgss(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) iptable_filter(E) iptable_nat(E) nf_nat(E) br_netfilter(E) bridge(E) stp(E) llc(E) nfsv4(E) dns_resolver(E) nfs(E) lockd(E) grace(E) fscache(E) netfs(E) overlay(E) rfkill(E) sunrpc(E) kvm_intel(E) iTCO_wdt(E) iTCO_vendor_support(E) kvm(E) irqbypass(E) virtio_net(E) i2c_i801(E) pcspkr(E) i2c_smbus(E) lpc_ich(E) net_failover(E) mfd_core(E) failover(E) sch_fq_codel(E) drm(E) i2c_core(E) ip_tables(E) crc32_pclmul(E) crc32c_intel(E) ghash_clmulni_intel(E) sha256_ssse3(E) sha1_ssse3(E) serio_raw(E) fuse(E) [ 940.472612] [last unloaded: ib_core] [ 940.481959] ---[ end trace 09663efb82dc1774 ]--- [ 940.482523] RIP: 0010:netlink_policy_dump_add_policy+0x95/0x160 fix --- Need to cherry-pick the following patch commit c1b05105573b2cd5845921eb0d2caa26e2144a34 Author: Jakub Kicinski <k...@kernel.org> Date: Wed Nov 9 10:32:54 2022 -0800 genetlink: fix single op policy dump when do is present Jonathan reports crashes when running net-next in Meta's fleet. Stats collection uses ethtool -I which does a per-op policy dump to check if stats are supported. We don't initialize the dumpit information if doit succeeds due to evaluation short-circuiting. The crash may look like this: BUG: kernel NULL pointer dereference, address: 0000000000000cc0 RIP: 0010:netlink_policy_dump_add_policy+0x174/0x2a0 ctrl_dumppolicy_start+0x19f/0x2f0 genl_start+0xe7/0x140 Or we may trigger a warning: WARNING: CPU: 1 PID: 785 at net/netlink/policy.c:87 netlink_policy_dump_get_policy_idx+0x79/0x80 RIP: 0010:netlink_policy_dump_get_policy_idx+0x79/0x80 ctrl_dumppolicy_put_op+0x214/0x360 depending on what garbage we pick up from the stack. Reported-by: Jonathan Lemon <b...@meta.com> Fixes: 26588edbef60 ("genetlink: support split policies in ctrl_dumppolicy_put_op()") Reviewed-by: Jacob Keller <jacob.e.kel...@intel.com> Tested-by: Leon Romanovsky <leo...@nvidia.com> Link: https://lore.kernel.org/r/20221109183254.554051-1-k...@kernel.org Signed-off-by: Jakub Kicinski <k...@kernel.org> To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2059961/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : kernel-packages@lists.launchpad.net Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp