** Summary changed: - [Regression] kernel 5.15.0-144-generic - discard broken with RAID10 + raid10: block discard causes a NULL pointer dereference after 5.15.0-144-generic
** Description changed: - After upgrading to jammy kernel 5.15.0-144-generic we encountered a - serious regression when the weekly fstrim timer ran. + BugLink: https://bugs.launchpad.net/bugs/2117395 - This bug was introduced by commit "md/raid10: fix missing discard IO accounting" - https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4a05f7ae33716d996c5ce56478a36a3ede1d76f2 - which was backported to all stable kernels and became part of 5.15.181 + [Impact] - The issue was discovered earlier upstream[1] and also in Debian[2], - which resulted in a fix being added to the Debian kernel and - subsequently into 6.1. However the missing patch[3] did not make it into - the 5.15-stable kernel triggering the regression also in Ubuntu jammy. + The below commit was backported to 5.15.181 -stable, and introduced a NULL + pointer dereference in the raid10 subsystem, due to io_acct_set only being used + in raid 0 and 456, and not 1 or 10. + commit d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 + Author: Yu Kuai <[email protected]> + Date: Tue Mar 25 09:57:46 2025 +0800 + Subject: md/raid10: fix missing discard IO accounting + Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 - [1] https://lists.linaro.org/archives/list/[email protected]/thread/TM2PPS3XKE6M5H2FW63MLZV2T7HTM3QJ/ - [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460 - [3] https://lore.kernel.org/all/[email protected]/ - - - dmesg: + Kernel oops: kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: #PF: supervisor instruction fetch in kernel mode kernel: #PF: error_code(0x0010) - not-present page - kernel: PGD 0 P4D 0 + kernel: PGD 0 P4D 0 kernel: Oops: 0010 [#1] SMP PTI kernel: CPU: 5 PID: 784107 Comm: fstrim Not tainted 5.15.0-144-generic #157-Ubuntu - kernel: Hardware name: FUJITSU /D3417-B2, BIOS V5.0.0.12 R1.27.0.SR.1 for D3417-B2x 06/10/2020 kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. kernel: RSP: 0018:ffffb576409c7858 EFLAGS: 00010206 kernel: RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000001 kernel: RDX: ffff8e7e012426f0 RSI: 0000000000000000 RDI: 0000000000092800 kernel: RBP: ffffb576409c78c8 R08: ffff8e884ec966c0 R09: ffff8e7e07c6b050 kernel: R10: 0000000000002ecb R11: 00000000000030c8 R12: 0000000000092c00 kernel: R13: 0000000000000400 R14: ffff8e7e01242708 R15: ffff8e7e10743400 - kernel: FS: 00007f6fff9f0800(0000) GS:ffff8e8cee540000(0000) knlGS:0000000000000000 - kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + kernel: FS: 00007f6fff9f0800(0000) GS:ffff8e8cee540000(0000) knlGS:0000000000000000 + kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 00000001090f6005 CR4: 00000000003706e0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: - kernel: <TASK> - kernel: mempool_alloc+0x61/0x1b0 - kernel: ? __kmalloc+0x179/0x330 - kernel: bio_alloc_bioset+0x9d/0x370 - kernel: ? r10bio_pool_alloc+0x26/0x30 [raid10] - kernel: bio_clone_fast+0x1f/0x90 - kernel: md_account_bio+0x42/0x80 - kernel: raid10_handle_discard+0x56f/0x6b0 [raid10] - kernel: raid10_make_request+0x147/0x180 [raid10] - kernel: md_handle_request+0x12a/0x1b0 - kernel: ? submit_bio_checks+0x1a5/0x580 - kernel: md_submit_bio+0x76/0xc0 - kernel: __submit_bio+0x1a2/0x220 - kernel: ? mempool_alloc_slab+0x17/0x20 - kernel: ? mempool_alloc+0x61/0x1b0 - kernel: ? schedule_timeout+0x91/0x140 - kernel: __submit_bio_noacct+0x85/0x200 - kernel: submit_bio_noacct+0x4e/0x120 - kernel: ? __cond_resched+0x1a/0x60 - kernel: submit_bio+0x4a/0x130 - kernel: submit_bio_wait+0x5a/0xc0 - kernel: blkdev_issue_discard+0x7e/0xd0 - kernel: ext4_try_to_trim_range+0x2db/0x520 - kernel: ? ext4_mb_load_buddy_gfp+0x91/0x3e0 - kernel: ext4_trim_fs+0x313/0x510 - kernel: __ext4_ioctl+0x82c/0xef0 - kernel: ext4_ioctl+0xe/0x20 - kernel: __x64_sys_ioctl+0x92/0xd0 - kernel: x64_sys_call+0x1e5f/0x1fa0 - kernel: do_syscall_64+0x56/0xb0 - kernel: entry_SYSCALL_64_after_hwframe+0x6c/0xd6 - kernel: RIP: 0033:0x7f6fffc0994f - kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 > - kernel: RSP: 002b:00007ffdce979c30 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 - kernel: RAX: ffffffffffffffda RBX: 00007ffdce979d80 RCX: 00007f6fffc0994f - kernel: RDX: 00007ffdce979ca0 RSI: 00000000c0185879 RDI: 0000000000000003 - kernel: RBP: 0000558436acccb0 R08: 0000558436acccb0 R09: 0000000000000000 - kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000003 - kernel: R13: 0000558436accfa0 R14: 0000558436acce80 R15: 0000558436acce80 - kernel: </TASK> - kernel: Modules linked in: tls tcp_diag udp_diag inet_diag bridge stp llc nft_counter nft_chain_nat nf_nat > - kernel: xhci_pci_renesas wmi video - kernel: CR2: 0000000000000000 - kernel: ---[ end trace db9334d27f904581 ]--- - kernel: RIP: 0010:0x0 - kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. - kernel: RSP: 0018:ffffb576409c7858 EFLAGS: 00010206 - kernel: RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000001 - kernel: RDX: ffff8e7e012426f0 RSI: 0000000000000000 RDI: 0000000000092800 - kernel: RBP: ffffb576409c78c8 R08: ffff8e884ec966c0 R09: ffff8e7e07c6b050 - kernel: R10: 0000000000002ecb R11: 00000000000030c8 R12: 0000000000092c00 - kernel: R13: 0000000000000400 R14: ffff8e7e01242708 R15: ffff8e7e10743400 - kernel: FS: 00007f6fff9f0800(0000) GS:ffff8e8cee540000(0000) knlGS:0000000000000000 - kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 - kernel: CR2: ffffffffffffffd6 CR3: 00000001090f6005 CR4: 00000000003706e0 - kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 - kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 - kernel: BUG: unable to handle page fault for address: ffffb57600000010 + kernel: <TASK> + kernel: mempool_alloc+0x61/0x1b0 + kernel: ? __kmalloc+0x179/0x330 + kernel: bio_alloc_bioset+0x9d/0x370 + kernel: ? r10bio_pool_alloc+0x26/0x30 [raid10] + kernel: bio_clone_fast+0x1f/0x90 + kernel: md_account_bio+0x42/0x80 + kernel: raid10_handle_discard+0x56f/0x6b0 [raid10] + kernel: raid10_make_request+0x147/0x180 [raid10] + kernel: md_handle_request+0x12a/0x1b0 + kernel: ? submit_bio_checks+0x1a5/0x580 + kernel: md_submit_bio+0x76/0xc0 + kernel: __submit_bio+0x1a2/0x220 + kernel: ? mempool_alloc_slab+0x17/0x20 + kernel: ? mempool_alloc+0x61/0x1b0 + kernel: ? schedule_timeout+0x91/0x140 + kernel: __submit_bio_noacct+0x85/0x200 + kernel: submit_bio_noacct+0x4e/0x120 + kernel: ? __cond_resched+0x1a/0x60 + kernel: submit_bio+0x4a/0x130 + kernel: submit_bio_wait+0x5a/0xc0 + kernel: blkdev_issue_discard+0x7e/0xd0 + kernel: ext4_try_to_trim_range+0x2db/0x520 + kernel: ? ext4_mb_load_buddy_gfp+0x91/0x3e0 + kernel: ext4_trim_fs+0x313/0x510 + kernel: __ext4_ioctl+0x82c/0xef0 + kernel: ext4_ioctl+0xe/0x20 + kernel: __x64_sys_ioctl+0x92/0xd0 + kernel: x64_sys_call+0x1e5f/0x1fa0 + kernel: do_syscall_64+0x56/0xb0 + kernel: entry_SYSCALL_64_after_hwframe+0x6c/0xd6 + + A workaround is to disable the systemd weekly fstrim timer and to not fstrim / + discard blocks while the problem exists. + + [Fix] + + The below necessary commit was mainlined in 6.6-rc1 and needs to be backported + to jammy. + + commit c567c86b90d4715081adfe5eb812141a5b6b4883 + Author: Yu Kuai <[email protected]> + Date: Thu Jun 22 00:51:03 2023 +0800 + Subject: md: move initialization and destruction of 'io_acct_set' to md.c + Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c567c86b90d4715081adfe5eb812141a5b6b4883 + + This needs a minor backport, adjusting __md_stop() to md_stop(). + + [Testcase] + + You will need a machine with at least 4x NVMe drives which support block + discard. I use a i3.8xlarge instance on AWS, since it has all of these things. + + $ lsblk + xvda 202:0 0 8G 0 disk + └─xvda1 202:1 0 8G 0 part / + nvme0n1 259:2 0 1.7T 0 disk + nvme1n1 259:0 0 1.7T 0 disk + nvme2n1 259:1 0 1.7T 0 disk + nvme3n1 259:3 0 1.7T 0 disk + + Create a Raid10 array: + + $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 + /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 + + Format the array with XFS: + + $ sudo mkfs.xfs /dev/md0 + + $ sudo mkdir /mnt/disk + $ sudo mount /dev/md0 /mnt/disk + + Do a fstrim: + + $ sudo fstrim /mnt/disk + + There are test packages available in the following ppa: + + https://launchpad.net/~mruffell/+archive/ubuntu/sf414897-test + + If you install the test kernel, the kernel will no longer panic on + fstrim. + + [Where problems can occur] + + This changes io_acct_set from being sometimes initialised, mostly under raid 0, + 456 to being always initialised under all raid types. + + If a regression were to occur, it would likely impact block discard on any raid + type, not just raid 10, but raid 10 would carry more risk as we may be missing + more patches due to discard on raid10 being very new, as in the last 5 or so + years, versus 0, 456 which have had full discard for a decade or more. + + The workarounds would be the same, to disable the systemd block discard timer + or disable fstrim. + + [Other info] + + Upstream bug: + https://lists.linaro.org/archives/list/[email protected]/thread/TM2PPS3XKE6M5H2FW63MLZV2T7HTM3QJ/ + + Debian bug: + https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460 -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2117395 Title: raid10: block discard causes a NULL pointer dereference after 5.15.0-144-generic Status in linux package in Ubuntu: Fix Released Status in linux source package in Jammy: In Progress Bug description: BugLink: https://bugs.launchpad.net/bugs/2117395 [Impact] The below commit was backported to 5.15.181 -stable, and introduced a NULL pointer dereference in the raid10 subsystem, due to io_acct_set only being used in raid 0 and 456, and not 1 or 10. commit d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 Author: Yu Kuai <[email protected]> Date: Tue Mar 25 09:57:46 2025 +0800 Subject: md/raid10: fix missing discard IO accounting Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d05af90d6218e9c8f1c2026990c3f53c1b41bfb0 Kernel oops: kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 kernel: #PF: supervisor instruction fetch in kernel mode kernel: #PF: error_code(0x0010) - not-present page kernel: PGD 0 P4D 0 kernel: Oops: 0010 [#1] SMP PTI kernel: CPU: 5 PID: 784107 Comm: fstrim Not tainted 5.15.0-144-generic #157-Ubuntu kernel: RIP: 0010:0x0 kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. kernel: RSP: 0018:ffffb576409c7858 EFLAGS: 00010206 kernel: RAX: 0000000000000000 RBX: 0000000000092800 RCX: 0000000000000001 kernel: RDX: ffff8e7e012426f0 RSI: 0000000000000000 RDI: 0000000000092800 kernel: RBP: ffffb576409c78c8 R08: ffff8e884ec966c0 R09: ffff8e7e07c6b050 kernel: R10: 0000000000002ecb R11: 00000000000030c8 R12: 0000000000092c00 kernel: R13: 0000000000000400 R14: ffff8e7e01242708 R15: ffff8e7e10743400 kernel: FS: 00007f6fff9f0800(0000) GS:ffff8e8cee540000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: ffffffffffffffd6 CR3: 00000001090f6005 CR4: 00000000003706e0 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 kernel: Call Trace: kernel: <TASK> kernel: mempool_alloc+0x61/0x1b0 kernel: ? __kmalloc+0x179/0x330 kernel: bio_alloc_bioset+0x9d/0x370 kernel: ? r10bio_pool_alloc+0x26/0x30 [raid10] kernel: bio_clone_fast+0x1f/0x90 kernel: md_account_bio+0x42/0x80 kernel: raid10_handle_discard+0x56f/0x6b0 [raid10] kernel: raid10_make_request+0x147/0x180 [raid10] kernel: md_handle_request+0x12a/0x1b0 kernel: ? submit_bio_checks+0x1a5/0x580 kernel: md_submit_bio+0x76/0xc0 kernel: __submit_bio+0x1a2/0x220 kernel: ? mempool_alloc_slab+0x17/0x20 kernel: ? mempool_alloc+0x61/0x1b0 kernel: ? schedule_timeout+0x91/0x140 kernel: __submit_bio_noacct+0x85/0x200 kernel: submit_bio_noacct+0x4e/0x120 kernel: ? __cond_resched+0x1a/0x60 kernel: submit_bio+0x4a/0x130 kernel: submit_bio_wait+0x5a/0xc0 kernel: blkdev_issue_discard+0x7e/0xd0 kernel: ext4_try_to_trim_range+0x2db/0x520 kernel: ? ext4_mb_load_buddy_gfp+0x91/0x3e0 kernel: ext4_trim_fs+0x313/0x510 kernel: __ext4_ioctl+0x82c/0xef0 kernel: ext4_ioctl+0xe/0x20 kernel: __x64_sys_ioctl+0x92/0xd0 kernel: x64_sys_call+0x1e5f/0x1fa0 kernel: do_syscall_64+0x56/0xb0 kernel: entry_SYSCALL_64_after_hwframe+0x6c/0xd6 A workaround is to disable the systemd weekly fstrim timer and to not fstrim / discard blocks while the problem exists. [Fix] The below necessary commit was mainlined in 6.6-rc1 and needs to be backported to jammy. commit c567c86b90d4715081adfe5eb812141a5b6b4883 Author: Yu Kuai <[email protected]> Date: Thu Jun 22 00:51:03 2023 +0800 Subject: md: move initialization and destruction of 'io_acct_set' to md.c Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c567c86b90d4715081adfe5eb812141a5b6b4883 This needs a minor backport, adjusting __md_stop() to md_stop(). [Testcase] You will need a machine with at least 4x NVMe drives which support block discard. I use a i3.8xlarge instance on AWS, since it has all of these things. $ lsblk xvda 202:0 0 8G 0 disk └─xvda1 202:1 0 8G 0 part / nvme0n1 259:2 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme2n1 259:1 0 1.7T 0 disk nvme3n1 259:3 0 1.7T 0 disk Create a Raid10 array: $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 Format the array with XFS: $ sudo mkfs.xfs /dev/md0 $ sudo mkdir /mnt/disk $ sudo mount /dev/md0 /mnt/disk Do a fstrim: $ sudo fstrim /mnt/disk There are test packages available in the following ppa: https://launchpad.net/~mruffell/+archive/ubuntu/sf414897-test If you install the test kernel, the kernel will no longer panic on fstrim. [Where problems can occur] This changes io_acct_set from being sometimes initialised, mostly under raid 0, 456 to being always initialised under all raid types. If a regression were to occur, it would likely impact block discard on any raid type, not just raid 10, but raid 10 would carry more risk as we may be missing more patches due to discard on raid10 being very new, as in the last 5 or so years, versus 0, 456 which have had full discard for a decade or more. The workarounds would be the same, to disable the systemd block discard timer or disable fstrim. [Other info] Upstream bug: https://lists.linaro.org/archives/list/[email protected]/thread/TM2PPS3XKE6M5H2FW63MLZV2T7HTM3QJ/ Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1104460 To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2117395/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

