** Description changed: [Impact] During repeated NS map/unmap operations in ONTAP (which triggers NS attr changed AENs) where new NSs get mapped reusing the old NSID, one occasionally sees the Ubuntu 24.04 NVMe/TCP host ending up with device inconsistencies where the respective NVMe block device (i.e. /dev/nvmeXnY) is available, but not the corresponding NVMe generic char device (i.e. /dev/ngXnY). This issue is not seen if the same NS is remapped on the same NSID, but only hit when a new NS is mapped reusing the same NSID which was previously used by some other NS. The following error entries are seen in the messages file during this device inconsistency scenario: ... kernel: [267011.744167][ T2016] nvme nvme6: rescanning namespaces. kernel: [267011.744347][T46805] nvme nvme2: rescanning namespaces. kernel: [267011.750418][ T7876] nvme nvme1: rescanning namespaces. kernel: [267011.784466][ T2016] nvme nvme6: IDs don't match for shared namespace 1 kernel: [267011.784791][T46805] nvme nvme2: IDs don't match for shared namespace 1 kernel: [267011.790843][ T7876] nvme nvme1: IDs don't match for shared namespace 1 kernel: [267011.804852][ T2016] nvme nvme6: IDs don't match for shared namespace 2 kernel: [267011.804867][T46805] nvme nvme2: IDs don't match for shared namespace 2 kernel: [267011.810788][ T7876] nvme nvme1: IDs don't match for shared namespace 2 kernel: [267011.824600][ T2016] nvme nvme6: IDs don't match for shared namespace 3 kernel: [267011.825114][T46805] nvme nvme2: IDs don't match for shared namespace 3 kernel: [267011.830982][ T7876] nvme nvme1: IDs don't match for shared namespace 3 kernel: [267011.844712][ T2016] nvme nvme6: duplicate IDs in subsystem for nsid 4 kernel: [267011.845161][T46805] nvme nvme2: duplicate IDs in subsystem for nsid 4 kernel: [267011.851060][ T7876] nvme nvme1: duplicate IDs in subsystem for nsid 4 [Fix] The following upstream commits are required: - 62baf70c3274 nvme: re-read ANA log page after ns scan completes - 9546ad1a9bda nvme: requeue namespace scan on missed AENs - 1f021341eef4 nvme-multipath: defer partition scanning - 3b97f5a05cfc nvme-multipath: avoid hang on inaccessible namespaces - 63bcf9014e95 nvme-multipath: system fails to create generic nvme device + 62baf70c3274 nvme: re-read ANA log page after ns scan completes + 9546ad1a9bda nvme: requeue namespace scan on missed AENs + 1f021341eef4 nvme-multipath: defer partition scanning + 3b97f5a05cfc nvme-multipath: avoid hang on inaccessible namespaces + 63bcf9014e95 nvme-multipath: system fails to create generic nvme device $ git describe --contains 3b97f5a05cfc 63bcf9014e95 1f021341eef4 62baf70c3274 9546ad1a9bda v6.12-rc1~47^2^2~2 v6.12-rc1~47^2^2~3 v6.12-rc4~20^2~1^2~4 v6.15-rc2~11^2~1^2~10 v6.15-rc2~11^2~1^2~11 The first three patches are already present in the 6.8-based kernels, and the two follow-up commits have been sent to stable trees already. Given 6.8 is not an upstream stable tree, we should pick up those patches for the Ubuntu kernels as well. [Test Case] + The attached ns-stress.ng script should be able to reproduce this. It repeatedly creates and deletes NVMe namespaces mapped to the same ID. An example run from an affected system will look like the one below: + # ./ns-stress.sh /dev/nvme2 + Starting test with parameters: + Controller: /dev/nvme2 + NSID: 1 + Iterations: 100 + Size1: 0x200000 + Size2: 0x400000 + Iteration 1/100 + create-ns: Success, created nsid:1 + attach-ns: Success, nsid:1 + ❌ Char device missing after first attach [Where Problems Could Occur] + The fix requeues controller scans if there are any pending/missed AEN events. This can introduce delays when managing NVMe namespaces, so we should look out for any delays or hangs with such operations.
** Attachment added: "ns-stress.sh" https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115209/+attachment/5885590/+files/ns-stress.sh -- You received this bug notification because you are a member of Kernel Packages, which is subscribed to linux in Ubuntu. https://bugs.launchpad.net/bugs/2115209 Title: NVMe namespace ID mismatch on repeated map/unmap Status in linux package in Ubuntu: Fix Released Status in linux source package in Noble: In Progress Bug description: [Impact] During repeated NS map/unmap operations in ONTAP (which triggers NS attr changed AENs) where new NSs get mapped reusing the old NSID, one occasionally sees the Ubuntu 24.04 NVMe/TCP host ending up with device inconsistencies where the respective NVMe block device (i.e. /dev/nvmeXnY) is available, but not the corresponding NVMe generic char device (i.e. /dev/ngXnY). This issue is not seen if the same NS is remapped on the same NSID, but only hit when a new NS is mapped reusing the same NSID which was previously used by some other NS. The following error entries are seen in the messages file during this device inconsistency scenario: ... kernel: [267011.744167][ T2016] nvme nvme6: rescanning namespaces. kernel: [267011.744347][T46805] nvme nvme2: rescanning namespaces. kernel: [267011.750418][ T7876] nvme nvme1: rescanning namespaces. kernel: [267011.784466][ T2016] nvme nvme6: IDs don't match for shared namespace 1 kernel: [267011.784791][T46805] nvme nvme2: IDs don't match for shared namespace 1 kernel: [267011.790843][ T7876] nvme nvme1: IDs don't match for shared namespace 1 kernel: [267011.804852][ T2016] nvme nvme6: IDs don't match for shared namespace 2 kernel: [267011.804867][T46805] nvme nvme2: IDs don't match for shared namespace 2 kernel: [267011.810788][ T7876] nvme nvme1: IDs don't match for shared namespace 2 kernel: [267011.824600][ T2016] nvme nvme6: IDs don't match for shared namespace 3 kernel: [267011.825114][T46805] nvme nvme2: IDs don't match for shared namespace 3 kernel: [267011.830982][ T7876] nvme nvme1: IDs don't match for shared namespace 3 kernel: [267011.844712][ T2016] nvme nvme6: duplicate IDs in subsystem for nsid 4 kernel: [267011.845161][T46805] nvme nvme2: duplicate IDs in subsystem for nsid 4 kernel: [267011.851060][ T7876] nvme nvme1: duplicate IDs in subsystem for nsid 4 [Fix] The following upstream commits are required: 62baf70c3274 nvme: re-read ANA log page after ns scan completes 9546ad1a9bda nvme: requeue namespace scan on missed AENs 1f021341eef4 nvme-multipath: defer partition scanning 3b97f5a05cfc nvme-multipath: avoid hang on inaccessible namespaces 63bcf9014e95 nvme-multipath: system fails to create generic nvme device $ git describe --contains 3b97f5a05cfc 63bcf9014e95 1f021341eef4 62baf70c3274 9546ad1a9bda v6.12-rc1~47^2^2~2 v6.12-rc1~47^2^2~3 v6.12-rc4~20^2~1^2~4 v6.15-rc2~11^2~1^2~10 v6.15-rc2~11^2~1^2~11 The first three patches are already present in the 6.8-based kernels, and the two follow-up commits have been sent to stable trees already. Given 6.8 is not an upstream stable tree, we should pick up those patches for the Ubuntu kernels as well. [Test Case] The attached ns-stress.ng script should be able to reproduce this. It repeatedly creates and deletes NVMe namespaces mapped to the same ID. An example run from an affected system will look like the one below: # ./ns-stress.sh /dev/nvme2 Starting test with parameters: Controller: /dev/nvme2 NSID: 1 Iterations: 100 Size1: 0x200000 Size2: 0x400000 Iteration 1/100 create-ns: Success, created nsid:1 attach-ns: Success, nsid:1 ❌ Char device missing after first attach [Where Problems Could Occur] The fix requeues controller scans if there are any pending/missed AEN events. This can introduce delays when managing NVMe namespaces, so we should look out for any delays or hangs with such operations. To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115209/+subscriptions -- Mailing list: https://launchpad.net/~kernel-packages Post to : [email protected] Unsubscribe : https://launchpad.net/~kernel-packages More help : https://help.launchpad.net/ListHelp

