On 8/20/2025 7:30 PM, Zhijian Li (Fujitsu) wrote:


On 21/08/2025 07:14, Alison Schofield wrote:
On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
Hi Dan and Smita,


On 24/07/2025 00:13, dan.j.willi...@intel.com wrote:
dan.j.williams@ wrote:
[..]
If the goal is: "I want to give device-dax a point at which it can make
a go / no-go decision about whether the CXL subsystem has properly
assembled all CXL regions implied by Soft Reserved instersecting with
CXL Windows." Then that is something like the below, only lightly tested
and likely regresses the non-CXL case.

-- 8< --
   From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.willi...@intel.com>
Date: Tue, 22 Jul 2025 16:11:08 -0700
Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration

Likely needs this incremental change to prevent DEV_DAX_HMEM from being
built-in when CXL is not. This still leaves the awkward scenario of CXL
enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
safely fails in devdax only / fallback mode, but something to
investigate when respinning on top of this.


Thank you for your RFC; I find your proposal remarkably compelling, as it 
adeptly addresses the issues I am currently facing.


To begin with, I still encountered several issues with your patch (considering 
the patch at the RFC stage, I think it is already quite commendable):

Hi Zhijian,

Like you, I tried this RFC out. It resolved the issue of soft reserved
resources preventing teardown and replacement of a region in place.

I looked at the issues you found, and have some questions comments
included below.


1. Some resources described by SRAT are wrongly identified as System RAM 
(kmem), such as the following: 200000000-5bffffff.
```
      200000000-5bffffff : dax6.0
        200000000-5bffffff : System RAM (kmem)
      5c0001128-5c00011b7 : port1
      5d0000000-64ffffff : CXL Window 0
        5d0000000-64ffffff : region0
          5d0000000-64ffffff : dax0.0
            5d0000000-64ffffff : System RAM (kmem)
      680000000-e7ffffff : PCI Bus 0000:00

      [root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
      [    0.000000] Command line: 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+ 
root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 
no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8 
softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled 
panic_on_warn ignore_loglevel kasan.fault=panic
      [    0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] 
soft reserved
      [    0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft 
reserved
      [    0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff] 
hotplug
      ```

Is that range also labelled as soft reserved?
I ask, because I'm trying to draw a parallel between our test platforms.

No, It's not a soft reserved range. This can simply simulate with QEMU with 
`maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1, 
DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3

This range is reserved for the DRAM hot-add. In my case, it will be registered 
into 'HMEM devices' by calling hmem_register_resource in 
HMAT(drivers/acpi/numa/hmat.c)

   893 static void hmat_register_target_devices(struct memory_target *target)
   894 {
   895         struct resource *res;
   896
   897         /*
   898          * Do not bother creating devices if no driver is available to
   899          * consume them.
   900          */
   901         if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
   902                 return;
   903
   904         for (res = target->memregions.child; res; res = res->sibling) {
   905                 int target_nid = pxm_to_node(target->memory_pxm);
   906
   907                 hmem_register_resource(target_nid, res);
   908         }
   909 }


$ dmesg | grep -i -e soft -e hotplug -e Node
[    0.000000] Command line: 
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty
 root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0 
no_timer_check net.ifnames=0 console=tty1 conc
[    0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft 
reserved
[    0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft 
reserved
[    0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[    0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[    0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[    0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[    0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[    0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[    0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem 
0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[    0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[    0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[    0.086077] Movable zone start for each node
[    0.087054] Early memory node ranges
[    0.087890]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.089264]   node   0: [mem 0x0000000000100000-0x000000007ffdefff]
[    0.090631]   node   1: [mem 0x0000000100000000-0x000000017fffffff]
[    0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[    0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[    0.095164] Initmem setup node 2 as memoryless
[    0.096281] Initmem setup node 3 as memoryless
[    0.097397] Initmem setup node 4 as memoryless
[    0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[    0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[    0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[    0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs

=================================

Please note that this is a modified QEMU.

/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine 
q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device 
pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0 
\
-device 
cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true
 \
-object 
memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M
 \
-M 
cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k
 \
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110
 \
-numa 
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
 \
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240
 \
-numa 
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
 \
-numa 
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340
 \
-numa 
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
 \
-numa 
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440
 \
-numa 
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
 \
-numa 
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240
 \
-numa 
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
 \
-numa 
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110
 \
-numa 
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
 \
-numa 
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340
 \
-numa 
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
 \
-numa 
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440
 \
-numa 
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M



I see -

[] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug

/proc/iomem - as expected
24080000000-5f77fffffff : CXL Window 0
    24080000000-4407fffffff : region0
      24080000000-4407fffffff : dax0.0
        24080000000-4407fffffff : System RAM (kmem)


I'm also seeing this message:
[] resource: Unaddressable device  [mem 0x24080000000-0x4407fffffff] conflicts 
with [mem 0x24080000000-0x4407fffffff]


2. Triggers dev_warn and dev_err:
```
      [root@rdma-server ~]# journalctl -p err -p warning --dmesg
      ...snip...
      Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache 
calculation failed rc:-2
      Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem 
failed with error -12
      Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem 
failed with error -12
      Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0: 
0x100000000-0x17ffffff could not reserve region
      Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem 
failed with error -16

I see the kmem dax messages also. It seems the kmem probe is going after
every range (except hotplug) in the SRAT, and failing.

Yes, that's true, because current RFC removed the code that filters out the 
non-soft-reserverd resource. As a result, it will try to register dax/kmem for 
all of them while some of them has been marked as busy in the iomem_resource.

-   rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
-                          IORES_DESC_SOFT_RESERVED);
-   if (rc != REGION_INTERSECTS)
-       return 0;


This is another example on my real *CXL HOST*:
Aug 19 17:59:05  kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is 
disabled. Duplicate IMA measuremen>
Aug 19 17:59:09  kernel: power_meter ACPI000D:00: Ignoring unsafe software 
power cap!
Aug 19 17:59:09  kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not 
reserve region
Aug 19 17:59:09  kernel: kmem dax2.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:09  kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could 
not reserve region
Aug 19 17:59:09  kernel: kmem dax3.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:09  kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could 
not reserve region
Aug 19 17:59:09  kernel: kmem dax4.0: probe with driver kmem failed with error 
-16
Aug 19 17:59:19  kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27  kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000 
/proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved


This issue can be resolved by re-introducing 
sort_reserved_region_intersects(...) I guess.




      ```

3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in 
which case only CXL Window X is visible.

Haven't tested !CXL_REGION yet.

When CXL_REGION is disabled, DEV_DAX_CXL will also be disabled. So dax_hmem should handle it. I was able to fallback to dax_hmem. But let me know if I'm missing something.

config DEV_DAX_CXL
        tristate "CXL DAX: direct access to CXL RAM regions"
        depends on CXL_BUS && CXL_REGION && DEV_DAX
..


On failure: ```
      100000000-27ffffff : System RAM
      5c0001128-5c00011b7 : port1
      5c0011128-5c00111b7 : port2
      5d0000000-6cffffff : CXL Window 0
      6d0000000-7cffffff : CXL Window 1
      7000000000-700000ffff : PCI Bus 0000:0c
        7000000000-700000ffff : 0000:0c:00.0
          7000001080-70000010d7 : mem1
      ```

      On success:
```
      5d0000000-7cffffff : dax0.0
        5d0000000-7cffffff : System RAM (kmem)
          5d0000000-6cffffff : CXL Window 0
          6d0000000-7cffffff : CXL Window 1
      ```

In term of issues 1 and 2, this arises because hmem_register_device() attempts to 
register resources of all "HMEM devices," whereas we only need to register the 
IORES_DESC_SOFT_RESERVED resources. I believe resolving the current TODO will address 
this.

```
-   rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
-                          IORES_DESC_SOFT_RESERVED);
-   if (rc != REGION_INTERSECTS)
-       return 0;
+   /* TODO: insert "Soft Reserved" into iomem here */
```

Above makes sense.

I think the subroutine add_soft_reserved() in your previous patchset[1] are 
able to cover this TODO


I'll probably wait for an update from Smita to test again, but if you
or Smita have anything you want me to try out on my hardwware in the
meantime, let me know.


Here is my local fixup based on Dan's RFC, it can resovle issue 1 and 2.

I almost have the same approach :) Sorry, I missed adding your
"Signed-off-by".. Will include for next revision..



-- 8< --
   commit e7ccd7a01e168e185971da66f4aa13eb451caeaf
Author: Li Zhijian <lizhij...@fujitsu.com>
Date:   Fri Aug 20 11:07:15 2025 +0800

      Fix probe-order TODO
Signed-off-by: Li Zhijian <lizhij...@fujitsu.com>

diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 754115da86cc..965ffc622136 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -93,6 +93,26 @@ static void process_defer_work(struct work_struct *_work)
        walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
   }
+static int add_soft_reserved(resource_size_t start, resource_size_t len,
+                            unsigned long flags)
+{
+       struct resource *res = kzalloc(sizeof(*res), GFP_KERNEL);
+       int rc;
+
+       if (!res)
+               return -ENOMEM;
+
+       *res = DEFINE_RES_NAMED_DESC(start, len, "Soft Reserved",
+                                    flags | IORESOURCE_MEM,
+                                    IORES_DESC_SOFT_RESERVED);
+
+       rc = insert_resource(&iomem_resource, res);
+       if (rc)
+               kfree(res);
+
+       return rc;
+}
+
   static int hmem_register_device(struct device *host, int target_nid,
                                const struct resource *res)
   {
@@ -102,6 +122,10 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        long id;
        int rc;

   > +       if (soft_reserve_res_intersects(res->start, resource_size(res),
+                     IORESOURCE_MEM, IORES_DESC_NONE) == REGION_DISJOINT)
+               return 0;
+

Should also handle CONFIG_EFI_SOFT_RESERVE not enabled case..


Thanks
Smita

        if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
            region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
                              IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int 
target_nid,
                }
        }
- /* TODO: insert "Soft Reserved" into iomem here */
+       /*
+        * This is a verified Soft Reserved region that CXL is not claiming (or
+        * is being overridden). Add it to the main iomem tree so it can be
+        * properly reserved by the DAX driver.
+        */
+       rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+       if (rc) {
+               dev_warn(host, "failed to insert soft-reserved resource %pr into 
iomem: %d\n",
+                        res, rc);
+               return rc;
+       }
id = memregion_alloc(GFP_KERNEL);
        if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
   int region_intersects(resource_size_t offset, size_t size, unsigned long 
flags,
                      unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags,
+                     unsigned long desc);
   /* Support for virtually mapped pages */
   struct page *vmalloc_to_page(const void *addr);
   unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc, 
unsigned long flags,
                             arg, func);
   }
   EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+                              size_t size, unsigned long flags,
+                              unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned 
long flags,
+                     unsigned long desc)
+{
+       int ret;
+
+       read_lock(&resource_lock);
+       ret = __region_intersects(&soft_reserve_resource, start, size, flags, 
desc);
+       read_unlock(&resource_lock);
+
+       return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
   #endif
/*



[1] 
https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofi...@intel.com/


-- Alison



Regarding issue 3 (which exists in the current situation), this could be 
because it cannot ensure that dax_hmem_probe() executes prior to 
cxl_acpi_probe() when CXL_REGION is disabled.

I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order 
branch, and I'm looking forward to its integration into the upstream during the 
v6.18 merge window.
Besides the current TODO, you also mentioned that this RFC PATCH must be 
further subdivided into several patches, so there remains significant work to 
be done.
If my understanding is correct, you would be personally continuing to push 
forward this patch, right?


Smita,

Do you have any additional thoughts on this proposal from your side?


Thanks
Zhijian

snip



Reply via email to