On 21/08/2025 07:14, Alison Schofield wrote:
On Tue, Aug 05, 2025 at 03:58:41AM +0000, Zhijian Li (Fujitsu) wrote:
Hi Dan and Smita,
On 24/07/2025 00:13, dan.j.willi...@intel.com wrote:
dan.j.williams@ wrote:
[..]
If the goal is: "I want to give device-dax a point at which it can make
a go / no-go decision about whether the CXL subsystem has properly
assembled all CXL regions implied by Soft Reserved instersecting with
CXL Windows." Then that is something like the below, only lightly tested
and likely regresses the non-CXL case.
-- 8< --
From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.willi...@intel.com>
Date: Tue, 22 Jul 2025 16:11:08 -0700
Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration
Likely needs this incremental change to prevent DEV_DAX_HMEM from being
built-in when CXL is not. This still leaves the awkward scenario of CXL
enabled, DEV_DAX_CXL disabled, and DEV_DAX_HMEM built-in. I believe that
safely fails in devdax only / fallback mode, but something to
investigate when respinning on top of this.
Thank you for your RFC; I find your proposal remarkably compelling, as it
adeptly addresses the issues I am currently facing.
To begin with, I still encountered several issues with your patch (considering
the patch at the RFC stage, I think it is already quite commendable):
Hi Zhijian,
Like you, I tried this RFC out. It resolved the issue of soft reserved
resources preventing teardown and replacement of a region in place.
I looked at the issues you found, and have some questions comments
included below.
1. Some resources described by SRAT are wrongly identified as System RAM
(kmem), such as the following: 200000000-5bffffff.
```
200000000-5bffffff : dax6.0
200000000-5bffffff : System RAM (kmem)
5c0001128-5c00011b7 : port1
5d0000000-64ffffff : CXL Window 0
5d0000000-64ffffff : region0
5d0000000-64ffffff : dax0.0
5d0000000-64ffffff : System RAM (kmem)
680000000-e7ffffff : PCI Bus 0000:00
[root@rdma-server ~]# dmesg | grep -i -e soft -e hotplug
[ 0.000000] Command line:
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan+
root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0
no_timer_check net.ifnames=0 console=tty1 console=ttyS0,115200n8
softlockup_panic=1 printk.devkmsg=on oops=panic sysrq_always_enabled
panic_on_warn ignore_loglevel kasan.fault=panic
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff]
soft reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064ffffff] soft
reserved
[ 0.072114] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bffffff]
hotplug
```
Is that range also labelled as soft reserved?
I ask, because I'm trying to draw a parallel between our test platforms.
No, It's not a soft reserved range. This can simply simulate with QEMU with
`maxmem=192G` option(see below full qemu command line).
In my environment, `0x200000000-0x5bffffff` is something like [DRAM_END + 1,
DRAM_END + maxmem - TOTAL_INSTALLED_DRAM_SIZE]
DRAM_END: end of the installed DRAM in Node 3
This range is reserved for the DRAM hot-add. In my case, it will be registered
into 'HMEM devices' by calling hmem_register_resource in
HMAT(drivers/acpi/numa/hmat.c)
893 static void hmat_register_target_devices(struct memory_target *target)
894 {
895 struct resource *res;
896
897 /*
898 * Do not bother creating devices if no driver is available to
899 * consume them.
900 */
901 if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
902 return;
903
904 for (res = target->memregions.child; res; res = res->sibling) {
905 int target_nid = pxm_to_node(target->memory_pxm);
906
907 hmem_register_resource(target_nid, res);
908 }
909 }
$ dmesg | grep -i -e soft -e hotplug -e Node
[ 0.000000] Command line:
BOOT_IMAGE=(hd0,msdos1)/boot/vmlinuz-6.16.0-rc4-lizhijian-Dan-00026-g1473b9914846-dirty
root=UUID=386769a3-cfa5-47c8-8797-d5ec58c9cb6c ro earlyprintk=ttyS0
no_timer_check net.ifnames=0 console=tty1 conc
[ 0.000000] BIOS-e820: [mem 0x0000000180000000-0x00000001ffffffff] soft
reserved
[ 0.000000] BIOS-e820: [mem 0x00000005d0000000-0x000000064fffffff] soft
reserved
[ 0.066332] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.067665] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[ 0.068995] ACPI: SRAT: Node 1 PXM 1 [mem 0x100000000-0x17fffffff]
[ 0.070359] ACPI: SRAT: Node 2 PXM 2 [mem 0x180000000-0x1bfffffff]
[ 0.071723] ACPI: SRAT: Node 3 PXM 3 [mem 0x1c0000000-0x1ffffffff]
[ 0.073085] ACPI: SRAT: Node 3 PXM 3 [mem 0x200000000-0x5bfffffff] hotplug
[ 0.075689] NUMA: Node 0 [mem 0x00001000-0x0009ffff] + [mem
0x00100000-0x7fffffff] -> [mem 0x00001000-0x7fffffff]
[ 0.077849] NODE_DATA(0) allocated [mem 0x7ffb3e00-0x7ffdefff]
[ 0.079149] NODE_DATA(1) allocated [mem 0x17ffd1e00-0x17fffcfff]
[ 0.086077] Movable zone start for each node
[ 0.087054] Early memory node ranges
[ 0.087890] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.089264] node 0: [mem 0x0000000000100000-0x000000007ffdefff]
[ 0.090631] node 1: [mem 0x0000000100000000-0x000000017fffffff]
[ 0.092003] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffdefff]
[ 0.093532] Initmem setup node 1 [mem 0x0000000100000000-0x000000017fffffff]
[ 0.095164] Initmem setup node 2 as memoryless
[ 0.096281] Initmem setup node 3 as memoryless
[ 0.097397] Initmem setup node 4 as memoryless
[ 0.098444] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.099866] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.104342] On node 1, zone Normal: 33 pages in unavailable ranges
[ 0.126883] CPU topo: Allowing 4 present CPUs plus 0 hotplug CPUs
=================================
Please note that this is a modified QEMU.
/home/lizhijian/qemu/build-hmem/qemu-system-x86_64 -machine
q35,accel=kvm,cxl=on,hmat=on \
-name guest-rdma-server -nographic -boot c \
-m size=6G,slots=2,maxmem=19922944k \
-hda /home/lizhijian/images/Fedora-rdma-server.qcow2 \
-object memory-backend-memfd,share=on,size=2G,id=m0 \
-object memory-backend-memfd,share=on,size=2G,id=m1 \
-numa node,nodeid=0,cpus=0-1,memdev=m0 \
-numa node,nodeid=1,cpus=2-3,memdev=m1 \
-smp 4,sockets=2,cores=2 \
-device pcie-root-port,id=pci-root,slot=8,bus=pcie.0,chassis=0 \
-device
pxb-cxl,id=pxb-cxl-host-bridge,bus=pcie.0,bus_nr=0x35,hdm_for_passthrough=true \
-device cxl-rp,id=cxl-rp-hb-rp0,bus=pxb-cxl-host-bridge,chassis=0,slot=0,port=0
\
-device
cxl-type3,bus=cxl-rp-hb-rp0,volatile-memdev=cxl-vmem0,id=cxl-vmem0,program-hdm-decoder=true
\
-object
memory-backend-file,id=cxl-vmem0,share=on,mem-path=/home/lizhijian/images/cxltest0.raw,size=2048M
\
-M
cxl-fmw.0.targets.0=pxb-cxl-host-bridge,cxl-fmw.0.size=2G,cxl-fmw.0.interleave-granularity=8k
\
-nic bridge,br=virbr0,model=e1000,mac=52:54:00:c9:76:74 \
-bios /home/lizhijian/seabios/out/bios.bin \
-object memory-backend-memfd,share=on,size=1G,id=m2 \
-object memory-backend-memfd,share=on,size=1G,id=m3 \
-numa node,memdev=m2,nodeid=2 \
-numa node,memdev=m3,nodeid=3 \
-numa dist,src=0,dst=0,val=10 \
-numa dist,src=0,dst=1,val=21 \
-numa dist,src=0,dst=2,val=21 \
-numa dist,src=0,dst=3,val=21 \
-numa dist,src=1,dst=0,val=21 \
-numa dist,src=1,dst=1,val=10 \
-numa dist,src=1,dst=2,val=21 \
-numa dist,src=1,dst=3,val=21 \
-numa dist,src=2,dst=0,val=21 \
-numa dist,src=2,dst=1,val=21 \
-numa dist,src=2,dst=2,val=10 \
-numa dist,src=2,dst=3,val=21 \
-numa dist,src=3,dst=0,val=21 \
-numa dist,src=3,dst=1,val=21 \
-numa dist,src=3,dst=2,val=21 \
-numa dist,src=3,dst=3,val=10 \
-numa
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=110
\
-numa
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=240
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
\
-numa
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=340
\
-numa
hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
\
-numa
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-latency,latency=440
\
-numa
hmat-lb,initiator=0,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
\
-numa
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=240
\
-numa
hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=40000M
\
-numa
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=110
\
-numa
hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=20000M
\
-numa
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=340
\
-numa
hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=60000M
\
-numa
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-latency,latency=440
\
-numa
hmat-lb,initiator=1,target=3,hierarchy=memory,data-type=access-bandwidth,bandwidth=80000M
I see -
[] BIOS-e820: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] reserve setup_data: [mem 0x0000024080000000-0x000004407fffffff] soft reserved
.
.
[] ACPI: SRAT: Node 6 PXM 14 [mem 0x24080000000-0x4407fffffff] hotplug
/proc/iomem - as expected
24080000000-5f77fffffff : CXL Window 0
24080000000-4407fffffff : region0
24080000000-4407fffffff : dax0.0
24080000000-4407fffffff : System RAM (kmem)
I'm also seeing this message:
[] resource: Unaddressable device [mem 0x24080000000-0x4407fffffff] conflicts
with [mem 0x24080000000-0x4407fffffff]
2. Triggers dev_warn and dev_err:
```
[root@rdma-server ~]# journalctl -p err -p warning --dmesg
...snip...
Jul 29 13:17:36 rdma-server kernel: cxl root0: Extended linear cache
calculation failed rc:-2
Jul 29 13:17:36 rdma-server kernel: hmem hmem.1: probe with driver hmem
failed with error -12
Jul 29 13:17:36 rdma-server kernel: hmem hmem.2: probe with driver hmem
failed with error -12
Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: mapping0:
0x100000000-0x17ffffff could not reserve region
Jul 29 13:17:36 rdma-server kernel: kmem dax3.0: probe with driver kmem
failed with error -16
I see the kmem dax messages also. It seems the kmem probe is going after
every range (except hotplug) in the SRAT, and failing.
Yes, that's true, because current RFC removed the code that filters out the
non-soft-reserverd resource. As a result, it will try to register dax/kmem for
all of them while some of them has been marked as busy in the iomem_resource.
- rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
- IORES_DESC_SOFT_RESERVED);
- if (rc != REGION_INTERSECTS)
- return 0;
This is another example on my real *CXL HOST*:
Aug 19 17:59:05 kernel: device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is
disabled. Duplicate IMA measuremen>
Aug 19 17:59:09 kernel: power_meter ACPI000D:00: Ignoring unsafe software
power cap!
Aug 19 17:59:09 kernel: kmem dax2.0: mapping0: 0x0-0x8fffffff could not
reserve region
Aug 19 17:59:09 kernel: kmem dax2.0: probe with driver kmem failed with error
-16
Aug 19 17:59:09 kernel: kmem dax3.0: mapping0: 0x100000000-0x86fffffff could
not reserve region
Aug 19 17:59:09 kernel: kmem dax3.0: probe with driver kmem failed with error
-16
Aug 19 17:59:09 kernel: kmem dax4.0: mapping0: 0x870000000-0x106fffffff could
not reserve region
Aug 19 17:59:09 kernel: kmem dax4.0: probe with driver kmem failed with error
-16
Aug 19 17:59:19 kernel: nvme nvme0: using unchecked data buffer
Aug 19 18:36:27 kernel: block nvme1n1: No UUID available providing old NGUID
lizhijian@:~$ sudo grep -w -e 106fffffff -e 870000000 -e 8fffffff -e 100000000
/proc/iomem
6fffb000-8fffffff : Reserved
100000000-10000ffff : Reserved
106ccc0000-106fffffff : Reserved
This issue can be resolved by re-introducing
sort_reserved_region_intersects(...) I guess.
```
3. When CXL_REGION is disabled, there is a failure to fallback to dax_hmem, in
which case only CXL Window X is visible.
Haven't tested !CXL_REGION yet.
if (IS_ENABLED(CONFIG_DEV_DAX_CXL) &&
region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
IORES_DESC_CXL) != REGION_DISJOINT) {
@@ -119,7 +143,17 @@ static int hmem_register_device(struct device *host, int
target_nid,
}
}
- /* TODO: insert "Soft Reserved" into iomem here */
+ /*
+ * This is a verified Soft Reserved region that CXL is not claiming (or
+ * is being overridden). Add it to the main iomem tree so it can be
+ * properly reserved by the DAX driver.
+ */
+ rc = add_soft_reserved(res->start, res->end - res->start + 1, 0);
+ if (rc) {
+ dev_warn(host, "failed to insert soft-reserved resource %pr into
iomem: %d\n",
+ res, rc);
+ return rc;
+ }
id = memregion_alloc(GFP_KERNEL);
if (id < 0) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 349f0d9aad22..eca5956c444b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1069,6 +1069,8 @@ enum {
int region_intersects(resource_size_t offset, size_t size, unsigned long
flags,
unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t offset, size_t size, unsigned long flags,
+ unsigned long desc);
/* Support for virtually mapped pages */
struct page *vmalloc_to_page(const void *addr);
unsigned long vmalloc_to_pfn(const void *addr);
diff --git a/kernel/resource.c b/kernel/resource.c
index b8eac6af2fad..a34b76cf690a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -461,6 +461,22 @@ int walk_soft_reserve_res_desc(unsigned long desc,
unsigned long flags,
arg, func);
}
EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+
+static int __region_intersects(struct resource *parent, resource_size_t start,
+ size_t size, unsigned long flags,
+ unsigned long desc);
+int soft_reserve_res_intersects(resource_size_t start, size_t size, unsigned
long flags,
+ unsigned long desc)
+{
+ int ret;
+
+ read_lock(&resource_lock);
+ ret = __region_intersects(&soft_reserve_resource, start, size, flags,
desc);
+ read_unlock(&resource_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(soft_reserve_res_intersects);
#endif
/*
[1]
https://lore.kernel.org/linux-cxl/29312c0765224ae76862d59a17748c8188fb95f1.1692638817.git.alison.schofi...@intel.com/
-- Alison
Regarding issue 3 (which exists in the current situation), this could be
because it cannot ensure that dax_hmem_probe() executes prior to
cxl_acpi_probe() when CXL_REGION is disabled.
I am pleased that you have pushed the patch to the cxl/for-6.18/cxl-probe-order
branch, and I'm looking forward to its integration into the upstream during the
v6.18 merge window.
Besides the current TODO, you also mentioned that this RFC PATCH must be
further subdivided into several patches, so there remains significant work to
be done.
If my understanding is correct, you would be personally continuing to push
forward this patch, right?
Smita,
Do you have any additional thoughts on this proposal from your side?
Thanks
Zhijian
snip