Hi Dan,

On 7/23/2025 12:31 AM, dan.j.willi...@intel.com wrote:
Smita Koralahalli wrote:
Introduce a background worker in cxl_acpi to delay SOFT RESERVE handling
until the cxl_mem driver has probed at least one device. This coordination
ensures that DAX registration or fallback handling for soft-reserved
regions is not triggered prematurely.

The worker waits on cxl_wait_queue, which is signaled via
cxl_mem_active_inc() during cxl_mem_probe(). Once at least one memory
device probe is confirmed, the worker invokes wait_for_device_probe()
to allow the rest of the CXL device hierarchy to complete initialization.

Additionally, it also handles initialization order issues where
cxl_acpi_probe() may complete before other drivers such as cxl_port or
cxl_mem have loaded, especially when cxl_acpi and cxl_port are built-in
and cxl_mem is a loadable module. In such cases, using only
wait_for_device_probe() is insufficient, as it may return before all
relevant probes are registered.

Right, but that problem is not solved by this which still leaves the
decision on when to give up on this mechanism, and this mechanism does
not tell you when follow-on probe work is complete.

While region creation happens in cxl_port_probe(), waiting on
cxl_mem_active() would be sufficient as cxl_mem_probe() can only succeed
after the port hierarchy is in place. Furthermore, since cxl_mem depends
on cxl_pci, this also guarantees that cxl_pci has loaded by the time the
wait completes.

As cxl_mem_active() infrastructure already exists for tracking probe
activity, cxl_acpi can use it without introducing new coordination
mechanisms.

In appreciate the instinct to not add anything new, but the module
loading problem is solvable.

If the goal is: "I want to give device-dax a point at which it can make
a go / no-go decision about whether the CXL subsystem has properly
assembled all CXL regions implied by Soft Reserved instersecting with
CXL Windows." Then that is something like the below, only lightly tested
and likely regresses the non-CXL case.

-- 8< --
 From 48b25461eca050504cf5678afd7837307b2dd14f Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.willi...@intel.com>
Date: Tue, 22 Jul 2025 16:11:08 -0700
Subject: [RFC PATCH] dax/cxl: Defer Soft Reserved registration

CXL and dax_hmem fight over "Soft Reserved" (EFI Specific Purpose Memory)
resources are published in the iomem resource tree. The entry blocks some
CXL hotplug flows, and CXL blocks dax_hmem from publishing the memory in
the event that CXL fails to parse the platform configuration.

Towards resolving this conflict: (the non-RFC version
of this patch should split these into separate patches):

1/ Defer publishing "Soft Reserved" entries in the iomem resource tree
    until the consumer, dax_hmem, is ready to use them.

2/ Fix detection of "Soft Reserved" vs "CXL Window" resource overlaps by
    switching from MODULE_SOFTDEP() to request_module() for making sure that
    cxl_acpi has had a chance to publish "CXL Window" resources.

3/ Add cxl_pci to the list of modules that need to have had a chance to
    scan boot devices such that wait_device_probe() flushes initial CXL
    topology discovery.

4/ Add a workqueue that delays consideration of "Soft Reserved" that
    overlaps CXL so that the CXL subsystem can complete all of its region
    assembly.

For RFC purposes this only solves the reliabilty of the DAX_CXL_MODE_DROP
case. DAX_CXL_MODE_REGISTER support can follow to shutdown CXL in favor of
vanilla DAX devices as an emergency fallback for platform configuration
quirks and bugs.

Signed-off-by: Dan Williams <dan.j.willi...@intel.com>
---
  arch/x86/kernel/e820.c    |  2 +-
  drivers/dax/hmem/device.c |  4 +-
  drivers/dax/hmem/hmem.c   | 94 +++++++++++++++++++++++++++++++++------
  include/linux/ioport.h    | 25 +++++++++++
  kernel/resource.c         | 58 +++++++++++++++++++-----
  5 files changed, 156 insertions(+), 27 deletions(-)

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index c3acbd26408b..aef1ff2cabda 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1153,7 +1153,7 @@ void __init e820__reserve_resources_late(void)
        res = e820_res;
        for (i = 0; i < e820_table->nr_entries; i++) {
                if (!res->parent && res->end)
-                       insert_resource_expand_to_fit(&iomem_resource, res);
+                       insert_resource_late(res);
                res++;
        }
diff --git a/drivers/dax/hmem/device.c b/drivers/dax/hmem/device.c
index f9e1a76a04a9..22732b729017 100644
--- a/drivers/dax/hmem/device.c
+++ b/drivers/dax/hmem/device.c
@@ -83,8 +83,8 @@ static __init int hmem_register_one(struct resource *res, 
void *data)
static __init int hmem_init(void)
  {
-       walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
-                       IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+       walk_soft_reserve_res_desc(IORES_DESC_SOFT_RESERVED, IORESOURCE_MEM, 0,
+                                  -1, NULL, hmem_register_one);
        return 0;
  }
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 5e7c53f18491..0916478e3817 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -59,9 +59,45 @@ static void release_hmem(void *pdev)
        platform_device_unregister(pdev);
  }
+static enum dax_cxl_mode {
+       DAX_CXL_MODE_DEFER,
+       DAX_CXL_MODE_REGISTER,
+       DAX_CXL_MODE_DROP,
+} dax_cxl_mode;
+
+static int handle_deferred_cxl(struct device *host, int target_nid,
+                               const struct resource *res)
+{
+       if (region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
+                             IORES_DESC_CXL) != REGION_DISJOINT) {
+               if (dax_cxl_mode == DAX_CXL_MODE_DROP)
+                       dev_dbg(host, "dropping CXL range: %pr\n", res);
+       }
+       return 0;
+}
+
+struct dax_defer_work {
+       struct platform_device *pdev;
+       struct work_struct work;
+};
+
+static void process_defer_work(struct work_struct *_work)
+{
+       struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+       struct platform_device *pdev = work->pdev;
+
+       /* relies on cxl_acpi and cxl_pci having had a chance to load */
+       wait_for_device_probe();
+
+       dax_cxl_mode = DAX_CXL_MODE_DROP;
+
+       walk_hmem_resources(&pdev->dev, handle_deferred_cxl);
+}
+
  static int hmem_register_device(struct device *host, int target_nid,
                                const struct resource *res)
  {
+       struct dax_defer_work *work = dev_get_drvdata(host);
        struct platform_device *pdev;
        struct memregion_info info;
        long id;
@@ -70,14 +106,21 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        if (IS_ENABLED(CONFIG_CXL_REGION) &&
            region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
                              IORES_DESC_CXL) != REGION_DISJOINT) {

I may be wrong here, but could this check fail? While request_module() ensures that cxl_acpi and cxl_pci are requested for loading, it does not guarantee that either has completed initialization or that region enumeration (i.e add_cxl_resources()) has finished by the time we reach this check.

We also haven't called wait_for_device_probe() at this point, which is typically used to block until all pending device probes are complete.

Thanks
Smita
-               dev_dbg(host, "deferring range to CXL: %pr\n", res);
-               return 0;
+               switch (dax_cxl_mode) {
+               case DAX_CXL_MODE_DEFER:
+                       dev_dbg(host, "deferring range to CXL: %pr\n", res);
+                       schedule_work(&work->work);
+                       return 0;
+               case DAX_CXL_MODE_REGISTER:
+                       dev_dbg(host, "registering CXL range: %pr\n", res);
+                       break;
+               case DAX_CXL_MODE_DROP:
+                       dev_dbg(host, "dropping CXL range: %pr\n", res);
+                       return 0;
+               }
        }
- rc = region_intersects(res->start, resource_size(res), IORESOURCE_MEM,
-                              IORES_DESC_SOFT_RESERVED);
-       if (rc != REGION_INTERSECTS)
-               return 0;
+       /* TODO: insert "Soft Reserved" into iomem here */
id = memregion_alloc(GFP_KERNEL);
        if (id < 0) {
@@ -123,8 +166,30 @@ static int hmem_register_device(struct device *host, int 
target_nid,
        return rc;
  }
+static void kill_defer_work(void *_work)
+{
+       struct dax_defer_work *work = container_of(_work, typeof(*work), work);
+
+       cancel_work_sync(&work->work);
+       kfree(work);
+}
+
  static int dax_hmem_platform_probe(struct platform_device *pdev)
  {
+       struct dax_defer_work *work = kzalloc(sizeof(*work), GFP_KERNEL);
+       int rc;
+
+       if (!work)
+               return -ENOMEM;
+
+       work->pdev = pdev;
+       INIT_WORK(&work->work, process_defer_work);
+
+       rc = devm_add_action_or_reset(&pdev->dev, kill_defer_work, work);
+       if (rc)
+               return rc;
+
+       platform_set_drvdata(pdev, work);
        return walk_hmem_resources(&pdev->dev, hmem_register_device);
  }
@@ -139,6 +204,16 @@ static __init int dax_hmem_init(void)
  {
        int rc;
+ /*
+        * Ensure that cxl_acpi and cxl_pci have a chance to kick off
+        * CXL topology discovery at least once before scanning the
+        * iomem resource tree for IORES_DESC_CXL resources.
+        */
+       if (IS_ENABLED(CONFIG_CXL_REGION)) {
+               request_module("cxl_acpi");
+               request_module("cxl_pci");
+       }
+
        rc = platform_driver_register(&dax_hmem_platform_driver);
        if (rc)
                return rc;
@@ -159,13 +234,6 @@ static __exit void dax_hmem_exit(void)
  module_init(dax_hmem_init);
  module_exit(dax_hmem_exit);
-/* Allow for CXL to define its own dax regions */
-#if IS_ENABLED(CONFIG_CXL_REGION)
-#if IS_MODULE(CONFIG_CXL_ACPI)
-MODULE_SOFTDEP("pre: cxl_acpi");
-#endif
-#endif
-
  MODULE_ALIAS("platform:hmem*");
  MODULE_ALIAS("platform:hmem_platform*");
  MODULE_DESCRIPTION("HMEM DAX: direct access to 'specific purpose' memory");
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index e8b2d6aa4013..4fc6ab518c24 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -232,6 +232,9 @@ struct resource_constraint {
  /* PC/ISA/whatever - the normal PC address spaces: IO and memory */
  extern struct resource ioport_resource;
  extern struct resource iomem_resource;
+#ifdef CONFIG_EFI_SOFT_RESERVE
+extern struct resource soft_reserve_resource;
+#endif
extern struct resource *request_resource_conflict(struct resource *root, struct resource *new);
  extern int request_resource(struct resource *root, struct resource *new);
@@ -255,6 +258,22 @@ int adjust_resource(struct resource *res, resource_size_t 
start,
                    resource_size_t size);
  resource_size_t resource_alignment(struct resource *res);
+
+#ifdef CONFIG_EFI_SOFT_RESERVE
+static inline void insert_resource_late(struct resource *new)
+{
+       if (new->desc == IORES_DESC_SOFT_RESERVED)
+               insert_resource_expand_to_fit(&soft_reserve_resource, new);
+       else
+               insert_resource_expand_to_fit(&iomem_resource, new);
+}
+#else
+static inline void insert_resource_late(struct resource *new)
+{
+       insert_resource_expand_to_fit(&iomem_resource, new);
+}
+#endif
+
  /**
   * resource_set_size - Calculate resource end address from size and start
   * @res: Resource descriptor
@@ -409,6 +428,12 @@ walk_system_ram_res_rev(u64 start, u64 end, void *arg,
  extern int
  walk_iomem_res_desc(unsigned long desc, unsigned long flags, u64 start, u64 
end,
                    void *arg, int (*func)(struct resource *, void *));
+int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+                              u64 start, u64 end, void *arg,
+                              int (*func)(struct resource *, void *));
+int region_intersects_soft_reserve(struct resource *root, resource_size_t 
start,
+                                  size_t size, unsigned long flags,
+                                  unsigned long desc);
struct resource *devm_request_free_mem_region(struct device *dev,
                struct resource *base, unsigned long size);
diff --git a/kernel/resource.c b/kernel/resource.c
index 8d3e6ed0bdc1..fd90990c31c6 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -321,8 +321,8 @@ static bool is_type_match(struct resource *p, unsigned long 
flags, unsigned long
  }
/**
- * find_next_iomem_res - Finds the lowest iomem resource that covers part of
- *                      [@start..@end].
+ * find_next_res - Finds the lowest resource that covers part of
+ *                [@start..@end].
   *
   * If a resource is found, returns 0 and @*res is overwritten with the part
   * of the resource that's within [@start..@end]; if none is found, returns
@@ -337,9 +337,9 @@ static bool is_type_match(struct resource *p, unsigned long 
flags, unsigned long
   * The caller must specify @start, @end, @flags, and @desc
   * (which may be IORES_DESC_NONE).
   */
-static int find_next_iomem_res(resource_size_t start, resource_size_t end,
-                              unsigned long flags, unsigned long desc,
-                              struct resource *res)
+static int find_next_res(struct resource *parent, resource_size_t start,
+                        resource_size_t end, unsigned long flags,
+                        unsigned long desc, struct resource *res)
  {
        struct resource *p;
@@ -351,7 +351,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end, read_lock(&resource_lock); - for_each_resource(&iomem_resource, p, false) {
+       for_each_resource(parent, p, false) {
                /* If we passed the resource we are looking for, stop */
                if (p->start > end) {
                        p = NULL;
@@ -382,16 +382,23 @@ static int find_next_iomem_res(resource_size_t start, 
resource_size_t end,
        return p ? 0 : -ENODEV;
  }
-static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
-                                unsigned long flags, unsigned long desc,
-                                void *arg,
-                                int (*func)(struct resource *, void *))
+static int find_next_iomem_res(resource_size_t start, resource_size_t end,
+                              unsigned long flags, unsigned long desc,
+                              struct resource *res)
+{
+       return find_next_res(&iomem_resource, start, end, flags, desc, res);
+}
+
+static int walk_res_desc(struct resource *parent, resource_size_t start,
+                        resource_size_t end, unsigned long flags,
+                        unsigned long desc, void *arg,
+                        int (*func)(struct resource *, void *))
  {
        struct resource res;
        int ret = -EINVAL;
while (start < end &&
-              !find_next_iomem_res(start, end, flags, desc, &res)) {
+              !find_next_res(parent, start, end, flags, desc, &res)) {
                ret = (*func)(&res, arg);
                if (ret)
                        break;
@@ -402,6 +409,15 @@ static int __walk_iomem_res_desc(resource_size_t start, 
resource_size_t end,
        return ret;
  }
+static int __walk_iomem_res_desc(resource_size_t start, resource_size_t end,
+                                unsigned long flags, unsigned long desc,
+                                void *arg,
+                                int (*func)(struct resource *, void *))
+{
+       return walk_res_desc(&iomem_resource, start, end, flags, desc, arg, 
func);
+}
+
+
  /**
   * walk_iomem_res_desc - Walks through iomem resources and calls func()
   *                     with matching resource ranges.
@@ -426,6 +442,26 @@ int walk_iomem_res_desc(unsigned long desc, unsigned long 
flags, u64 start,
  }
  EXPORT_SYMBOL_GPL(walk_iomem_res_desc);
+#ifdef CONFIG_EFI_SOFT_RESERVE
+struct resource soft_reserve_resource = {
+       .name   = "Soft Reserved",
+       .start  = 0,
+       .end    = -1,
+       .desc   = IORES_DESC_SOFT_RESERVED,
+       .flags  = IORESOURCE_MEM,
+};
+EXPORT_SYMBOL_GPL(soft_reserve_resource);
+
+int walk_soft_reserve_res_desc(unsigned long desc, unsigned long flags,
+                              u64 start, u64 end, void *arg,
+                              int (*func)(struct resource *, void *))
+{
+       return walk_res_desc(&soft_reserve_resource, start, end, flags, desc,
+                            arg, func);
+}
+EXPORT_SYMBOL_GPL(walk_soft_reserve_res_desc);
+#endif
+
  /*
   * This function calls the @func callback against all memory ranges of type
   * System RAM which are marked as IORESOURCE_SYSTEM_RAM and IORESOUCE_BUSY.


Reply via email to