On 2/13/26 2:47 AM, Chen Ridong wrote:
Hi Longman:

On 2026/2/13 0:46, Waiman Long wrote:
The current cpuset partition code is able to dynamically update
the sched domains of a running system and the corresponding
HK_TYPE_DOMAIN housekeeping cpumask to perform what is essentally the
"isolcpus=domain,..." boot command line feature at run time.

The housekeeping cpumask update requires flushing a number of different
workqueues which may not be safe with cpus_read_lock() held as the
workqueue flushing code may acquire cpus_read_lock() or acquiring locks
which have locking dependency with cpus_read_lock() down the chain. Below
is an example of such circular locking problem.

   ======================================================
   WARNING: possible circular locking dependency detected
   6.18.0-test+ #2 Tainted: G S
   ------------------------------------------------------
   test_cpuset_prs/10971 is trying to acquire lock:
   ffff888112ba4958 ((wq_completion)sync_wq){+.+.}-{0:0}, at: 
touch_wq_lockdep_map+0x7a/0x180

   but task is already holding lock:
   ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:
   -> #4 (cpuset_mutex){+.+.}-{4:4}:
   -> #3 (cpu_hotplug_lock){++++}-{0:0}:
   -> #2 (rtnl_mutex){+.+.}-{4:4}:
   -> #1 ((work_completion)(&arg.work)){+.+.}-{0:0}:
   -> #0 ((wq_completion)sync_wq){+.+.}-{0:0}:

   Chain exists of:
     (wq_completion)sync_wq --> cpu_hotplug_lock --> cpuset_mutex

   5 locks held by test_cpuset_prs/10971:
    #0: ffff88816810e440 (sb_writers#7){.+.+}-{0:0}, at: ksys_write+0xf9/0x1d0
    #1: ffff8891ab620890 (&of->mutex#2){+.+.}-{4:4}, at: 
kernfs_fop_write_iter+0x260/0x5f0
    #2: ffff8890a78b83e8 (kn->active#187){.+.+}-{0:0}, at: 
kernfs_fop_write_iter+0x2b6/0x5f0
    #3: ffffffffadf32900 (cpu_hotplug_lock){++++}-{0:0}, at: 
cpuset_partition_write+0x77/0x130
    #4: ffffffffae47f450 (cpuset_mutex){+.+.}-{4:4}, at: 
cpuset_partition_write+0x85/0x130

   Call Trace:
    <TASK>
      :
    touch_wq_lockdep_map+0x93/0x180
    __flush_workqueue+0x111/0x10b0
    housekeeping_update+0x12d/0x2d0
    update_parent_effective_cpumask+0x595/0x2440
    update_prstate+0x89d/0xce0
    cpuset_partition_write+0xc5/0x130
    cgroup_file_write+0x1a5/0x680
    kernfs_fop_write_iter+0x3df/0x5f0
    vfs_write+0x525/0xfd0
    ksys_write+0xf9/0x1d0
    do_syscall_64+0x95/0x520
    entry_SYSCALL_64_after_hwframe+0x76/0x7e

To avoid such a circular locking dependency problem, we have to
call housekeeping_update() without holding the cpus_read_lock() and
cpuset_mutex. The current set of wq's flushed by housekeeping_update()
may not have work functions that call cpus_read_lock() directly,
but we are likely to extend the list of wq's that are flushed in the
future. Moreover, the current set of work functions may hold locks that
may have cpu_hotplug_lock down the dependency chain.

One way to do that is to defer the housekeeping_update() call after
the current cpuset critical section has finished without holding
cpus_read_lock. For cpuset control file write, this can be done by
deferring it using task_work right before returning to userspace.

To enable mutual exclusion between the housekeeping_update() call and
other cpuset control file write actions, a new top level cpuset_top_mutex
is introduced. This new mutex will be acquired first to allow sharing
variables used by both code paths. However, cpuset update from CPU
hotplug can still happen in parallel with the housekeeping_update()
call, though that should be rare in production environment.

As cpus_read_lock() is now no longer held when
tmigr_isolated_exclude_cpumask() is called, it needs to acquire it
directly.

The lockdep_is_cpuset_held() is also updated to return true if either
cpuset_top_mutex or cpuset_mutex is held.

Signed-off-by: Waiman Long <[email protected]>
---
  kernel/cgroup/cpuset.c        | 99 ++++++++++++++++++++++++++++++++---
  kernel/sched/isolation.c      |  4 +-
  kernel/time/timer_migration.c |  4 +-
  3 files changed, 93 insertions(+), 14 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 48b7f275085b..c6a97956a991 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -65,14 +65,28 @@ static const char * const perr_strings[] = {
   * CPUSET Locking Convention
   * -------------------------
   *
- * Below are the three global locks guarding cpuset structures in lock
+ * Below are the four global/local locks guarding cpuset structures in lock
   * acquisition order:
+ *  - cpuset_top_mutex
   *  - cpu_hotplug_lock (cpus_read_lock/cpus_write_lock)
   *  - cpuset_mutex
   *  - callback_lock (raw spinlock)
   *
- * A task must hold all the three locks to modify externally visible or
- * used fields of cpusets, though some of the internally used cpuset fields
+ * As cpuset will now indirectly flush a number of different workqueues in
+ * housekeeping_update() to update housekeeping cpumasks when the set of
+ * isolated CPUs is going to be changed, it may be vulnerable to deadlock
+ * if we hold cpus_read_lock while calling into housekeeping_update().
+ *
+ * The first cpuset_top_mutex will be held except when calling into
+ * cpuset_handle_hotplug() from the CPU hotplug code where cpus_write_lock
+ * and cpuset_mutex will be held instead. The main purpose of this mutex
+ * is to prevent regular cpuset control file write actions from interfering
+ * with the call to housekeeping_update(), though CPU hotplug operation can
+ * still happen in parallel. This mutex also provides protection for some
+ * internal variables.
+ *
+ * A task must hold all the remaining three locks to modify externally visible
+ * or used fields of cpusets, though some of the internally used cpuset fields
   * and internal variables can be modified without holding callback_lock. If 
only
   * reliable read access of the externally used fields are needed, a task can
   * hold either cpuset_mutex or callback_lock which are exposed to other
@@ -100,6 +114,7 @@ static const char * const perr_strings[] = {
   * cpumasks and nodemasks.
   */
+static DEFINE_MUTEX(cpuset_top_mutex);
  static DEFINE_MUTEX(cpuset_mutex);
/*
@@ -111,6 +126,8 @@ static DEFINE_MUTEX(cpuset_mutex);
   *
   * CSCB: Readable by holding either cpuset_mutex or callback_lock. Writable
   *     by holding both cpuset_mutex and callback_lock.
+ *
+ * T:   Read/write-able by holding the cpuset_top_mutex.
   */
/*
@@ -135,6 +152,18 @@ static cpumask_var_t       isolated_cpus;          /* CSCB 
*/
   */
  static bool           isolated_cpus_updating; /* RWCS */
+/*
+ * Copy of isolated_cpus to be passed to housekeeping_update()
+ */
+static cpumask_var_t   isolated_hk_cpus;       /* T */
+
+/*
+ * Flag to prevent queuing more than one task_work to the same cpuset_top_mutex
+ * critical section.
+ */
+static bool            isolcpus_twork_queued;  /* T */
+
+
  /*
   * A flag to force sched domain rebuild at the end of an operation.
   * It can be set in
@@ -301,20 +330,24 @@ void lockdep_assert_cpuset_lock_held(void)
   */
  void cpuset_full_lock(void)
  {
+       mutex_lock(&cpuset_top_mutex);
        cpus_read_lock();
        mutex_lock(&cpuset_mutex);
  }
void cpuset_full_unlock(void)
  {
+       isolcpus_twork_queued = false;
This is odd.

        mutex_unlock(&cpuset_mutex);
        cpus_read_unlock();
+       mutex_unlock(&cpuset_top_mutex);
  }
#ifdef CONFIG_LOCKDEP
  bool lockdep_is_cpuset_held(void)
  {
-       return lockdep_is_held(&cpuset_mutex);
+       return lockdep_is_held(&cpuset_mutex) ||
+              lockdep_is_held(&cpuset_top_mutex);
  }
  #endif
@@ -1338,6 +1371,28 @@ static bool prstate_housekeeping_conflict(int prstate, struct cpumask *new_cpus)
        return false;
  }
+/*
+ * housekeeping_update() will only be called if isolated_cpus differs
+ * from isolated_hk_cpus. To be safe, rebuild_sched_domains() will always
+ * be called just in case there are still pending sched domains changes.
+ */
+static void isolcpus_tworkfn(struct callback_head *cb)
+{
+       bool update_hk = true;
+
+       guard(mutex)(&cpuset_top_mutex);
+       scoped_guard(spinlock_irq, &callback_lock) {
+               if (cpumask_equal(isolated_hk_cpus, isolated_cpus))
+                       update_hk = false;
+               else
+                       cpumask_copy(isolated_hk_cpus, isolated_cpus);
+       }
+       if (update_hk)
+               WARN_ON_ONCE(housekeeping_update(isolated_hk_cpus) < 0);
+       rebuild_sched_domains();
+       kfree(cb);
+}
+
  /*
   * update_isolation_cpumasks - Update external isolation related CPU masks
   *
@@ -1346,15 +1401,42 @@ static bool prstate_housekeeping_conflict(int prstate, 
struct cpumask *new_cpus)
   */
  static void update_isolation_cpumasks(void)
  {
-       int ret;
+       struct callback_head *twork_cb;
if (!isolated_cpus_updating)
                return;
+       else
+               isolated_cpus_updating = false;
+
+       /*
+        * CPU hotplug shouldn't set isolated_cpus_updating.
+        *
+        * To have better flexibility and prevent the possibility of deadlock,
+        * we defer the housekeeping_update() call to after the current cpuset
+        * critical section has finished. This is done via the synchronous
+        * task_work which will be executed right before returning to userspace.
+        *
+        * update_isolation_cpumasks() may be called more than once in the
+        * same cpuset_mutex critical section.
+        */
+       lockdep_assert_held(&cpuset_top_mutex);
+       if (isolcpus_twork_queued)
+               return;
- ret = housekeeping_update(isolated_cpus);
-       WARN_ON_ONCE(ret < 0);
+       twork_cb = kzalloc(sizeof(struct callback_head), GFP_KERNEL);
+       if (!twork_cb)
+               return;
- isolated_cpus_updating = false;
+       /*
+        * isolcpus_tworkfn() will be invoked before returning to userspace
+        */
+       init_task_work(twork_cb, isolcpus_tworkfn);
+       if (task_work_add(current, twork_cb, TWA_RESUME)) {
+               kfree(twork_cb);
+               WARN_ON_ONCE(1);        /* Current task shouldn't be exiting */
+       } else {
+               isolcpus_twork_queued = true;
+       }
  }
Actually, I find this function quite complex, with numerous global
variables to maintain.

I'm considering whether we can simplify it. Could we just call
update_isolation_cpumasks() at the end of update_prstate(),
update_cpumask(), and update_exclusive_cpumask()?

i.e.

static void update_isolation_cpumasks(void)
{
        struct callback_head twork_cb

        if (!isolated_cpus_updating)
                return;
        task_work_add(...)
        isolated_cpus_updating = false;
}

static int update_prstate(struct cpuset *cs, int new_prs)
{
        ...
        free_tmpmasks(&tmpmask);
        update_isolation_cpumasks();
        return 0;
}

For rebuilding scheduling domains, we could rebuild them during the
set operation only when force_sd_rebuild = true and
!isolated_cpus_updating. Otherwise, the rebuild would be deferred
and performed only once in isolcpus_tworkfn().

Yes, the v5 series make the code more complex, so now I am reverting back to a more simple way.

Cheers,
Longman


Reply via email to