On Thu, Jun 18 2026 at 11:11, Jing Wu wrote:
> +
> +/*
> + * Managed IRQ housekeeping callback: iterate all managed IRQs and ask
S/IRQ/interrupt/
> + * the chip to move them off CPUs newly removed from HK_TYPE_MANAGED_IRQ.
Also this doesn't ask the chip to move it.
> + */
> +static void irq_hk_apply(enum hk_type type)
> +{
> + cpumask_var_t hk_mask;
> + struct irq_desc *desc;
> + unsigned int irq;
> +
> + if (!alloc_cpumask_var(&hk_mask, GFP_KERNEL))
> + return;
> +
> + /*
> + * Snapshot the new HK_TYPE_MANAGED_IRQ mask under an RCU read lock
> + * before iterating IRQ descriptors. The lockdep annotation in
> + * housekeeping_cpumask() requires an RCU read-side critical section
> + * for runtime-mutable types.
> + */
> + rcu_read_lock();
> + cpumask_copy(hk_mask, housekeeping_cpumask_rcu(HK_TYPE_MANAGED_IRQ));
> + rcu_read_unlock();
Same comments as in the nohz patch.
> +
> + irq_lock_sparse();
> +
> + for_each_active_irq(irq) {
> + desc = irq_to_desc(irq);
> + if (!desc || !desc->action)
> + continue;
> +
for (unsigned int irq = 0; irq < total_nr_irqs; irq++) {
struct irq_desc *desc;
scoped_guard(rcu)
desc = irq_find_desc_at_or_after(irq);
....
> + /*
> + * Only managed interrupts are selected: they have
> + * IRQF_AFFINITY_MANAGED set, meaning the kernel owns their
> + * affinity. User-controlled IRQs are intentionally skipped.
> + *
> + * When the intersection of the current affinity mask and the
> + * new housekeeping mask is non-empty, re-apply the restricted
> + * affinity to migrate the IRQ away from newly isolated CPUs.
> + * If the intersection is empty (all serving CPUs are now
> + * isolated), the IRQ is left on its current CPU temporarily;
> + * handling that case (IRQ shutdown / re-startup) is left for
> + * a follow-up.
Oh well...
> + */
> + if (irqd_affinity_is_managed(&desc->irq_data)) {
So you set the affinity even on an interrupt which is shutdown?
> + const struct cpumask *mask;
> + struct cpumask *tmp = this_cpu_ptr(&__tmp_mask);
> +
> + raw_spin_lock_irq(&desc->lock);
guard()
> + mask = irq_data_get_affinity_mask(&desc->irq_data);
> + cpumask_and(tmp, mask, hk_mask);
> + if (cpumask_intersects(tmp, cpu_online_mask))
> + irq_do_set_affinity(&desc->irq_data, tmp,
> false);
That's completely broken. You _cannot_ change the affinity mask of a
managed interrupt. The mask itself is immutable.
The effective affinity can be changed by invoking the affinity setter
with the original unmodified mask. irq_do_set_affinity() already deals
with the housekeeping mask.
Also invoking irq_do_set_affinity() directly here is just wrong. It
breaks interrupts which cannot be moved in process context.
But even if that is fixed, then there is zero coordination with the
affected drivers/subsystems. Managed interrupts are related to device
and block queues and you cannot change one without the other. Neither
can you stop managed interrupts without quiescing the related device
queue. Starting them up requires also to reenable the device queue.
This problem needs to be fixed no matter what. See below.
> +static int irq_hk_validate(enum hk_type type,
> + const struct cpumask *cur_mask,
> + const struct cpumask *new_mask)
> +{
> + if (!IS_ENABLED(CONFIG_SMP))
> + return -EOPNOTSUPP;
> + return 0;
Seriously? Why is this stuff even built when CONFIG_SMP=n?
So these validate callback seem to be just another voodoo container for
no value.
While this series might work for you by some definition of "works", it's
broken beyond repair and it's really annoying that I explained all of it
to the other people who try to solve that very same problem. Of course
you did not read any of that otherwise you would have CC'ed them.
https://lore.kernel.org/lkml/87o6jcb84w.ffs@tglx
Trying to do that without taking the CPUs mostly offline and bringing
them online again is not going to work and there is zero benefit trying
to avoid that. First of all changing the isolation is not a hotpath
operation. Doing it one by one without bringing the CPU completely down
as I outlined in the above linked mail is not much more disruptive than
trying to do all of this on the fly. If you isolate a CPU then the tasks
on that CPU which do not belong to the isolation set need to get off the
CPU anyway. If you unisolate a CPU then it's really not a problem
whether the non-isolated tasks can move on it 10 milliseconds earlier or
later.
If you want to solve all the problems related to NOHZ, managed
interrupts, RCU etc. without the hotplug machinery then you end up
replicating half of it. Don't even try to think about it, that's a
complete waste of time and won't go anywhere.
Fix the few issues which are related to hotplug that I described in the
above linked mail and use the fully correct and tested common code for
your isolation muck. Please coordinate with Waiman or whoever is working
on it at RH right now.
Thanks,
tglx