From: Yury Norov <yury.no...@gmail.com> Sent: Wednesday, May 14, 2025 9:55 AM
> 
> On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote:
> > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node)
> > > +static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > +              bool skip_first_cpu)
> > >  {
> > >   const struct cpumask *next, *prev = cpu_none_mask;
> > >   cpumask_var_t cpus __free(free_cpumask_var);
> > > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs, unsigned 
> > > int len, int node)
> > >           while (weight > 0) {
> > >                   cpumask_andnot(cpus, next, prev);
> > >                   for_each_cpu(cpu, cpus) {
> > > +                         /*
> > > +                          * if the CPU sibling set is to be skipped we
> > > +                          * just move on to the next CPUs without len--
> > > +                          */
> > > +                         if (unlikely(skip_first_cpu)) {
> > > +                                 skip_first_cpu = false;
> > > +                                 goto next_cpumask;
> > > +                         }
> > > +
> > >                           if (len-- == 0)
> > >                                   goto done;
> > > +
> > >                           irq_set_affinity_and_hint(*irqs++, 
> > > topology_sibling_cpumask(cpu));
> > > +next_cpumask:
> > >                           cpumask_andnot(cpus, cpus, 
> > > topology_sibling_cpumask(cpu));
> > >                           --weight;
> > >                   }
> >
> > With a little bit of reordering of the code, you could avoid the need for 
> > the "next_cpumask"
> > label and goto statement.  "continue" is usually cleaner than a "goto". 
> > Here's what I'm thinking:
> >
> >             for_each_cpu(cpu, cpus) {
> >                     cpumask_andnot(cpus, cpus, 
> > topology_sibling_cpumask(cpu));
> >                     --weight;
> 
> cpumask_andnot() is O(N), and before it was conditional on 'len == 0',
> so we didn't do that on the very last step. Your version has to do that.
> Don't know how important that is for real workloads. Shradha maybe can
> measure it...

Yes, there's one extra invocation of cpumask_andnot(). But if the
VM has a small number of CPUs, that extra invocation is negligible.
If the VM has a large number of CPUs, we're already executing
cpumask_andnot() many times, so one extra time is also negligible.
And this whole thing is done only at device initialization time, so
it's not a hot path.

> 
> >
> >                     If (unlikely(skip_first_cpu)) {
> >                             skip_first_cpu = false;
> >                             continue;
> >                     }
> >
> >                     If (len-- == 0)
> >                             goto done;
> >
> >                     irq_set_affinity_and_hint(*irqs++, 
> > topology_sibling_cpumask(cpu));
> >             }
> >
> > I wish there were some comments in irq_setup() explaining the overall 
> > intention of
> > the algorithm. I can see how the goal is to first assign CPUs that are 
> > local to the current
> > NUMA node, and then expand outward to CPUs that are further away. And you 
> > want
> > to *not* assign both siblings in a hyper-threaded core.
> 
> I wrote this function, so let me step in.
> 
> The intention is described in the corresponding commit message:
> 
>   Souradeep investigated that the driver performs faster if IRQs are
>   spread on CPUs with the following heuristics:
> 
>   1. No more than one IRQ per CPU, if possible;
>   2. NUMA locality is the second priority;
>   3. Sibling dislocality is the last priority.
> 
>   Let's consider this topology:
> 
>   Node            0               1
>   Core        0       1       2       3
>   CPU       0   1   2   3   4   5   6   7
> 
>   The most performant IRQ distribution based on the above topology
>   and heuristics may look like this:
> 
>   IRQ     Nodes   Cores   CPUs
>   0       1       0       0-1
>   1       1       1       2-3
>   2       1       0       0-1
>   3       1       1       2-3
>   4       2       2       4-5
>   5       2       3       6-7
>   6       2       2       4-5
>   7       2       3       6-7
> 
> > But I can't figure out what
> > "weight" is trying to accomplish. Maybe this was discussed when the code 
> > first
> > went in, but I can't remember now. :-(
> 
> The weight here is to implement the heuristic discovered by Souradeep:
> NUMA locality is preferred over sibling dislocality.
> 
> The outer for_each() loop resets the weight to the actual number of
> CPUs in the hop. Then inner for_each() loop decrements it by the
> number of sibling groups (cores) while assigning first IRQ to each
> group.
> 
> Now, because NUMA locality is more important, we should walk the
> same set of siblings and assign 2nd IRQ, and it's implemented by the
> medium while() loop. So, we do like this unless the number of IRQs
> assigned on this hop will not become equal to number of CPUs in the
> hop (weight == 0). Then we switch to the next hop and do the same
> thing.
> 
> Hope that helps.

Yes, that helps! So the key to understanding "weight" is that
NUMA locality is preferred over sibling dislocality.

This is a great summary!  All or most of it should go as a
comment describing the function and what it is trying to do.

Michael

Reply via email to