On Wed, May 14, 2025 at 05:26:45PM +0000, Michael Kelley wrote: > From: Yury Norov <yury.no...@gmail.com> Sent: Wednesday, May 14, 2025 9:55 AM > > > > On Wed, May 14, 2025 at 04:53:34AM +0000, Michael Kelley wrote: > > > > -static int irq_setup(unsigned int *irqs, unsigned int len, int node) > > > > +static int irq_setup(unsigned int *irqs, unsigned int len, int node, > > > > + bool skip_first_cpu) > > > > { > > > > const struct cpumask *next, *prev = cpu_none_mask; > > > > cpumask_var_t cpus __free(free_cpumask_var); > > > > @@ -1303,9 +1304,20 @@ static int irq_setup(unsigned int *irqs, > > > > unsigned int len, int node) > > > > while (weight > 0) { > > > > cpumask_andnot(cpus, next, prev); > > > > for_each_cpu(cpu, cpus) { > > > > + /* > > > > + * if the CPU sibling set is to be > > > > skipped we > > > > + * just move on to the next CPUs > > > > without len-- > > > > + */ > > > > + if (unlikely(skip_first_cpu)) { > > > > + skip_first_cpu = false; > > > > + goto next_cpumask; > > > > + } > > > > + > > > > if (len-- == 0) > > > > goto done; > > > > + > > > > irq_set_affinity_and_hint(*irqs++, > > > > topology_sibling_cpumask(cpu)); > > > > +next_cpumask: > > > > cpumask_andnot(cpus, cpus, > > > > topology_sibling_cpumask(cpu)); > > > > --weight; > > > > } > > > > > > With a little bit of reordering of the code, you could avoid the need for > > > the "next_cpumask" > > > label and goto statement. "continue" is usually cleaner than a "goto". > > > Here's what I'm thinking: > > > > > > for_each_cpu(cpu, cpus) { > > > cpumask_andnot(cpus, cpus, > > > topology_sibling_cpumask(cpu)); > > > --weight; > > > > cpumask_andnot() is O(N), and before it was conditional on 'len == 0', > > so we didn't do that on the very last step. Your version has to do that. > > Don't know how important that is for real workloads. Shradha maybe can > > measure it... > > Yes, there's one extra invocation of cpumask_andnot(). But if the > VM has a small number of CPUs, that extra invocation is negligible. > If the VM has a large number of CPUs, we're already executing > cpumask_andnot() many times, so one extra time is also negligible. > And this whole thing is done only at device initialization time, so > it's not a hot path. >
Hi Michael, Yury, That's right, the overhead is negligible. Tested with some common workloads. I will change this in the next version. Shradha. > > > > > > > > If (unlikely(skip_first_cpu)) { > > > skip_first_cpu = false; > > > continue; > > > } > > > > > > If (len-- == 0) > > > goto done; > > > > > > irq_set_affinity_and_hint(*irqs++, > > > topology_sibling_cpumask(cpu)); > > > } > > > > > > I wish there were some comments in irq_setup() explaining the overall > > > intention of > > > the algorithm. I can see how the goal is to first assign CPUs that are > > > local to the current > > > NUMA node, and then expand outward to CPUs that are further away. And you > > > want > > > to *not* assign both siblings in a hyper-threaded core. > > > > I wrote this function, so let me step in. > > > > The intention is described in the corresponding commit message: > > > > Souradeep investigated that the driver performs faster if IRQs are > > spread on CPUs with the following heuristics: > > > > 1. No more than one IRQ per CPU, if possible; > > 2. NUMA locality is the second priority; > > 3. Sibling dislocality is the last priority. > > > > Let's consider this topology: > > > > Node 0 1 > > Core 0 1 2 3 > > CPU 0 1 2 3 4 5 6 7 > > > > The most performant IRQ distribution based on the above topology > > and heuristics may look like this: > > > > IRQ Nodes Cores CPUs > > 0 1 0 0-1 > > 1 1 1 2-3 > > 2 1 0 0-1 > > 3 1 1 2-3 > > 4 2 2 4-5 > > 5 2 3 6-7 > > 6 2 2 4-5 > > 7 2 3 6-7 > > > > > But I can't figure out what > > > "weight" is trying to accomplish. Maybe this was discussed when the code > > > first > > > went in, but I can't remember now. :-( > > > > The weight here is to implement the heuristic discovered by Souradeep: > > NUMA locality is preferred over sibling dislocality. > > > > The outer for_each() loop resets the weight to the actual number of > > CPUs in the hop. Then inner for_each() loop decrements it by the > > number of sibling groups (cores) while assigning first IRQ to each > > group. > > > > Now, because NUMA locality is more important, we should walk the > > same set of siblings and assign 2nd IRQ, and it's implemented by the > > medium while() loop. So, we do like this unless the number of IRQs > > assigned on this hop will not become equal to number of CPUs in the > > hop (weight == 0). Then we switch to the next hop and do the same > > thing. > > > > Hope that helps. > > Yes, that helps! So the key to understanding "weight" is that > NUMA locality is preferred over sibling dislocality. > > This is a great summary! All or most of it should go as a > comment describing the function and what it is trying to do. > > Michael