> -----Original Message-----
> From: Michael Kelley <mhkli...@outlook.com>
> Sent: Tuesday, January 9, 2024 2:23 PM
> To: Souradeep Chakrabarti <schakraba...@linux.microsoft.com>; KY Srinivasan
> <k...@microsoft.com>; Haiyang Zhang <haiya...@microsoft.com>;
> wei....@kernel.org; Dexuan Cui <de...@microsoft.com>;
> da...@davemloft.net; eduma...@google.com; k...@kernel.org;
> pab...@redhat.com; Long Li <lon...@microsoft.com>; yury.no...@gmail.com;
> l...@kernel.org; cai.huoq...@linux.dev; ssen...@linux.microsoft.com;
> vkuzn...@redhat.com; t...@linutronix.de; linux-hyperv@vger.kernel.org;
> net...@vger.kernel.org; linux-ker...@vger.kernel.org; linux-
> r...@vger.kernel.org
> Cc: Souradeep Chakrabarti <schakraba...@microsoft.com>; Paul Rosswurm
> <paul...@microsoft.com>
> Subject: RE: [PATCH 3/4 net-next] net: mana: add a function to spread IRQs per
> CPUs
> 
> [Some people who received this message don't often get email from
> mhkli...@outlook.com. Learn why this is important at
> https://aka.ms/LearnAboutSenderIdentification ]
> 
> From: Souradeep Chakrabarti <schakraba...@linux.microsoft.com> Sent:
> Tuesday, January 9, 2024 2:51 AM
> >
> > From: Yury Norov <yury.no...@gmail.com>
> >
> > Souradeep investigated that the driver performs faster if IRQs are
> > spread on CPUs with the following heuristics:
> >
> > 1. No more than one IRQ per CPU, if possible;
> > 2. NUMA locality is the second priority;
> > 3. Sibling dislocality is the last priority.
> >
> > Let's consider this topology:
> >
> > Node            0               1
> > Core        0       1       2       3
> > CPU       0   1   2   3   4   5   6   7
> >
> > The most performant IRQ distribution based on the above topology
> > and heuristics may look like this:
> >
> > IRQ     Nodes   Cores   CPUs
> > 0       1       0       0-1
> > 1       1       1       2-3
> > 2       1       0       0-1
> > 3       1       1       2-3
> > 4       2       2       4-5
> > 5       2       3       6-7
> > 6       2       2       4-5
> > 7       2       3       6-7
> 
> I didn't pay attention to the detailed discussion of this issue
> over the past 2 to 3 weeks during the holidays in the U.S., but
> the above doesn't align with the original problem as I understood
> it.  I thought the original problem was to avoid putting IRQs on
> both hyper-threads in the same core, and that the perf
> improvements are based on that configuration.  At least that's
> what the commit message for Patch 4/4 in this series says.
> 
> The above chart results in 8 IRQs being assigned to the 8 CPUs,
> probably with 1 IRQ per CPU.   At least on x86, if the affinity
> mask for an IRQ contains multiple CPUs, matrix_find_best_cpu()
> should balance the IRQ assignments between the CPUs in the mask.
> So the original problem is still present because both hyper-threads
> in a core are likely to have an IRQ assigned.
> 
> Of course, this example has 8 IRQs and 8 CPUs, so assigning an
> IRQ to every hyper-thread may be the only choice.  If that's the
> case, maybe this just isn't a good example to illustrate the
> original problem and solution.  But even with a better example
> where the # of IRQs is <= half the # of CPUs in a NUMA node,
> I don't think the code below accomplishes the original intent.
> 
> Maybe I've missed something along the way in getting to this
> version of the patch.  Please feel free to set me straight. :-)
> 
> Michael

I have the same question as Michael. Also, I'm asking Souradeep
in another channel: So, the algorithm still uses up all current 
NUMA node before moving on to the next NUMA node, right?

Except each IRQ is affinitized to 2 CPUs. 
For example, a system with 2 IRQs:
IRQ     Nodes   Cores  CPUs
0       1       0      0-1
1       1       1      2-3
 
Is this performing better than the algorithm in earlier patches? like below:
IRQ     Nodes   Cores  CPUs
0       1       0      0
1       1       1      2

Thanks,
- Haiyang


Reply via email to