On 2024-09-12 15:30, Bruce Richardson wrote:
On Thu, Sep 12, 2024 at 01:59:34PM +0200, Mattias Rönnblom wrote:
On 2024-09-12 13:17, Varghese, Vipin wrote:
[AMD Official Use Only - AMD Internal Distribution Only]

<snipped>
Thank you Mattias for the information, as shared by in the reply
with
Anatoly we want expose a new API `rte_get_next_lcore_ex` which
intakes a extra argument `u32 flags`.
The flags can be RTE_GET_LCORE_L1 (SMT), RTE_GET_LCORE_L2,
RTE_GET_LCORE_L3, RTE_GET_LCORE_BOOST_ENABLED,
RTE_GET_LCORE_BOOST_DISABLED.

Wouldn't using that API be pretty awkward to use?
Current API available under DPDK is ` rte_get_next_lcore`, which is used
within DPDK example and in customer solution.
Based on the comments from others we responded to the idea of changing
the new Api from ` rte_get_next_lcore_llc` to ` rte_get_next_lcore_exntd`.

Can you please help us understand what is `awkward`.


The awkwardness starts when you are trying to fit provide hwloc type
information over an API that was designed for iterating over lcores.
I disagree to this point, current implementation of lcore libraries is
only focused on iterating through list of enabled cores, core-mask, and
lcore-map.
With ever increasing core count, memory, io and accelerators on SoC,
sub-numa partitioning is common in various vendor SoC. Enhancing or
Augumenting lcore API to extract or provision NUMA, Cache Topology is
not awkward.

DPDK providing an API for this information makes sense to me, as I've
mentioned before. What I questioned was the way it was done (i.e., the API
design) in your RFC, and the limited scope (which in part you have
addressed).


Actually, I'd like to touch on this first item a little bit. What is the
main benefit of providing this information in EAL? To me, it seems like
something that is for apps to try and be super-smart and select particular
cores out of a set of cores to run on. However, is that not taking work
that should really be the job of the person deploying the app? The deployer
- if I can use that term - has already selected a set of cores and NICs for
a DPDK application to use. Should they not also be the one selecting - via
app argument, via --lcores flag to map one core id to another, or otherwise
- which part of an application should run on what particular piece of
hardware?


Scheduling in one form or another will happen on a number of levels. One level is what you call the "deployer". Whether man or machine, it will allocate a bunch of lcores to the application - either statically by using -l <cores>, or dynamically, by giving a very large core mask, combined with having an agent in the app responsible to scale up or down the number of cores actually used (allowing coexistence with other non-DPDK, Linux process scheduler-scheduled processes, on the same set of cores, although not at the same time).

I think the "deployer" level should generally not be aware of the DPDK app internals, including how to assign different tasks to different cores. That is consistent with how things work in a general-purpose operating system, where you allocate cores, memory and I/O devices to an instance (e.g., a VM), but then OS' scheduler figures out how to best use them.

The app internal may be complicated, change across software versions and traffic mixes/patterns, and most of all, not lend itself to static at-start configuration at all.

In summary, what is the final real-world intended usecase for this work?

One real-world example is an Eventdev app with some atomic processing stage, using DSW, and SMT. Hardware threading on Intel x86 generally improves performance with ~25%, which seems to hold true for data plane apps as well, in my experience. So that's a (not-so-)freebie you don't want to miss out on. To max out single-flow performance, the work scheduler may not only need to give 100% of an lcore to bottleneck stage atomic processing for that elephant flow, but a *full* physical core (i.e., assure that the SMT sibling is idle). But, DSW doesn't understand the CPU topology, so you have to choose between max multi-flow throughput or max single-flow throughput at the time of deployment. A RTE hwtopo API would certainly help in the implementation of SMT-aware scheduling.

Another example could be the use of bigger or turbo-capable cores to run CPU-hungry, singleton services (e.g., a Eventdev RX timer adapter core), or the use of a hardware thread to run the SW scheduler service (which needs to react quickly to incoming scheduling events, but maybe not need all the cycles of a full physical core).

Yet another example would be an event device which understand how to spread a particular flow across multiple cores, but use only cores sharing the same L2. Or, keep only processing of a certain kind (e.g., a certain Eventdev Queue) on cores with the same L2, improve L2 hit rates for instructions and data related to that processing stage.

DPDK already tries to be smart about cores and NUMA, and in some cases we
have hit issues where users have - for their own valid reasons - wanted to
run DPDK in a sub-optimal way, and they end up having to fight DPDK's
smarts in order to do so! Ref: [1]

/Bruce

[1] 
https://git.dpdk.org/dpdk/commit/?id=ed34d87d9cfbae8b908159f60df2008e45e4c39f

Reply via email to