On 2022-10-10 17:22, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:mattias.ronnb...@ericsson.com]
Sent: Thursday, 6 October 2022 17.27

On 2022-10-06 15:25, Morten Brørup wrote:
From: Kevin Laatz [mailto:kevin.la...@intel.com]
Sent: Wednesday, 5 October 2022 15.45

On 14/09/2022 10:29, Kevin Laatz wrote:
Currently, there is no way to measure lcore polling busyness in a
passive
way, without any modifications to the application. This patchset
adds
a new
EAL API that will be able to passively track core polling busyness.
As part
of the set, new telemetry endpoints are added to read the generate
metrics.

---

Based on the feedback in the discussions on this patchset, we have
decided to revoke the submission of this patchset for the 22.11
release.

We will re-evaluate the design with the aim to provide a more
acceptable
solution in a future release.

Good call. Thank you!

I suggest having an open discussion about requirements/expectations
for such a solution, before you implement any code.

We haven't found the golden solution for our application, but we have
discussed it quite a lot internally. Here are some of our thoughts:

The application must feed the library with information about how much
work it is doing.

E.g. A pipeline stage that polls the NIC for N ingress packets could
feed the busyness library with values such as:
   - "no work": zero packets received,
   - "25 % utilization": less than N packets received (in this
example: 8 of max 32 packets = 25 %), or
   - "100% utilization, possibly more work to do": all N packets
received (more packets could be ready in the queue, but we don't know).


If some lcore's NIC RX queue always, for every poll operation, produces
8 packets out of a max burst of 32, I would argue that lcore is 100%
busy. With always something to do, it doesn't have a single cycle to
spare.

I would argue that if I have four cores each only processing 25 % of the 
packets, one core would suffice instead. Or, the application could schedule the 
function at 1/4 of the frequency it does now (e.g. call the function once every 
40 microseconds instead of once every 10 microseconds).


Do you mean "only processing packets 25% of the time"? If yes, being able to replace four core @ 25% utilization with one core @ 100% might be a reasonable first guess. I'm not sure how it relates to what I wrote, though.

However, the business does not scale linearly with the number of packets 
processed - which an intended benefit of bursting.


Sure, there's usually a non-linear relationship between the system capacity used and the resulting CPU utilization. It can be both in the manner you describe below, with the per-packet processing latency reduced at higher rates, or the other way around. For example, NIC RX LLC stashing may cause a lot of LLC evictions, and generally the application might have a larger working set size during high load, so there may be forces working in the other direction as well.

It seems to me "busyness" telemetry value should just be lcore thread CPU utilization (in total, or with some per-module breakdown). If you want to know how much of the system's capacity is used, you need help from an application-specific agent, equipped with a model of how CPU utilization and capacity relates. Such a heuristic could take other factors into account as well, e.g. the average queue sizes, packet rates, packet sizes etc.

In my experience, for high touch applications (i.e., those that spends thousands of cycles per packet), CPU utilization is a pretty decent approximation on how much of the system's capacity is used.

Here are some real life numbers from our in-house profiler library in a 
production environment, which says that polling the NIC for packets takes on 
average:

104 cycles when the NIC returns 0 packets,
529 cycles when the NIC returns 1 packet,
679 cycles when the NIC returns 8 packets, and
1275 cycles when the NIC returns a full burst of 32 packets.

(This includes some overhead from our application, so you will see other 
numbers in your application.)


It seems to me that you basically have two options, if you do
application-level "busyness" reporting.

Either the application
a) reports when a section of useful work begins, and when it ends, as
two separate function calls.
b) after having taken a time stamp, and having completed a section of
code which turned out to be something useful, it reports back to the
busyness module with one function call, containing the busy cycles
spent.

In a), the two calls could be to the same function, with a boolean
argument informing the busyness module if this is the beginning of a
busy or an idle period. In such case, just pass "num_pkts_dequeued > 0"
to the call.

Our profiler library has a start()and an end() function, and an end_and_start() 
function for when a section directly follows the preceding section (to only 
take one timestamp instead of two).


I like the idea of a end_and_start() (except for the name, maybe).


What you would like is a solution which avoid ping-pong between idle
and
busy states (with the resulting time stamping and computations) in
scenarios where a lcore thread mix sources of work which often have
items available, with sources that do not (e.g., packets in a RX queue
versus reassembly timeouts in a core-local timer wheel). It would be
better in that situation, to attribute the timer wheel poll cycles as
busy cycles.

Another crucial aspect is that you want the API to be simple, and code
changes to be minimal.

It's unclear to me if you need to account for both idle and busy
cycles,
or only busy cycles, and assume all other cycles are idle. The will be
for a traditional 1:1 EAL thread <-> CPU core mapping, but not if the
"--lcores" parameter is used to create floating EAL threads, and EAL
threads which share the same core, and thus may not be able to use 100%
of the TSC cycles.

A pipeline stage that services a QoS scheduler could additionally
feed the library with values such as:
   - "100% utilization, definitely more work to do": stopped
processing due to some "max work per call" limitation.
   - "waiting, no work until [DELAY] ns": current timeslot has been
filled, waiting for the next timeslot to start.

It is important to note that any pipeline stage processing packets
(or some other objects!) might process a different maximum number of
objects than the ingress pipeline stage. What I mean is: The number N
might not be the same for all pipeline stages.


The information should be collected per lcore or thread, also to
prevent cache trashing.

Additionally, it could be collected per pipeline stage too, making
the collection two-dimensional. This would essentially make it a
profiling library, where you - in addition to seeing how much time is
spent working - also can see which work the time is spent on.


If you introduce subcategories of "busy", like "busy-with-X", and
"busy-with-Y", the book keeping will be more expensive, since you will
transit between states even for 100% busy lcores (which in principle
you
never, or at least very rarely, need to do if you have only busy and
idle as states).

If your application is organized as DPDK services, you will get this
already today, on a DPDK service level.

If you have your application organized as a pipeline, and you use an
event device as a scheduler between the stages, that event device has a
good opportunity to do this kind of bookkeeping. DSW, for example,
keeps
track of the average processing latency for events, and how many events
of various types have been processed.


Lots of good input, Mattias. Let's see what others suggest. :-)

As mentioned during the previous discussions, APIs should be provided
to make the collected information machine readable, so the application
can use it for power management and other purposes.

One of the simple things I would like to be able to extract from such
a library is CPU Utilization (percentage) per lcore. >
And since I want the CPU Utilization to be shown for multiple the
time intervals (usually 1, 5 or 15 minutes; but perhaps also 1 second
or 1 millisecond) the output data should be exposed as a counter type,
so my "loadavg application" can calculate the rate by subtracting the
previously obtained value from the current value and divide the
difference by the time interval.


I agree. In addition, you also want the "raw data" (lcore busy cycles)
so you can do you own sampling, at your own favorite-length intervals.

-Morten


Reply via email to