On 26/09/2022 10:37, Konstantin Ananyev wrote:
Hi Kevin,
Currently, there is no way to measure lcore poll busyness in a passive way,
without any modifications to the application. This patch adds a new EAL API
that will be able to passively track core polling busyness.
The poll busyness is calculated by relying on the fact that most DPDK API's
will poll for work (packets, completions, eventdev events, etc). Empty
polls can be counted as "idle", while non-empty polls can be counted as
busy. To measure lcore poll busyness, we simply call the telemetry
timestamping function with the number of polls a particular code section
has processed, and count the number of cycles we've spent processing empty
bursts. The more empty bursts we encounter, the less cycles we spend in
"busy" state, and the less core poll busyness will be reported.
In order for all of the above to work without modifications to the
application, the library code needs to be instrumented with calls to the
lcore telemetry busyness timestamping function. The following parts of DPDK
are instrumented with lcore poll busyness timestamping calls:
- All major driver API's:
- ethdev
- cryptodev
- compressdev
- regexdev
- bbdev
- rawdev
- eventdev
- dmadev
- Some additional libraries:
- ring
- distributor
To avoid performance impact from having lcore telemetry support, a global
variable is exported by EAL, and a call to timestamping function is wrapped
into a macro, so that whenever telemetry is disabled, it only takes one
additional branch and no function calls are performed. It is disabled at
compile time by default.
This patch also adds a telemetry endpoint to report lcore poll busyness, as
well as telemetry endpoints to enable/disable lcore telemetry. A
documentation entry has been added to the howto guides to explain the usage
of the new telemetry endpoints and API.
As was already mentioned by other reviewers, it would be much better
to let application itself decide when it is idle and when it is busy.
With current approach even for constant polling run-to-completion model there
are plenty of opportunities to get things wrong and provide misleading
statistics.
My special concern - inserting it into ring dequeue code.
Ring is used for various different things, not only pass packets between
threads (mempool, etc.).
Blindly assuming that ring dequeue returns empty means idle cycles seams wrong
to me.
Which make me wonder should we really hard-code these calls into DPDK core
functions?
If you still like to introduce such stats, might be better to implement it via
callback mechanism.
As I remember nearly all our drivers (net, crypto, etc.) do support it.
That way our generic code will remain unaffected, plus user will have ability
to enable/disable
it on a per device basis.
Thanks for your feedback, Konstantin.
You are right in saying that this approach won't be 100% suitable for
all use-cases, but should be suitable for the majority of applications.
First of all - could you explain how did you measure what is the 'majority' of
DPDK applications?
And how did you conclude that it definitely work for all the apps in that
'majority'?
Second what bother me with that approach - I don't see s clear and
deterministic way
for the user to understand would that stats work properly for his app or not.
(except manually ananlyzing his app code).
All of the DPDK example applications we've tested with (l2fwd, l3fwd + friends,
testpmd, distributor, dmafwd) report lcore poll busyness and respond to
changing traffic rates etc. We've also compared the reported busyness to
similar metrics reported by other projects such as VPP and OvS, and found the
reported busyness matches with a difference of +/- 1%. In addition to the DPDK
example applications, we've have shared our plans with end customers and they
have confirmed that the design should work with their applications.
It's worth keeping in mind that this feature is compile-time disabled by
default, so there is no impact to any application/user that does not
wish to use this, for example applications where this type of busyness
is not useful, or for applications that already use other mechanisms to
report similar telemetry.
Not sure that adding in new compile-time option disabled by default is a good
thing...
For me it would be much more preferable if we'll go through a more 'standard'
way here:
a) define clear API to enable/disable/collect/report such type of stats.
b) use some of our sample apps to demonstrate how to use it properly with
user-specific code.
c) if needed implement some 'silent' stats collection for limited scope of apps
via callbacks -
let say for run-to-completion apps that do use ether and crypto devs only.
With the compile-time option, its just one build flag for lots of applications
to silently benefit from this.
However, the upside for applications that do
wish to use this is that there are no code changes required (for the
most part), the feature simply needs to be enabled at compile-time via
the meson option.
In scenarios where contextual awareness of the application is needed in
order to report more accurate "busyness", the
"RTE_LCORE_POLL_BUSYNESS_TIMESTAMP(n)" macro can be used to mark
sections of code as "busy" or "idle". This way, the application can
assume control of determining the poll busyness of its lcores while
leveraging the telemetry hooks adding in this patchset.
We did initially consider implementing this via callbacks, however we
found this approach to have 2 main drawbacks:
1. Application changes are required for all applications wanting to
report this telemetry - rather than the majority getting it for free.
Didn't get it - why callbacks approach would require user-app changes?
In other situations - rte_power callbacks, pdump, etc. it works transparent to
user-leve code.
Why it can't be done here in a similar way?
From my understanding, the callbacks would need to be registered by the
application at the very least (and the callback would have to be registered per
device/pmd/lib).
2. Ring does not have callback support, meaning pipelined applications
could not report lcore poll busyness telemetry with this approach.
That's another big concern that I have:
Why you consider that all rings will be used for a pipilines between threads
and should
always be accounted by your stats?
They could be used for dozens different purposes.
What if that ring is used for mempool, and ring_dequeue() just means we try to
allocate
an object from the pool? In such case, why failing to allocate an object should
mean
start of new 'idle cycle'?
Another approach could be taken here if the mempool interactions are of concern.
From our understanding, mempool operations use the "_bulk" APIs, whereas polling operations use the
"_burst" APIs. Would only timestamping on the "_burst" APIs be better here? That way the
mempool interactions won't be counted towards the busyness.
Including support for pipelined applications using rings is key for a number of
usecases, this was highlighted as part of the customer feedback when we shared
the design.
Eventdev is another driver which would be completely missed with this
approach.
Ok, I see two ways here:
- implement CB support for eventdev.
-meanwhile clearly document that this stats are not supported for eventdev
scenarios (yet).