On 2022-11-03 09:59, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Wednesday, 2 November 2022 18.53
On 2022-11-02 10:09, Morten Brørup wrote:
From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
Sent: Wednesday, 2 November 2022 08.53
On 2022-10-31 12:26, Morten Brørup wrote:
Offset the stats array index by one, and count non-DPDK threads at
index
zero.
This patch provides two benefits:
* Non-DPDK threads are also included in the statistics.
* A conditional in the fast path is removed. Static branch
prediction
was
correct, so the performance improvement is negligible.
v2:
* New. No v1 of this patch in the series.
Suggested-by: Stephen Hemminger <step...@networkplumber.org>
Signed-off-by: Morten Brørup <m...@smartsharesystems.com>
---
lib/mempool/rte_mempool.c | 2 +-
lib/mempool/rte_mempool.h | 12 ++++++------
2 files changed, 7 insertions(+), 7 deletions(-)
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 62d1ce764e..e6208125e0 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -1272,7 +1272,7 @@ rte_mempool_dump(FILE *f, struct rte_mempool
*mp)
#ifdef RTE_LIBRTE_MEMPOOL_STATS
rte_mempool_ops_get_info(mp, &info);
memset(&sum, 0, sizeof(sum));
- for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+ for (lcore_id = 0; lcore_id < RTE_MAX_LCORE + 1; lcore_id++) {
sum.put_bulk += mp->stats[lcore_id].put_bulk;
sum.put_objs += mp->stats[lcore_id].put_objs;
sum.put_common_pool_bulk += mp-
stats[lcore_id].put_common_pool_bulk;
diff --git a/lib/mempool/rte_mempool.h b/lib/mempool/rte_mempool.h
index 9c4bf5549f..16e7e62e3c 100644
--- a/lib/mempool/rte_mempool.h
+++ b/lib/mempool/rte_mempool.h
@@ -238,8 +238,11 @@ struct rte_mempool {
struct rte_mempool_memhdr_list mem_list; /**< List of
memory
chunks */
#ifdef RTE_LIBRTE_MEMPOOL_STATS
- /** Per-lcore statistics. */
- struct rte_mempool_debug_stats stats[RTE_MAX_LCORE];
+ /** Per-lcore statistics.
+ *
+ * Offset by one, to include non-DPDK threads.
+ */
+ struct rte_mempool_debug_stats stats[RTE_MAX_LCORE + 1];
#endif
} __rte_cache_aligned;
@@ -304,10 +307,7 @@ struct rte_mempool {
*/
#ifdef RTE_LIBRTE_MEMPOOL_STATS
#define RTE_MEMPOOL_STAT_ADD(mp, name, n) do {
\
- unsigned __lcore_id = rte_lcore_id(); \
- if (__lcore_id < RTE_MAX_LCORE) { \
- mp->stats[__lcore_id].name += n; \
- } \
+ (mp)->stats[rte_lcore_id() + 1].name += n; \
This relies on LCORE_ID_ANY being UINT32_MAX, and a wrap to 0, for
an
unregistered non-EAL thread? Might be worth a comment, or better a
rewrite with an explicit LCORE_ID_ANY comparison.
The purpose of this patch is to avoid the comparison here.
Yes, it relies on the wrap to zero, and these conditions:
1. LCORE_ID_ANY being UINT32_MAX, and
2. the return type of rte_lcore_id() being unsigned int, and
3. unsigned int being uint32_t.
When I wrote this, I considered it safe to assume that LCORE_ID_ANY
will remain the unsigned equivalent of -1 using the return type of
rte_lcore_id(). In other words: If the return type of rte_lcore_id()
should change from 32 to 64 bit, LCORE_ID_ANY would be updated
accordingly to UINT64_MAX.
Because of this assumption, I didn't use [(rte_lcore_id() + 1) &
UINT32_MAX], but just [rte_lcore_id() + 1].
I struggled writing an appropriate comment without making it
unacceptably long, but eventually gave up, and settled for the one-line
comment in the structure only.
You anyways need a conditional. An atomic add must be used in the
unregistered EAL thread case.
Good point: The "+= n" must be atomic for non-isolated threads.
If the various unregistered non-EAL threads are run on isolated cores
or not does not matter.
Agree: They all share the same index, and thus may race, regardless which cores
they are using.
Rephrasing: The "+= n" must be atomic for the unregistered non-EAL threads.
I just took a look at how software maintained stats are handled
elsewhere, and the first thing I found, is the IOAT DMA driver, which
also seems to be using non-atomic increment [1] regardless if used by a
DPDK thread or not.
[1]: https://elixir.bootlin.com/dpdk/v22.11-
rc2/source/drivers/dma/ioat/ioat_dmadev.c#L228
However, doing it wrong elsewhere doesn't make it correct doing it
wrong here too. :-)
Atomic increments are costly, so I would rather follow your
suggestion and reintroduce the comparison. How about:
#define RTE_MEMPOOL_STAT_ADD(mp, name, n) do { \
unsigned int __lcore_id = rte_lcore_id(); \
if (likely(__lcore_id < RTE_MAX_LCORE)) { \
(mp)->stats[__lcore_id].name += n; \
} else {
rte_atomic64_add( \
(rte_atomic64_t*)&((mp)->stats[RTE_MAX_LCORE].name),
n);\
} \
}
You are supposed to use GCC C11 intrinsics (e.g.,
__atomic_fetch_add()).
Ups. I forgot!
This should be emphasized everywhere in the rte_atomic library, to prevent such
mistakes.
In addition: technically, you must use an atomic store for the EAL
thread case (and an atomic load on the reader side), although there are
tons of examples in DPDK where tearing is ignored. (The generated code
will likely look the same.)
The counters are 64 bit aligned, but tearing could happen on 32 bit
architectures.
The compiler is free to do it on any architecture, but I'm not sure if
it happens much in practice.
My initial reaction to your comment was to do it correctly on the EAL threads
too, to avoid tearing there too. However, there might be a performance cost for
32 bit architectures, so I will consider that these are only debug counters,
and accept the risk of tearing.
What would that cost consist of?
In theory C11 atomics could be implemented "in software" (i.e., without
the proper ISA-level guarantees, with locks), but does any of the
DPDK-supported compiler/32-bit ISAs actually do so?
It's also not obvious that if there, for a certain
compiler/ISA-combination, is a performance impact to pay for
correctness, correctness would have to give way.
DPDK coding conventions require there be no braces for a single
statement.
Will fix.
Other than that, it looks good.
And the structure comment could be:
* Plus one, for non-DPDK threads.
"Unregistered non-EAL threads". This is the term the EAL documentation
uses.
Thank you for clarifying. I didn't follow that discussion, so the new
terminology for threads hasn't stuck with me yet. :-)