On 5/7/2024 8:23 AM, Mattias Rönnblom wrote: > On 2024-04-28 17:11, Mattias Rönnblom wrote: >> On 2024-04-26 16:38, Ferruh Yigit wrote: >>> For stats reset, use an offset instead of zeroing out actual stats >>> values, >>> get_stats() displays diff between stats and offset. >>> This way stats only updated in datapath and offset only updated in stats >>> reset function. This makes stats reset function more reliable. >>> >>> As stats only written by single thread, we can remove 'volatile' >>> qualifier >>> which should improve the performance in datapath. >>> >> >> volatile wouldn't help you if you had multiple writers, so that can't >> be the reason for its removal. It would be more accurate to say it >> should be replaced with atomic updates. If you don't use volatile and >> don't use atomics, you have to consider if the compiler can reach the >> conclusion that it does not need to store the counter value for future >> use *for that thread*. Since otherwise, I don't think the store >> actually needs to occur. Since DPDK statistics tend to work, it's >> pretty obvious that current compilers tend not to reach this conclusion. >> >> If this should be done 100% properly, the update operation should be a >> non-atomic load, non-atomic add, and an atomic store. Similarly, for >> the reset, the offset store should be atomic. >> >> Considered the state of the rest of the DPDK code base, I think a >> non-atomic, non-volatile solution is also fine. >> >> (That said, I think we're better off just deprecating stats reset >> altogether, and returning -ENOTSUP here meanwhile.) >> >>> Signed-off-by: Ferruh Yigit <ferruh.yi...@amd.com> >>> --- >>> Cc: Mattias Rönnblom <mattias.ronnb...@ericsson.com> >>> Cc: Stephen Hemminger <step...@networkplumber.org> >>> Cc: Morten Brørup <m...@smartsharesystems.com> >>> >>> This update triggered by mail list discussion [1]. >>> >>> [1] >>> https://inbox.dpdk.org/dev/3b2cf48e-2293-4226-b6cd-5f4dd3969...@lysator.liu.se/ >>> >>> v2: >>> * Remove wrapping check for stats >>> --- >>> drivers/net/af_packet/rte_eth_af_packet.c | 66 ++++++++++++++--------- >>> 1 file changed, 41 insertions(+), 25 deletions(-) >>> >>> diff --git a/drivers/net/af_packet/rte_eth_af_packet.c >>> b/drivers/net/af_packet/rte_eth_af_packet.c >>> index 397a32db5886..10c8e1e50139 100644 >>> --- a/drivers/net/af_packet/rte_eth_af_packet.c >>> +++ b/drivers/net/af_packet/rte_eth_af_packet.c >>> @@ -51,8 +51,10 @@ struct pkt_rx_queue { >>> uint16_t in_port; >>> uint8_t vlan_strip; >>> - volatile unsigned long rx_pkts; >>> - volatile unsigned long rx_bytes; >>> + uint64_t rx_pkts; >>> + uint64_t rx_bytes; >>> + uint64_t rx_pkts_offset; >>> + uint64_t rx_bytes_offset; >> >> I suggest you introduce a separate struct for reset-able counters. >> It'll make things cleaner, and you can sneak in atomics without too >> much atomics-related bloat. >> >> struct counter >> { >> uint64_t count; >> uint64_t offset; >> }; >> >> /../ >> struct counter rx_pkts; >> struct counter rx_bytes; >> /../ >> >> static uint64_t >> counter_value(struct counter *counter) >> { >> uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED); >> uint64_t offset = __atomic_load_n(&counter->offset, >> __ATOMIC_RELAXED); >> > > Since the count and the offset are written to independently, without any > ordering restrictions, an update and a reset in quick succession may > cause the offset store to be globally visible before the new count. In > such a scenario, a reader could see an offset > count. > > Thus, unless I'm missing something, one should add a > > if (unlikely(offset > count)) > return 0; > > here. With the appropriate comment explaining why this might be. > > Another approach would be to think about what memory barriers may be > required to make sure one sees the count update before the offset > update, but, intuitively, that seems like both more complex and more > costly (performance-wise). >
We are going with lazy alternative and requesting to stop forwarding before stats reset, this should prevent 'count' and 'offset' being updated simultaneously. >> return count + offset; >> } >> >> static void >> counter_reset(struct counter *counter) >> { >> uint64_t count = __atomic_load_n(&counter->count, __ATOMIC_RELAXED); >> >> __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED); >> } >> >> static void >> counter_add(struct counter *counter, uint64_t operand) >> { >> __atomic_store_n(&counter->count, counter->count + operand, >> __ATOMIC_RELAXED); >> } >> >> You'd have to port this to <rte_stdatomic.h> calls, which prevents >> non-atomic loads from RTE_ATOMIC()s. The non-atomic reads above must >> be replaced with explicit relaxed non-atomic load. Otherwise, if you >> just use "counter->count", that would be an atomic load with >> sequential consistency memory order on C11 atomics-based builds, which >> would result in a barrier, at least on weakly ordered machines (e.g., >> ARM). >> >> I would still use a struct and some helper-functions even for the less >> ambitious, non-atomic variant. >> >> The only drawback of using GCC built-ins type atomics here, versus an >> atomic- and volatile-free approach, is that current compilers seems to >> refuse merging atomic stores. It's beyond me why this is the case. If >> you store to a variable twice in quick succession, it'll be two store >> machine instructions, even in cases where the compiler *knows* the >> value is identical. So volatile, even though you didn't ask for it. >> Weird. >> >> So if you have a loop, you may want to make an "counter_add()" in the >> end from a temporary, to get the final 0.001% of performance. >> >> If the tech board thinks MT-safe reset-able software-manage statistics >> is the future (as opposed to dropping reset support, for example), I >> think this stuff should go into a separate header file, so other PMDs >> can reuse it. Maybe out of scope for this patch. >> >>> }; >>> struct pkt_tx_queue { >>> @@ -64,9 +66,12 @@ struct pkt_tx_queue { >>> unsigned int framecount; >>> unsigned int framenum; >>> - volatile unsigned long tx_pkts; >>> - volatile unsigned long err_pkts; >>> - volatile unsigned long tx_bytes; >>> + uint64_t tx_pkts; >>> + uint64_t err_pkts; >>> + uint64_t tx_bytes; >>> + uint64_t tx_pkts_offset; >>> + uint64_t err_pkts_offset; >>> + uint64_t tx_bytes_offset; >>> }; >>> struct pmd_internals { >>> @@ -385,8 +390,15 @@ eth_dev_info(struct rte_eth_dev *dev, struct >>> rte_eth_dev_info *dev_info) >>> return 0; >>> } >>> + >>> +static uint64_t >>> +stats_get_diff(uint64_t stats, uint64_t offset) >>> +{ >>> + return stats - offset; >>> +} >>> + >>> static int >>> -eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *igb_stats) >>> +eth_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats) >>> { >>> unsigned i, imax; >>> unsigned long rx_total = 0, tx_total = 0, tx_err_total = 0; >>> @@ -396,27 +408,29 @@ eth_stats_get(struct rte_eth_dev *dev, struct >>> rte_eth_stats *igb_stats) >>> imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ? >>> internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS); >>> for (i = 0; i < imax; i++) { >>> - igb_stats->q_ipackets[i] = internal->rx_queue[i].rx_pkts; >>> - igb_stats->q_ibytes[i] = internal->rx_queue[i].rx_bytes; >>> - rx_total += igb_stats->q_ipackets[i]; >>> - rx_bytes_total += igb_stats->q_ibytes[i]; >>> + struct pkt_rx_queue *rxq = &internal->rx_queue[i]; >>> + stats->q_ipackets[i] = stats_get_diff(rxq->rx_pkts, >>> rxq->rx_pkts_offset); >>> + stats->q_ibytes[i] = stats_get_diff(rxq->rx_bytes, >>> rxq->rx_bytes_offset); >>> + rx_total += stats->q_ipackets[i]; >>> + rx_bytes_total += stats->q_ibytes[i]; >>> } >>> imax = (internal->nb_queues < RTE_ETHDEV_QUEUE_STAT_CNTRS ? >>> internal->nb_queues : RTE_ETHDEV_QUEUE_STAT_CNTRS); >>> for (i = 0; i < imax; i++) { >>> - igb_stats->q_opackets[i] = internal->tx_queue[i].tx_pkts; >>> - igb_stats->q_obytes[i] = internal->tx_queue[i].tx_bytes; >>> - tx_total += igb_stats->q_opackets[i]; >>> - tx_err_total += internal->tx_queue[i].err_pkts; >>> - tx_bytes_total += igb_stats->q_obytes[i]; >>> + struct pkt_tx_queue *txq = &internal->tx_queue[i]; >>> + stats->q_opackets[i] = stats_get_diff(txq->tx_pkts, >>> txq->tx_pkts_offset); >>> + stats->q_obytes[i] = stats_get_diff(txq->tx_bytes, >>> txq->tx_bytes_offset); >>> + tx_total += stats->q_opackets[i]; >>> + tx_err_total += stats_get_diff(txq->err_pkts, >>> txq->err_pkts_offset); >>> + tx_bytes_total += stats->q_obytes[i]; >>> } >>> - igb_stats->ipackets = rx_total; >>> - igb_stats->ibytes = rx_bytes_total; >>> - igb_stats->opackets = tx_total; >>> - igb_stats->oerrors = tx_err_total; >>> - igb_stats->obytes = tx_bytes_total; >>> + stats->ipackets = rx_total; >>> + stats->ibytes = rx_bytes_total; >>> + stats->opackets = tx_total; >>> + stats->oerrors = tx_err_total; >>> + stats->obytes = tx_bytes_total; >>> return 0; >>> } >>> @@ -427,14 +441,16 @@ eth_stats_reset(struct rte_eth_dev *dev) >>> struct pmd_internals *internal = dev->data->dev_private; >>> for (i = 0; i < internal->nb_queues; i++) { >>> - internal->rx_queue[i].rx_pkts = 0; >>> - internal->rx_queue[i].rx_bytes = 0; >>> + struct pkt_rx_queue *rxq = &internal->rx_queue[i]; >>> + rxq->rx_pkts_offset = rxq->rx_pkts; >>> + rxq->rx_bytes_offset = rxq->rx_bytes; >>> } >>> for (i = 0; i < internal->nb_queues; i++) { >>> - internal->tx_queue[i].tx_pkts = 0; >>> - internal->tx_queue[i].err_pkts = 0; >>> - internal->tx_queue[i].tx_bytes = 0; >>> + struct pkt_tx_queue *txq = &internal->tx_queue[i]; >>> + txq->tx_pkts_offset = txq->tx_pkts; >>> + txq->err_pkts_offset = txq->err_pkts; >>> + txq->tx_bytes_offset = txq->tx_bytes; >>> } >>> return 0;