> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > Sent: Wednesday, 8 May 2024 08.35 > > On 2024-05-07 21:19, Morten Brørup wrote: > >> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se] > >> Sent: Tuesday, 7 May 2024 09.24 > >> > >> On 2024-04-28 17:11, Mattias Rönnblom wrote: > >>> On 2024-04-26 16:38, Ferruh Yigit wrote: > > > > [...] > > > >>> static uint64_t > >>> counter_value(struct counter *counter) > >>> { > >>> uint64_t count = __atomic_load_n(&counter->count, > >> __ATOMIC_RELAXED); > >>> uint64_t offset = __atomic_load_n(&counter->offset, > >> __ATOMIC_RELAXED); > >>> > >> > >> Since the count and the offset are written to independently, without any > >> ordering restrictions, an update and a reset in quick succession may > >> cause the offset store to be globally visible before the new count. > > > > Good catch. > > This may happen when a thread calls stats_add() and then the same thread > calls stats_reset(). > > > >> In such a scenario, a reader could see an offset > count. > >> > >> Thus, unless I'm missing something, one should add a > >> > >> if (unlikely(offset > count)) > >> return 0; > >> > >> here. With the appropriate comment explaining why this might be. > >> > >> Another approach would be to think about what memory barriers may be > >> required to make sure one sees the count update before the offset > >> update, but, intuitively, that seems like both more complex and more > >> costly (performance-wise). > > > > I think it can be done without affecting stats_add(), by using "offset" with > Release-Consume ordering: > > - stats_reset() must write "offset" with memory_order_release, so > "counter" cannot be visible after it, and > > - stats_get() must read "offset" with memory_order_consume, so no reads or > writes in the current thread dependent on "offset" can be reordered before > this load, and writes to "counter" (a data-dependent variable) in other > threads that release "offset" are visible in the current thread. > > > > That was the kind of complexity I was thinking about. Those barriers > come with a non-zero cost, both with different instructions being used > and compiler optimizations being prevented.
Yep, you mentioned that there might be a more complex alternative, so I decided to explore it. :-) This approach doesn't impose any new requirements on stats_add(), so the data plane performance is not affected. For per-thread counters, stats_add() can store "counter" using memory_order_relaxed. Or, if the architecture prevents tearing of 64-bit variables, using volatile. Counters shared by multiple threads must be atomically incremented using rte_atomic_fetch_add_explicit() with memory_order_relaxed. > > >> > >>> return count + offset; > >>> } > >>> > >>> static void > >>> counter_reset(struct counter *counter) > >>> { > >>> uint64_t count = __atomic_load_n(&counter->count, > >> __ATOMIC_RELAXED); > >>> > >>> __atomic_store_n(&counter->offset, count, __ATOMIC_RELAXED); > >>> } > >>> > >>> static void > >>> counter_add(struct counter *counter, uint64_t operand) > >>> { > >>> __atomic_store_n(&counter->count, counter->count + operand, > >>> __ATOMIC_RELAXED); > >>> } > >