On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:

> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice.
> > + *
> > + * On native 64 bit platform, counter is implemented as basic
> > + * 64 bit unsigned integer that only increases.
> > + */
> > +typedef struct {
> > +   uint64_t current;
> > +   uint64_t offset;
> > +} rte_counter64_t;  
> 
> As discussed in the other thread [1], I strongly prefer having "current" and 
> "offset" separate, for performance reasons.
> Keeping each offset close together with its counter will require more cache 
> lines than necessary, because the offsets take up space in the hot part of a 
> fast path data structure. E.g. the size_bins[] counters could fit into one 
> cache line instead of two.
> 
> [1]: 
> https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

There are no size_bins in the current version of the patch.
And the number of counters in ethdev part are small so it is less of a concern.
The code is easier to maintain if the counter object is self contained.

> > + * @internal
> > + * Macro to implement write once (compiler barrier) using stdatomic.
> > + * This is compiler barrier only.
> > + */
> > +#define __rte_write_once(var, val)                                     \
> > +   rte_atomic_store_explicit((__rte_atomic typeof(&(var)))&(var), val, \
> > +           rte_memory_order_release)  
> 
> These macros certainly make the code using them shorter.
> But IMHO, they don't improve the readability of the code using them; quite 
> the opposite. Reviewing code using atomics is hard, so I prefer having the 
> memory order directly shown in the code, not hidden behind a macro.

Agree, was going to drop them in next version.

> > +__rte_experimental
> > +static inline uint64_t
> > +rte_counter64_read(const rte_counter64_t *counter)
> > +{
> > +   return __rte_read_once(counter->current) - __rte_read_once(counter-  
> > >offset);  
> 
> I'm not sure that "current" needs to be read using rte_memory_order_consume 
> here; I think rte_memory_order_consume for "offset" and 
> rte_memory_order_relaxed (or perhaps just volatile) for "counter" suffices. 
> But let's settle on the high level design before we start micro-optimizing. 
> :-)

memory order consume is compiler barrier only. Was trying to choose what was 
best here.

> > +
> > +/* On 32 bit platform, need to use atomic to avoid load/store tearing */
> > +typedef RTE_ATOMIC(uint64_t) rte_counter64_t;  
> 
> As shown by Godbolt experiments discussed in a previous thread [2], 
> non-tearing 64 bit counters can be implemented without using atomic 
> instructions on all 32 bit architectures supported by DPDK. So we should use 
> the counter/offset design pattern for RTE_ARCH_32 too.
> 
> [2]: 
> https://inbox.dpdk.org/dev/98cbd80474fa8b44bf855df32c47dc35e9f...@smartserver.smartshare.dk/

Bruce found some 32 bit versions of x86 have the problem.
Godbolt doesn't seem to list 32 bit x86 compiler for Gcc.
If you try MSVC in 32 bit mode it will split the loads.
I see no problem on ARM which is the only other 32 bit we care about.


Reply via email to