> From: Stephen Hemminger [mailto:step...@networkplumber.org] > Sent: Friday, 10 May 2024 06.56 > > On Thu, 9 May 2024 16:19:08 +0200 > Morten Brørup <m...@smartsharesystems.com> wrote: > > > > From: Morten Brørup [mailto:m...@smartsharesystems.com] > > > Sent: Thursday, 9 May 2024 13.37 > > > > > > > From: Bruce Richardson [mailto:bruce.richard...@intel.com] > > > > Sent: Thursday, 9 May 2024 11.30 > > > > > > > > On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote: > > > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > > > > > > Sent: Wednesday, 8 May 2024 22.54 > > > > > > > > > > > > On Wed, 8 May 2024 20:48:06 +0100 > > > > > > Ferruh Yigit <ferruh.yi...@amd.com> wrote: > > > > > > > > > > > > > > > > > > > > > > The idea of load tearing is crazy talk of integral types. > It > > > > would > > > > > > break so many things. > > > > > > > > It is the kind of stupid compiler thing that would send > Linus > > > on > > > > a > > > > > > rant and get > > > > > > > > the GCC compiler writers in trouble. > > > > > > > > > > > > > > > > The DPDK has always favored performance over strict safety > > > guard > > > > > > rails everywhere. > > > > > > > > Switching to making every statistic an atomic operation is > not > > > > in > > > > > > the spirit of > > > > > > > > what is required. There is no strict guarantee necessary > here. > > > > > > > > > > > > > > > > > > > > > > I kind of agree with Stephen. > > > > > > > > > > > > > > Thanks Mattias, Morten & Stephen, it was informative > discussion. > > > > But > > > > > > for > > > > > > > *SW drivers* stats update and reset is not core > functionality > > > and > > > > I > > > > > > > think we can be OK to get hit on corner cases, instead of > > > > > > > over-engineering or making code more complex. > > > > > > > > > > > > > > > > > > I forgot the case of 64 bit values on 32 bit platforms! > > > > > > Mostly because haven't cared about 32 bit for years... > > > > > > > > > > > > The Linux kernel uses some wrappers to handle this. > > > > > > On 64 bit platforms they become noop. > > > > > > On 32 bit platform, they are protected by a seqlock and > updates > > > are > > > > > > wrapped by the sequence count. > > > > > > > > > > > > If we go this way, then doing similar Noop on 64 bit and > atomic or > > > > > > seqlock > > > > > > on 32 bit should be done, but in common helper. > > > > > > > > > > > > Looking inside FreeBSD, it looks like that has changed over > the > > > > years as > > > > > > well. > > > > > > > > > > > > if_inc_counter > > > > > > counter_u64_add > > > > > > atomic_add_64 > > > > > > But the counters are always per-cpu in this case. So although > it > > > > does > > > > > > use > > > > > > locked operation, will always be uncontended. > > > > > > > > > > > > > > > > > > PS: Does DPDK still actually support 32 bit on x86? Can it be > > > > dropped > > > > > > this cycle? > > > > > > > > > > We cannot drop 32 bit architecture support altogether. > > > > > > > > > > But, unlike the Linux kernel, DPDK doesn't need to support > ancient > > > 32 > > > > bit architectures. > > > > > If the few 32 bit architectures supported by DPDK provide non- > > > tearing > > > > 64 bit loads/stores, we don't need locks (in the fast path) for 64 > bit > > > > counters. > > > > > > > > > > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit > > > > architecture) and 32 bit ARMv8. > > > > > I don't think DPDK support any other 32 bit architectures. > > > > > > > > > > > > > > > As Mattias mentioned, 32 bit x86 can use xmm registers to > provide 64 > > > > bit non-tearing load/store. > > > > > > > > > > > > > Testing this a little in godbolt, I see gcc using xmm registers on > 32- > > > > bit > > > > when updating 64-bit counters, but clang doesn't seem to do so, > but > > > > instead > > > > does 2 stores when writing back the 64 value. (I tried with both > > > > volatile > > > > and non-volatile 64-bit values, just to see if volatile would > > > encourage > > > > clang to do a single store). > > > > > > > > GCC: https://godbolt.org/z/9eqKfT3hz > > > > Clang: https://godbolt.org/z/PT5EqKn4c > > > > > > Interesting. > > > I guess this can be fixed by manually implementing what GCC does. > > > > > > I'm more concerned about finding a high-performance (in the fast > path) > > > 64 bit counter solution for 32 bit ARM. > > > > Reading up on the topic, and continuing Bruce's experiment on Godbolt, > it is possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store > Register Dual) instructions, which load/store 64 bit from memory into > two registers at once. > > > > Clang is emits more efficient code without volatile. > > GCC requires volatile to use STRD. > > > > Clang: https://godbolt.org/z/WjdTq6EKh > > GCC: https://godbolt.org/z/qq9j7d4Ea > > > > Summing it up, it is possible to implement non-tearing 64 bit high- > performance (lockless, barrier-free) counters on the 32 bit > architectures supported by DPDK. > > > > But the implementation is both architecture and compiler specific. > > So it seems a "64 bit counters" library would be handy. (Or a "non- > tearing 64 bit integers" library, for support of the signed variant too; > but I don't think we need that.) > > We can use uint64_t as the underlying type and type cast in the > library (where needed by the specific compiler/architecture), or > introduce a new rte_ctr64_t type to ensure that accessor functions are > always used and the developer doesn't have to worry about tearing on 32 > bit architectures. > > > > The most simple variant of such a library only provides load and store > functions. The API would look like this: > > > > uint64_t inline > > rte_ctr64_get(const rte_ctr64_t *const ctr); > > > > void inline > > rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value); > > > > And if some CPU offers special instructions for increment or addition, > faster (regarding performance) and/or more compact (regarding > instruction memory) than a sequence of load-add-store instructions: > > > > void inline > > rte_ctr64_inc(rte_ctr64_t *const ctr); > > > > void inline > > rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
Note: 32 bit architectures might achieve higher performance if the "value" parameter to rte_ctr64_add() is unsigned long (or unsigned int) instead of uint64_t. > > > > <feature creep> > > And perhaps atomic variants of all these functions, with explicit > and/or relaxed memory ordering, for counters shared by multiple threads. > > </feature creep> > > > > > This kind of what I am experimenting with but... Excellent! > Intend to keep the details of the counters read and update in one file > and not as inlines. Please note that traffic management applications maintain many counters (e.g. per-flow, per-subscriber and per-link packet and byte counters, some also per QoS class), so rte_ctr64_add() must have the highest performance technically possible. For reference, the packet scheduling code in our application updates 28 statistics counters per burst of packets. (In addition to internal state variables for e.g. queue lenghts.) Furthermore, our application processes and displays live statistics with one second granularity, using a separate thread. Although statistics processing is not part of the fast path, the sheer number of counters processed requires high performance read access to those counters. <more feature creep> Some groups of counters are maintained locally in the inner loop, and then added in bulk to the "public" statistics afterwards. Conceptually: struct stats_per_prio { uint64_t forwarded_bytes; uint64_t forwarded_packets; uint64_t marked_bytes; uint64_t marked_packets; uint64_t dropped_bytes; uint64_t dropped_packets; }; If this is a common design pattern in DPDK (drivers, libraries and/or applications), perhaps also provide a performance optimized function for architectures offering vector instructions: void inline rte_ctr64_add_bulk( rte_ctr64_t *const ctrs, const unsigned long *const values, const unsigned int n /* compile time constant */); This slightly resembles the Linux kernel's design pattern, where counters are updated in bulk, protected by a common lock for the bulk update. (However, DPDK has no lock, so the motivation for optimizing for this design pattern is only "nice to have".) PS: Many DPDK applications are 64 bit only, including the SmartShare appliances, and can easily manage 64 bit counters without all this. However, if the DPDK project still considers 32 bit architectures first class citizens, 64 bit counters should have high performance on 32 bit architectures too.