On Thu, 9 May 2024 16:19:08 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:

> > From: Morten Brørup [mailto:m...@smartsharesystems.com]
> > Sent: Thursday, 9 May 2024 13.37
> >   
> > > From: Bruce Richardson [mailto:bruce.richard...@intel.com]
> > > Sent: Thursday, 9 May 2024 11.30
> > >
> > > On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:  
> > > > > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> > > > > Sent: Wednesday, 8 May 2024 22.54
> > > > >
> > > > > On Wed, 8 May 2024 20:48:06 +0100
> > > > > Ferruh Yigit <ferruh.yi...@amd.com> wrote:
> > > > >  
> > > > > > >
> > > > > > > The idea of load tearing is crazy talk of integral types. It  
> > > would  
> > > > > break so many things.  
> > > > > > > It is the kind of stupid compiler thing that would send Linus  
> > on  
> > > a  
> > > > > rant and get  
> > > > > > > the GCC compiler writers in trouble.
> > > > > > >
> > > > > > > The DPDK has always favored performance over strict safety  
> > guard  
> > > > > rails everywhere.  
> > > > > > > Switching to making every statistic an atomic operation is not  
> > > in  
> > > > > the spirit of  
> > > > > > > what is required. There is no strict guarantee necessary here.
> > > > > > >  
> > > > > >
> > > > > > I kind of agree with Stephen.
> > > > > >
> > > > > > Thanks Mattias, Morten & Stephen, it was informative discussion.  
> > > But  
> > > > > for  
> > > > > > *SW drivers* stats update and reset is not core functionality  
> > and  
> > > I  
> > > > > > think we can be OK to get hit on corner cases, instead of
> > > > > > over-engineering or making code more complex.  
> > > > >
> > > > >
> > > > > I forgot the case of 64 bit values on 32 bit platforms!
> > > > > Mostly because haven't cared about 32 bit for years...
> > > > >
> > > > > The Linux kernel uses some wrappers to handle this.
> > > > > On 64 bit platforms they become noop.
> > > > > On 32 bit platform, they are protected by a seqlock and updates  
> > are  
> > > > > wrapped by the sequence count.
> > > > >
> > > > > If we go this way, then doing similar Noop on 64 bit and atomic or
> > > > > seqlock
> > > > > on 32 bit should be done, but in common helper.
> > > > >
> > > > > Looking inside FreeBSD, it looks like that has changed over the  
> > > years as  
> > > > > well.
> > > > >
> > > > >       if_inc_counter
> > > > >               counter_u64_add
> > > > >                       atomic_add_64
> > > > > But the counters are always per-cpu in this case. So although it  
> > > does  
> > > > > use
> > > > > locked operation, will always be uncontended.
> > > > >
> > > > >
> > > > > PS: Does DPDK still actually support 32 bit on x86? Can it be  
> > > dropped  
> > > > > this cycle?  
> > > >
> > > > We cannot drop 32 bit architecture support altogether.
> > > >
> > > > But, unlike the Linux kernel, DPDK doesn't need to support ancient  
> > 32  
> > > bit architectures.  
> > > > If the few 32 bit architectures supported by DPDK provide non-  
> > tearing  
> > > 64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
> > > counters.  
> > > >
> > > > In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit  
> > > architecture) and 32 bit ARMv8.  
> > > > I don't think DPDK support any other 32 bit architectures.
> > > >
> > > >
> > > > As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64  
> > > bit non-tearing load/store.  
> > > >  
> > >
> > > Testing this a little in godbolt, I see gcc using xmm registers on 32-
> > > bit
> > > when updating 64-bit counters, but clang doesn't seem to do so, but
> > > instead
> > > does 2 stores when writing back the 64 value. (I tried with both
> > > volatile
> > > and non-volatile 64-bit values, just to see if volatile would  
> > encourage  
> > > clang to do a single store).
> > >
> > > GCC: https://godbolt.org/z/9eqKfT3hz
> > > Clang: https://godbolt.org/z/PT5EqKn4c  
> > 
> > Interesting.
> > I guess this can be fixed by manually implementing what GCC does.
> > 
> > I'm more concerned about finding a high-performance (in the fast path)
> > 64 bit counter solution for 32 bit ARM.  
> 
> Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is 
> possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) 
> instructions, which load/store 64 bit from memory into two registers at once.
> 
> Clang is emits more efficient code without volatile.
> GCC requires volatile to use STRD.
> 
> Clang: https://godbolt.org/z/WjdTq6EKh
> GCC: https://godbolt.org/z/qq9j7d4Ea
> 
> Summing it up, it is possible to implement non-tearing 64 bit 
> high-performance (lockless, barrier-free) counters on the 32 bit 
> architectures supported by DPDK.
> 
> But the implementation is both architecture and compiler specific.
> So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 
> bit integers" library, for support of the signed variant too; but I don't 
> think we need that.)
> We can use uint64_t as the underlying type and type cast in the library 
> (where needed by the specific compiler/architecture), or introduce a new 
> rte_ctr64_t type to ensure that accessor functions are always used and the 
> developer doesn't have to worry about tearing on 32 bit architectures.
> 
> The most simple variant of such a library only provides load and store 
> functions. The API would look like this:
> 
> uint64_t inline
> rte_ctr64_get(const rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);
> 
> And if some CPU offers special instructions for increment or addition, faster 
> (regarding performance) and/or more compact (regarding instruction memory) 
> than a sequence of load-add-store instructions:
> 
> void inline
> rte_ctr64_inc(rte_ctr64_t *const ctr);
> 
> void inline
> rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);
> 
> <feature creep>
> And perhaps atomic variants of all these functions, with explicit and/or 
> relaxed memory ordering, for counters shared by multiple threads.
> </feature creep>
> 


This kind of what I am experimenting with but...
Intend to keep the details of the counters read and update in one file and not 
as inlines.

Reply via email to