Re: [RFC v3] net/af_packet: make stats reset reliable

Mattias Rönnblom Sun, 26 May 2024 00:07:08 -0700

On 2024-05-09 16:19, Morten Brørup wrote:

From: Morten Brørup [mailto:[email protected]]
Sent: Thursday, 9 May 2024 13.37

From: Bruce Richardson [mailto:[email protected]]
Sent: Thursday, 9 May 2024 11.30

On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:

From: Stephen Hemminger [mailto:[email protected]]
Sent: Wednesday, 8 May 2024 22.54

On Wed, 8 May 2024 20:48:06 +0100
Ferruh Yigit <[email protected]> wrote:


The idea of load tearing is crazy talk of integral types. It

would

break so many things.

It is the kind of stupid compiler thing that would send Linus

on

rant and get

the GCC compiler writers in trouble.

The DPDK has always favored performance over strict safety

guard

rails everywhere.

Switching to making every statistic an atomic operation is not

in

the spirit of

what is required. There is no strict guarantee necessary here.


I kind of agree with Stephen.

Thanks Mattias, Morten & Stephen, it was informative discussion.

But

for

*SW drivers* stats update and reset is not core functionality

and

think we can be OK to get hit on corner cases, instead of
over-engineering or making code more complex.



I forgot the case of 64 bit values on 32 bit platforms!
Mostly because haven't cared about 32 bit for years...

The Linux kernel uses some wrappers to handle this.
On 64 bit platforms they become noop.
On 32 bit platform, they are protected by a seqlock and updates

are

wrapped by the sequence count.

If we go this way, then doing similar Noop on 64 bit and atomic or
seqlock
on 32 bit should be done, but in common helper.

Looking inside FreeBSD, it looks like that has changed over the

years as

well.

        if_inc_counter
                counter_u64_add
                        atomic_add_64
But the counters are always per-cpu in this case. So although it

does

use
locked operation, will always be uncontended.


PS: Does DPDK still actually support 32 bit on x86? Can it be

dropped

this cycle?


We cannot drop 32 bit architecture support altogether.

But, unlike the Linux kernel, DPDK doesn't need to support ancient

bit architectures.

If the few 32 bit architectures supported by DPDK provide non-

tearing

64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
counters.


In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit

architecture) and 32 bit ARMv8.

I don't think DPDK support any other 32 bit architectures.


As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64

bit non-tearing load/store.


Testing this a little in godbolt, I see gcc using xmm registers on 32-
bit
when updating 64-bit counters, but clang doesn't seem to do so, but
instead
does 2 stores when writing back the 64 value. (I tried with both
volatile
and non-volatile 64-bit values, just to see if volatile would

encourage

clang to do a single store).

GCC: https://godbolt.org/z/9eqKfT3hz
Clang: https://godbolt.org/z/PT5EqKn4c


Interesting.
I guess this can be fixed by manually implementing what GCC does.

I'm more concerned about finding a high-performance (in the fast path)
64 bit counter solution for 32 bit ARM.


Reading up on the topic, and continuing Bruce's experiment on Godbolt, it is 
possible on 32 bit ARMv7-A too, using LDRD/STRD (Load/Store Register Dual) 
instructions, which load/store 64 bit from memory into two registers at once.

Clang is emits more efficient code without volatile.
GCC requires volatile to use STRD.

Clang: https://godbolt.org/z/WjdTq6EKh
GCC: https://godbolt.org/z/qq9j7d4Ea

Summing it up, it is possible to implement non-tearing 64 bit high-performance 
(lockless, barrier-free) counters on the 32 bit architectures supported by DPDK.

But the implementation is both architecture and compiler specific.
So it seems a "64 bit counters" library would be handy. (Or a "non-tearing 64 bit 
integers" library, for support of the signed variant too; but I don't think we need that.)
We can use uint64_t as the underlying type and type cast in the library (where 
needed by the specific compiler/architecture), or introduce a new rte_ctr64_t 
type to ensure that accessor functions are always used and the developer 
doesn't have to worry about tearing on 32 bit architectures.

The most simple variant of such a library only provides load and store 
functions. The API would look like this:

uint64_t inline
rte_ctr64_get(const rte_ctr64_t *const ctr);

void inline
rte_ctr64_set(rte_ctr64_t *const ctr, const uint64_t value);

And if some CPU offers special instructions for increment or addition, faster 
(regarding performance) and/or more compact (regarding instruction memory) than 
a sequence of load-add-store instructions:

void inline
rte_ctr64_inc(rte_ctr64_t *const ctr);

void inline
rte_ctr64_add(rte_ctr64_t *const ctr, const uint64_t value);

<feature creep>
And perhaps atomic variants of all these functions, with explicit and/or 
relaxed memory ordering, for counters shared by multiple threads.


Isn't rte_ctr64_get() already an atomic load with relaxed memory ordering?

</feature creep>

Re: [RFC v3] net/af_packet: make stats reset reliable

Reply via email to