On 2024-05-09 13:37, Morten Brørup wrote:
From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Thursday, 9 May 2024 11.30
On Thu, May 09, 2024 at 09:43:16AM +0200, Morten Brørup wrote:
From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 8 May 2024 22.54
On Wed, 8 May 2024 20:48:06 +0100
Ferruh Yigit <ferruh.yi...@amd.com> wrote:
The idea of load tearing is crazy talk of integral types. It
would
break so many things.
It is the kind of stupid compiler thing that would send Linus on
a
rant and get
the GCC compiler writers in trouble.
The DPDK has always favored performance over strict safety guard
rails everywhere.
Switching to making every statistic an atomic operation is not
in
the spirit of
what is required. There is no strict guarantee necessary here.
I kind of agree with Stephen.
Thanks Mattias, Morten & Stephen, it was informative discussion.
But
for
*SW drivers* stats update and reset is not core functionality and
I
think we can be OK to get hit on corner cases, instead of
over-engineering or making code more complex.
I forgot the case of 64 bit values on 32 bit platforms!
Mostly because haven't cared about 32 bit for years...
The Linux kernel uses some wrappers to handle this.
On 64 bit platforms they become noop.
On 32 bit platform, they are protected by a seqlock and updates are
wrapped by the sequence count.
If we go this way, then doing similar Noop on 64 bit and atomic or
seqlock
on 32 bit should be done, but in common helper.
Looking inside FreeBSD, it looks like that has changed over the
years as
well.
if_inc_counter
counter_u64_add
atomic_add_64
But the counters are always per-cpu in this case. So although it
does
use
locked operation, will always be uncontended.
PS: Does DPDK still actually support 32 bit on x86? Can it be
dropped
this cycle?
We cannot drop 32 bit architecture support altogether.
But, unlike the Linux kernel, DPDK doesn't need to support ancient 32
bit architectures.
If the few 32 bit architectures supported by DPDK provide non-tearing
64 bit loads/stores, we don't need locks (in the fast path) for 64 bit
counters.
In addition to 32 bit x86, DPDK supports ARMv7-A (a 32 bit
architecture) and 32 bit ARMv8.
I don't think DPDK support any other 32 bit architectures.
As Mattias mentioned, 32 bit x86 can use xmm registers to provide 64
bit non-tearing load/store.
Testing this a little in godbolt, I see gcc using xmm registers on 32-
bit
when updating 64-bit counters, but clang doesn't seem to do so, but
instead
does 2 stores when writing back the 64 value. (I tried with both
volatile
and non-volatile 64-bit values, just to see if volatile would encourage
clang to do a single store).
GCC: https://godbolt.org/z/9eqKfT3hz
Clang: https://godbolt.org/z/PT5EqKn4c
Interesting.
I guess this can be fixed by manually implementing what GCC does.
If you want an atomic store, use __atomic_store_n() or the equivalent.
On GCC, the resulting code will be compact and use an XMM register for
the store.
On clang, you get much more instructions including a conditional jump to
a compiler library function. You also get a compiler warning about
misaligned atomic may incur significant cost (no surprise). All this on
32-bit x86. Interesting, you get all those instructions even when you
explicitly tell the compiler the address should be assumed to be
aligned. Seems like a bug, or at least a short-coming.
So GCC assumes atomic stores are aligned, which seems perfectly
reasonable to me.
I'm more concerned about finding a high-performance (in the fast path) 64 bit
counter solution for 32 bit ARM.