On 2024-05-22 21:51, Stephen Hemminger wrote:
On Wed, 22 May 2024 12:01:12 -0700
Tyler Retzlaff <roret...@linux.microsoft.com> wrote:

On Wed, May 22, 2024 at 07:57:01PM +0200, Morten Brørup wrote:
From: Stephen Hemminger [mailto:step...@networkplumber.org]
Sent: Wednesday, 22 May 2024 17.38

On Wed, 22 May 2024 10:31:39 +0200
Morten Brørup <m...@smartsharesystems.com> wrote:
+/* On 32 bit platform, need to use atomic to avoid load/store
tearing */
+typedef RTE_ATOMIC(uint64_t) rte_counter64_t;

As shown by Godbolt experiments discussed in a previous thread [2],
non-tearing 64 bit counters can be implemented without using atomic
instructions on all 32 bit architectures supported by DPDK. So we should
use the counter/offset design pattern for RTE_ARCH_32 too.

[2]:
https://inbox.dpdk.org/dev/98CBD80474FA8B44BF855DF32C47DC35E9F433@smarts
erver.smartshare.dk/


This code built with -O3 and -m32 on godbolt shows split problem.

#include <stdint.h>

typedef uint64_t rte_counter64_t;

void
rte_counter64_add(rte_counter64_t *counter, uint32_t val)
{
        *counter += val;
}
…       *counter = val;
}

rte_counter64_add:
         push    ebx
         mov     eax, DWORD PTR [esp+8]
         xor     ebx, ebx
         mov     ecx, DWORD PTR [esp+12]
         add     DWORD PTR [eax], ecx
         adc     DWORD PTR [eax+4], ebx
         pop     ebx
         ret

rte_counter64_read:
         mov     eax, DWORD PTR [esp+4]
         mov     edx, DWORD PTR [eax+4]
         mov     eax, DWORD PTR [eax]
         ret
rte_counter64_set:
         movq    xmm0, QWORD PTR [esp+8]
         mov     eax, DWORD PTR [esp+4]
         movq    QWORD PTR [eax], xmm0
         ret

Sure, atomic might be required on some 32 bit architectures and/or with some 
compilers.

in theory i think you should be able to use generic atomics and
depending on the target you get codegen that works. it might be
something more expensive on 32-bit and nothing on 64-bit etc..

what's the damage if we just use atomic generic and relaxed ordering? is
the codegen not optimal?

If we use atomic with relaxed memory order, then compiler for x86 still 
generates
a locked increment in the fast path. This costs about 100 extra cycles due
to cache and prefetch stall. This whole endeavor is an attempt to avoid that.


It's because the code is overly restrictive (e.g., needlessly forcing the whole read-modify-read being atomic), in that case, and no fault of the compiler.

void add(uint64_t *addr, uint64_t operand)
{
    uint64_t value = __atomic_load_n(addr, __ATOMIC_RELAXED);
    value += operand;
    __atomic_store_n(addr, value, __ATOMIC_RELAXED);
}

->

x86_64

add:
        mov     rax, QWORD PTR [rdi]
        add     rax, rsi
        mov     QWORD PTR [rdi], rax
        ret


x86

add:
        sub     esp, 12
        mov     ecx, DWORD PTR [esp+16]
        movq    xmm0, QWORD PTR [ecx]
        movq    QWORD PTR [esp], xmm0
        mov     eax, DWORD PTR [esp]
        mov     edx, DWORD PTR [esp+4]
        add     eax, DWORD PTR [esp+20]
        adc     edx, DWORD PTR [esp+24]
        mov     DWORD PTR [esp], eax
        mov     DWORD PTR [esp+4], edx
        movq    xmm1, QWORD PTR [esp]
        movq    QWORD PTR [ecx], xmm1
        add     esp, 12
        ret

No locked instructions.

PS: looking at the locked increment code for 32 bit involves locked compare
exchange and potential retry. Probably don't care about performance on that 
platform
anymore.


Reply via email to