On Mon, 1 Jun 2026 19:15:09 +0100
Konstantin Ananyev <[email protected]> wrote:
> C11 __rte_ring_headtail_move_head_mt() uses output
> parameter: 'uint32_t *old_head' directly within CAS operation.
> In x86_64 that cause gcc to generate extra instructions to
> store return value of CAS (eax) within 'old_head' memory location,
> even when CAS was not successful and another attempt should be
> performed. In some cases, even extra branch can be observed.
> To be more specific the code like that is generated:
> // start of 'do { } while();' loop
> .L2
> ...
> lock cmpxchgl %r8d, (%rdi)
> jne .L17 //
> .L1: // <---- successful completion of CAS, finish
> movl %edx, %eax
> ret
> .L17: // <---- unsuccessful completion of CAS, repeat
> movl %eax, (%r9)
> jmp .L2
>
> In constrast, x86 specific version that uses
> __sync_bool_compare_and_swap() doesn't exibit such problem,
> as __sync_bool_compare_and_swap() doesn't update the 'old_head'
> with new value, and we have to re-read it explicitly on each iteration.
>
> Overcome that problem by using local variable 'head' inside the loop,
> and updaing '*old_head' value only at exit.
> With such change gcc manages to avoid extra store(/branch).
>
> Depends-on: series-38225 ("deprecate rte_atomicNN family")
>
> Signed-off-by: Konstantin Ananyev <[email protected]>
> ---
I used the standard ring perf tests and ran 10 times via:
! /bin/bash
if [ -z "$1" ]; then
echo "Usage $0 version"
exit 1
fi
VERSION=$1
for i in $(seq 1 10); do
sudo DPDK_TEST=ring_perf_autotest \
./build/app/dpdk-test -l 2-5 -n 4 --no-pci --file-prefix=run$i \
> ~/DPDK/ring_perf_results/${VERSION}_run${i}.log 2>&1
echo "${VERSION} run $i done"
done
Then had Claude compare results:
Key metric (two physical cores legacy MP/MC bulk n=128):
main: 5.380 cycles/elem
sync-bool: 5.377 cycles/elem (-0.07%)
avoid-store: 5.892 cycles/elem (+9.52%) ← regresses
Looking at the dissassembly of ring_enqueue_bulk:
The inner loop of main and sync-bool versions is:
mov 0x80(%rdi),%r11d ; load d->head via displacement
mov 0x104(%rdi),%ebx ; load s->tail
add %ecx,%ebx
sub %r11d,%ebx
cmp %ebx,%r12d
jae [exit]
lea (%r8,%r11,1),%r13d ; new_head = old_head + n
mov %r11d,%eax ; expected → eax
lock cmpxchg %r13d,0x80(%rdi) ; ← displacement addressing
jne [retry] ; ← direct jne, eax preserved
Using atomic_compare_exchange and your patch:
mov 0x38(%rdi),%r10d
mov 0x80(%rdi),%eax ; load d->head directly into %eax
lea 0x80(%rdi),%rcx ; ← MATERIALIZE &d->head into %rcx
lea -0x1(%r8),%r12d
mov 0x104(%rdi),%r11d
add %r10d,%r11d
sub %eax,%r11d
cmp %r11d,%r12d
jae [exit]
lea (%r8,%rax,1),%r13d ; new_head
lock cmpxchg %r13d,(%rcx) ; ← INDIRECT addressing via %rcx
mov %eax,%ebx ; ← EXTRA: save post-CAS %eax to %ebx
jne [retry]
Bottom line: good idea but still fighting with Gcc optimizer here.