> > Then had Claude compare results:
> >
> > Key metric (two physical cores legacy MP/MC bulk n=128):
> >   main:        5.380 cycles/elem
> >   sync-bool:   5.377 cycles/elem  (-0.07%)
> >   avoid-store: 5.892 cycles/elem  (+9.52%)  ← regresses
> >
> >
> > Looking at the dissassembly of ring_enqueue_bulk:
> >
> > The inner loop of main and sync-bool versions is:
> > mov    0x80(%rdi),%r11d            ; load d->head via displacement
> > mov    0x104(%rdi),%ebx             ; load s->tail
> > add    %ecx,%ebx
> > sub    %r11d,%ebx
> > cmp    %ebx,%r12d
> > jae    [exit]
> > lea    (%r8,%r11,1),%r13d           ; new_head = old_head + n
> > mov    %r11d,%eax                   ; expected → eax
> > lock cmpxchg %r13d,0x80(%rdi)       ; ← displacement addressing
> > jne    [retry]                      ; ← direct jne, eax preserved
> >
> > Using atomic_compare_exchange and your patch:
> > mov    0x38(%rdi),%r10d
> > mov    0x80(%rdi),%eax              ; load d->head directly into %eax
> > lea    0x80(%rdi),%rcx               ; ← MATERIALIZE &d->head into
> %rcx
> > lea    -0x1(%r8),%r12d
> > mov    0x104(%rdi),%r11d
> > add    %r10d,%r11d
> > sub    %eax,%r11d
> > cmp    %r11d,%r12d
> > jae    [exit]
> > lea    (%r8,%rax,1),%r13d           ; new_head
> > lock cmpxchg %r13d,(%rcx)           ; ← INDIRECT addressing via %rcx
> > mov    %eax,%ebx                    ; ← EXTRA: save post-CAS %eax to
> %ebx
> > jne    [retry]
> >
> > Bottom line: good idea but still fighting with Gcc optimizer here.
> 
> Thanks for trying.
> On my box (AMD EPYC 9534) with same test, there is no much difference
> between all of them:
> use-sync-bool:                     2.2273
> use-c11-current-version:   2.2422
> use-c11-patched:                2.2431
> Anyway, -10% on some boxes - that's probably good enough reason to keep
> specific version
> for  __rte_ring_headtail_move_head_mt().
> My ask would be to have some special macro for it, so users can
> enable/disable it via 'meson setup' at will.

This seems very exotic as a meson command line option.
Either put it in rte_config.h, or make it CPU specific.

Reply via email to