> > Then had Claude compare results: > > > > Key metric (two physical cores legacy MP/MC bulk n=128): > > main: 5.380 cycles/elem > > sync-bool: 5.377 cycles/elem (-0.07%) > > avoid-store: 5.892 cycles/elem (+9.52%) ← regresses > > > > > > Looking at the dissassembly of ring_enqueue_bulk: > > > > The inner loop of main and sync-bool versions is: > > mov 0x80(%rdi),%r11d ; load d->head via displacement > > mov 0x104(%rdi),%ebx ; load s->tail > > add %ecx,%ebx > > sub %r11d,%ebx > > cmp %ebx,%r12d > > jae [exit] > > lea (%r8,%r11,1),%r13d ; new_head = old_head + n > > mov %r11d,%eax ; expected → eax > > lock cmpxchg %r13d,0x80(%rdi) ; ← displacement addressing > > jne [retry] ; ← direct jne, eax preserved > > > > Using atomic_compare_exchange and your patch: > > mov 0x38(%rdi),%r10d > > mov 0x80(%rdi),%eax ; load d->head directly into %eax > > lea 0x80(%rdi),%rcx ; ← MATERIALIZE &d->head into > %rcx > > lea -0x1(%r8),%r12d > > mov 0x104(%rdi),%r11d > > add %r10d,%r11d > > sub %eax,%r11d > > cmp %r11d,%r12d > > jae [exit] > > lea (%r8,%rax,1),%r13d ; new_head > > lock cmpxchg %r13d,(%rcx) ; ← INDIRECT addressing via %rcx > > mov %eax,%ebx ; ← EXTRA: save post-CAS %eax to > %ebx > > jne [retry] > > > > Bottom line: good idea but still fighting with Gcc optimizer here. > > Thanks for trying. > On my box (AMD EPYC 9534) with same test, there is no much difference > between all of them: > use-sync-bool: 2.2273 > use-c11-current-version: 2.2422 > use-c11-patched: 2.2431 > Anyway, -10% on some boxes - that's probably good enough reason to keep > specific version > for __rte_ring_headtail_move_head_mt(). > My ask would be to have some special macro for it, so users can > enable/disable it via 'meson setup' at will.
This seems very exotic as a meson command line option. Either put it in rte_config.h, or make it CPU specific.

