Hi, This is more a note for the future, than something I'm planning to pursue right now. Turns out our x86 full memory barrier could be sped up noticably on larger systems with a trivial change.
Just changing __asm__ __volatile__ ("lock; addl $0,0(%%rsp)" : : : "memory", "cc") into something like __asm__ __volatile__ ("lock; addl $0,-8(%%rsp)" : : : "memory", "cc") makes the barrier faster because there's no dependency slowing down other uses of %rsp. Which are, rsp being the stack pointer, not rare. Obviously that requires to have a few bytes below the stack pointer, but I can't see that ever being a problem on x86. More details, among others, is available at: https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ In the past we'd run into barriers being relevant for performance both around the latch code and shm_mq. So it's probably worth trying the above in a benchmark exercising either heavily. Greetings, Andres Freund