On Thu, Oct 18, 2018 at 01:10:15AM +0200, Daniel Borkmann wrote: > Wouldn't this then also allow the kernel side to use smp_store_release() > when it updates the head? We'd be pretty much at the model as described > in Documentation/core-api/circular-buffers.rst. > > Meaning, rough pseudo-code diff would look as: > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index 5d3cf40..3d96275 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -84,8 +84,9 @@ static void perf_output_put_handle(struct > perf_output_handle *handle) > * > * See perf_output_begin(). > */ > - smp_wmb(); /* B, matches C */ > - rb->user_page->data_head = head; > + > + /* B, matches C */ > + smp_store_release(&rb->user_page->data_head, head);
Yes, this would be correct. The reason we didn't do this is because smp_store_release() ends up being smp_mb() + WRITE_ONCE() for a fair number of platforms, even if they have a cheaper smp_wmb(). Most notably ARM. (ARM64 OTOH would like to have smp_store_release() there I imagine; while x86 doesn't care either way around). A similar concern exists for the smp_load_acquire() I proposed for the userspace side, ARM would have to resort to smp_mb() in that situation, instead of the cheaper smp_rmb(). The smp_store_release() on the userspace side will actually be of equal cost or cheaper, since it already has an smp_mb(). Most notably, x86 can avoid barrier entirely, because TSO doesn't allow the LOAD-STORE reorder (it only allows the STORE-LOAD reorder). And PowerPC can use LWSYNC instead of SYNC.