On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Nope, control.

Since stores cannot be speculated and thus require linear control flow
history we can use it to order LOAD -> STORE when the LOAD is required
for the control flow decision and the STORE depends on the control flow
path.

Also see commit 18c03c61444a211237f3d4782353cb38dba795df to
Documentation/memory-barriers.txt

---
commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63
Author: Peter Zijlstra <pet...@infradead.org>
Date:   Mon Nov 25 11:49:10 2013 +0100

    perf: Optimize ring-buffer write by depending on control dependencies
    
    Remove a full barrier from the ring-buffer write path by relying on
    a control dependency to order a LOAD -> STORE scenario.
    
    Cc: "Paul E. McKenney" <paul...@us.ibm.com>
    Signed-off-by: Peter Zijlstra <pet...@infradead.org>
    Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org
    Signed-off-by: Ingo Molnar <mi...@kernel.org>

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e8b168af135b..146a5792b1d2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -61,19 +61,20 @@ static void perf_output_put_handle(struct 
perf_output_handle *handle)
         *
         *   kernel                             user
         *
-        *   READ ->data_tail                   READ ->data_head
-        *   smp_mb()   (A)                     smp_rmb()       (C)
-        *   WRITE $data                        READ $data
-        *   smp_wmb()  (B)                     smp_mb()        (D)
-        *   STORE ->data_head                  WRITE ->data_tail
+        *   if (LOAD ->data_tail) {            LOAD ->data_head
+        *                      (A)             smp_rmb()       (C)
+        *      STORE $data                     LOAD $data
+        *      smp_wmb()       (B)             smp_mb()        (D)
+        *      STORE ->data_head               STORE ->data_tail
+        *   }
         *
         * Where A pairs with D, and B pairs with C.
         *
-        * I don't think A needs to be a full barrier because we won't in fact
-        * write data until we see the store from userspace. So we simply don't
-        * issue the data WRITE until we observe it. Be conservative for now.
+        * In our case (A) is a control dependency that separates the load of
+        * the ->data_tail and the stores of $data. In case ->data_tail
+        * indicates there is no room in the buffer to store $data we do not.
         *
-        * OTOH, D needs to be a full barrier since it separates the data READ
+        * D needs to be a full barrier since it separates the data READ
         * from the tail WRITE.
         *
         * For B a WMB is sufficient since it separates two WRITEs, and for C
@@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle 
*handle)
         *
         * See perf_output_begin().
         */
-       smp_wmb();
+       smp_wmb(); /* B, matches C */
        rb->user_page->data_head = head;
 
        /*
@@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle,
                if (!rb->overwrite &&
                    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
                        goto fail;
+
+               /*
+                * The above forms a control dependency barrier separating the
+                * @tail load above from the data stores below. Since the @tail
+                * load is required to compute the branch to fail below.
+                *
+                * A, matches D; the full memory barrier userspace SHOULD issue
+                * after reading the data and before storing the new tail
+                * position.
+                *
+                * See perf_output_put_handle().
+                */
+
                head += size;
        } while (local_cmpxchg(&rb->head, offset, head) != offset);
 
        /*
-        * Separate the userpage->tail read from the data stores below.
-        * Matches the MB userspace SHOULD issue after reading the data
-        * and before storing the new tail position.
-        *
-        * See perf_output_put_handle().
+        * We rely on the implied barrier() by local_cmpxchg() to ensure
+        * none of the data stores below can be lifted up by the compiler.
         */
-       smp_mb();
 
        if (unlikely(head - local_read(&rb->wakeup) > rb->watermark))
                local_add(rb->watermark, &rb->wakeup);

Reply via email to