On 4 Sep 2020, at 14:36, Jessica Clarke <jrt...@freebsd.org> wrote:
> On 4 Sep 2020, at 14:09, Konstantin Belousov <kostik...@gmail.com> wrote:
>> On Fri, Sep 04, 2020 at 11:22:18AM +0000, Marcin Wojtas wrote:
>>> Author: mw
>>> Date: Fri Sep  4 11:22:18 2020
>>> New Revision: 365326
>>> URL: https://svnweb.freebsd.org/changeset/base/365326
>>> 
>>> Log:
>>> MFC: r346593
>>> 
>>> Add barrier in buf ring peek function to prevent race in ARM and ARM64.
>>> 
>>> Obtained from: Semihalf
>>> Sponsored by: Amazon, Inc.
>>> 
>>> Modified:
>>> stable/12/sys/sys/buf_ring.h
>>> Directory Properties:
>>> stable/12/   (props changed)
>>> 
>>> Modified: stable/12/sys/sys/buf_ring.h
>>> ==============================================================================
>>> --- stable/12/sys/sys/buf_ring.h    Fri Sep  4 04:31:56 2020        
>>> (r365325)
>>> +++ stable/12/sys/sys/buf_ring.h    Fri Sep  4 11:22:18 2020        
>>> (r365326)
>>> @@ -310,14 +310,23 @@ buf_ring_peek_clear_sc(struct buf_ring *br)
>>>     if (!mtx_owned(br->br_lock))
>>>             panic("lock not held on single consumer dequeue");
>>> #endif      
>>> -   /*
>>> -    * I believe it is safe to not have a memory barrier
>>> -    * here because we control cons and tail is worst case
>>> -    * a lagging indicator so we worst case we might
>>> -    * return NULL immediately after a buffer has been enqueued
>>> -    */
>>> +
>>>     if (br->br_cons_head == br->br_prod_tail)
>>>             return (NULL);
>>> +
>>> +#if defined(__arm__) || defined(__aarch64__)
>>> +   /*
>>> +    * The barrier is required there on ARM and ARM64 to ensure, that
>>> +    * br->br_ring[br->br_cons_head] will not be fetched before the above
>>> +    * condition is checked.
>>> +    * Without the barrier, it is possible, that buffer will be fetched
>>> +    * before the enqueue will put mbuf into br, then, in the meantime, the
>>> +    * enqueue will update the array and the br_prod_tail, and the
>>> +    * conditional check will be true, so we will return previously fetched
>>> +    * (and invalid) buffer.
>>> +    */
>>> +   atomic_thread_fence_acq();
>>> +#endif
>> 
>> Putting the semantic of the change aside, why did you added the fence (it is
>> a fence, not barrier as stated in the comment) only to arm* ?   If it is
>> needed, it is needed for all arches.
> 
> Agreed. The code looks fine, though I would have made it an acquire
> load of br_prod_tail myself to be able to take advantage load-acquire
> instructions when present, and better document what the exact issue is.
> I also don't think the comment needs to be quite so extensive
> (especially since atomic_load_acq_32 is somewhat self-documenting in
> terms of one half of the race); if we had a comment like this for every
> fence in the kernel we'd never get anything done.
> 
> There's also an ARM-specific fence in buf_ring_dequeue_sc:
> 
>>      /*
>>       * This is a workaround to allow using buf_ring on ARM and ARM64.
>>       * ARM64TODO: Fix buf_ring in a generic way.
>>       * REMARKS: It is suspected that br_cons_head does not require
>>       *   load_acq operation, but this change was extensively tested
>>       *   and confirmed it's working. To be reviewed once again in
>>       *   FreeBSD-12.
>>       *
>>       * Preventing following situation:
>> 
>>       * Core(0) - buf_ring_enqueue()                                       
>> Core(1) - buf_ring_dequeue_sc()
>>       * -----------------------------------------                            
>>            ----------------------------------------------
>>       *
>>       *                                                                      
>>           cons_head = br->br_cons_head;
>>       * atomic_cmpset_acq_32(&br->br_prod_head, ...));
>>       *                                                                      
>>           buf = br->br_ring[cons_head];     <see <1>>
>>       * br->br_ring[prod_head] = buf;
>>       * atomic_store_rel_32(&br->br_prod_tail, ...);
>>       *                                                                      
>>           prod_tail = br->br_prod_tail;
>>       *                                                                      
>>           if (cons_head == prod_tail) 
>>       *                                                                      
>>                   return (NULL);
>>       *                                                                      
>>           <condition is false and code uses invalid(old) buf>`  
>>       *
>>       * <1> Load (on core 1) from br->br_ring[cons_head] can be reordered 
>> (speculative readed) by CPU.
>>       */     
>> #if defined(__arm__) || defined(__aarch64__)
>>      cons_head = atomic_load_acq_32(&br->br_cons_head);
>> #else
>>      cons_head = br->br_cons_head;
>> #endif
>>      prod_tail = atomic_load_acq_32(&br->br_prod_tail);
> 
> 
> The comment is completely correct that the ARM-specific fence is a
> waste of time. It's the single-consumer path, so such fences are just
> synchronising with the current thread and thus pointless. The important
> one is the load-acquire of br_prod_tail, as has been discovered (sort
> of) in the peek case leading to this comment, which already stops the
> reordering in question.

Fixes filed as https://reviews.freebsd.org/D26336 for those interested.

Jess

_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to