Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings

John Fastabend Tue, 15 Nov 2016 20:38:20 -0800

On 16-11-15 05:32 AM, Jesper Dangaard Brouer wrote:
> 
> (looks like my message didn't reach the netdev list, due to me sending
> from the wrong email, forwarded message again):
> 
> On Thu, 10 Nov 2016 20:44:08 -0800 John Fastabend <john.fastab...@gmail.com> 
> wrote:
> 
>> ---
>>  include/linux/ptr_ring_ll.h |  136 
>> +++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/skb_array.h   |   25 ++++++++
>>  2 files changed, 161 insertions(+)
>>  create mode 100644 include/linux/ptr_ring_ll.h
>>
>> diff --git a/include/linux/ptr_ring_ll.h b/include/linux/ptr_ring_ll.h
>> new file mode 100644
>> index 0000000..bcb11f3
>> --- /dev/null
>> +++ b/include/linux/ptr_ring_ll.h
>> @@ -0,0 +1,136 @@
>> +/*
>> + *  Definitions for the 'struct ptr_ring_ll' datastructure.
>> + *
>> + *  Author:
>> + *          John Fastabend <john.r.fastab...@intel.com>  
> [...]
>> + *
>> + *  This is a limited-size FIFO maintaining pointers in FIFO order, with
>> + *  one CPU producing entries and another consuming entries from a FIFO.
>> + *  extended from ptr_ring_ll to use cmpxchg over spin lock.  
> 
> It sounds like this is Single Producer Single Consumer (SPSC)
> implementation, but your implementation actually is Multi Producer
> Multi Consumer (MPMC) capable.


Correct qdisc requires a MPMC to handle all the OOO cases.

> 
> The implementation looks a lot like my alf_queue[1] implementation:
>  [1] 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h
> 

Sure, I was using that implementation originally.

> If the primary use-case is one CPU producing and another consuming,
> then the normal ptr_ring (skb_array) will actually be faster!
> 
> The reason is ptr_ring avoids bouncing a cache-line between the CPUs on
> every ring access.  This is achieved by having the checks for full
> (__ptr_ring_full) and empty (__ptr_ring_empty) use the contents of the
> array (NULL value).
> 
> I actually implemented two micro-benchmarks to measure the difference
> between skb_array[2] and alf_queue[3]:
>  [2] 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
>  [3] 
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/alf_queue_parallel01.c
> 

But :) this doesn't jive with my experiments where this implementation
was actually giving better numbers with pktgen over pfifo_fast even in
the SPSC case. I'll rerun metrics later this week its possible there was
some other issue causing the difference I guess.

As I noted in Michael's email though really I need to fix a bug in my
qdisc code and submit it before I worry too much about this
optimization.

> 
>> + */
>> +
>> +#ifndef _LINUX_PTR_RING_LL_H
>> +#define _LINUX_PTR_RING_LL_H 1
>> +  
> [...]
>> +
>> +struct ptr_ring_ll {
>> +    u32 prod_size;
>> +    u32 prod_mask;
>> +    u32 prod_head;
>> +    u32 prod_tail;
>> +    u32 cons_size;
>> +    u32 cons_mask;
>> +    u32 cons_head;
>> +    u32 cons_tail;
>> +
>> +    void **queue;
>> +};  
> 
> Your implementation doesn't even split the consumer and producer into
> different cachelines (which in practice doesn't help much due to how
> the empty/full checks are performed).

Its was just a implementation to get the qdisc patches off the ground. I
expected to follow up with patches to optimize the implementation.


[...]

>> +static inline int ptr_ring_ll_init(struct ptr_ring_ll *r, int size, gfp_t 
>> gfp)
>> +{
>> +    r->queue = __ptr_ring_init_queue_alloc(size, gfp);
>> +    if (!r->queue)
>> +            return -ENOMEM;
>> +
>> +    r->prod_size = r->cons_size = size;
>> +    r->prod_mask = r->cons_mask = size - 1;  
> 
> Shouldn't we have some check like is_power_of_2(size), as this code
> looks like it depend on this.
> 

Sure it is required. I was just ensuring callers do it correctly.

>> +    r->prod_tail = r->prod_head = 0;
>> +    r->cons_tail = r->prod_tail = 0;
>> +
>> +    return 0;
>> +}
>> +  

[...]

>>  
>> +static inline struct sk_buff *skb_array_ll_consume(struct skb_array_ll *a)
>> +{
>> +    return __ptr_ring_ll_consume(&a->ring);
>> +}
>> +  
> 
> Note in the Multi Producer Multi Consumer (MPMC) use-case this type of
> queue can be faster than normal ptr_ring.  And in patch2 you implement
> bulking, which is where the real benefit shows (in the MPMC case) for
> this kind of queue.
> 
> What I would really like to see is a lock-free (locked cmpxchg) queue
> implementation, what like ptr_ring use the array as empty/full check,
> and still (somehow) support bulking.
> 

OK perhaps worth experimenting with after if I can _finally_ get the
qdisc series in.

.John

Re: [RFC PATCH 1/2] net: use cmpxchg instead of spinlock in ptr rings

Reply via email to