On Mon, Oct 23, 2017 at 09:05:24AM +0000, Kuusisaari, Juhamatti wrote: > > Hi, > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > > -----Original Message----- > > >> > > [...] > > >> dependant on each other. Thus a memory barrier is neccessary. > > > Yes. The barrier is necessary. In fact, upstream freebsd fixed > > > this issue for arm64. DPDK ring implementation is derived from > > > freebsd's buf_ring.h. > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > > > I think, the only outstanding issue is, how to reduce the > > > performance impact for arm64. I believe using accurate/release > > > semantics instead of rte_smp_rmb() will reduce the performance > > > overhead like similar ring implementations below, freebsd: > > > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > odp: > > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio > > > /ring.c > > > > > > Jia, 1) Can you verify the use of accurate/release semantics fixes > > > the problem in your platform? like use of atomic_load_acq* in the > > > reference > > code. > > > 2) If so, What is the overhead between accurate/release and plane > > > smp_smb() barriers. Based on that we need decide what path to > > > take. > > I've tested 3 cases. The new 3rd case is to use the load_acquire > > barrier (half barrier) you mentioned at above link. The patch seems > > like: @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring > > *r, int is_sp, /* Reset n to the initial burst count > > */ n = max; > > > > - *old_head = r->prod.head; - const > > uint32_t cons_tail = r->cons.tail; + *old_head = > > atomic_load_acq_32(&r->prod.head); + const uint32_t > > cons_tail = atomic_load_acq_32(&r->cons.tail); > > > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, > > int is_s /* Restore n as it may change every loop */ > > n = max; > > > > - *old_head = r->cons.head; - const > > uint32_t prod_tail = r->prod.tail; + *old_head = > > atomic_load_acq_32(&r->cons.head); + const uint32_t > > prod_tail = atomic_load_acq_32(&r->prod.tail) /* The > > subtraction is done between two unsigned 32bits value > > * (the result is always modulo 32 bits even if we > > have * cons_head > prod_tail). So 'entries' is > > always between 0 * and size(ring)-1. */ > > > > The half barrier patch passed the fuctional test. > > > > As for the performance comparision on *arm64*(the debug patch is at > > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please > > see the test results below: > > > > [case 1] old codes, no barrier > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 689275.001200 task-clock (msec) # 9.771 CPUs > > utilized 6223 context-switches # > > 0.009 K/sec 10 cpu-migrations # > > 0.000 K/sec 653 page-faults # > > 0.001 K/sec 1721190914583 cycles # > > 2.497 GHz 3363238266430 instructions # > > 1.95 insn per cycle <not supported> branches > > 27804740 branch-misses # 0.00% of all branches > > > > 70.540618825 seconds time elapsed > > > > [case 2] full barrier with rte_smp_rmb() > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 582557.895850 task-clock (msec) # 9.752 CPUs > > utilized 5242 context-switches # > > 0.009 K/sec 10 cpu-migrations # > > 0.000 K/sec 665 page-faults # > > 0.001 K/sec 1454360730055 cycles # > > 2.497 GHz 587197839907 instructions # > > 0.40 insn per cycle <not supported> branches > > 27799687 branch-misses # 0.00% of all branches > > > > 59.735582356 seconds time elapse > > > > [case 1] half barrier with load_acquire > > ============================================ Performance counter > > stats for './test --no-huge -l 1-10': > > > > 660758.877050 task-clock (msec) # 9.764 CPUs > > utilized 5982 context-switches # > > 0.009 K/sec 11 cpu-migrations # > > 0.000 K/sec 657 page-faults # > > 0.001 K/sec 1649875318044 cycles # > > 2.497 GHz 591583257765 instructions # > > 0.36 insn per cycle <not supported> branches > > 27994903 branch-misses # 0.00% of all branches > > > > 67.672855107 seconds time elapsed > > > > Please see the context-switches in the perf results test result > > sorted by time is: full barrier < half barrier < no barrier > > > > AFAICT, in this case ,the cpu reordering will add the possibility > > for context switching and increase the running time. > > > > Any ideas? > > You could also try a case where you rearrange the rte_ring structure > so that prod.head and cons.tail are parts of an union of a 64-bit > variable and then read this 64-bit variable with one atomic read. I do > not think that half barrier is even needed here with this approach, as > long as you can really read the 64-bit variable fully at once. This > should speed up. > > Cheers, -- Yes, that would work, but unfortunately it means mixing producer and consumer data onto the same cacheline, which may have other negative effects on performance (depending on arch/platform) due to cores invalidating each other's cachelines.
/Bruce