Hi, > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > -----Original Message----- > >> > [...] > >> dependant on each other. > >> Thus a memory barrier is neccessary. > > Yes. The barrier is necessary. > > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > > implementation is derived from freebsd's buf_ring.h. > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > I think, the only outstanding issue is, how to reduce the performance > > impact for arm64. I believe using accurate/release semantics instead > > of rte_smp_rmb() will reduce the performance overhead like similar > > ring implementations below, > > freebsd: > > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > odp: > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio > > /ring.c > > > > Jia, > > 1) Can you verify the use of accurate/release semantics fixes the > > problem in your platform? like use of atomic_load_acq* in the reference > code. > > 2) If so, What is the overhead between accurate/release and plane > > smp_smb() barriers. Based on that we need decide what path to take. > I've tested 3 cases. The new 3rd case is to use the load_acquire barrier > (half > barrier) you mentioned at above link. > The patch seems like: > @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int > is_sp, > /* Reset n to the initial burst count */ > n = max; > > - *old_head = r->prod.head; > - const uint32_t cons_tail = r->cons.tail; > + *old_head = atomic_load_acq_32(&r->prod.head); > + const uint32_t cons_tail = > atomic_load_acq_32(&r->cons.tail); > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int > is_s > /* Restore n as it may change every loop */ > n = max; > > - *old_head = r->cons.head; > - const uint32_t prod_tail = r->prod.tail; > + *old_head = atomic_load_acq_32(&r->cons.head); > + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) > /* The subtraction is done between two unsigned 32bits > value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > * and size(ring)-1. */ > > The half barrier patch passed the fuctional test. > > As for the performance comparision on *arm64*(the debug patch is at > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see > the test results > below: > > [case 1] old codes, no barrier > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 689275.001200 task-clock (msec) # 9.771 CPUs utilized > 6223 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 653 page-faults # 0.001 K/sec > 1721190914583 cycles # 2.497 GHz > 3363238266430 instructions # 1.95 insn per > cycle > <not supported> branches > 27804740 branch-misses # 0.00% of all > branches > > 70.540618825 seconds time elapsed > > [case 2] full barrier with rte_smp_rmb() > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 582557.895850 task-clock (msec) # 9.752 CPUs utilized > 5242 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 665 page-faults # 0.001 K/sec > 1454360730055 cycles # 2.497 GHz > 587197839907 instructions # 0.40 insn per > cycle > <not supported> branches > 27799687 branch-misses # 0.00% of all > branches > > 59.735582356 seconds time elapse > > [case 1] half barrier with load_acquire > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 660758.877050 task-clock (msec) # 9.764 CPUs utilized > 5982 context-switches # 0.009 K/sec > 11 cpu-migrations # 0.000 K/sec > 657 page-faults # 0.001 K/sec > 1649875318044 cycles # 2.497 GHz > 591583257765 instructions # 0.36 insn per > cycle > <not supported> branches > 27994903 branch-misses # 0.00% of all > branches > > 67.672855107 seconds time elapsed > > Please see the context-switches in the perf results > test result sorted by time is: > full barrier < half barrier < no barrier > > AFAICT, in this case ,the cpu reordering will add the possibility for > context switching and > increase the running time. > > Any ideas?
You could also try a case where you rearrange the rte_ring structure so that prod.head and cons.tail are parts of an union of a 64-bit variable and then read this 64-bit variable with one atomic read. I do not think that half barrier is even needed here with this approach, as long as you can really read the 64-bit variable fully at once. This should speed up. Cheers, -- Juhamatti > Cheers, > Jia > > > > > Note: > > This issue wont come in all the arm64 implementation. it comes on arm64 > > implementation with OOO(out of order) implementations. > > > > > >> Cheers, > >> Jia > >> > >>> Konstantin > >>> > >>>>> . In another > >>>>> mail of this thread, we've made a simple test based on this and > captured > >>>>> some information and I pasted there.(I pasted the patch there :-)) > >>>> Are you talking about that one: > >>>> http://dpdk.org/dev/patchwork/patch/30405/ > >>>> ? > >>>> It still uses test/test/test_mbuf.c..., > >>>> but anyway I don't really understand how mbuf_autotest supposed > >>>> to work with these changes: > >>>> @@ -730,7 +739,7 @@ test_refcnt_iter(unsigned int lcore, unsigned int > iter, > >>>> rte_ring_enqueue(refcnt_mbuf_ring, m); > >>>> } > >>>> } > >>>> - rte_pktmbuf_free(m); > >>>> + // rte_pktmbuf_free(m); > >>>> } > >>>> @@ -741,6 +750,12 @@ test_refcnt_iter(unsigned int lcore, unsigned > int iter, > >>>> while (!rte_ring_empty(refcnt_mbuf_ring)) > >>>> ; > >>>> > >>>> + if (NULL != m) { > >>>> + if (1 != rte_mbuf_refcnt_read(m)) > >>>> + printf("m ref is %u\n", rte_mbuf_refcnt_read(m)); > >>>> + rte_pktmbuf_free(m); > >>>> + } > >>>> + > >>>> /* check that all mbufs are back into mempool by now */ > >>>> for (wn = 0; wn != REFCNT_MAX_TIMEOUT; wn++) { > >>>> if ((i = rte_mempool_avail_count(refcnt_pool)) == n) { > >>>> > >>>> That means all your mbufs (except the last one) will still be allocated. > >>>> So the test would fail - as it should, I think. > >>>> > >>>>> And > >>>>> it seems that Juhamatti & Jacod found some reverting action several > >>>>> months ago. > >>>> Didn't get that one either. > >>>> Konstantin > > -- > Cheers, > Jia