-----Original Message----- > Date: Mon, 23 Oct 2017 16:49:01 +0800 > From: Jia He <hejia...@gmail.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Zhao, Bing" > <iloveth...@163.com>, Olivier MATZ <olivier.m...@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia...@hxt-semitech.com" > <jia...@hxt-semitech.com>, "jie2....@hxt-semitech.com" > <jie2....@hxt-semitech.com>, "bing.z...@hxt-semitech.com" > <bing.z...@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richard...@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin > > > On 10/20/2017 1:43 PM, Jerin Jacob Wrote: > > -----Original Message----- > > > > [...] > > > dependant on each other. > > > Thus a memory barrier is neccessary. > > Yes. The barrier is necessary. > > In fact, upstream freebsd fixed this issue for arm64. DPDK ring > > implementation is derived from freebsd's buf_ring.h. > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > > > I think, the only outstanding issue is, how to reduce the performance > > impact for arm64. I believe using accurate/release semantics instead > > of rte_smp_rmb() will reduce the performance overhead like similar ring > > implementations below, > > freebsd: > > https://github.com/freebsd/freebsd/blob/master/sys/sys/buf_ring.h#L166 > > odp: > > https://github.com/Linaro/odp/blob/master/platform/linux-generic/pktio/ring.c > > > > Jia, > > 1) Can you verify the use of accurate/release semantics fixes the problem > > in your > > platform? like use of atomic_load_acq* in the reference code. > > 2) If so, What is the overhead between accurate/release and plane smp_smb() > > barriers. Based on that we need decide what path to take. > I've tested 3 cases. The new 3rd case is to use the load_acquire barrier > (half barrier) you mentioned > at above link. > The patch seems like: > @@ -408,8 +466,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp, > /* Reset n to the initial burst count */ > n = max; > > - *old_head = r->prod.head; > - const uint32_t cons_tail = r->cons.tail; > + *old_head = atomic_load_acq_32(&r->prod.head); > + const uint32_t cons_tail = > atomic_load_acq_32(&r->cons.tail); > > @@ -516,14 +576,15 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_s > /* Restore n as it may change every loop */ > n = max; > > - *old_head = r->cons.head; > - const uint32_t prod_tail = r->prod.tail; > + *old_head = atomic_load_acq_32(&r->cons.head); > + const uint32_t prod_tail = atomic_load_acq_32(&r->prod.tail) > /* The subtraction is done between two unsigned 32bits value > * (the result is always modulo 32 bits even if we have > * cons_head > prod_tail). So 'entries' is always between 0 > * and size(ring)-1. */ > > The half barrier patch passed the fuctional test. > > As for the performance comparision on *arm64*(the debug patch is at > http://dpdk.org/ml/archives/dev/2017-October/079012.html), please see the > test results > below: > > [case 1] old codes, no barrier > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 689275.001200 task-clock (msec) # 9.771 CPUs utilized > 6223 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 653 page-faults # 0.001 K/sec > 1721190914583 cycles # 2.497 GHz > 3363238266430 instructions # 1.95 insn per cycle > <not supported> branches > 27804740 branch-misses # 0.00% of all branches > > 70.540618825 seconds time elapsed > > [case 2] full barrier with rte_smp_rmb() > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 582557.895850 task-clock (msec) # 9.752 CPUs utilized > 5242 context-switches # 0.009 K/sec > 10 cpu-migrations # 0.000 K/sec > 665 page-faults # 0.001 K/sec > 1454360730055 cycles # 2.497 GHz > 587197839907 instructions # 0.40 insn per cycle > <not supported> branches > 27799687 branch-misses # 0.00% of all branches > > 59.735582356 seconds time elapse > > [case 1] half barrier with load_acquire > ============================================ > Performance counter stats for './test --no-huge -l 1-10': > > 660758.877050 task-clock (msec) # 9.764 CPUs utilized > 5982 context-switches # 0.009 K/sec > 11 cpu-migrations # 0.000 K/sec > 657 page-faults # 0.001 K/sec > 1649875318044 cycles # 2.497 GHz > 591583257765 instructions # 0.36 insn per cycle > <not supported> branches > 27994903 branch-misses # 0.00% of all branches > > 67.672855107 seconds time elapsed > > Please see the context-switches in the perf results > test result sorted by time is: > full barrier < half barrier < no barrier > > AFAICT, in this case ,the cpu reordering will add the possibility for > context switching and > increase the running time.
> Any ideas? Regarding performance test, it better to use ring perf test case on _isolated_ cores to measure impact on number of enqueue/dequeue operations. example: ./build/app/test -c 0xff -n 4 >>ring_perf_autotest By default, arm64+dpdk will be using el0 counter to measure the cycles. I think, in your SoC, it will be running at 50MHz or 100MHz.So, You can follow the below scheme to get accurate cycle measurement scheme: See: http://dpdk.org/doc/guides/prog_guide/profile_app.html check: 44.2.2. High-resolution cycle counter