-----Original Message----- > Date: Tue, 31 Oct 2017 10:55:15 +0800 > From: Jia He <hejia...@gmail.com> > To: Jerin Jacob <jerin.ja...@caviumnetworks.com> > Cc: "Ananyev, Konstantin" <konstantin.anan...@intel.com>, "Zhao, Bing" > <iloveth...@163.com>, Olivier MATZ <olivier.m...@6wind.com>, > "dev@dpdk.org" <dev@dpdk.org>, "jia...@hxt-semitech.com" > <jia...@hxt-semitech.com>, "jie2....@hxt-semitech.com" > <jie2....@hxt-semitech.com>, "bing.z...@hxt-semitech.com" > <bing.z...@hxt-semitech.com>, "Richardson, Bruce" > <bruce.richard...@intel.com> > Subject: Re: [dpdk-dev] [PATCH] ring: guarantee ordering of cons/prod > loading when doing enqueue/dequeue > User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 > Thunderbird/52.4.0 > > Hi Jerin
Hi Jia, > > Do you thinkĀ next step whether I need to implement the load_acquire half > barrier as per freebsd I did a quick prototype using C11 memory model(ACQUIRE/RELEASE) schematics and tested on two arm64 platform in Cavium(Platform A: Non arm64 OOO machine) and Platform B: arm64 OOO machine) smp_rmb() performs better in Platform A: acquire/release semantics perform better in platform B: Here is the patch: https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch In terms of next step: - I am not sure the cost associated with acquire/release semantics on x86 or ppc. IMO, We need to have both options under conditional compilation flags and let the target platform choose the best one. Thoughts? Here is the performance numbers: - Both platforms are running at different frequency, So absolute numbers does not matter, Just check the relative numbers. Platform A: Performance numbers: ================================ no patch(Non arm64 OOO machine) ------------------------------- SP/SC single enq/dequeue: 40 MP/MC single enq/dequeue: 282 SP/SC burst enq/dequeue (size: 8): 11 MP/MC burst enq/dequeue (size: 8): 42 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 8.01 MC empty dequeue: 11.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.30 MP/MC bulk enq/dequeue (size: 8): 42.85 SP/SC bulk enq/dequeue (size: 32): 8.25 MP/MC bulk enq/dequeue (size: 32): 16.46 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.62 MP/MC bulk enq/dequeue (size: 8): 56.30 SP/SC bulk enq/dequeue (size: 32): 10.94 MP/MC bulk enq/dequeue (size: 32): 18.66 Test OK # smp_rmb() patch((Non OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ ----------------------------------------- SP/SC single enq/dequeue: 42 MP/MC single enq/dequeue: 291 SP/SC burst enq/dequeue (size: 8): 12 MP/MC burst enq/dequeue (size: 8): 44 SP/SC burst enq/dequeue (size: 32): 8 MP/MC burst enq/dequeue (size: 32): 16 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 15.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 11.60 MP/MC bulk enq/dequeue (size: 8): 44.32 SP/SC bulk enq/dequeue (size: 32): 8.60 MP/MC bulk enq/dequeue (size: 32): 16.50 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 20.95 MP/MC bulk enq/dequeue (size: 8): 56.90 SP/SC bulk enq/dequeue (size: 32): 10.90 MP/MC bulk enq/dequeue (size: 32): 18.78 Test OK RTE>> # c11 memory model patch((Non OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch --------------------------------------------------------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 197 MP/MC single enq/dequeue: 328 SP/SC burst enq/dequeue (size: 8): 31 MP/MC burst enq/dequeue (size: 8): 50 SP/SC burst enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 18 ### Testing empty dequeue ### SC empty dequeue: 13.01 MC empty dequeue: 18.02 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 30.95 MP/MC bulk enq/dequeue (size: 8): 50.30 SP/SC bulk enq/dequeue (size: 32): 13.27 MP/MC bulk enq/dequeue (size: 32): 18.11 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 43.38 MP/MC bulk enq/dequeue (size: 8): 64.42 SP/SC bulk enq/dequeue (size: 32): 16.71 MP/MC bulk enq/dequeue (size: 32): 22.21 Platform B: Performance numbers: ============================== #no patch(OOO arm64 machine) ---------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 81 MP/MC single enq/dequeue: 207 SP/SC burst enq/dequeue (size: 8): 15 MP/MC burst enq/dequeue (size: 8): 31 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 11 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 5.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 15.38 MP/MC bulk enq/dequeue (size: 8): 30.64 SP/SC bulk enq/dequeue (size: 32): 7.25 MP/MC bulk enq/dequeue (size: 32): 11.06 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.51 MP/MC bulk enq/dequeue (size: 8): 49.38 SP/SC bulk enq/dequeue (size: 32): 14.32 MP/MC bulk enq/dequeue (size: 32): 15.89 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 72.66 MP/MC bulk enq/dequeue (size: 8): 121.89 SP/SC bulk enq/dequeue (size: 32): 16.88 MP/MC bulk enq/dequeue (size: 32): 24.23 Test OK RTE>> # smp_rmb() patch((OOO arm64 machine) http://dpdk.org/dev/patchwork/patch/30029/ ------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 152 MP/MC single enq/dequeue: 265 SP/SC burst enq/dequeue (size: 8): 24 MP/MC burst enq/dequeue (size: 8): 39 SP/SC burst enq/dequeue (size: 32): 9 MP/MC burst enq/dequeue (size: 32): 13 ### Testing empty dequeue ### SC empty dequeue: 31.01 MC empty dequeue: 32.01 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 24.26 MP/MC bulk enq/dequeue (size: 8): 39.52 SP/SC bulk enq/dequeue (size: 32): 9.47 MP/MC bulk enq/dequeue (size: 32): 13.31 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 40.29 MP/MC bulk enq/dequeue (size: 8): 59.57 SP/SC bulk enq/dequeue (size: 32): 17.34 MP/MC bulk enq/dequeue (size: 32): 21.58 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 79.05 MP/MC bulk enq/dequeue (size: 8): 153.46 SP/SC bulk enq/dequeue (size: 32): 26.41 MP/MC bulk enq/dequeue (size: 32): 38.37 Test OK RTE>> # c11 memory model patch((OOO arm64 machine) https://github.com/jerinjacobk/mytests/blob/master/ring/0001-ring-using-c11-memory-model.patch ---------------------------------------------------------------------------------------------- ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 98 MP/MC single enq/dequeue: 130 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 9 ### Testing empty dequeue ### SC empty dequeue: 4.00 MC empty dequeue: 5.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.40 MP/MC bulk enq/dequeue (size: 8): 22.88 SP/SC bulk enq/dequeue (size: 32): 7.62 MP/MC bulk enq/dequeue (size: 32): 8.96 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 20.24 MP/MC bulk enq/dequeue (size: 8): 25.83 SP/SC bulk enq/dequeue (size: 32): 12.21 MP/MC bulk enq/dequeue (size: 32): 13.20 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 67.54 MP/MC bulk enq/dequeue (size: 8): 124.63 SP/SC bulk enq/dequeue (size: 32): 21.13 MP/MC bulk enq/dequeue (size: 32): 28.44 Test OK RTE>>quit > or find any other performance test case to compare the performance impact? As far as I know, ring_perf_autotest is the better performance test. If you have trouble in using "High-resolution cycle counter" in your platform then also you can use ring_perf_auto test to compare the performance(as relative number matters) Jerin > Thanks for any suggestions. > > Cheers, > Jia