Hi Jerin, Following the guide to use the PMU counters(KO inserted and DPDK recompiled), the numbers increased 10+ folds(bigger numbers here mean more precise?), is this valid and expected? No significant difference was seen.
gavin@net-arm-thunderx2:~/community/dpdk$ sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 --socket-mem=1024 -- -i RTE>>ring_perf_autotest (#1 run w/o the patch) ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 103 MP/MC single enq/dequeue: 130 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 21 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 3.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.48 MP/MC bulk enq/dequeue (size: 8): 21.77 SP/SC bulk enq/dequeue (size: 32): 7.39 MP/MC bulk enq/dequeue (size: 32): 8.52 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.32 MP/MC bulk enq/dequeue (size: 8): 38.52 SP/SC bulk enq/dequeue (size: 32): 13.39 MP/MC bulk enq/dequeue (size: 32): 14.15 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 75.00 MP/MC bulk enq/dequeue (size: 8): 141.97 SP/SC bulk enq/dequeue (size: 32): 23.85 MP/MC bulk enq/dequeue (size: 32): 36.13 Test OK RTE>>ring_perf_autotest (#2 run w/o the patch) ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 103 MP/MC single enq/dequeue: 130 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 21 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 3.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.48 MP/MC bulk enq/dequeue (size: 8): 21.77 SP/SC bulk enq/dequeue (size: 32): 7.38 MP/MC bulk enq/dequeue (size: 32): 8.52 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.31 MP/MC bulk enq/dequeue (size: 8): 38.52 SP/SC bulk enq/dequeue (size: 32): 13.33 MP/MC bulk enq/dequeue (size: 32): 14.16 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 75.74 MP/MC bulk enq/dequeue (size: 8): 147.33 SP/SC bulk enq/dequeue (size: 32): 24.79 MP/MC bulk enq/dequeue (size: 32): 40.09 Test OK RTE>>ring_perf_autotest (#1 run w/ the patch) ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 103 MP/MC single enq/dequeue: 129 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 4.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.89 MP/MC bulk enq/dequeue (size: 8): 21.77 SP/SC bulk enq/dequeue (size: 32): 7.50 MP/MC bulk enq/dequeue (size: 32): 8.52 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.24 MP/MC bulk enq/dequeue (size: 8): 38.14 SP/SC bulk enq/dequeue (size: 32): 13.24 MP/MC bulk enq/dequeue (size: 32): 14.69 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 74.63 MP/MC bulk enq/dequeue (size: 8): 137.61 SP/SC bulk enq/dequeue (size: 32): 24.82 MP/MC bulk enq/dequeue (size: 32): 36.64 Test OK RTE>>ring_perf_autotest (#1 run w/ the patch) ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue: 103 MP/MC single enq/dequeue: 129 SP/SC burst enq/dequeue (size: 8): 18 MP/MC burst enq/dequeue (size: 8): 22 SP/SC burst enq/dequeue (size: 32): 7 MP/MC burst enq/dequeue (size: 32): 8 ### Testing empty dequeue ### SC empty dequeue: 3.00 MC empty dequeue: 4.00 ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size: 8): 17.89 MP/MC bulk enq/dequeue (size: 8): 21.77 SP/SC bulk enq/dequeue (size: 32): 7.50 MP/MC bulk enq/dequeue (size: 32): 8.52 ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue (size: 8): 31.53 MP/MC bulk enq/dequeue (size: 8): 38.59 SP/SC bulk enq/dequeue (size: 32): 13.24 MP/MC bulk enq/dequeue (size: 32): 14.69 ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8): 75.60 MP/MC bulk enq/dequeue (size: 8): 149.14 SP/SC bulk enq/dequeue (size: 32): 25.13 MP/MC bulk enq/dequeue (size: 32): 40.60 Test OK > -----Original Message----- > From: Jerin Jacob <jerin.ja...@caviumnetworks.com> > Sent: Monday, October 8, 2018 6:50 PM > To: Gavin Hu (Arm Technology China) <gavin...@arm.com> > Cc: Ola Liljedahl <ola.liljed...@arm.com>; dev@dpdk.org; Honnappa > Nagarahalli <honnappa.nagaraha...@arm.com>; Ananyev, Konstantin > <konstantin.anan...@intel.com>; Steve Capper <steve.cap...@arm.com>; > nd <n...@arm.com>; sta...@dpdk.org > Subject: Re: [PATCH v3 1/3] ring: read tail using atomic load > > -----Original Message----- > > Date: Mon, 8 Oct 2018 10:33:43 +0000 > > From: "Gavin Hu (Arm Technology China)" <gavin...@arm.com> > > To: Ola Liljedahl <ola.liljed...@arm.com>, Jerin Jacob > > <jerin.ja...@caviumnetworks.com> > > CC: "dev@dpdk.org" <dev@dpdk.org>, Honnappa Nagarahalli > > <honnappa.nagaraha...@arm.com>, "Ananyev, Konstantin" > > <konstantin.anan...@intel.com>, Steve Capper > <steve.cap...@arm.com>, > > nd <n...@arm.com>, "sta...@dpdk.org" <sta...@dpdk.org> > > Subject: RE: [PATCH v3 1/3] ring: read tail using atomic load > > > > > > I did benchmarking w/o and w/ the patch, it did not show any noticeable > differences in terms of latency. > > Here is the full log( 3 runs w/o the patch and 2 runs w/ the patch). > > > > sudo ./test/test/test -l 16-19,44-47,72-75,100-103 -n 4 > > --socket-mem=1024 -- -i > > These counters are running at 100MHz. Use PMU counters to get more > accurate results. > > https://doc.dpdk.org/guides/prog_guide/profile_app.html > See: 55.2. Profiling on ARM64 >