> -----Original Message----- > From: Hunt, David <david.h...@intel.com> > Sent: Wednesday, April 10, 2019 10:06 PM > To: Phil Yang (Arm Technology China) <phil.y...@arm.com>; dev@dpdk.org; > tho...@monjalon.net > Cc: reshma.pat...@intel.com; Gavin Hu (Arm Technology China) > <gavin...@arm.com>; Honnappa Nagarahalli > <honnappa.nagaraha...@arm.com>; nd <n...@arm.com> > Subject: Re: [PATCH v4 2/3] test/distributor: replace sync builtins with > atomic > builtins > > Hi Phil, > > On 8/4/2019 4:02 AM, Phil Yang wrote: > > '__sync' built-in functions are deprecated, should use the '__atomic' > > built-in instead. the sync built-in functions are full barriers, while > > atomic built-in functions offer less restrictive one-way barriers, > > which help performance. > > > > Here is the example test result on TX2: > > sudo ./arm64-armv8a-linuxapp-gcc/app/test -l 112-139 \ -n 4 > > --socket-mem=1024,1024 -- -i > > RTE>>distributor_perf_autotest > > > > *** distributor_perf_autotest without this patch *** ==== Cache line > > switch test === Time for 33554432 iterations = 1519202730 ticks Ticks > > per iteration = 45 > > > > *** distributor_perf_autotest with this patch *** ==== Cache line > > switch test === Time for 33554432 iterations = 1251715496 ticks Ticks > > per iteration = 37 > > > > Less ticks needed for the cache line switch test. It got 17% of > > performance improvement. > > Hi, Dave
Thanks for your input. > I'm seeing about an 8% performance degradation on my platform for the I'd tested this patch on our x86 server (E5-2640 v3 @ 2.60GHz) several rounds. However, I didn't found performance degradation. Please check the test result below. $ sudo ./x86_64-native-linuxapp-gcc/app/test -l 8-15 -n 4 --socket-mem=1024,1024 -- -i RTE>>distributor_perf_autotest #### without this patch #### ==== Cache line switch test === Time for 33554432 iterations = 12379399910 ticks Ticks per iteration = 368 === Performance test of distributor (single mode) === Time per burst: 5815 Time per packet: 90 === Performance test of distributor (burst mode) === Time per burst: 3487 Time per packet: 54 #### with this patch #### ==== Cache line switch test === Time for 33554432 iterations = 12388791845 ticks Ticks per iteration = 369 === Performance test of distributor (single mode) === Time per burst: 5796 Time per packet: 90 === Performance test of distributor (burst mode) === Time per burst: 3477 Time per packet: 54 From my test, there was a little bit of performance improvement (You can also think of it as a measurement bias) on x86. > cache line switch test with the patch, however the single mode and burst > mode tests area showing no difference, which are the more important tests. > What kind of differences are you seeing in the single/burst mode tests? Actually, I found no difference in the single mode and burst mode on aarch64 neither. I think it means this test case is not the hotspot for those two mode's performance. Just like the __sync_xxx builtins, the __atomic_xxx builtins are atomic operations, which elide the memory barrier. So I think it should benefit all platform. Thanks, Phil > > Rgds, > Dave. > > > ---snip--- > >