Hi all After further investigating, we have found some benefits with the patchset. So the plan is to add a config parameter CONFIG_RTE_ENABLE_RUNTIME_DISPATCH. By default, the value is "n" and would use current memcpy codes. Only if users config it to "y", it would use the run-time dispatch codes(without inline).
Best Regards, Xiaoyun Li > -----Original Message----- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Li, Xiaoyun > Sent: Tuesday, September 12, 2017 10:27 > To: Wang, Liang-min <liang-min.w...@intel.com>; Richardson, Bruce > <bruce.richard...@intel.com>; Ananyev, Konstantin > <konstantin.anan...@intel.com> > Cc: Zhang, Qi Z <qi.z.zh...@intel.com>; Lu, Wenzhuo > <wenzhuo...@intel.com>; Zhang, Helin <helin.zh...@intel.com>; > pie...@emutex.com; dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH v2 1/3] eal/x86: run-time dispatch over > memcpy > > Hi ALL > > After investigating, most DPDK codes are already run-time dispatching. Only > rte_memcpy chooses the ISA at build-time. > > To modify memcpy, there are two ways. The first one is function pointers > and another is function multi-versioning in GCC. > > But memcpy has been greatly optimized and gets benefit from total inline. If > changing it to run-time dispatching via function pointers, the perf will drop > a > lot especially when copy size is small. > > And function multi-versioning in GCC only works for C++. Even if it is said > that > GCC6 can support C, but in fact it does not support C in my trial. > > > > The attachment is the perf results of memcpy with and without my patch and > original DPDK codes but without inline. > > It's just for comparison, so right now, I only tested on Broadwell, using > AVX2. > > The results are from running test/test/test_memcpy_perf.c. > > (C = compile-time constant) > > /* Do aligned tests where size is a variable */ > > /* Do aligned tests where size is a compile-time constant */ > > /* Do unaligned tests where size is a variable */ > > /* Do unaligned tests where size is a compile-time constant */ > > > > 4-7 means dpdk costs time 4 and glibc costs time 7 > > For size smaller than 128 bytes. This patch's perf is bad and even worse than > glibc. > > When size grows, the perf is better than glibc but worse than original dpdk. > > And when grows above about 1024 bytes, it performs similarly to original > dpdk. > > Furthermore, if delete inline in original dpdk, the perf are similar to the > perf > with patch. > > Different situations(4 types, such as cache to cache) perform differently but > the trend is the same (size grows, perf grows). > > > > So if needs dynamic, needs sacrifices some perf and needs to compile for the > minimum target (e.g. compile for target avx, run on avx, avx2, avx512f). > > > > Thus, I think this feature shouldn't be delivered in this release. > > > > Best Regards, > > Xiaoyun Li