> > Hi > You mean just use rte_memcpy_internal in rte_memcpy_avx2, rte_memcpy_avx512?
Yes, exactly and for rte_memcpy_sse() too. Basically we for rte_memcpy_avx512() we force compiler to use AVX512F path inside rte_memcpy_iternal(), for rte_memcpy_avx2() we use AVX2 path inside rte_memcpy_internal(), etc. To do that we setup: CFLAGS_rte_memcpy_avx512f.o += -mavx512f CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F inside the Makefile. For rte_memcpy_avx2() we force compiler to use AVX2 path inside rte_memcpy_internal(), etc. > But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler supports > avx2, then internal would only compiled > With avx2 codes, then cannot choose other code path. What if the HW cannot > support avx2? If the HW can't support AVX2 then rte_memcpy_init() just wouldn't select rte_memcpy_avx2(), it would select rte_memcpy_sse() instead: if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {...} - that is a runtime check that underlying HW does support AVX2. Konstantin > If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether both compiler > and HW supports avx2. Then the function > has no difference right now. > The mocro is determined at compilation time. But selection is hoped to be at > runtime. > Did I consider something wrong? > > Best Regards, > Xiaoyun Li > > > > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Tuesday, October 3, 2017 19:16 > > To: Li, Xiaoyun <xiaoyun...@intel.com>; Richardson, Bruce > > <bruce.richard...@intel.com> > > Cc: Lu, Wenzhuo <wenzhuo...@intel.com>; Zhang, Helin > > <helin.zh...@intel.com>; dev@dpdk.org > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > Hi, > > > > > > > > Hi > > > > > > > -----Original Message----- > > > > From: Ananyev, Konstantin > > > > Sent: Tuesday, October 3, 2017 00:39 > > > > To: Li, Xiaoyun <xiaoyun...@intel.com>; Richardson, Bruce > > > > <bruce.richard...@intel.com> > > > > Cc: Lu, Wenzhuo <wenzhuo...@intel.com>; Zhang, Helin > > > > <helin.zh...@intel.com>; dev@dpdk.org > > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Li, Xiaoyun > > > > > Sent: Monday, October 2, 2017 5:13 PM > > > > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; > > > > > Richardson, > > > > Bruce <bruce.richard...@intel.com> > > > > > Cc: Lu, Wenzhuo <wenzhuo...@intel.com>; Zhang, Helin > > > > <helin.zh...@intel.com>; dev@dpdk.org; Li, Xiaoyun > > > > <xiaoyun...@intel.com> > > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy > > > > > > > > > > This patch dynamically selects functions of memcpy at run-time > > > > > based on CPU flags that current machine supports. This patch uses > > > > > function pointers which are bind to the relative functions at > > > > > constrctor > > time. > > > > > In addition, AVX512 instructions set would be compiled only if > > > > > users config it enabled and the compiler supports it. > > > > > > > > > > Signed-off-by: Xiaoyun Li <xiaoyun...@intel.com> > > > > > --- > > > > > v2 > > > > > * Use gcc function multi-versioning to avoid compilation issues. > > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and > > > > > the compiler supports it, the AVX512 codes would be compiled. Only > > > > > if the compiler supports AVX2, the AVX2 codes would be compiled. > > > > > > > > > > v3 > > > > > * Reduce function calls via only keep rte_memcpy_xxx. > > > > > * Add conditions that when copy size is small, use inline code path. > > > > > Otherwise, use dynamic code path. > > > > > * To support attribute target, clang version must be greater than 3.7. > > > > > Otherwise, would choose SSE/AVX code path, the same as before. > > > > > * Move two mocro functions to the top of the code since they would > > > > > be used in inline SSE/AVX and dynamic SSE/AVX codes. > > > > > > > > > > v4 > > > > > * Modify rte_memcpy.h to several .c files and modify makefiles to > > > > > compile > > > > > AVX2 and AVX512 files. > > > > > > > > Could you explain to me why instead of reusing existing rte_memcpy() > > > > code to generate _sse/_avx2/ax512f flavors you keep pushing changes > > > > with 3 separate implementations? > > > > Obviously that is much more expensive in terms of maintenance and > > > > doesn't look like feasible solution to me. > > > > Is existing rte_memcpy() implementation is not good enough in terms > > > > of functionality and/or performance? > > > > If so, can you outline these problems and try to fix them first. > > > > Konstantin > > > > > > > > > > I just change many small functions to one function in those 3 separate > > functions. > > > > Yes, so with what you suggest we'll have 4 implementations for rte_memcpy > > to support. > > That's very expensive terms of maintenance and I believe totally > > unnecessary. > > > > > Because the existing codes are totally inline, including rte_memcpy() > > > itself. So the compilation will change all rte_memcpy() calls into the > > > basic > > codes like xmm0=xxx. > > > > > > The existing codes in this way are OK. > > > > Good. > > > > >But when run-time, it will bring lots of function calls and cause perf > > >drop. > > > > I believe it wouldn't if we do it properly. > > All internal functions (mov16, mov32, etc.) will still be unlined by the > > compiler for each flavor (sse/avx2/etc.) - have a look at the patch I sent. > > > > Konstantin