Hiļ¼Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Thursday, December 15, 2016 6:54 PM > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Thomas Monjalon > <thomas.monja...@6wind.com> > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi Zhiyong, > > > -----Original Message----- > > From: Yang, Zhiyong > > Sent: Thursday, December 15, 2016 6:51 AM > > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Ananyev, Konstantin > > <konstantin.anan...@intel.com>; Thomas Monjalon > > <thomas.monja...@6wind.com> > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.gua...@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > on IA platform > > > > Hi, Thomas, Konstantin: > > > > > -----Original Message----- > > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Yang, Zhiyong > > > Sent: Sunday, December 11, 2016 8:33 PM > > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > > > Monjalon <thomas.monja...@6wind.com> > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.gua...@intel.com> > > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce > rte_memset > > > on IA platform > > > > > > Hi, Konstantin, Bruce: > > > > > > > -----Original Message----- > > > > From: Ananyev, Konstantin > > > > Sent: Thursday, December 8, 2016 6:31 PM > > > > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Thomas Monjalon > > > > <thomas.monja...@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.gua...@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > rte_memset on IA platform > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > From: Yang, Zhiyong > > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > > > > > Monjalon <thomas.monja...@6wind.com> > > > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > > > <pablo.de.lara.gua...@intel.com> > > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce > > > > > rte_memset on IA platform > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > > > static inline void* > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > static inline void * > > > > rte_memset(void *s, int c, size_t n) { > > > > If (n < XXX) > > > > return rte_memset_scalar(s, c, n); > > > > else > > > > return rte_memset_huge(s, c, n); } > > > > > > > > XXX could be either a define, or could also be a variable, so it > > > > can be setuped at startup, depending on the architecture. > > > > > > > > Would that work? > > > > Konstantin > > > > > > I have implemented the code for choosing the functions at run time. > > rte_memcpy is used more frequently, So I test it at run time. > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline > > void * rte_memcpy(void *dst, const void *src, size_t n) { > > return rte_memcpy_vector(dst, src, n); } In order to reduce > > the overhead at run time, I assign the function address to var > > rte_memcpy_vector before main() starts to init the var. > > > > static void __attribute__((constructor)) > > rte_memcpy_init(void) > > { > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > { > > rte_memcpy_vector = rte_memcpy_avx2; > > } > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > { > > rte_memcpy_vector = rte_memcpy_sse; > > } > > else > > { > > rte_memcpy_vector = memcpy; > > } > > > > } > > I thought we discussed a bit different approach. > In which rte_memcpy_vector() (rte_memeset_vector) would be called only > after some cutoff point, i.e: > > void > rte_memcpy(void *dst, const void *src, size_t len) { > if (len < N) memcpy(dst, src, len); > else rte_memcpy_vector(dst, src, len); > } > > If you just always call rte_memcpy_vector() for every len, then it means that > compiler most likely has always to generate a proper call (not inlining > happening).
> For small length(s) price of extra function would probably overweight any > potential gain with SSE/AVX2 implementation. > > Konstantin Yes, in fact, from my tests, For small length(s) rte_memset is far better than glibc memset, For large lengths, rte_memset is only a bit better than memset. because memset use the AVX2/SSE, too. Of course, it will use AVX512 on future machine. >For small length(s) price of extra function would probably overweight any >potential gain. This is the key point. I think it should include the scalar optimization, not only vector optimization. The value of rte_memset is always inlined and for small lengths it will be better. when in some case We are not sure that memset is always inlined by compiler. It seems that choosing function at run time will lose the gains. The following is tested on haswell by patch code. ** rte_memset() - memset perf tests (C = compile-time constant) ** ======== ======= ======== ======= ======== Size memset in cache memset in mem (bytes) (ticks) (ticks) ------- -------------- --------------- ============= 32B aligned ================ 3 3 - 8 19 - 128 4 4 - 8 13 - 128 8 2 - 7 19 - 128 9 2 - 7 19 - 127 12 2 - 7 19 - 127 17 3 - 8 19 - 132 64 3 - 8 28 - 168 128 7 - 13 54 - 200 255 8 - 20 100 - 223 511 14 - 20 187 - 314 1024 24 - 29 328 - 379 8192 198 - 225 1829 - 2193 Thanks Zhiyong