Hi, Konstantin: > -----Original Message----- > From: Ananyev, Konstantin > Sent: Friday, December 16, 2016 7:48 PM > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Thomas Monjalon > <thomas.monja...@6wind.com> > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi Zhiyong, > > > > > > > > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t > > > > > > n); > > > > > > > > > > > > static inline void* > > > > > > rte_memset_huge(void *s, int c, size_t n) { > > > > > > return __rte_memset_vector(s, c, n); } > > > > > > > > > > > > static inline void * > > > > > > rte_memset(void *s, int c, size_t n) { > > > > > > If (n < XXX) > > > > > > return rte_memset_scalar(s, c, n); > > > > > > else > > > > > > return rte_memset_huge(s, c, n); } > > > > > > > > > > > > XXX could be either a define, or could also be a variable, so > > > > > > it can be setuped at startup, depending on the architecture. > > > > > > > > > > > > Would that work? > > > > > > Konstantin > > > > > > > > > > I have implemented the code for choosing the functions at run time. > > > > rte_memcpy is used more frequently, So I test it at run time. > > > > > > > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, > > > > size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static > > > > inline void * rte_memcpy(void *dst, const void *src, size_t n) { > > > > return rte_memcpy_vector(dst, src, n); } In order to > > > > reduce the overhead at run time, I assign the function address to > > > > var rte_memcpy_vector before main() starts to init the var. > > > > > > > > static void __attribute__((constructor)) > > > > rte_memcpy_init(void) > > > > { > > > > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > > > > { > > > > rte_memcpy_vector = rte_memcpy_avx2; > > > > } > > > > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > > > > { > > > > rte_memcpy_vector = rte_memcpy_sse; > > > > } > > > > else > > > > { > > > > rte_memcpy_vector = memcpy; > > > > } > > > > > > > > } > > > > > > I thought we discussed a bit different approach. > > > In which rte_memcpy_vector() (rte_memeset_vector) would be called > > > only after some cutoff point, i.e: > > > > > > void > > > rte_memcpy(void *dst, const void *src, size_t len) { > > > if (len < N) memcpy(dst, src, len); > > > else rte_memcpy_vector(dst, src, len); } > > > > > > If you just always call rte_memcpy_vector() for every len, then it > > > means that compiler most likely has always to generate a proper call > > > (not inlining happening). > > > > > For small length(s) price of extra function would probably > > > overweight any potential gain with SSE/AVX2 implementation. > > > > > > Konstantin > > > > Yes, in fact, from my tests, For small length(s) rte_memset is far > > better than glibc memset, For large lengths, rte_memset is only a bit better > than memset. > > because memset use the AVX2/SSE, too. Of course, it will use AVX512 on > future machine. > > Ok, thanks for clarification. > From previous mails I got a wrong impression that on big lengths > rte_memset_vector() is significantly faster than memset(). > > > > > >For small length(s) price of extra function would probably overweight > > >any > > >potential gain. > > This is the key point. I think it should include the scalar optimization, > > not > only vector optimization. > > > > The value of rte_memset is always inlined and for small lengths it will be > better. > > when in some case We are not sure that memset is always inlined by > compiler. > > Ok, so do you know in what cases memset() is not get inlined? > Is it when len parameter can't be precomputed by the compiler (is not a > constant)? > > So to me it sounds like: > - We don't need to have an optimized verision of rte_memset() for big sizes. > - Which probably means we don't need an arch specific versions of > rte_memset_vector() at all - > for small sizes (<= 32B) scalar version would be good enough. > - For big sizes we can just rely on memset(). > Is that so?
Using memset has actually met some trouble in some case, such as http://dpdk.org/ml/archives/dev/2016-October/048628.html > > > It seems that choosing function at run time will lose the gains. > > The following is tested on haswell by patch code. > > Not sure what columns 2 and 3 in the table below mean? > Konstantin Column1 shows Size(bytes). Column2 shows rte_memset Vs memset perf results in cache Column3 shows rte_memset Vs memset perf results in memory. The data is gotten using rte_rdtsc(); The test can be run using [PATCH 3/4] app/test: add performance autotest for rte_memset Thanks Zhiyong > > > ** rte_memset() - memset perf tests > > (C = compile-time constant) ** ======== ======= ======== > > ======= ======== > > Size memset in cache memset in mem > > (bytes) (ticks) (ticks) > > ------- -------------- --------------- ============= 32B aligned > > ================ > > 3 3 - 8 19 - 128 > > 4 4 - 8 13 - 128 > > 8 2 - 7 19 - 128 > > 9 2 - 7 19 - 127 > > 12 2 - 7 19 - 127 > > 17 3 - 8 19 - 132 > > 64 3 - 8 28 - 168 > > 128 7 - 13 54 - 200 > > 255 8 - 20 100 - 223 > > 511 14 - 20 187 - 314 > > 1024 24 - 29 328 - 379 > > 8192 198 - 225 1829 - 2193 > > > > Thanks > > Zhiyong