Hi Zhiyong, > -----Original Message----- > From: Yang, Zhiyong > Sent: Thursday, December 15, 2016 6:51 AM > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Ananyev, Konstantin > <konstantin.anan...@intel.com>; Thomas Monjalon > <thomas.monja...@6wind.com> > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com> > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on IA > platform > > Hi, Thomas, Konstantin: > > > -----Original Message----- > > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Yang, Zhiyong > > Sent: Sunday, December 11, 2016 8:33 PM > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > > Monjalon <thomas.monja...@6wind.com> > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.gua...@intel.com> > > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > > IA platform > > > > Hi, Konstantin, Bruce: > > > > > -----Original Message----- > > > From: Ananyev, Konstantin > > > Sent: Thursday, December 8, 2016 6:31 PM > > > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Thomas Monjalon > > > <thomas.monja...@6wind.com> > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.gua...@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > > > > > > > > > -----Original Message----- > > > > From: Yang, Zhiyong > > > > Sent: Thursday, December 8, 2016 9:53 AM > > > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > > > > Monjalon <thomas.monja...@6wind.com> > > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > > <pablo.de.lara.gua...@intel.com> > > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > > on IA platform > > > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > > > static inline void* > > > rte_memset_huge(void *s, int c, size_t n) { > > > return __rte_memset_vector(s, c, n); } > > > > > > static inline void * > > > rte_memset(void *s, int c, size_t n) > > > { > > > If (n < XXX) > > > return rte_memset_scalar(s, c, n); > > > else > > > return rte_memset_huge(s, c, n); > > > } > > > > > > XXX could be either a define, or could also be a variable, so it can > > > be setuped at startup, depending on the architecture. > > > > > > Would that work? > > > Konstantin > > > > I have implemented the code for choosing the functions at run time. > rte_memcpy is used more frequently, So I test it at run time. > > typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n); > extern rte_memcpy_vector_t rte_memcpy_vector; > static inline void * > rte_memcpy(void *dst, const void *src, size_t n) > { > return rte_memcpy_vector(dst, src, n); > } > In order to reduce the overhead at run time, > I assign the function address to var rte_memcpy_vector before main() starts > to init the var. > > static void __attribute__((constructor)) > rte_memcpy_init(void) > { > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > { > rte_memcpy_vector = rte_memcpy_avx2; > } > else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) > { > rte_memcpy_vector = rte_memcpy_sse; > } > else > { > rte_memcpy_vector = memcpy; > } > > }
I thought we discussed a bit different approach. In which rte_memcpy_vector() (rte_memeset_vector) would be called only after some cutoff point, i.e: void rte_memcpy(void *dst, const void *src, size_t len) { if (len < N) memcpy(dst, src, len); else rte_memcpy_vector(dst, src, len); } If you just always call rte_memcpy_vector() for every len, then it means that compiler most likely has always to generate a proper call (not inlining happening). For small length(s) price of extra function would probably overweight any potential gain with SSE/AVX2 implementation. Konstantin > I run the same virtio/vhost loopback tests without NIC. > I can see the throughput drop when running choosing functions at run time > compared to original code as following on the same platform(my machine is > haswell) > Packet size perf drop > 64 -4% > 256 -5.4% > 1024 -5% > 1500 -2.5% > Another thing, I run the memcpy_perf_autotest, when N= <128, > the rte_memcpy perf gains almost disappears > When choosing functions at run time. For N=other numbers, the perf gains > will become narrow. > > Thanks > Zhiyong