Hi, Thomas, Konstantin: > -----Original Message----- > From: dev [mailto:dev-boun...@dpdk.org] On Behalf Of Yang, Zhiyong > Sent: Sunday, December 11, 2016 8:33 PM > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > Monjalon <thomas.monja...@6wind.com> > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com> > Subject: Re: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset on > IA platform > > Hi, Konstantin, Bruce: > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Thursday, December 8, 2016 6:31 PM > > To: Yang, Zhiyong <zhiyong.y...@intel.com>; Thomas Monjalon > > <thomas.monja...@6wind.com> > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > <pablo.de.lara.gua...@intel.com> > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > on IA platform > > > > > > > > > -----Original Message----- > > > From: Yang, Zhiyong > > > Sent: Thursday, December 8, 2016 9:53 AM > > > To: Ananyev, Konstantin <konstantin.anan...@intel.com>; Thomas > > > Monjalon <thomas.monja...@6wind.com> > > > Cc: dev@dpdk.org; yuanhan....@linux.intel.com; Richardson, Bruce > > > <bruce.richard...@intel.com>; De Lara Guarch, Pablo > > > <pablo.de.lara.gua...@intel.com> > > > Subject: RE: [dpdk-dev] [PATCH 1/4] eal/common: introduce rte_memset > > > on IA platform > > > > > extern void *(*__rte_memset_vector)( (void *s, int c, size_t n); > > > > static inline void* > > rte_memset_huge(void *s, int c, size_t n) { > > return __rte_memset_vector(s, c, n); } > > > > static inline void * > > rte_memset(void *s, int c, size_t n) > > { > > If (n < XXX) > > return rte_memset_scalar(s, c, n); > > else > > return rte_memset_huge(s, c, n); > > } > > > > XXX could be either a define, or could also be a variable, so it can > > be setuped at startup, depending on the architecture. > > > > Would that work? > > Konstantin > > I have implemented the code for choosing the functions at run time. rte_memcpy is used more frequently, So I test it at run time.
typedef void *(*rte_memcpy_vector_t)(void *dst, const void *src, size_t n); extern rte_memcpy_vector_t rte_memcpy_vector; static inline void * rte_memcpy(void *dst, const void *src, size_t n) { return rte_memcpy_vector(dst, src, n); } In order to reduce the overhead at run time, I assign the function address to var rte_memcpy_vector before main() starts to init the var. static void __attribute__((constructor)) rte_memcpy_init(void) { if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) { rte_memcpy_vector = rte_memcpy_avx2; } else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1)) { rte_memcpy_vector = rte_memcpy_sse; } else { rte_memcpy_vector = memcpy; } } I run the same virtio/vhost loopback tests without NIC. I can see the throughput drop when running choosing functions at run time compared to original code as following on the same platform(my machine is haswell) Packet size perf drop 64 -4% 256 -5.4% 1024 -5% 1500 -2.5% Another thing, I run the memcpy_perf_autotest, when N= <128, the rte_memcpy perf gains almost disappears When choosing functions at run time. For N=other numbers, the perf gains will become narrow. Thanks Zhiyong