> -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon at 6wind.com] > Sent: Saturday, August 27, 2016 1:58 AM > To: De Lara Guarch, Pablo; Marohn, Byron > Cc: dev at dpdk.org; Richardson, Bruce; Edupuganti, Saikrishna; > jianbo.liu at linaro.org; chaozhu at linux.vnet.ibm.com; > jerin.jacob at caviumnetworks.com > Subject: Re: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison > > 2016-08-26 22:34, Pablo de Lara: > > From: Byron Marohn <byron.marohn at intel.com> > > > > In lookup bulk function, the signatures of all entries > > are compared against the signature of the key that is being looked up. > > Now that all the signatures are together, they can be compared > > with vector instructions (SSE, AVX2), achieving higher lookup performance. > > > > Also, entries per bucket are increased to 8 when using processors > > with AVX2, as 256 bits can be compared at once, which is the size of > > 8x32-bit signatures. > > Please, would it be possible to use the generic SIMD intrinsics? > We could define generic types compatible with Altivec and NEON: > __attribute__ ((vector_size (n))) > as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html >
I tried to convert these into generic code with gcc builtins, but I couldn't find a way to translate the __mm_movemask instrinsic into a generic builtin (which is very necessary for performance reasons). Therefore, I think it is not possible to do this without penalizing performance. Sure, we could try to translate the other intrinsics, but it would mean that we still need to use #ifdefs and we would have a mix of code with x86 instrinsics and gcc builtins, so it is better to leave it this way. > > +/* 8 entries per bucket */ > > +#if defined(__AVX2__) > > Please prefer > #ifdef RTE_MACHINE_CPUFLAG_AVX2 > Ideally the vector support could be checked at runtime: > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > It would allow packaging one binary using the best optimization available. > Good idea. Will submit a v2 with this change. It took me a bit of time to figure out a way to do this without paying a big performance penalty. > > + *prim_hash_matches |= > _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > > + _mm256_load_si256((__m256i const *)prim_bkt- > >sig_current), > > + _mm256_set1_epi32(prim_hash))); > > + *sec_hash_matches |= > _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > > + _mm256_load_si256((__m256i const *)sec_bkt- > >sig_current), > > + _mm256_set1_epi32(sec_hash))); > > +/* 4 entries per bucket */ > > +#elif defined(__SSE2__) > > + *prim_hash_matches |= > _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > > + _mm_load_si128((__m128i const *)prim_bkt- > >sig_current), > > + _mm_set1_epi32(prim_hash))); > > + *sec_hash_matches |= > _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > > + _mm_load_si128((__m128i const *)sec_bkt- > >sig_current), > > + _mm_set1_epi32(sec_hash))); > > In order to allow such switch based on register size, we could have an > abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER. > I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be > enough.