2016-08-26 22:34, Pablo de Lara: > From: Byron Marohn <byron.marohn at intel.com> > > In lookup bulk function, the signatures of all entries > are compared against the signature of the key that is being looked up. > Now that all the signatures are together, they can be compared > with vector instructions (SSE, AVX2), achieving higher lookup performance. > > Also, entries per bucket are increased to 8 when using processors > with AVX2, as 256 bits can be compared at once, which is the size of > 8x32-bit signatures.
Please, would it be possible to use the generic SIMD intrinsics? We could define generic types compatible with Altivec and NEON: __attribute__ ((vector_size (n))) as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html > +/* 8 entries per bucket */ > +#if defined(__AVX2__) Please prefer #ifdef RTE_MACHINE_CPUFLAG_AVX2 Ideally the vector support could be checked at runtime: if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) It would allow packaging one binary using the best optimization available. > + *prim_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > + _mm256_load_si256((__m256i const > *)prim_bkt->sig_current), > + _mm256_set1_epi32(prim_hash))); > + *sec_hash_matches |= _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > + _mm256_load_si256((__m256i const > *)sec_bkt->sig_current), > + _mm256_set1_epi32(sec_hash))); > +/* 4 entries per bucket */ > +#elif defined(__SSE2__) > + *prim_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > + _mm_load_si128((__m128i const *)prim_bkt->sig_current), > + _mm_set1_epi32(prim_hash))); > + *sec_hash_matches |= _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > + _mm_load_si128((__m128i const *)sec_bkt->sig_current), > + _mm_set1_epi32(sec_hash))); In order to allow such switch based on register size, we could have an abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER. I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be enough.