Hi Mairtin, > -----Original Message----- > From: O'loingsigh, Mairtin <mairtin.oloings...@intel.com> > Sent: Thursday, September 10, 2020 1:01 PM > To: Singh, Jasvinder <jasvinder.si...@intel.com> > Cc: dev@dpdk.org; Ryan, Brendan <brendan.r...@intel.com>; Coyle, David > <david.co...@intel.com>; De Lara Guarch, Pablo > <pablo.de.lara.gua...@intel.com>; O'loingsigh, Mairtin > <mairtin.oloings...@intel.com> > Subject: [PATCH] net: add support for AVX512 when generating CRC > > This patch enables the generation of CRC using AVX512 instruction set when > available on the host platform. > > Signed-off-by: Mairtin o Loingsigh <mairtin.oloings...@intel.com> > --- > > v1: > * Initial version, with AVX512 support for CRC32 Ethernet only (requires > further > updates) > * AVX512 support for CRC16-CCITT and final implementation of > CRC32 Ethernet will be added in v2 > --- > doc/guides/rel_notes/release_20_11.rst | 4 + > lib/librte_net/net_crc_avx.h | 331 > ++++++++++++++++++++++++++++++++ > lib/librte_net/rte_net_crc.c | 23 ++- > lib/librte_net/rte_net_crc.h | 1 + > 4 files changed, 358 insertions(+), 1 deletions(-) create mode 100644 > lib/librte_net/net_crc_avx.h > > diff --git a/doc/guides/rel_notes/release_20_11.rst > b/doc/guides/rel_notes/release_20_11.rst > index df227a1..d6a84ca 100644 > --- a/doc/guides/rel_notes/release_20_11.rst > +++ b/doc/guides/rel_notes/release_20_11.rst > @@ -55,6 +55,10 @@ New Features > Also, make sure to start the actual text at the margin. > ======================================================= > > +* **Added support for AVX512 in rte_net CRC calculations.** > + > + Added new CRC32 calculation code using AVX512 instruction set Added > + new CRC16-CCITT calculation code using AVX512 instruction set > > Removed Items > ------------- > diff --git a/lib/librte_net/net_crc_avx.h b/lib/librte_net/net_crc_avx.h new > file > mode 100644 index 0000000..d9481d5 > --- /dev/null > +++ b/lib/librte_net/net_crc_avx.h
... > +static __rte_always_inline uint32_t > +crc32_eth_calc_pclmulqdq( > + const uint8_t *data, > + uint32_t data_len, > + uint32_t crc, > + const struct crc_pclmulqdq512_ctx *params) { > + __m256i b; > + __m512i temp, k; > + __m512i qw0 = _mm512_set1_epi64(0); > + __m512i fold0; > + uint32_t n; This is loading 64 bytes of data, but if seems like only 16 are available, right? Should we use _mm_loadu_si128? > + fold0 = _mm512_xor_si512(fold0, temp); > + goto reduction_128_64; > + } > + > + if (unlikely(data_len < 16)) { > + /* 0 to 15 bytes */ > + uint8_t buffer[16] __rte_aligned(16); > + > + memset(buffer, 0, sizeof(buffer)); > + memcpy(buffer, data, data_len); I would use _mm_maskz_loadu_epi8, passing a mask register with ((1 << data_len) - 1). > + > + fold0 = _mm512_load_si512((const __m128i *)buffer); > + fold0 = _mm512_xor_si512(fold0, temp); > + if (unlikely(data_len < 4)) { > + fold0 = xmm_shift_left(fold0, 8 - data_len); > + goto barret_reduction; > + } > + fold0 = xmm_shift_left(fold0, 16 - data_len); > + goto reduction_128_64; > + } > + /* 17 to 31 bytes */ > + fold0 = _mm512_loadu_si512((const __m512i *)data); Same here. Looks like you are loading too much data? > + fold0 = _mm512_xor_si512(fold0, temp); > + n = 16; > + k = params->rk1_rk2; > + goto partial_bytes; > + } ... > + > + fold0 = _mm512_xor_si512(fold0, temp); > + fold0 = _mm512_xor_si512(fold0, b); You could use _mm512_ternarylogic_epi64 with 0x96 as to do 2x XORs in one instruction. > + } > + > + /** Reduction 128 -> 32 Assumes: fold holds 128bit folded data */ > +reduction_128_64: > + k = params->rk5_rk6; > + > +barret_reduction: > + k = params->rk7_rk8; > + n = crcr32_reduce_64_to_32(fold0, k); > + > + return n; > +} > + > +