[PATCH v2 0/9] riscv: implement accelerated crc using zbc
The RISC-V Zbc extension adds instructions for carry-less multiplication we can use to implement CRC in hardware. This patch set contains two new implementations: - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to implement the four rte_hash_crc_* functions - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce the buffer until it is small enough for a Barrett reduction to implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler My approach is largely based on the Intel's "Fast CRC Computation Using PCLMULQDQ Instruction" white paper https://www.researchgate.net/publication/263424619_Fast_CRC_computation and a post about "Optimizing CRC32 for small payload sizes on x86" https://mary.rs/lab/crc32/ Whether these new implementations are enabled is controlled by new build-time and run-time detection of the RISC-V extensions present in the compiler and on the target system. I have carried out some performance comparisons between the generic table implementations and the new hardware implementations. Listed below is the number of cycles it takes to compute the CRC hash for buffers of various sizes (as reported by rte_get_timer_cycles()). These results were collected on a Kendryte K230 and averaged over 20 samples: |Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash) | |Size (MB) | Table| Hardware | Table| Hardware | |--|--|--|--|--| |1 | 155168 |11610 |73026 |18385 | |2 | 311203 |22998 | 145586 |35886 | |3 | 466744 |34370 | 218536 |53939 | |4 | 621843 |45536 | 291574 |71944 | |5 | 777908 |56989 | 364152 |89706 | |6 | 932736 |68023 | 437016 | 107726 | |7 | 1088756 |79236 | 510197 | 125426 | |8 | 1243794 |90467 | 583231 | 143614 | These results suggest a speed-up of lib/net by thirteen times, and of lib/hash by four times. I have also run the hash_functions_autotest benchmark in dpdk_test, which measures the performance of the lib/hash implementation on small buffers, getting the following times: | Key Length | Time (ticks/op) | | (bytes)| Table| Hardware | ||--|--| | 1 | 0.47 | 0.85 | | 2 | 0.57 | 0.87 | | 4 | 0.99 | 0.88 | | 8 | 1.35 | 0.88 | | 9 | 1.20 | 1.09 | | 13 | 1.76 | 1.35 | | 16 | 1.87 | 1.02 | | 32 | 2.96 | 0.98 | | 37 | 3.35 | 1.45 | | 40 | 3.49 | 1.12 | | 48 | 4.02 | 1.25 | | 64 | 5.08 | 1.54 | v2: - replace compile flag with build-time (riscv extension macros) and run-time detection (linux hwprobe syscall) (Stephen Hemminger) - add qemu target that supports zbc (Stanislaw Kardach) - fix spelling error in commit message - fix a bug in the net/ implementation that would cause segfaults on small unaligned buffers - refactor net/ implemementation to move variable declarations to top of functions - enable the optimisation in a couple other places optimised crc is preferred to jhash - l3fwd-power - cuckoo-hash Daniel Gregory (9): config/riscv: detect presence of Zbc extension hash: implement crc using riscv carryless multiply net: implement crc using riscv carryless multiply config/riscv: add qemu crossbuild target examples/l3fwd: use accelerated crc on riscv ipfrag: use accelerated crc on riscv examples/l3fwd-power: use accelerated crc on riscv hash/cuckoo: use accelerated crc on riscv member: use accelerated crc on riscv MAINTAINERS | 2 + app/test/test_crc.c | 9 + app/test/test_hash.c | 7 + config/riscv/meson.build | 44 +++- config/riscv/riscv64_qemu_linux_gcc | 17 ++ .../linux_gsg/cross_build_dpdk_for_riscv.rst | 5 + examples/l3fwd-power/main.c | 2 +- examples/l3fwd/l3fwd_em.c | 2 +- lib/eal/riscv/include/rte_cpuflags.h | 2 + lib/eal/riscv/rte_cpuflags.c | 112 +++--- lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h| 89 lib/hash/rte_cuckoo_hash.c| 3 + lib/hash/rte_hash_crc.c | 13 +- lib/hash/rte_hash_crc.h | 6 +- lib/ip_frag/ip_frag_internal.c| 6 +- lib/member/rte_member.h | 2 +- lib/net/meson.build | 4 + lib/net/net_crc.h | 11 + lib/net/net_crc_zbc.c | 191 ++ lib/net/rte_net_crc.c | 40 lib/net/rte_net_crc.h | 2
[PATCH v2 1/9] config/riscv: detect presence of Zbc extension
The RISC-V Zbc extension adds carry-less multiply instructions we can use to implement more efficient CRC hashing algorithms. The RISC-V C api defines architecture extension test macros https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/riscv-c-api.md#architecture-extension-test-macros These let us detect whether the Zbc extension is supported on the compiler and -march we're building with. The C api also defines Zbc intrinsics we can use rather than inline assembly on newer versions of GCC (14.1.0+) and Clang (18.1.0+). The Linux kernel exposes a RISC-V hardware probing syscall for getting information about the system at run-time including which extensions are available. We detect whether this interface is present by looking for the header, as it's only present in newer kernels (v6.4+). Furthermore, support for detecting certain extensions, including Zbc, wasn't present until versions after this, so we need to check the constants this header exports. The kernel exposes bitmasks for each extension supported by the probing interface, rather than the bit index that is set if that extensions is present, so modify the existing cpu flag HWCAP table entries to line up with this. The values returned by the interface are 64-bits long, so grow the hwcap registers array to be able to hold them. If the Zbc extension and intrinsics are both present and we can detect the Zbc extension at runtime, we define a flag, RTE_RISCV_FEATURE_ZBC. Signed-off-by: Daniel Gregory --- config/riscv/meson.build | 41 ++ lib/eal/riscv/include/rte_cpuflags.h | 2 + lib/eal/riscv/rte_cpuflags.c | 112 +++ 3 files changed, 123 insertions(+), 32 deletions(-) diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 07d7d9da23..5d8411b254 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -119,6 +119,47 @@ foreach flag: arch_config['machine_args'] endif endforeach +# check if we can do buildtime detection of extensions supported by the target +riscv_extension_macros = false +if (cc.get_define('__riscv_arch_test', args: machine_args) == '1') + message('Detected architecture extension test macros') + riscv_extension_macros = true +else + warning('RISC-V architecture extension test macros not available. Build-time detection of extensions not possible') +endif + +# check if we can use hwprobe interface for runtime extension detection +riscv_hwprobe = false +if (cc.check_header('asm/hwprobe.h', args: machine_args)) + message('Detected hwprobe interface, enabling runtime detection of supported extensions') + machine_args += ['-DRTE_RISCV_FEATURE_HWPROBE'] + riscv_hwprobe = true +else + warning('Hwprobe interface not available (present in Linux v6.4+), instruction extensions won\'t be enabled') +endif + +# detect extensions +# RISC-V Carry-less multiplication extension (Zbc) for hardware implementations +# of CRC-32C (lib/hash/rte_crc_riscv64.h) and CRC-32/16 (lib/net/net_crc_zbc.c). +# Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+ +if (riscv_extension_macros and riscv_hwprobe and +(cc.get_define('__riscv_zbc', args: machine_args) != '')) + if ((cc.get_id() == 'gcc' and cc.version().version_compare('>=14.1.0')) + or (cc.get_id() == 'clang' and cc.version().version_compare('>=18.1.0'))) +# determine whether we can detect Zbc extension (this wasn't possible until +# Linux kernel v6.8) +if (cc.compiles('''#include + int a = RISCV_HWPROBE_EXT_ZBC;''', args: machine_args)) + message('Compiling with the Zbc extension') + machine_args += ['-DRTE_RISCV_FEATURE_ZBC'] +else + warning('Detected Zbc extension but cannot use because runtime detection doesn\'t support it (support present in Linux kernel v6.8+)') +endif + else +warning('Detected Zbc extension but cannot use because intrinsics are not available (present in GCC 14.1.0+ and Clang 18.1.0+)') + endif +endif + # apply flags foreach flag: dpdk_flags if flag.length() > 0 diff --git a/lib/eal/riscv/include/rte_cpuflags.h b/lib/eal/riscv/include/rte_cpuflags.h index d742efc40f..4e26b584b3 100644 --- a/lib/eal/riscv/include/rte_cpuflags.h +++ b/lib/eal/riscv/include/rte_cpuflags.h @@ -42,6 +42,8 @@ enum rte_cpu_flag_t { RTE_CPUFLAG_RISCV_ISA_X, /* Non-standard extension present */ RTE_CPUFLAG_RISCV_ISA_Y, /* Reserved */ RTE_CPUFLAG_RISCV_ISA_Z, /* Reserved */ + + RTE_CPUFLAG_RISCV_EXT_ZBC, /* Carry-less multiplication */ }; #include "generic/rte_cpuflags.h" diff --git a/lib/eal/riscv/rte_cpuflags.c b/lib/eal/riscv/rte_cpuflags.c index eb4105c18b..dedf0395ab 100644 --- a/lib/eal/riscv/rt
[PATCH v2 2/9] hash: implement crc using riscv carryless multiply
Using carryless multiply instructions from RISC-V's Zbc extension, implement a Barrett reduction that calculates CRC-32C checksums. Based on the approach described by Intel's whitepaper on "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction", which is also described here (https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/) Add a case to the autotest_hash unit test. Signed-off-by: Daniel Gregory --- MAINTAINERS| 1 + app/test/test_hash.c | 7 +++ lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h | 89 ++ lib/hash/rte_hash_crc.c| 13 +- lib/hash/rte_hash_crc.h| 6 ++- 6 files changed, 115 insertions(+), 2 deletions(-) create mode 100644 lib/hash/rte_crc_riscv64.h diff --git a/MAINTAINERS b/MAINTAINERS index 533f707d5f..81f13ebcf2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -318,6 +318,7 @@ M: Stanislaw Kardach F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ +F: lib/hash/rte_crc_riscv64.h Intel x86 M: Bruce Richardson diff --git a/app/test/test_hash.c b/app/test/test_hash.c index 24d3b547ad..c8c4197ad8 100644 --- a/app/test/test_hash.c +++ b/app/test/test_hash.c @@ -205,6 +205,13 @@ test_crc32_hash_alg_equiv(void) printf("Failed checking CRC32_SW against CRC32_ARM64\n"); break; } + + /* Check against 8-byte-operand RISCV64 CRC32 if available */ + rte_hash_crc_set_alg(CRC32_RISCV64); + if (hash_val != rte_hash_crc(data64, data_len, init_val)) { + printf("Failed checking CRC32_SW against CRC32_RISC64\n"); + break; + } } /* Resetting to best available algorithm */ diff --git a/lib/hash/meson.build b/lib/hash/meson.build index 277eb9fa93..8355869a80 100644 --- a/lib/hash/meson.build +++ b/lib/hash/meson.build @@ -12,6 +12,7 @@ headers = files( indirect_headers += files( 'rte_crc_arm64.h', 'rte_crc_generic.h', +'rte_crc_riscv64.h', 'rte_crc_sw.h', 'rte_crc_x86.h', 'rte_thash_x86_gfni.h', diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h new file mode 100644 index 00..94f6857c69 --- /dev/null +++ b/lib/hash/rte_crc_riscv64.h @@ -0,0 +1,89 @@ +/* SPDX-License_Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include + +#ifndef _RTE_CRC_RISCV64_H_ +#define _RTE_CRC_RISCV64_H_ + +/* + * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected + * output. As reflecting the value we're checksumming is expensive, we instead + * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm. + * + * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6) + * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc + * using only two multiplications (https://mary.rs/lab/crc32/) + */ +static const uint64_t p = 0x105EC76F1; +static const uint64_t mu = 0x4869EC38DEA713F1UL; + +/* Calculate the CRC32C checksum using a Barrett reduction */ +static inline uint32_t +crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + crc = __riscv_clmul_64(crc, mu); + /* Multiply by P */ + crc = __riscv_clmulh_64(crc, p); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + crc ^= init_val >> bits; + + return crc; +} + +/* + * Use carryless multiply to perform hash on a value, falling back on the + * software in case the Zbc extension is not supported + */ +static inline uint32_t +rte_hash_crc_1byte(uint8_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 8); + + return crc32c_1byte(data, init_val); +} + +static inline uint32_t +rte_hash_crc_2byte(uint16_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 16); + + return crc32c_2bytes(data, init_val); +} + +static inline uint32_t +rte_hash_crc_4byte(uint32_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 32); + + return crc32c_1word(data, init_val);
[PATCH v2 3/9] net: implement crc using riscv carryless multiply
Using carryless multiply instructions (clmul) from RISC-V's Zbc extension, implement CRC-32 and CRC-16 calculations on buffers. Based on the approach described in Intel's whitepaper on "Fast CRC Computation for Generic Polynomails Using PCLMULQDQ Instructions", we perform repeated folds-by-1 whilst the buffer is still big enough, then perform Barrett's reductions on the rest. Add a case to the crc_autotest suite that tests this implementation. Signed-off-by: Daniel Gregory --- MAINTAINERS | 1 + app/test/test_crc.c | 9 ++ lib/net/meson.build | 4 + lib/net/net_crc.h | 11 +++ lib/net/net_crc_zbc.c | 191 ++ lib/net/rte_net_crc.c | 40 + lib/net/rte_net_crc.h | 2 + 7 files changed, 258 insertions(+) create mode 100644 lib/net/net_crc_zbc.c diff --git a/MAINTAINERS b/MAINTAINERS index 81f13ebcf2..58fbc51e64 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -319,6 +319,7 @@ F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ F: lib/hash/rte_crc_riscv64.h +F: lib/net/net_crc_zbc.c Intel x86 M: Bruce Richardson diff --git a/app/test/test_crc.c b/app/test/test_crc.c index b85fca35fe..fa91557cf5 100644 --- a/app/test/test_crc.c +++ b/app/test/test_crc.c @@ -168,6 +168,15 @@ test_crc(void) return ret; } + /* set CRC riscv mode */ + rte_net_crc_set_alg(RTE_NET_CRC_ZBC); + + ret = test_crc_calc(); + if (ret < 0) { + printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret); + return ret; + } + return 0; } diff --git a/lib/net/meson.build b/lib/net/meson.build index 0b69138949..404d8dd3ae 100644 --- a/lib/net/meson.build +++ b/lib/net/meson.build @@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '') sources += files('net_crc_neon.c') cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT'] +elif (dpdk_conf.has('RTE_ARCH_RISCV') and +cc.get_define('RTE_RISCV_FEATURE_ZBC', args: machine_args) != '') +sources += files('net_crc_zbc.c') +cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT'] endif diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h index 7a74d5406c..06ae113b47 100644 --- a/lib/net/net_crc.h +++ b/lib/net/net_crc.h @@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t data_len); uint32_t rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len); +/* RISCV64 Zbc */ +void +rte_net_crc_zbc_init(void); + +uint32_t +rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len); + +uint32_t +rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len); + + #endif /* _NET_CRC_H_ */ diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c new file mode 100644 index 00..be416ba52f --- /dev/null +++ b/lib/net/net_crc_zbc.c @@ -0,0 +1,191 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include +#include + +#include "net_crc.h" + +/* CLMUL CRC computation context structure */ +struct crc_clmul_ctx { + uint64_t Pr; + uint64_t mu; + uint64_t k3; + uint64_t k4; + uint64_t k5; +}; + +struct crc_clmul_ctx crc32_eth_clmul; +struct crc_clmul_ctx crc16_ccitt_clmul; + +/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */ +static inline uint32_t +crc32_barrett_zbc( + const uint64_t data, + uint32_t crc, + uint32_t bits, + const struct crc_clmul_ctx *params) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + temp = __riscv_clmul_64(temp, params->mu); + /* Multiply by P */ + temp = __riscv_clmulh_64(temp, params->Pr); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + temp ^= crc >> bits; + + return temp; +} + +/* Repeat Barrett's reduction for short buffer sizes */ +static inline uint32_t +crc32_repeated_barrett_zbc( + const uint8_t *data, + uint32_t data_len, + uint32_t crc, + const struct crc_clmul_ctx *params) +{ + while (data_len >= 8) { + crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, params); + data += 8; + data_len -= 8; + } + if (data_len >= 4) { + crc = crc32_barrett_zbc(*(const uint32_t *)data, crc, 32, params); +
[PATCH v2 4/9] config/riscv: add qemu crossbuild target
A new cross-compilation target that has extensions that DPDK uses and QEMU supports. Initially, this is just the Zbc extension for hardware crc support. Signed-off-by: Daniel Gregory --- config/riscv/meson.build| 3 ++- config/riscv/riscv64_qemu_linux_gcc | 17 + .../linux_gsg/cross_build_dpdk_for_riscv.rst| 5 + 3 files changed, 24 insertions(+), 1 deletion(-) create mode 100644 config/riscv/riscv64_qemu_linux_gcc diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 5d8411b254..337b26bbac 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -43,7 +43,8 @@ vendor_generic = { ['RTE_MAX_NUMA_NODES', 2] ], 'arch_config': { -'generic': {'machine_args': ['-march=rv64gc']} +'generic': {'machine_args': ['-march=rv64gc']}, +'qemu': {'machine_args': ['-march=rv64gc_zbc']}, } } diff --git a/config/riscv/riscv64_qemu_linux_gcc b/config/riscv/riscv64_qemu_linux_gcc new file mode 100644 index 00..007cc98885 --- /dev/null +++ b/config/riscv/riscv64_qemu_linux_gcc @@ -0,0 +1,17 @@ +[binaries] +c = ['ccache', 'riscv64-linux-gnu-gcc'] +cpp = ['ccache', 'riscv64-linux-gnu-g++'] +ar = 'riscv64-linux-gnu-ar' +strip = 'riscv64-linux-gnu-strip' +pcap-config = '' + +[host_machine] +system = 'linux' +cpu_family = 'riscv64' +cpu = 'rv64gc_zbc' +endian = 'little' + +[properties] +vendor_id = 'generic' +arch_id = 'qemu' +pkg_config_libdir = '/usr/lib/riscv64-linux-gnu/pkgconfig' diff --git a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst index 7d7f7ac72b..c3b67671a0 100644 --- a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst +++ b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst @@ -110,6 +110,11 @@ Currently the following targets are supported: * SiFive U740 SoC: ``config/riscv/riscv64_sifive_u740_linux_gcc`` +* QEMU: ``config/riscv/riscv64_qemu_linux_gcc`` + + * A target with all the extensions that QEMU supports that DPDK has a use for +(currently ``rv64gc_zbc``). Requires QEMU version 7.0.0 or newer. + To add a new target support, ``config/riscv/meson.build`` has to be modified by adding a new vendor/architecture id and a corresponding cross-file has to be added to ``config/riscv`` directory. -- 2.39.2
[PATCH v2 5/9] examples/l3fwd: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- examples/l3fwd/l3fwd_em.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c index d98e66ea2c..78cec7f5cc 100644 --- a/examples/l3fwd/l3fwd_em.c +++ b/examples/l3fwd/l3fwd_em.c @@ -29,7 +29,7 @@ #include "l3fwd_event.h" #include "em_route_parse.c" -#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) +#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || defined(RTE_RISCV_FEATURE_ZBC) #define EM_HASH_CRC 1 #endif -- 2.39.2
[PATCH v2 6/9] ipfrag: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/ip_frag/ip_frag_internal.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c index 7cbef647df..19a28c447b 100644 --- a/lib/ip_frag/ip_frag_internal.c +++ b/lib/ip_frag/ip_frag_internal.c @@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *)&key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_FEATURE_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(key->id, v); #else v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE); -#endif /* RTE_ARCH_X86 */ +#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_FEATURE_ZBC */ *v1 = v; *v2 = (v << 7) + (v >> 14); @@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *) &key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_FEATURE_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(p[2], v); -- 2.39.2
[PATCH v2 7/9] examples/l3fwd-power: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- examples/l3fwd-power/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index fba11da7ca..c67a3c4011 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -270,7 +270,7 @@ static struct rte_mempool * pktmbuf_pool[NB_SOCKETS]; #if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH) -#ifdef RTE_ARCH_X86 +#if defined(RTE_ARCH_X86) || defined(RTE_RISCV_FEATURE_ZBC) #include #define DEFAULT_HASH_FUNC rte_hash_crc #else -- 2.39.2
[PATCH v2 8/9] hash/cuckoo: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/hash/rte_cuckoo_hash.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c index d87aa52b5b..8bdb1ff69d 100644 --- a/lib/hash/rte_cuckoo_hash.c +++ b/lib/hash/rte_cuckoo_hash.c @@ -409,6 +409,9 @@ rte_hash_create(const struct rte_hash_parameters *params) #elif defined(RTE_ARCH_ARM64) if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_CRC32)) default_hash_func = (rte_hash_function)rte_hash_crc; +#elif defined(RTE_ARCH_RISCV) + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RISCV_EXT_ZBC)) + default_hash_func = (rte_hash_function)rte_hash_crc; #endif /* Setup hash context */ strlcpy(h->name, params->name, sizeof(h->name)); -- 2.39.2
[PATCH v2 9/9] member: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/member/rte_member.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/member/rte_member.h b/lib/member/rte_member.h index aec192eba5..152659628a 100644 --- a/lib/member/rte_member.h +++ b/lib/member/rte_member.h @@ -92,7 +92,7 @@ typedef uint16_t member_set_t; #define RTE_MEMBER_SKETCH_COUNT_BYTE 0x02 /** @internal Hash function used by membership library. */ -#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) +#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || defined(RTE_RISCV_FEATURE_ZBC) #include #define MEMBER_HASH_FUNC rte_hash_crc #else -- 2.39.2
[PATCH v3 0/9] riscv: implement accelerated crc using zbc
The RISC-V Zbc extension adds instructions for carry-less multiplication we can use to implement CRC in hardware. This patch set contains two new implementations: - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to implement the four rte_hash_crc_* functions - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce the buffer until it is small enough for a Barrett reduction to implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler My approach is largely based on the Intel's "Fast CRC Computation Using PCLMULQDQ Instruction" white paper https://www.researchgate.net/publication/263424619_Fast_CRC_computation and a post about "Optimizing CRC32 for small payload sizes on x86" https://mary.rs/lab/crc32/ Whether these new implementations are enabled is controlled by new build-time and run-time detection of the RISC-V extensions present in the compiler and on the target system. I have carried out some performance comparisons between the generic table implementations and the new hardware implementations. Listed below is the number of cycles it takes to compute the CRC hash for buffers of various sizes (as reported by rte_get_timer_cycles()). These results were collected on a Kendryte K230 and averaged over 20 samples: |Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash) | |Size (MB) | Table| Hardware | Table| Hardware | |--|--|--|--|--| |1 | 155168 |11610 |73026 |18385 | |2 | 311203 |22998 | 145586 |35886 | |3 | 466744 |34370 | 218536 |53939 | |4 | 621843 |45536 | 291574 |71944 | |5 | 777908 |56989 | 364152 |89706 | |6 | 932736 |68023 | 437016 | 107726 | |7 | 1088756 |79236 | 510197 | 125426 | |8 | 1243794 |90467 | 583231 | 143614 | These results suggest a speed-up of lib/net by thirteen times, and of lib/hash by four times. I have also run the hash_functions_autotest benchmark in dpdk_test, which measures the performance of the lib/hash implementation on small buffers, getting the following times: | Key Length | Time (ticks/op) | | (bytes)| Table| Hardware | ||--|--| | 1 | 0.47 | 0.85 | | 2 | 0.57 | 0.87 | | 4 | 0.99 | 0.88 | | 8 | 1.35 | 0.88 | | 9 | 1.20 | 1.09 | | 13 | 1.76 | 1.35 | | 16 | 1.87 | 1.02 | | 32 | 2.96 | 0.98 | | 37 | 3.35 | 1.45 | | 40 | 3.49 | 1.12 | | 48 | 4.02 | 1.25 | | 64 | 5.08 | 1.54 | v3: - rebase on 24.07 - replace crc with CRC in commits (check-git-log.sh) v2: - replace compile flag with build-time (riscv extension macros) and run-time detection (linux hwprobe syscall) (Stephen Hemminger) - add qemu target that supports zbc (Stanislaw Kardach) - fix spelling error in commit message - fix a bug in the net/ implementation that would cause segfaults on small unaligned buffers - refactor net/ implementation to move variable declarations to top of functions - enable the optimisation in a couple other places optimised crc is preferred to jhash - l3fwd-power - cuckoo-hash Daniel Gregory (9): config/riscv: detect presence of Zbc extension hash: implement CRC using riscv carryless multiply net: implement CRC using riscv carryless multiply config/riscv: add qemu crossbuild target examples/l3fwd: use accelerated CRC on riscv ipfrag: use accelerated CRC on riscv examples/l3fwd-power: use accelerated CRC on riscv hash/cuckoo: use accelerated CRC on riscv member: use accelerated CRC on riscv MAINTAINERS | 2 + app/test/test_crc.c | 9 + app/test/test_hash.c | 7 + config/riscv/meson.build | 44 +++- config/riscv/riscv64_qemu_linux_gcc | 17 ++ .../linux_gsg/cross_build_dpdk_for_riscv.rst | 5 + examples/l3fwd-power/main.c | 2 +- examples/l3fwd/l3fwd_em.c | 2 +- lib/eal/riscv/include/rte_cpuflags.h | 2 + lib/eal/riscv/rte_cpuflags.c | 112 +++--- lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h| 89 lib/hash/rte_cuckoo_hash.c| 3 + lib/hash/rte_hash_crc.c | 13 +- lib/hash/rte_hash_crc.h | 6 +- lib/ip_frag/ip_frag_internal.c| 6 +- lib/member/rte_member.h | 2 +- lib/net/meson.build | 4 + lib/net/net_crc.h | 11 + lib/net/net_crc_zbc.c | 191 ++ lib/net/rte_net_crc.c
[PATCH v3 1/9] config/riscv: detect presence of Zbc extension
The RISC-V Zbc extension adds carry-less multiply instructions we can use to implement more efficient CRC hashing algorithms. The RISC-V C api defines architecture extension test macros https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/riscv-c-api.md#architecture-extension-test-macros These let us detect whether the Zbc extension is supported on the compiler and -march we're building with. The C api also defines Zbc intrinsics we can use rather than inline assembly on newer versions of GCC (14.1.0+) and Clang (18.1.0+). The Linux kernel exposes a RISC-V hardware probing syscall for getting information about the system at run-time including which extensions are available. We detect whether this interface is present by looking for the header, as it's only present in newer kernels (v6.4+). Furthermore, support for detecting certain extensions, including Zbc, wasn't present until versions after this, so we need to check the constants this header exports. The kernel exposes bitmasks for each extension supported by the probing interface, rather than the bit index that is set if that extensions is present, so modify the existing cpu flag HWCAP table entries to line up with this. The values returned by the interface are 64-bits long, so grow the hwcap registers array to be able to hold them. If the Zbc extension and intrinsics are both present and we can detect the Zbc extension at runtime, we define a flag, RTE_RISCV_FEATURE_ZBC. Signed-off-by: Daniel Gregory --- config/riscv/meson.build | 41 ++ lib/eal/riscv/include/rte_cpuflags.h | 2 + lib/eal/riscv/rte_cpuflags.c | 112 +++ 3 files changed, 123 insertions(+), 32 deletions(-) diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 07d7d9da23..5d8411b254 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -119,6 +119,47 @@ foreach flag: arch_config['machine_args'] endif endforeach +# check if we can do buildtime detection of extensions supported by the target +riscv_extension_macros = false +if (cc.get_define('__riscv_arch_test', args: machine_args) == '1') + message('Detected architecture extension test macros') + riscv_extension_macros = true +else + warning('RISC-V architecture extension test macros not available. Build-time detection of extensions not possible') +endif + +# check if we can use hwprobe interface for runtime extension detection +riscv_hwprobe = false +if (cc.check_header('asm/hwprobe.h', args: machine_args)) + message('Detected hwprobe interface, enabling runtime detection of supported extensions') + machine_args += ['-DRTE_RISCV_FEATURE_HWPROBE'] + riscv_hwprobe = true +else + warning('Hwprobe interface not available (present in Linux v6.4+), instruction extensions won\'t be enabled') +endif + +# detect extensions +# RISC-V Carry-less multiplication extension (Zbc) for hardware implementations +# of CRC-32C (lib/hash/rte_crc_riscv64.h) and CRC-32/16 (lib/net/net_crc_zbc.c). +# Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+ +if (riscv_extension_macros and riscv_hwprobe and +(cc.get_define('__riscv_zbc', args: machine_args) != '')) + if ((cc.get_id() == 'gcc' and cc.version().version_compare('>=14.1.0')) + or (cc.get_id() == 'clang' and cc.version().version_compare('>=18.1.0'))) +# determine whether we can detect Zbc extension (this wasn't possible until +# Linux kernel v6.8) +if (cc.compiles('''#include + int a = RISCV_HWPROBE_EXT_ZBC;''', args: machine_args)) + message('Compiling with the Zbc extension') + machine_args += ['-DRTE_RISCV_FEATURE_ZBC'] +else + warning('Detected Zbc extension but cannot use because runtime detection doesn\'t support it (support present in Linux kernel v6.8+)') +endif + else +warning('Detected Zbc extension but cannot use because intrinsics are not available (present in GCC 14.1.0+ and Clang 18.1.0+)') + endif +endif + # apply flags foreach flag: dpdk_flags if flag.length() > 0 diff --git a/lib/eal/riscv/include/rte_cpuflags.h b/lib/eal/riscv/include/rte_cpuflags.h index d742efc40f..4e26b584b3 100644 --- a/lib/eal/riscv/include/rte_cpuflags.h +++ b/lib/eal/riscv/include/rte_cpuflags.h @@ -42,6 +42,8 @@ enum rte_cpu_flag_t { RTE_CPUFLAG_RISCV_ISA_X, /* Non-standard extension present */ RTE_CPUFLAG_RISCV_ISA_Y, /* Reserved */ RTE_CPUFLAG_RISCV_ISA_Z, /* Reserved */ + + RTE_CPUFLAG_RISCV_EXT_ZBC, /* Carry-less multiplication */ }; #include "generic/rte_cpuflags.h" diff --git a/lib/eal/riscv/rte_cpuflags.c b/lib/eal/riscv/rte_cpuflags.c index eb4105c18b..dedf0395ab 100644 --- a/lib/eal/riscv/rt
[PATCH v3 2/9] hash: implement CRC using riscv carryless multiply
Using carryless multiply instructions from RISC-V's Zbc extension, implement a Barrett reduction that calculates CRC-32C checksums. Based on the approach described by Intel's whitepaper on "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction", which is also described here (https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/) Add a case to the autotest_hash unit test. Signed-off-by: Daniel Gregory --- MAINTAINERS| 1 + app/test/test_hash.c | 7 +++ lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h | 89 ++ lib/hash/rte_hash_crc.c| 13 +- lib/hash/rte_hash_crc.h| 6 ++- 6 files changed, 115 insertions(+), 2 deletions(-) create mode 100644 lib/hash/rte_crc_riscv64.h diff --git a/MAINTAINERS b/MAINTAINERS index c5a703b5c0..fa081552c7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -322,6 +322,7 @@ M: Stanislaw Kardach F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ +F: lib/hash/rte_crc_riscv64.h Intel x86 M: Bruce Richardson diff --git a/app/test/test_hash.c b/app/test/test_hash.c index 65b9cad93c..dd491ea4d9 100644 --- a/app/test/test_hash.c +++ b/app/test/test_hash.c @@ -231,6 +231,13 @@ test_crc32_hash_alg_equiv(void) printf("Failed checking CRC32_SW against CRC32_ARM64\n"); break; } + + /* Check against 8-byte-operand RISCV64 CRC32 if available */ + rte_hash_crc_set_alg(CRC32_RISCV64); + if (hash_val != rte_hash_crc(data64, data_len, init_val)) { + printf("Failed checking CRC32_SW against CRC32_RISC64\n"); + break; + } } /* Resetting to best available algorithm */ diff --git a/lib/hash/meson.build b/lib/hash/meson.build index 277eb9fa93..8355869a80 100644 --- a/lib/hash/meson.build +++ b/lib/hash/meson.build @@ -12,6 +12,7 @@ headers = files( indirect_headers += files( 'rte_crc_arm64.h', 'rte_crc_generic.h', +'rte_crc_riscv64.h', 'rte_crc_sw.h', 'rte_crc_x86.h', 'rte_thash_x86_gfni.h', diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h new file mode 100644 index 00..94f6857c69 --- /dev/null +++ b/lib/hash/rte_crc_riscv64.h @@ -0,0 +1,89 @@ +/* SPDX-License_Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include + +#ifndef _RTE_CRC_RISCV64_H_ +#define _RTE_CRC_RISCV64_H_ + +/* + * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected + * output. As reflecting the value we're checksumming is expensive, we instead + * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm. + * + * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6) + * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc + * using only two multiplications (https://mary.rs/lab/crc32/) + */ +static const uint64_t p = 0x105EC76F1; +static const uint64_t mu = 0x4869EC38DEA713F1UL; + +/* Calculate the CRC32C checksum using a Barrett reduction */ +static inline uint32_t +crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + crc = __riscv_clmul_64(crc, mu); + /* Multiply by P */ + crc = __riscv_clmulh_64(crc, p); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + crc ^= init_val >> bits; + + return crc; +} + +/* + * Use carryless multiply to perform hash on a value, falling back on the + * software in case the Zbc extension is not supported + */ +static inline uint32_t +rte_hash_crc_1byte(uint8_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 8); + + return crc32c_1byte(data, init_val); +} + +static inline uint32_t +rte_hash_crc_2byte(uint16_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 16); + + return crc32c_2bytes(data, init_val); +} + +static inline uint32_t +rte_hash_crc_4byte(uint32_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 32); + + return crc32c_1word(data, init_val);
[PATCH v3 3/9] net: implement CRC using riscv carryless multiply
Using carryless multiply instructions (clmul) from RISC-V's Zbc extension, implement CRC-32 and CRC-16 calculations on buffers. Based on the approach described in Intel's whitepaper on "Fast CRC Computation for Generic Polynomails Using PCLMULQDQ Instructions", we perform repeated folds-by-1 whilst the buffer is still big enough, then perform Barrett's reductions on the rest. Add a case to the crc_autotest suite that tests this implementation. Signed-off-by: Daniel Gregory --- MAINTAINERS | 1 + app/test/test_crc.c | 9 ++ lib/net/meson.build | 4 + lib/net/net_crc.h | 11 +++ lib/net/net_crc_zbc.c | 191 ++ lib/net/rte_net_crc.c | 40 + lib/net/rte_net_crc.h | 2 + 7 files changed, 258 insertions(+) create mode 100644 lib/net/net_crc_zbc.c diff --git a/MAINTAINERS b/MAINTAINERS index fa081552c7..eeaa2c645e 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -323,6 +323,7 @@ F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ F: lib/hash/rte_crc_riscv64.h +F: lib/net/net_crc_zbc.c Intel x86 M: Bruce Richardson diff --git a/app/test/test_crc.c b/app/test/test_crc.c index b85fca35fe..fa91557cf5 100644 --- a/app/test/test_crc.c +++ b/app/test/test_crc.c @@ -168,6 +168,15 @@ test_crc(void) return ret; } + /* set CRC riscv mode */ + rte_net_crc_set_alg(RTE_NET_CRC_ZBC); + + ret = test_crc_calc(); + if (ret < 0) { + printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret); + return ret; + } + return 0; } diff --git a/lib/net/meson.build b/lib/net/meson.build index 0b69138949..404d8dd3ae 100644 --- a/lib/net/meson.build +++ b/lib/net/meson.build @@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '') sources += files('net_crc_neon.c') cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT'] +elif (dpdk_conf.has('RTE_ARCH_RISCV') and +cc.get_define('RTE_RISCV_FEATURE_ZBC', args: machine_args) != '') +sources += files('net_crc_zbc.c') +cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT'] endif diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h index 7a74d5406c..06ae113b47 100644 --- a/lib/net/net_crc.h +++ b/lib/net/net_crc.h @@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t data_len); uint32_t rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len); +/* RISCV64 Zbc */ +void +rte_net_crc_zbc_init(void); + +uint32_t +rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len); + +uint32_t +rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len); + + #endif /* _NET_CRC_H_ */ diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c new file mode 100644 index 00..be416ba52f --- /dev/null +++ b/lib/net/net_crc_zbc.c @@ -0,0 +1,191 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include +#include + +#include "net_crc.h" + +/* CLMUL CRC computation context structure */ +struct crc_clmul_ctx { + uint64_t Pr; + uint64_t mu; + uint64_t k3; + uint64_t k4; + uint64_t k5; +}; + +struct crc_clmul_ctx crc32_eth_clmul; +struct crc_clmul_ctx crc16_ccitt_clmul; + +/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */ +static inline uint32_t +crc32_barrett_zbc( + const uint64_t data, + uint32_t crc, + uint32_t bits, + const struct crc_clmul_ctx *params) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + temp = __riscv_clmul_64(temp, params->mu); + /* Multiply by P */ + temp = __riscv_clmulh_64(temp, params->Pr); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + temp ^= crc >> bits; + + return temp; +} + +/* Repeat Barrett's reduction for short buffer sizes */ +static inline uint32_t +crc32_repeated_barrett_zbc( + const uint8_t *data, + uint32_t data_len, + uint32_t crc, + const struct crc_clmul_ctx *params) +{ + while (data_len >= 8) { + crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, params); + data += 8; + data_len -= 8; + } + if (data_len >= 4) { + crc = crc32_barrett_zbc(*(const uint32_t *)data, crc, 32, params); +
[PATCH v3 4/9] config/riscv: add qemu crossbuild target
A new cross-compilation target that has extensions that DPDK uses and QEMU supports. Initially, this is just the Zbc extension for hardware CRC support. Signed-off-by: Daniel Gregory --- config/riscv/meson.build| 3 ++- config/riscv/riscv64_qemu_linux_gcc | 17 + .../linux_gsg/cross_build_dpdk_for_riscv.rst| 5 + 3 files changed, 24 insertions(+), 1 deletion(-) create mode 100644 config/riscv/riscv64_qemu_linux_gcc diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 5d8411b254..337b26bbac 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -43,7 +43,8 @@ vendor_generic = { ['RTE_MAX_NUMA_NODES', 2] ], 'arch_config': { -'generic': {'machine_args': ['-march=rv64gc']} +'generic': {'machine_args': ['-march=rv64gc']}, +'qemu': {'machine_args': ['-march=rv64gc_zbc']}, } } diff --git a/config/riscv/riscv64_qemu_linux_gcc b/config/riscv/riscv64_qemu_linux_gcc new file mode 100644 index 00..007cc98885 --- /dev/null +++ b/config/riscv/riscv64_qemu_linux_gcc @@ -0,0 +1,17 @@ +[binaries] +c = ['ccache', 'riscv64-linux-gnu-gcc'] +cpp = ['ccache', 'riscv64-linux-gnu-g++'] +ar = 'riscv64-linux-gnu-ar' +strip = 'riscv64-linux-gnu-strip' +pcap-config = '' + +[host_machine] +system = 'linux' +cpu_family = 'riscv64' +cpu = 'rv64gc_zbc' +endian = 'little' + +[properties] +vendor_id = 'generic' +arch_id = 'qemu' +pkg_config_libdir = '/usr/lib/riscv64-linux-gnu/pkgconfig' diff --git a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst index 7d7f7ac72b..c3b67671a0 100644 --- a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst +++ b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst @@ -110,6 +110,11 @@ Currently the following targets are supported: * SiFive U740 SoC: ``config/riscv/riscv64_sifive_u740_linux_gcc`` +* QEMU: ``config/riscv/riscv64_qemu_linux_gcc`` + + * A target with all the extensions that QEMU supports that DPDK has a use for +(currently ``rv64gc_zbc``). Requires QEMU version 7.0.0 or newer. + To add a new target support, ``config/riscv/meson.build`` has to be modified by adding a new vendor/architecture id and a corresponding cross-file has to be added to ``config/riscv`` directory. -- 2.39.2
[PATCH v3 5/9] examples/l3fwd: use accelerated CRC on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- examples/l3fwd/l3fwd_em.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c index 31a7e05e39..36520401e5 100644 --- a/examples/l3fwd/l3fwd_em.c +++ b/examples/l3fwd/l3fwd_em.c @@ -29,7 +29,7 @@ #include "l3fwd_event.h" #include "em_route_parse.c" -#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) +#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || defined(RTE_RISCV_FEATURE_ZBC) #define EM_HASH_CRC 1 #endif -- 2.39.2
[PATCH v3 6/9] ipfrag: use accelerated CRC on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/ip_frag/ip_frag_internal.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c index 7cbef647df..19a28c447b 100644 --- a/lib/ip_frag/ip_frag_internal.c +++ b/lib/ip_frag/ip_frag_internal.c @@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *)&key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_FEATURE_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(key->id, v); #else v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE); -#endif /* RTE_ARCH_X86 */ +#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_FEATURE_ZBC */ *v1 = v; *v2 = (v << 7) + (v >> 14); @@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *) &key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_FEATURE_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(p[2], v); -- 2.39.2
[PATCH v3 8/9] hash/cuckoo: use accelerated CRC on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/hash/rte_cuckoo_hash.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c index 577b5839d3..872f88fdce 100644 --- a/lib/hash/rte_cuckoo_hash.c +++ b/lib/hash/rte_cuckoo_hash.c @@ -427,6 +427,9 @@ rte_hash_create(const struct rte_hash_parameters *params) #elif defined(RTE_ARCH_ARM64) if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_CRC32)) default_hash_func = (rte_hash_function)rte_hash_crc; +#elif defined(RTE_ARCH_RISCV) + if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RISCV_EXT_ZBC)) + default_hash_func = (rte_hash_function)rte_hash_crc; #endif /* Setup hash context */ strlcpy(h->name, params->name, sizeof(h->name)); -- 2.39.2
[PATCH v3 7/9] examples/l3fwd-power: use accelerated CRC on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- examples/l3fwd-power/main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c index 2bb6b092c3..c631c14193 100644 --- a/examples/l3fwd-power/main.c +++ b/examples/l3fwd-power/main.c @@ -270,7 +270,7 @@ static struct rte_mempool * pktmbuf_pool[NB_SOCKETS]; #if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH) -#ifdef RTE_ARCH_X86 +#if defined(RTE_ARCH_X86) || defined(RTE_RISCV_FEATURE_ZBC) #include #define DEFAULT_HASH_FUNC rte_hash_crc #else -- 2.39.2
[PATCH v3 9/9] member: use accelerated CRC on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/member/rte_member.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/lib/member/rte_member.h b/lib/member/rte_member.h index aec192eba5..152659628a 100644 --- a/lib/member/rte_member.h +++ b/lib/member/rte_member.h @@ -92,7 +92,7 @@ typedef uint16_t member_set_t; #define RTE_MEMBER_SKETCH_COUNT_BYTE 0x02 /** @internal Hash function used by membership library. */ -#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) +#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || defined(RTE_RISCV_FEATURE_ZBC) #include #define MEMBER_HASH_FUNC rte_hash_crc #else -- 2.39.2
[PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check memorder, which is not constant. This causes compile errors when it is enabled with RTE_ARM_USE_WFE. eg. ../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’: ../lib/eal/include/rte_common.h:530:56: error: expression in static assertion is not constant 530 | #define RTE_BUILD_BUG_ON(condition) do { static_assert(!(condition), #condition); } while (0) |^~~~ ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro ‘RTE_BUILD_BUG_ON’ 156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && | ^~~~ This has been the case since the switch to C11 assert (537caad2). Fix the compile errors by replacing the check with an RTE_ASSERT. Signed-off-by: Daniel Gregory --- lib/eal/arm/include/rte_pause_64.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/lib/eal/arm/include/rte_pause_64.h b/lib/eal/arm/include/rte_pause_64.h index 5cb8b59056..98e10e91c4 100644 --- a/lib/eal/arm/include/rte_pause_64.h +++ b/lib/eal/arm/include/rte_pause_64.h @@ -11,6 +11,7 @@ extern "C" { #endif #include +#include #ifdef RTE_ARM_USE_WFE #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED @@ -153,7 +154,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, { uint16_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + RTE_ASSERT(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_16(addr, value, memorder) @@ -172,7 +173,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, { uint32_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + RTE_ASSERT(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_32(addr, value, memorder) @@ -191,7 +192,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected, { uint64_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + RTE_ASSERT(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_64(addr, value, memorder) -- 2.39.2
[RFC PATCH] eal/riscv: add support for zawrs extension
The zawrs extension adds a pair of instructions that stall a core until a memory location is written to. This patch uses one of them to implement RISCV-specific versions of the rte_wait_until_equal_* functions. This is potentially more energy efficient than the default implementation that uses rte_pause/Zihintpause. The technique works as follows: * Create a reservation set containing the address we want to wait on using an atomic load (lr.dw) * Call wrs.nto - this blocks until the reservation set is invalidated by someone else writing to that address * Execution can also resume arbitrarily, so we still need to check whether a change occurred and loop if not Due to RISC-V atomics only supporting naturally aligned word (32 bit) and double word (64 bit) loads, I've used pointer rounding and bit shifting to implement waiting on 16-bit values. This new functionality is controlled by a Meson flag that is disabled by default. Signed-off-by: Daniel Gregory Suggested-by: Punit Agrawal --- Posting as an RFC to get early feedback and enable testing by others with Zawrs-enabled hardware. Whilst I have been able to test it compiles & passes tests using QEMU, I am waiting on some Zawrs-enabled hardware to become available before I carry out performance tests. Nonetheless, I would be glad to hear any feedback on the general approach. Thanks, Daniel config/riscv/meson.build | 5 ++ lib/eal/riscv/include/rte_pause.h | 139 ++ 2 files changed, 144 insertions(+) diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 07d7d9da23..4cfdc42ecb 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -26,6 +26,11 @@ flags_common = [ # read from /proc/device-tree/cpus/timebase-frequency. This property is # guaranteed on Linux, as riscv time_init() requires it. ['RTE_RISCV_TIME_FREQ', 0], + +# Enable use of RISC-V Wait-on-Reservation-Set extension (Zawrs) +# Mitigates looping when polling on memory locations +# Make sure to add '_zawrs' to your target's -march below +['RTE_RISCV_ZAWRS', false] ] ## SoC-specific options. diff --git a/lib/eal/riscv/include/rte_pause.h b/lib/eal/riscv/include/rte_pause.h index cb8e9ca52d..e7b43dffa3 100644 --- a/lib/eal/riscv/include/rte_pause.h +++ b/lib/eal/riscv/include/rte_pause.h @@ -11,6 +11,12 @@ extern "C" { #endif +#ifdef RTE_RISCV_ZAWRS +#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED +#endif + +#include + #include "rte_atomic.h" #include "generic/rte_pause.h" @@ -24,6 +30,139 @@ static inline void rte_pause(void) asm volatile(".int 0x010F" : : : "memory"); } +#ifdef RTE_RISCV_ZAWRS + +/* + * Atomic load from an address, it returns either a sign-extended word or + * doubleword and creates a 'reservation set' containing the read memory + * location. When someone else writes to the reservation set, it is invalidated, + * causing any stalled WRS instructions to resume. + * + * Address needs to be naturally aligned. + */ +#define __RTE_RISCV_LR_32(src, dst, memorder) do {\ + if ((memorder) == rte_memory_order_relaxed) { \ + asm volatile("lr.w %0, (%1)" \ + : "=r" (dst) \ + : "r" (src) \ + : "memory"); \ + } else { \ + asm volatile("lr.w.aq %0, (%1)" \ + : "=r" (dst) \ + : "r" (src) \ + : "memory"); \ + } } while (0) +#define __RTE_RISCV_LR_64(src, dst, memorder) do {\ + if ((memorder) == rte_memory_order_relaxed) { \ + asm volatile("lr.d %0, (%1)" \ + : "=r" (dst) \ + : "r" (src) \ + : "memory"); \ + } else { \ + asm volatile("lr.d.aq %0, (%1)" \ + : "=r" (dst) \ + : "r" (src) \ + : "memory"); \ + } } while (0) + +/* + * There's not a RISC-V atomic load primitive for halfwords, so cast up to a + * _naturally aligned_ word and extract the halfw
Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
On Thu, May 02, 2024 at 09:20:45AM -0700, Stephen Hemminger wrote: > Why not: > diff --git a/lib/eal/arm/include/rte_pause_64.h > b/lib/eal/arm/include/rte_pause_64.h > index 5cb8b59056..81987de771 100644 > --- a/lib/eal/arm/include/rte_pause_64.h > +++ b/lib/eal/arm/include/rte_pause_64.h > @@ -172,6 +172,8 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t > expected, > { > uint32_t value; > > + static_assert(__builtin_constant_p(memorder), "memory order is not a > constant"); > + > RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && > memorder != rte_memory_order_relaxed); > > @@ -191,6 +193,8 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t > expected, > { > uint64_t value; > > + static_assert(__builtin_constant_p(memorder), "memory order is not a > constant"); > + > RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && > memorder != rte_memory_order_relaxed); > What toolchain are you using? With your change I still get errors about the expression not being constant: In file included from ../lib/eal/arm/include/rte_pause.h:13, from ../lib/eal/include/generic/rte_spinlock.h:25, from ../lib/eal/arm/include/rte_spinlock.h:17, from ../lib/telemetry/telemetry.c:20: ../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’: ../lib/eal/arm/include/rte_pause_64.h:156:23: error: expression in static assertion is not constant 156 | static_assert(__builtin_constant_p(memorder), "memory order is not a constant"); | ^~ I'm cross-compiling with GCC v12.2 using the config/arm/arm64_armv8_linux_gcc cross-file, and enabling RTE_ARM_USE_WFE by uncommenting it in config/arm/meson.build and setting its value to true.
Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
On Thu, May 02, 2024 at 02:48:26PM -0700, Stephen Hemminger wrote: > There are already constant checks like this elsewhere in the file. Yes, but they're in macros, rather than inlined functions, so my understanding was that at compile time, macro expansion has put the memorder constant in the _Static_assert call as opposed to still being a function parameter in the inline definition. This is also the same approach used by the generic implementation (lib/eal/include/generic/rte_pause.h), the inline functions use assert and the macros use RTE_BUILD_BUG_ON. To give a minimal example, the following inline function doesn't compile (Godbolt demo here https://godbolt.org/z/aPqTf3v4o ): static inline __attribute__((always_inline)) void add(int *dst, int val) { _Static_assert(val != 0, "adding zero does nothing"); *dst += val; } But as a macro it does ( https://godbolt.org/z/x4a8fTf8h ): #define add(dst, val) do {\ _Static_assert(val != 0, "adding zero does nothing"); \ *(dst) += (val); \ } while(0); I don't believe this is a compiler bug as both GCC and Clang produce the same error message.
Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
On Fri, May 03, 2024 at 03:32:20PM +0200, David Marchand wrote: > - RTE_BUILD_BUG_ON() should not be used indeed. > IIRC, this issue was introduced with 875f350924b8 ("eal: add a new > helper for wait until scheme"). > Please add a corresponding Fixes: tag in next revision. Will do. Should I CC sta...@dpdk.org too? > - This ARM specific implementation should take a rte_memory_order type > instead of a int type for the memorder input variable. > This was missed in 1ec6a845b5cb ("eal: use stdatomic API in public headers"). > > Could you send a fix for this small issue too? Yes, sure thing. Thanks, Daniel
[PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant
The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check memorder, which is not constant. This causes compile errors when it is enabled with RTE_ARM_USE_WFE. eg. ../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’: ../lib/eal/include/rte_common.h:530:56: error: expression in static assertion is not constant 530 | #define RTE_BUILD_BUG_ON(condition) do { static_assert(!(condition), #condition); } while (0) |^~~~ ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro ‘RTE_BUILD_BUG_ON’ 156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && | ^~~~ Fix the compile errors by replacing the check with an assert, like in the generic implementation (lib/eal/include/generic/rte_pause.h). Fixes: 875f350924b8 ("eal: add a new helper for wait until scheme") Signed-off-by: Daniel Gregory --- Cc: feifei.wa...@arm.com --- lib/eal/arm/include/rte_pause_64.h | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/eal/arm/include/rte_pause_64.h b/lib/eal/arm/include/rte_pause_64.h index 5cb8b59056..852660091a 100644 --- a/lib/eal/arm/include/rte_pause_64.h +++ b/lib/eal/arm/include/rte_pause_64.h @@ -10,6 +10,8 @@ extern "C" { #endif +#include + #include #ifdef RTE_ARM_USE_WFE @@ -153,7 +155,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, { uint16_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + assert(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_16(addr, value, memorder) @@ -172,7 +174,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, { uint32_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + assert(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_32(addr, value, memorder) @@ -191,7 +193,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected, { uint64_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && + assert(memorder != rte_memory_order_acquire && memorder != rte_memory_order_relaxed); __RTE_ARM_LOAD_EXC_64(addr, value, memorder) -- 2.39.2
Re: [PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant
Apologies, mis-sent this before attaching a changelog: v2: * replaced RTE_ASSERT with assert * added Fixes: tag
[PATCH] eal/arm: use stdatomic api in rte_pause
Missed during commit 1ec6a845b5cb ("eal: use stdatomic API in public headers") Signed-off-by: Daniel Gregory --- lib/eal/arm/include/rte_pause_64.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/lib/eal/arm/include/rte_pause_64.h b/lib/eal/arm/include/rte_pause_64.h index 5cb8b59056..9e2dbf3531 100644 --- a/lib/eal/arm/include/rte_pause_64.h +++ b/lib/eal/arm/include/rte_pause_64.h @@ -11,6 +11,7 @@ extern "C" { #endif #include +#include #ifdef RTE_ARM_USE_WFE #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED @@ -149,7 +150,7 @@ static inline void rte_pause(void) static __rte_always_inline void rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, - int memorder) + rte_memory_order memorder) { uint16_t value; @@ -168,7 +169,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, static __rte_always_inline void rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, - int memorder) + rte_memory_order memorder) { uint32_t value; @@ -187,7 +188,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, static __rte_always_inline void rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected, - int memorder) + rte_memory_order memorder) { uint64_t value; -- 2.39.2
Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
On Fri, May 03, 2024 at 05:56:24PM -0700, Stephen Hemminger wrote: > On Fri, 3 May 2024 10:46:05 +0100 > Daniel Gregory wrote: > > > On Thu, May 02, 2024 at 02:48:26PM -0700, Stephen Hemminger wrote: > > > There are already constant checks like this elsewhere in the file. > > > > Yes, but they're in macros, rather than inlined functions, so my > > understanding was that at compile time, macro expansion has put the > > memorder constant in the _Static_assert call as opposed to still being > > a function parameter in the inline definition. > > Gcc and clang are smart enough that it is possible to use the internal > __builtin_constant_p() in the function. Some examples in DPDK: > > static __rte_always_inline int > rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table, > unsigned int n, struct rte_mempool_cache *cache) > { > int ret; > unsigned int remaining; > uint32_t index, len; > void **cache_objs; > > /* No cache provided */ > if (unlikely(cache == NULL)) { > remaining = n; > goto driver_dequeue; > } > > /* The cache is a stack, so copy will be in reverse order. */ > cache_objs = &cache->objs[cache->len]; > > if (__extension__(__builtin_constant_p(n)) && n <= cache->len) { > > It should be possible to use RTE_BUILD_BUG_ON() or static_assert here. Yes, it's possible to use RTE_BUILD_BUG_ON(!__builtin_constant_p(n)) on Clang, I am simply not seeing it succeed. In fact, the opposite check, that the memorder is not constant, builds just fine with Clang 16. diff --git a/lib/eal/arm/include/rte_pause_64.h b/lib/eal/arm/include/rte_pause_64.h index 5cb8b59056..d0646320e6 100644 --- a/lib/eal/arm/include/rte_pause_64.h +++ b/lib/eal/arm/include/rte_pause_64.h @@ -153,8 +153,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, { uint16_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && - memorder != rte_memory_order_relaxed); + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); __RTE_ARM_LOAD_EXC_16(addr, value, memorder) if (value != expected) { @@ -172,8 +171,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, { uint32_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && - memorder != rte_memory_order_relaxed); + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); __RTE_ARM_LOAD_EXC_32(addr, value, memorder) if (value != expected) { @@ -191,8 +189,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected, { uint64_t value; - RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && - memorder != rte_memory_order_relaxed); + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); __RTE_ARM_LOAD_EXC_64(addr, value, memorder) if (value != expected) { diff --git a/lib/eal/include/generic/rte_pause.h b/lib/eal/include/generic/rte_pause.h index f2a1eadcbd..3973488865 100644 --- a/lib/eal/include/generic/rte_pause.h +++ b/lib/eal/include/generic/rte_pause.h @@ -80,6 +80,7 @@ static __rte_always_inline void rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected, rte_memory_order memorder) { + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); assert(memorder == rte_memory_order_acquire || memorder == rte_memory_order_relaxed); while (rte_atomic_load_explicit((volatile __rte_atomic uint16_t *)addr, memorder) @@ -91,6 +92,7 @@ static __rte_always_inline void rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected, rte_memory_order memorder) { + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); assert(memorder == rte_memory_order_acquire || memorder == rte_memory_order_relaxed); while (rte_atomic_load_explicit((volatile __rte_atomic uint32_t *)addr, memorder) @@ -102,6 +104,7 @@ static __rte_always_inline void rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected, rte_memory_order memorder) { + RTE_BUILD_BUG_ON(__builtin_constant_p(memorder)); assert(memorder == rte_memory_order_acquire || memorder == rte_memory_order_relaxed); while (rte_atomic_load_explicit((volatile __rte_atomic uint64_t *)addr, memorder) This seemed odd, and it doesn't line up with what the GCC documentation says about __builtin_constant_p: > [__builtin_constant_p] does not return 1 when you pass a constant > numeric value to the inline function unless you specify the -O option. https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html So I did some more looking and the behaviour I've seen is that both C
Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant
On Fri, May 03, 2024 at 06:02:36PM -0700, Stephen Hemminger wrote: > On Thu, 2 May 2024 15:21:16 +0100 > Daniel Gregory wrote: > > > The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check > > memorder, which is not constant. This causes compile errors when it is > > enabled with RTE_ARM_USE_WFE. eg. > > > > ../lib/eal/arm/include/rte_pause_64.h: In function > > ‘rte_wait_until_equal_16’: > > ../lib/eal/include/rte_common.h:530:56: error: expression in static > > assertion is not constant > > 530 | #define RTE_BUILD_BUG_ON(condition) do { > > static_assert(!(condition), #condition); } while (0) > > |^~~~ > > ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro > > ‘RTE_BUILD_BUG_ON’ > > 156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && > > | ^~~~ > > > > This has been the case since the switch to C11 assert (537caad2). Fix > > the compile errors by replacing the check with an RTE_ASSERT. > > > > Signed-off-by: Daniel Gregory > > The only calls to rte_wait_until_equal_16 in upstream code > are in the test_bbdev_perf.c and test_timer.c. Looks like > these test never got fixed to use rte_memory_order instead of __ATOMIC_ > defines. Apologies, the commit message could make it clearer, but this is also an issue for rte_wait_until_equal_32 and rte_wait_until_equal_64. rte_wait_until_equal_32 is used in a dozen or so lock tests with the old __ATOMIC_ defines, as well as rte_ring_generic_pvt.h and rte_ring_c11_pvt.h, where it's used with the new rte_memorder_order values. Correct me if I'm wrong, but shouldn't the static assertions in rte_stdatomic.h ensure that mixed usage doesn't cause any issues, even if using the older __ATOMIC_ defines isn't ideal? > And there should be a CI test for ARM that enables the WFE code at least > to ensure it works! Yes, that could've caught this sooner.
Re: [RFC PATCH] eal/riscv: add support for zawrs extension
On Sun, May 12, 2024 at 09:10:49AM +0200, Stanisław Kardach wrote: > On Thu, May 2, 2024 at 4:44 PM Daniel Gregory > wrote: > > diff --git a/config/riscv/meson.build b/config/riscv/meson.build > > index 07d7d9da23..4cfdc42ecb 100644 > > --- a/config/riscv/meson.build > > +++ b/config/riscv/meson.build > > @@ -26,6 +26,11 @@ flags_common = [ > > # read from /proc/device-tree/cpus/timebase-frequency. This property is > > # guaranteed on Linux, as riscv time_init() requires it. > > ['RTE_RISCV_TIME_FREQ', 0], > > + > > +# Enable use of RISC-V Wait-on-Reservation-Set extension (Zawrs) > > +# Mitigates looping when polling on memory locations > > +# Make sure to add '_zawrs' to your target's -march below > > +['RTE_RISCV_ZAWRS', false] > A bit orthogonal to this patch (or maybe not?) > Should we perhaps add a Qemu target in meson.build which would have > the modified -march for what qemu supports now? Yes, I can see that being worth doing as part of this patch. In addition to Zawrs for this patch, GCC 13+ should generate prefetch instructions for __builtin_prefetch() (lib/eal/include/generic/rte_prefetch.h) if the Zicbop extension is enabled. Any more in particular you think would benefit or would it be best to add every extension GCC 14 supports? > Or perhaps add machine detection logic based either on the "riscv,isa" > cpu@0 property in the DT or RHCT ACPI table? I have had a look and, at least on QEMU 9, this seems non-trivial. The RHCT acpi table at /proc/cpuinfo doesn't list every extension present (eg. it's missing Zawrs), and the DT, whilst complete, can't be fed directly into GCC because QEMU reports several newer and non-ratified extensions that GCC doesn't support yet. > Or add perhaps some other way we could specify the extension list > suffix for -march? Setting -Dcpu_instruction_set to some arbitrary ISA could work with somes minor changes to the build script to not discard it in favour of rv64gc. Then, we could add a map from ISA extensions to flags that are enabled when that extension is present in cpu_instruction_set? Thanks for your review, Daniel
[PATCH 0/2] eal/riscv: implement prefetch using zicbop
Instructions from RISC-V's Zicbop extension can be used to implement the rte_prefetch* family of functions. On modern versions of GCC (13.1.0+) and Clang (17.0.1+), these are emitted by __builtin_prefetch() when the extension is present. In order to support older compiler versions, this patchset manually emits these instructions using inline assembly. To do this, I have added a new flag, RTE_PREFETCH_WRITE_ARCH_DEFINED, that (similarly to RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED) hides the generic implementation of rte_prefetch*_write. I am still in the process of acquiring hardware that supports this extension, so I haven't tested how this affects performance yet. Daniel Gregory (2): eal: add flag to hide generic prefetch_write eal/riscv: add support for zicbop extension config/riscv/meson.build | 6 +++ lib/eal/include/generic/rte_prefetch.h | 47 + lib/eal/riscv/include/rte_prefetch.h | 57 -- 3 files changed, 90 insertions(+), 20 deletions(-) -- 2.39.2
[PATCH 1/2] eal: add flag to hide generic prefetch_write
This allows for the definition of architecture-specific implementations of the rte_prefetch*_write collection of functions by defining RTE_PREFETCH_WRITE_ARCH_DEFINED. Signed-off-by: Daniel Gregory --- lib/eal/include/generic/rte_prefetch.h | 47 +- 1 file changed, 31 insertions(+), 16 deletions(-) diff --git a/lib/eal/include/generic/rte_prefetch.h b/lib/eal/include/generic/rte_prefetch.h index f9fab5e359..5558376cba 100644 --- a/lib/eal/include/generic/rte_prefetch.h +++ b/lib/eal/include/generic/rte_prefetch.h @@ -65,14 +65,7 @@ static inline void rte_prefetch_non_temporal(const volatile void *p); */ __rte_experimental static inline void -rte_prefetch0_write(const void *p) -{ - /* 1 indicates intention to write, 3 sets target cache level to L1. See -* GCC docs where these integer constants are described in more detail: -* https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html -*/ - __builtin_prefetch(p, 1, 3); -} +rte_prefetch0_write(const void *p); /** * @warning @@ -86,14 +79,7 @@ rte_prefetch0_write(const void *p) */ __rte_experimental static inline void -rte_prefetch1_write(const void *p) -{ - /* 1 indicates intention to write, 2 sets target cache level to L2. See -* GCC docs where these integer constants are described in more detail: -* https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html -*/ - __builtin_prefetch(p, 1, 2); -} +rte_prefetch1_write(const void *p); /** * @warning @@ -105,6 +91,33 @@ rte_prefetch1_write(const void *p) * * @param p Address to prefetch */ +__rte_experimental +static inline void +rte_prefetch2_write(const void *p); + +#ifndef RTE_PREFETCH_WRITE_ARCH_DEFINED +__rte_experimental +static inline void +rte_prefetch0_write(const void *p) +{ + /* 1 indicates intention to write, 3 sets target cache level to L1. See +* GCC docs where these integer constants are described in more detail: +* https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html +*/ + __builtin_prefetch(p, 1, 3); +} + +__rte_experimental +static inline void +rte_prefetch1_write(const void *p) +{ + /* 1 indicates intention to write, 2 sets target cache level to L2. See +* GCC docs where these integer constants are described in more detail: +* https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html +*/ + __builtin_prefetch(p, 1, 2); +} + __rte_experimental static inline void rte_prefetch2_write(const void *p) @@ -116,6 +129,8 @@ rte_prefetch2_write(const void *p) __builtin_prefetch(p, 1, 1); } +#endif /* RTE_PREFETCH_WRITE_ARCH_DEFINED */ + /** * @warning * @b EXPERIMENTAL: this API may change, or be removed, without prior notice -- 2.39.2
[PATCH 2/2] eal/riscv: add support for zicbop extension
The zicbop extension adds instructions for prefetching data into cache. Use them to implement RISCV-specific versions of the rte_prefetch* and rte_prefetch*_write functions. - prefetch.r indicates to hardware that the cache block will be accessed by a data read soon - prefetch.w indicates to hardware that the cache block will be accessed by a data write soon These instructions are emitted by __builtin_prefetch on modern versions of Clang (17.0.1+) and GCC (13.1.0+). For earlier versions, we may not have support for assembling Zicbop instructions, so emit the word that encodes a 'prefetch.[rw] 0(a0)' instruction. This new functionality is controlled by a Meson flag that is disabled by default. Whilst it's a hint, like rte_pause(), and so has no effect if the target doesn't support the extension, it requires the address prefetched to be loaded into a0, which may be costly. Signed-off-by: Daniel Gregory Suggested-by: Punit Agrawal --- config/riscv/meson.build | 6 +++ lib/eal/riscv/include/rte_prefetch.h | 57 ++-- 2 files changed, 59 insertions(+), 4 deletions(-) diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 07d7d9da23..ecf9da1c39 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -26,6 +26,12 @@ flags_common = [ # read from /proc/device-tree/cpus/timebase-frequency. This property is # guaranteed on Linux, as riscv time_init() requires it. ['RTE_RISCV_TIME_FREQ', 0], + +# When true override the default implementation of the prefetching functions +# (rte_prefetch*) with a version that explicitly uses the Zicbop extension. +# Do not enable when using modern versions of GCC (13.1.0+) or Clang +# (17.0.1+). They will emit these instructions in the default implementation +['RTE_RISCV_ZICBOP', false], ] ## SoC-specific options. diff --git a/lib/eal/riscv/include/rte_prefetch.h b/lib/eal/riscv/include/rte_prefetch.h index 748cf1b626..82cad526b3 100644 --- a/lib/eal/riscv/include/rte_prefetch.h +++ b/lib/eal/riscv/include/rte_prefetch.h @@ -14,21 +14,42 @@ extern "C" { #include #include + +#ifdef RTE_RISCV_ZICBOP +#define RTE_PREFETCH_WRITE_ARCH_DEFINED +#endif + #include "generic/rte_prefetch.h" +/* + * Modern versions of GCC & Clang will emit prefetch instructions for + * __builtin_prefetch when the Zicbop extension is present. + * The RTE_RISCV_ZICBOP option controls whether we emit them manually for older + * compilers that may not have the support to assemble them. + */ static inline void rte_prefetch0(const volatile void *p) { - RTE_SET_USED(p); +#ifndef RTE_RISCV_ZICBOP + /* by default __builtin_prefetch prepares for a read */ + __builtin_prefetch((const void *)p); +#else + /* prefetch.r 0(a0) */ + register const volatile void *a0 asm("a0") = p; + asm volatile (".int 0x00156013" : : "r" (a0)); +#endif } +/* + * The RISC-V Zicbop extension doesn't have instructions to prefetch to only a + * subset of cache levels, so fallback to rte_prefetch0 + */ static inline void rte_prefetch1(const volatile void *p) { - RTE_SET_USED(p); + rte_prefetch0(p); } - static inline void rte_prefetch2(const volatile void *p) { - RTE_SET_USED(p); + rte_prefetch0(p); } static inline void rte_prefetch_non_temporal(const volatile void *p) @@ -44,6 +65,34 @@ rte_cldemote(const volatile void *p) RTE_SET_USED(p); } +#ifdef RTE_RISCV_ZICBOP +__rte_experimental +static inline void +rte_prefetch0_write(const void *p) +{ + /* prefetch.w 0(a0) */ + register const void *a0 asm("a0") = p; + asm volatile (".int 0x00356013" : : "r" (a0)); +} + +/* + * The RISC-V Zicbop extension doesn't have instructions to prefetch to only a + * subset of cache levels, so fallback to rte_prefetch0_write + */ +__rte_experimental +static inline void +rte_prefetch1_write(const void *p) +{ + rte_prefetch0_write(p); +} +__rte_experimental +static inline void +rte_prefetch2_write(const void *p) +{ + rte_prefetch0_write(p); +} +#endif /* RTE_RISCV_ZICBOP */ + #ifdef __cplusplus } #endif -- 2.39.2
Re: [PATCH 2/2] eal/riscv: add support for zicbop extension
On Thu, May 30, 2024 at 06:19:48PM +0100, Daniel Gregory wrote: > + * The RTE_RISCV_ZICBOP option controls whether we emit them manually for > older > + * compilers that may not have the support to assemble them. > + */ > static inline void rte_prefetch0(const volatile void *p) > { > - RTE_SET_USED(p); > +#ifndef RTE_RISCV_ZICBOP > + /* by default __builtin_prefetch prepares for a read */ > + __builtin_prefetch((const void *)p); This cast causes warnings (which are treated as errors by the 0-day Robot) due to it discarding the 'volatile' on p. Removing the volatile from the definition of rte_prefetch0 causes build failures in some drivers (txgbe_rxtx.c:1809, ixgbe_rxtx.c:2174, enic_rxtx.c:127, ...). rte_prefetch0_write takes its argument as 'const void *' and so can use __builtin_prefetch().
[PATCH 0/5] riscv: implement accelerated crc using zbc
The RISC-V Zbc extension adds instructions for carry-less multiplication we can use to implement CRC in hardware. This patchset contains two new implementations: - one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to implement the four rte_hash_crc_* functions - one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce the buffer until it is small enough for a Barrett reduction to implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler My approach is largely based on the Intel's "Fast CRC Computation Using PCLMULQDQ Instruction" white paper https://www.researchgate.net/publication/263424619_Fast_CRC_computation and a post about "Optimizing CRC32 for small payload sizes on x86" https://mary.rs/lab/crc32/ These implementations are behind a new flag, RTE_RISCV_ZBC. Due to use of bitmanip compiler intrinsics, a modern version of GCC (14+) or Clang (18+) is required to compile with this flag enabled. I have carried out some performance comparisons between the generic table implementations and the new hardware implementations. Listed below is the number of cycles it takes to compute the CRC hash for buffers of various sizes (as reported by rte_get_timer_cycles()). These results were collected on a Kendryte K230 and averaged over 20 samples: |Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash) | |Size (MB) | Table| Hardware | Table| Hardware | |--|--|--|--|--| |1 | 155168 |11610 |73026 |18385 | |2 | 311203 |22998 | 145586 |35886 | |3 | 466744 |34370 | 218536 |53939 | |4 | 621843 |45536 | 291574 |71944 | |5 | 777908 |56989 | 364152 |89706 | |6 | 932736 |68023 | 437016 | 107726 | |7 | 1088756 |79236 | 510197 | 125426 | |8 | 1243794 |90467 | 583231 | 143614 | These results suggest a speed-up of lib/net by thirteen times, and of lib/hash by four times. Daniel Gregory (5): config/riscv: add flag for using Zbc extension hash: implement crc using riscv carryless multiply net: implement crc using riscv carryless multiply examples/l3fwd: use accelerated crc on riscv ipfrag: use accelerated crc on riscv MAINTAINERS| 2 + app/test/test_crc.c| 9 ++ app/test/test_hash.c | 7 ++ config/riscv/meson.build | 7 ++ examples/l3fwd/l3fwd_em.c | 2 +- lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h | 89 +++ lib/hash/rte_hash_crc.c| 12 +- lib/hash/rte_hash_crc.h| 6 +- lib/ip_frag/ip_frag_internal.c | 6 +- lib/net/meson.build| 4 + lib/net/net_crc.h | 11 ++ lib/net/net_crc_zbc.c | 202 + lib/net/rte_net_crc.c | 35 ++ lib/net/rte_net_crc.h | 2 + 15 files changed, 389 insertions(+), 6 deletions(-) create mode 100644 lib/hash/rte_crc_riscv64.h create mode 100644 lib/net/net_crc_zbc.c -- 2.39.2
[PATCH 1/5] config/riscv: add flag for using Zbc extension
The RISC-V Zbc extension adds carry-less multiply instructions we can use to implement more efficient CRC hashing algorithms. Signed-off-by: Daniel Gregory --- config/riscv/meson.build | 7 +++ 1 file changed, 7 insertions(+) diff --git a/config/riscv/meson.build b/config/riscv/meson.build index 07d7d9da23..4bda4089bd 100644 --- a/config/riscv/meson.build +++ b/config/riscv/meson.build @@ -26,6 +26,13 @@ flags_common = [ # read from /proc/device-tree/cpus/timebase-frequency. This property is # guaranteed on Linux, as riscv time_init() requires it. ['RTE_RISCV_TIME_FREQ', 0], + +# Use RISC-V Carry-less multiplication extension (Zbc) for hardware +# implementations of CRC-32C (lib/hash/rte_crc_riscv64.h), CRC-32 and CRC-16 +# (lib/net/net_crc_zbc.c). Requires intrinsics available in GCC 14.1.0+ and +# Clang 18.1.0+ +# Make sure to add '_zbc' to your target's -march below +['RTE_RISCV_ZBC', false], ] ## SoC-specific options. -- 2.39.2
[PATCH 2/5] hash: implement crc using riscv carryless multiply
Using carryless multiply instructions from RISC-V's Zbc extension, implement a Barrett reduction that calculates CRC-32C checksums. Based on the approach described by Intel's whitepaper on "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction", which is also described here (https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/) Signed-off-by: Daniel Gregory --- MAINTAINERS| 1 + app/test/test_hash.c | 7 +++ lib/hash/meson.build | 1 + lib/hash/rte_crc_riscv64.h | 89 ++ lib/hash/rte_hash_crc.c| 12 - lib/hash/rte_hash_crc.h| 6 ++- 6 files changed, 114 insertions(+), 2 deletions(-) create mode 100644 lib/hash/rte_crc_riscv64.h diff --git a/MAINTAINERS b/MAINTAINERS index 472713124c..48800f39c4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -318,6 +318,7 @@ M: Stanislaw Kardach F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ +F: lib/hash/rte_crc_riscv64.h Intel x86 M: Bruce Richardson diff --git a/app/test/test_hash.c b/app/test/test_hash.c index 24d3b547ad..c8c4197ad8 100644 --- a/app/test/test_hash.c +++ b/app/test/test_hash.c @@ -205,6 +205,13 @@ test_crc32_hash_alg_equiv(void) printf("Failed checking CRC32_SW against CRC32_ARM64\n"); break; } + + /* Check against 8-byte-operand RISCV64 CRC32 if available */ + rte_hash_crc_set_alg(CRC32_RISCV64); + if (hash_val != rte_hash_crc(data64, data_len, init_val)) { + printf("Failed checking CRC32_SW against CRC32_RISC64\n"); + break; + } } /* Resetting to best available algorithm */ diff --git a/lib/hash/meson.build b/lib/hash/meson.build index 277eb9fa93..8355869a80 100644 --- a/lib/hash/meson.build +++ b/lib/hash/meson.build @@ -12,6 +12,7 @@ headers = files( indirect_headers += files( 'rte_crc_arm64.h', 'rte_crc_generic.h', +'rte_crc_riscv64.h', 'rte_crc_sw.h', 'rte_crc_x86.h', 'rte_thash_x86_gfni.h', diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h new file mode 100644 index 00..94f6857c69 --- /dev/null +++ b/lib/hash/rte_crc_riscv64.h @@ -0,0 +1,89 @@ +/* SPDX-License_Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include + +#ifndef _RTE_CRC_RISCV64_H_ +#define _RTE_CRC_RISCV64_H_ + +/* + * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected + * output. As reflecting the value we're checksumming is expensive, we instead + * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm. + * + * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6) + * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc + * using only two multiplications (https://mary.rs/lab/crc32/) + */ +static const uint64_t p = 0x105EC76F1; +static const uint64_t mu = 0x4869EC38DEA713F1UL; + +/* Calculate the CRC32C checksum using a Barrett reduction */ +static inline uint32_t +crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + crc = __riscv_clmul_64(crc, mu); + /* Multiply by P */ + crc = __riscv_clmulh_64(crc, p); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + crc ^= init_val >> bits; + + return crc; +} + +/* + * Use carryless multiply to perform hash on a value, falling back on the + * software in case the Zbc extension is not supported + */ +static inline uint32_t +rte_hash_crc_1byte(uint8_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 8); + + return crc32c_1byte(data, init_val); +} + +static inline uint32_t +rte_hash_crc_2byte(uint16_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 16); + + return crc32c_2bytes(data, init_val); +} + +static inline uint32_t +rte_hash_crc_4byte(uint32_t data, uint32_t init_val) +{ + if (likely(rte_hash_crc32_alg & CRC32_RISCV64)) + return crc32c_riscv64(data, init_val, 32); + + return crc32c_1word(data, init_val); +} + +static inline uint32_t +rte_hash_crc_8b
[PATCH 3/5] net: implement crc using riscv carryless multiply
Using carryless multiply instructions (clmul) from RISC-V's Zbc extension, implement CRC-32 and CRC-16 calculations on buffers. Based on the approach described in Intel's whitepaper on "Fast CRC Computation for Generic Polynomails Using PCLMULQDQ Instructions", we perfom repeated folds-by-1 whilst the buffer is still big enough, then perform Barrett's reductions on the rest. Add a case to the crc_autotest suite that tests this implementation. This implementation is enabled by setting the RTE_RISCV_ZBC flag (see config/riscv/meson.build). Signed-off-by: Daniel Gregory --- MAINTAINERS | 1 + app/test/test_crc.c | 9 ++ lib/net/meson.build | 4 + lib/net/net_crc.h | 11 +++ lib/net/net_crc_zbc.c | 202 ++ lib/net/rte_net_crc.c | 35 lib/net/rte_net_crc.h | 2 + 7 files changed, 264 insertions(+) create mode 100644 lib/net/net_crc_zbc.c diff --git a/MAINTAINERS b/MAINTAINERS index 48800f39c4..6562e62779 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -319,6 +319,7 @@ F: config/riscv/ F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst F: lib/eal/riscv/ F: lib/hash/rte_crc_riscv64.h +F: lib/net/net_crc_zbc.c Intel x86 M: Bruce Richardson diff --git a/app/test/test_crc.c b/app/test/test_crc.c index b85fca35fe..fa91557cf5 100644 --- a/app/test/test_crc.c +++ b/app/test/test_crc.c @@ -168,6 +168,15 @@ test_crc(void) return ret; } + /* set CRC riscv mode */ + rte_net_crc_set_alg(RTE_NET_CRC_ZBC); + + ret = test_crc_calc(); + if (ret < 0) { + printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret); + return ret; + } + return 0; } diff --git a/lib/net/meson.build b/lib/net/meson.build index 0b69138949..f2ae019bea 100644 --- a/lib/net/meson.build +++ b/lib/net/meson.build @@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '') sources += files('net_crc_neon.c') cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT'] +elif (dpdk_conf.has('RTE_ARCH_RISCV') and dpdk_conf.has('RTE_RISCV_ZBC') and + dpdk_conf.get('RTE_RISCV_ZBC')) +sources += files('net_crc_zbc.c') +cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT'] endif diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h index 7a74d5406c..06ae113b47 100644 --- a/lib/net/net_crc.h +++ b/lib/net/net_crc.h @@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t data_len); uint32_t rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len); +/* RISCV64 Zbc */ +void +rte_net_crc_zbc_init(void); + +uint32_t +rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len); + +uint32_t +rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len); + + #endif /* _NET_CRC_H_ */ diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c new file mode 100644 index 00..5907d69471 --- /dev/null +++ b/lib/net/net_crc_zbc.c @@ -0,0 +1,202 @@ +/* SPDX-License-Identifier: BSD-3-Clause + * Copyright(c) ByteDance 2024 + */ + +#include +#include + +#include +#include + +#include "net_crc.h" + +/* CLMUL CRC computation context structure */ +struct crc_clmul_ctx { + uint64_t Pr; + uint64_t mu; + uint64_t k3; + uint64_t k4; + uint64_t k5; +}; + +struct crc_clmul_ctx crc32_eth_clmul; +struct crc_clmul_ctx crc16_ccitt_clmul; + +/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */ +static inline uint32_t +crc32_barrett_zbc( + const uint64_t data, + uint32_t crc, + uint32_t bits, + const struct crc_clmul_ctx *params) +{ + assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8)); + + /* Combine data with the initial value */ + uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits); + + /* +* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking +* the lower 64 bits of the result (remember we're inverted) +*/ + temp = __riscv_clmul_64(temp, params->mu); + /* Multiply by P */ + temp = __riscv_clmulh_64(temp, params->Pr); + + /* Subtract from original (only needed for smaller sizes) */ + if (bits == 16 || bits == 8) + temp ^= crc >> bits; + + return temp; +} + +/* Repeat Barrett's reduction for short buffer sizes */ +static inline uint32_t +crc32_repeated_barrett_zbc( + const uint8_t *data, + uint32_t data_len, + uint32_t crc, + const struct crc_clmul_ctx *params) +{ + while (data_len >= 8) { + crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, params); + data += 8; + data_len -= 8; + } + if (data_len &
[PATCH 4/5] examples/l3fwd: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- examples/l3fwd/l3fwd_em.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c index d98e66ea2c..4cec2dc6a9 100644 --- a/examples/l3fwd/l3fwd_em.c +++ b/examples/l3fwd/l3fwd_em.c @@ -29,7 +29,7 @@ #include "l3fwd_event.h" #include "em_route_parse.c" -#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) +#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || defined(RTE_RISCV_ZBC) #define EM_HASH_CRC 1 #endif -- 2.39.2
[PATCH 5/5] ipfrag: use accelerated crc on riscv
When the RISC-V Zbc (carryless multiplication) extension is present, an implementation of CRC hashing using hardware instructions is available. Use it rather than jhash. Signed-off-by: Daniel Gregory --- lib/ip_frag/ip_frag_internal.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c index 7cbef647df..7806264078 100644 --- a/lib/ip_frag/ip_frag_internal.c +++ b/lib/ip_frag/ip_frag_internal.c @@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *)&key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(key->id, v); #else v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE); -#endif /* RTE_ARCH_X86 */ +#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_ZBC */ *v1 = v; *v2 = (v << 7) + (v >> 14); @@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, uint32_t *v2) p = (const uint32_t *) &key->src_dst; -#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) +#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_ZBC) v = rte_hash_crc_4byte(p[0], PRIME_VALUE); v = rte_hash_crc_4byte(p[1], v); v = rte_hash_crc_4byte(p[2], v); -- 2.39.2
RE: [PATCH 1/5] config/riscv: add flag for using Zbc extension
On Wed, Jun 19, 2024 at 09:08:14AM +0200, Morten Brørup wrote: > > From: Stephen Hemminger [mailto:step...@networkplumber.org] > 1/5] config/riscv: add flag for using Zbc extension > > > > On Tue, 18 Jun 2024 18:41:29 +0100 > > Daniel Gregory wrote: > > > > > diff --git a/config/riscv/meson.build b/config/riscv/meson.build > > > index 07d7d9da23..4bda4089bd 100644 > > > --- a/config/riscv/meson.build > > > +++ b/config/riscv/meson.build > > > @@ -26,6 +26,13 @@ flags_common = [ > > > # read from /proc/device-tree/cpus/timebase-frequency. This property > > > is > > > # guaranteed on Linux, as riscv time_init() requires it. > > > ['RTE_RISCV_TIME_FREQ', 0], > > > + > > > +# Use RISC-V Carry-less multiplication extension (Zbc) for hardware > > > +# implementations of CRC-32C (lib/hash/rte_crc_riscv64.h), CRC-32 and > > CRC-16 > > > +# (lib/net/net_crc_zbc.c). Requires intrinsics available in GCC > > > 14.1.0+ > > and > > > +# Clang 18.1.0+ > > > +# Make sure to add '_zbc' to your target's -march below > > > +['RTE_RISCV_ZBC', false], > > > ] > > > > Please do not add more config options via compile flags. > > It makes it impossible for distros to ship one version. > > > > Instead, detect at compile or runtime > > Build time detection is not possible for cross builds. > How about build time detection based on the target's configured instruction set (either specified by cross-file or passed in through -Dinstruction_set)? We could have a map from extensions present in the ISA string to compile flags that should be enabled. I suggested this whilst discussing a previous patch adding support for the Zawrs extension, but haven't heard back from Stanisław yet: https://lore.kernel.org/dpdk-dev/20240520094854.GA3672529@ste-uk-lab-gw/ As for runtime detection, newer kernel versions have a hardware probing interface for detecting the presence of extensions, support could be added to rte_cpuflags.c? https://docs.kernel.org/arch/riscv/hwprobe.html In combination, distros on newer kernels could ship a version that has these optimisations baked in that falls back to a generic implementation when the extension is detected to not be present, and systems without the latest GCC/Clang can still compile by specifying a target ISA that doesn't include "_zbc".
Re: [PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant
On Thu, Jun 27, 2024 at 05:08:51PM +0200, Thomas Monjalon wrote: > 04/05/2024 02:59, Stephen Hemminger: > > On Fri, 3 May 2024 19:27:30 +0100 > > Daniel Gregory wrote: > > > > > The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check > > > memorder, which is not constant. This causes compile errors when it is > > > enabled with RTE_ARM_USE_WFE. eg. > > > > > > ../lib/eal/arm/include/rte_pause_64.h: In function > > > ‘rte_wait_until_equal_16’: > > > ../lib/eal/include/rte_common.h:530:56: error: expression in static > > > assertion is not constant > > > 530 | #define RTE_BUILD_BUG_ON(condition) do { > > > static_assert(!(condition), #condition); } while (0) > > > | > > > ^~~~ > > > ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro > > > ‘RTE_BUILD_BUG_ON’ > > > 156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire && > > > | ^~~~ > > > > > > Fix the compile errors by replacing the check with an assert, like in > > > the generic implementation (lib/eal/include/generic/rte_pause.h). > > > > No, don't hide the problem. > > > > What code is calling these. Looks like a real bug. Could be behind layers > > of wrappers. > > I support Stephen's opinion. > Please look for the real issue. In DPDK, I have found 26 calls of rte_wait_until_equal_16, largely split between app/test-bbdev/test_bbdev_perf.c and app/test/test_timer.c, with a couple calls in lib/eal/include/rte_pflock.h and lib/eal/include/rte_ticketlock.h as well. 16 calls of rte_wait_until_equal_32, spread amongst various test cases (test_func_reentrancy.c test_mcslock.c, test_mempool_perf.c, ...), two drivers (drivers/event/opdl/opdl_ring.c and drivers/net/thunderx/nicvf_rxrx.c), lib/eal/common/eal_common_mcfg.c, lib/eal/include/generic/rte_spinlock.h, lib/ring/rte_ring_c11_pvt.h, lib/ring/rte_ring_generic_pvt.h and lib/eal/include/rte_mcslock.h. There is a single call to rte_wait_until_equal_64 in app/test/test_pmd_perf.c They all correctly use the primitives from rte_stdatomic.h As I discussed on another chain https://lore.kernel.org/dpdk-dev/20240509110251.GA3795959@ste-uk-lab-gw/ from what I've seen, it seems that neither Clang nor GCC allow for static checks on the parameters of inline functions. For instance, the following does not compile: static inline __attribute__((always_inline)) int fn(int val) { _Static_assert(val == 0, "val nonzero"); return 0; } int main(void) { return fn(0); } ( https://godbolt.org/z/TrfWqYoGo ) With the same "expression in static assertion is not constant" error that I get when cross-compiling DPDK for ARM with WFE enabled on main: diff --git a/config/arm/meson.build b/config/arm/meson.build index a45aa9e466..661c735977 100644 --- a/config/arm/meson.build +++ b/config/arm/meson.build @@ -18,7 +18,7 @@ flags_common = [ #['RTE_ARM64_MEMCPY_STRICT_ALIGN', false], # Enable use of ARM wait for event instruction. -# ['RTE_ARM_USE_WFE', false], +['RTE_ARM_USE_WFE', true], ['RTE_ARCH_ARM64', true], ['RTE_CACHE_LINE_SIZE', 128]
Re: [PATCH v3 0/9] riscv: implement accelerated crc using zbc
Would it be possible to get a review on this patchset? I would be happy to hear any feedback on the approach to RISC-V extension detection or how I have implemented the hardware-optimised CRCs. Kind regards, Daniel