[PATCH v2 0/9] riscv: implement accelerated crc using zbc

2024-07-12 Thread Daniel Gregory
The RISC-V Zbc extension adds instructions for carry-less multiplication
we can use to implement CRC in hardware. This patch set contains two new
implementations:

- one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to
  implement the four rte_hash_crc_* functions
- one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce
  the buffer until it is small enough for a Barrett reduction to
  implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler

My approach is largely based on the Intel's "Fast CRC Computation Using
PCLMULQDQ Instruction" white paper
https://www.researchgate.net/publication/263424619_Fast_CRC_computation
and a post about "Optimizing CRC32 for small payload sizes on x86"
https://mary.rs/lab/crc32/

Whether these new implementations are enabled is controlled by new
build-time and run-time detection of the RISC-V extensions present in
the compiler and on the target system.

I have carried out some performance comparisons between the generic
table implementations and the new hardware implementations. Listed below
is the number of cycles it takes to compute the CRC hash for buffers of
various sizes (as reported by rte_get_timer_cycles()). These results
were collected on a Kendryte K230 and averaged over 20 samples:

|Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash)   |
|Size (MB) | Table| Hardware | Table| Hardware |
|--|--|--|--|--|
|1 |   155168 |11610 |73026 |18385 |
|2 |   311203 |22998 |   145586 |35886 |
|3 |   466744 |34370 |   218536 |53939 |
|4 |   621843 |45536 |   291574 |71944 |
|5 |   777908 |56989 |   364152 |89706 |
|6 |   932736 |68023 |   437016 |   107726 |
|7 |  1088756 |79236 |   510197 |   125426 |
|8 |  1243794 |90467 |   583231 |   143614 |

These results suggest a speed-up of lib/net by thirteen times, and of
lib/hash by four times.

I have also run the hash_functions_autotest benchmark in dpdk_test,
which measures the performance of the lib/hash implementation on small
buffers, getting the following times:

| Key Length | Time (ticks/op) |
| (bytes)| Table| Hardware |
||--|--|
|  1 | 0.47 | 0.85 |
|  2 | 0.57 | 0.87 |
|  4 | 0.99 | 0.88 |
|  8 | 1.35 | 0.88 |
|  9 | 1.20 | 1.09 |
| 13 | 1.76 | 1.35 |
| 16 | 1.87 | 1.02 |
| 32 | 2.96 | 0.98 |
| 37 | 3.35 | 1.45 |
| 40 | 3.49 | 1.12 |
| 48 | 4.02 | 1.25 |
| 64 | 5.08 | 1.54 |

v2:
- replace compile flag with build-time (riscv extension macros) and
  run-time detection (linux hwprobe syscall) (Stephen Hemminger)
- add qemu target that supports zbc (Stanislaw Kardach)
- fix spelling error in commit message
- fix a bug in the net/ implementation that would cause segfaults on
  small unaligned buffers
- refactor net/ implemementation to move variable declarations to top
  of functions
- enable the optimisation in a couple other places optimised crc is
  preferred to jhash
  - l3fwd-power
  - cuckoo-hash

Daniel Gregory (9):
  config/riscv: detect presence of Zbc extension
  hash: implement crc using riscv carryless multiply
  net: implement crc using riscv carryless multiply
  config/riscv: add qemu crossbuild target
  examples/l3fwd: use accelerated crc on riscv
  ipfrag: use accelerated crc on riscv
  examples/l3fwd-power: use accelerated crc on riscv
  hash/cuckoo: use accelerated crc on riscv
  member: use accelerated crc on riscv

 MAINTAINERS   |   2 +
 app/test/test_crc.c   |   9 +
 app/test/test_hash.c  |   7 +
 config/riscv/meson.build  |  44 +++-
 config/riscv/riscv64_qemu_linux_gcc   |  17 ++
 .../linux_gsg/cross_build_dpdk_for_riscv.rst  |   5 +
 examples/l3fwd-power/main.c   |   2 +-
 examples/l3fwd/l3fwd_em.c |   2 +-
 lib/eal/riscv/include/rte_cpuflags.h  |   2 +
 lib/eal/riscv/rte_cpuflags.c  | 112 +++---
 lib/hash/meson.build  |   1 +
 lib/hash/rte_crc_riscv64.h|  89 
 lib/hash/rte_cuckoo_hash.c|   3 +
 lib/hash/rte_hash_crc.c   |  13 +-
 lib/hash/rte_hash_crc.h   |   6 +-
 lib/ip_frag/ip_frag_internal.c|   6 +-
 lib/member/rte_member.h   |   2 +-
 lib/net/meson.build   |   4 +
 lib/net/net_crc.h |  11 +
 lib/net/net_crc_zbc.c | 191 ++
 lib/net/rte_net_crc.c |  40 
 lib/net/rte_net_crc.h |   2

[PATCH v2 1/9] config/riscv: detect presence of Zbc extension

2024-07-12 Thread Daniel Gregory
The RISC-V Zbc extension adds carry-less multiply instructions we can
use to implement more efficient CRC hashing algorithms.

The RISC-V C api defines architecture extension test macros
https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/riscv-c-api.md#architecture-extension-test-macros
These let us detect whether the Zbc extension is supported on the
compiler and -march we're building with. The C api also defines Zbc
intrinsics we can use rather than inline assembly on newer versions of
GCC (14.1.0+) and Clang (18.1.0+).

The Linux kernel exposes a RISC-V hardware probing syscall for getting
information about the system at run-time including which extensions are
available. We detect whether this interface is present by looking for
the  header, as it's only present in newer kernels
(v6.4+). Furthermore, support for detecting certain extensions,
including Zbc, wasn't present until versions after this, so we need to
check the constants this header exports.

The kernel exposes bitmasks for each extension supported by the probing
interface, rather than the bit index that is set if that extensions is
present, so modify the existing cpu flag HWCAP table entries to line up
with this. The values returned by the interface are 64-bits long, so
grow the hwcap registers array to be able to hold them.

If the Zbc extension and intrinsics are both present and we can detect
the Zbc extension at runtime, we define a flag, RTE_RISCV_FEATURE_ZBC.

Signed-off-by: Daniel Gregory 
---
 config/riscv/meson.build |  41 ++
 lib/eal/riscv/include/rte_cpuflags.h |   2 +
 lib/eal/riscv/rte_cpuflags.c | 112 +++
 3 files changed, 123 insertions(+), 32 deletions(-)

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 07d7d9da23..5d8411b254 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -119,6 +119,47 @@ foreach flag: arch_config['machine_args']
 endif
 endforeach
 
+# check if we can do buildtime detection of extensions supported by the target
+riscv_extension_macros = false
+if (cc.get_define('__riscv_arch_test', args: machine_args) == '1')
+  message('Detected architecture extension test macros')
+  riscv_extension_macros = true
+else
+  warning('RISC-V architecture extension test macros not available. Build-time 
detection of extensions not possible')
+endif
+
+# check if we can use hwprobe interface for runtime extension detection
+riscv_hwprobe = false
+if (cc.check_header('asm/hwprobe.h', args: machine_args))
+  message('Detected hwprobe interface, enabling runtime detection of supported 
extensions')
+  machine_args += ['-DRTE_RISCV_FEATURE_HWPROBE']
+  riscv_hwprobe = true
+else
+  warning('Hwprobe interface not available (present in Linux v6.4+), 
instruction extensions won\'t be enabled')
+endif
+
+# detect extensions
+# RISC-V Carry-less multiplication extension (Zbc) for hardware implementations
+# of CRC-32C (lib/hash/rte_crc_riscv64.h) and CRC-32/16 
(lib/net/net_crc_zbc.c).
+# Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+
+if (riscv_extension_macros and riscv_hwprobe and
+(cc.get_define('__riscv_zbc', args: machine_args) != ''))
+  if ((cc.get_id() == 'gcc' and cc.version().version_compare('>=14.1.0'))
+  or (cc.get_id() == 'clang' and cc.version().version_compare('>=18.1.0')))
+# determine whether we can detect Zbc extension (this wasn't possible until
+# Linux kernel v6.8)
+if (cc.compiles('''#include 
+   int a = RISCV_HWPROBE_EXT_ZBC;''', args: machine_args))
+  message('Compiling with the Zbc extension')
+  machine_args += ['-DRTE_RISCV_FEATURE_ZBC']
+else
+  warning('Detected Zbc extension but cannot use because runtime detection 
doesn\'t support it (support present in Linux kernel v6.8+)')
+endif
+  else
+warning('Detected Zbc extension but cannot use because intrinsics are not 
available (present in GCC 14.1.0+ and Clang 18.1.0+)')
+  endif
+endif
+
 # apply flags
 foreach flag: dpdk_flags
 if flag.length() > 0
diff --git a/lib/eal/riscv/include/rte_cpuflags.h 
b/lib/eal/riscv/include/rte_cpuflags.h
index d742efc40f..4e26b584b3 100644
--- a/lib/eal/riscv/include/rte_cpuflags.h
+++ b/lib/eal/riscv/include/rte_cpuflags.h
@@ -42,6 +42,8 @@ enum rte_cpu_flag_t {
RTE_CPUFLAG_RISCV_ISA_X, /* Non-standard extension present */
RTE_CPUFLAG_RISCV_ISA_Y, /* Reserved */
RTE_CPUFLAG_RISCV_ISA_Z, /* Reserved */
+
+   RTE_CPUFLAG_RISCV_EXT_ZBC, /* Carry-less multiplication */
 };
 
 #include "generic/rte_cpuflags.h"
diff --git a/lib/eal/riscv/rte_cpuflags.c b/lib/eal/riscv/rte_cpuflags.c
index eb4105c18b..dedf0395ab 100644
--- a/lib/eal/riscv/rt

[PATCH v2 2/9] hash: implement crc using riscv carryless multiply

2024-07-12 Thread Daniel Gregory
Using carryless multiply instructions from RISC-V's Zbc extension,
implement a Barrett reduction that calculates CRC-32C checksums.

Based on the approach described by Intel's whitepaper on "Fast CRC
Computation for Generic Polynomials Using PCLMULQDQ Instruction", which
is also described here
(https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/)

Add a case to the autotest_hash unit test.

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS|  1 +
 app/test/test_hash.c   |  7 +++
 lib/hash/meson.build   |  1 +
 lib/hash/rte_crc_riscv64.h | 89 ++
 lib/hash/rte_hash_crc.c| 13 +-
 lib/hash/rte_hash_crc.h|  6 ++-
 6 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100644 lib/hash/rte_crc_riscv64.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 533f707d5f..81f13ebcf2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -318,6 +318,7 @@ M: Stanislaw Kardach 
 F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
+F: lib/hash/rte_crc_riscv64.h
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_hash.c b/app/test/test_hash.c
index 24d3b547ad..c8c4197ad8 100644
--- a/app/test/test_hash.c
+++ b/app/test/test_hash.c
@@ -205,6 +205,13 @@ test_crc32_hash_alg_equiv(void)
printf("Failed checking CRC32_SW against 
CRC32_ARM64\n");
break;
}
+
+   /* Check against 8-byte-operand RISCV64 CRC32 if available */
+   rte_hash_crc_set_alg(CRC32_RISCV64);
+   if (hash_val != rte_hash_crc(data64, data_len, init_val)) {
+   printf("Failed checking CRC32_SW against 
CRC32_RISC64\n");
+   break;
+   }
}
 
/* Resetting to best available algorithm */
diff --git a/lib/hash/meson.build b/lib/hash/meson.build
index 277eb9fa93..8355869a80 100644
--- a/lib/hash/meson.build
+++ b/lib/hash/meson.build
@@ -12,6 +12,7 @@ headers = files(
 indirect_headers += files(
 'rte_crc_arm64.h',
 'rte_crc_generic.h',
+'rte_crc_riscv64.h',
 'rte_crc_sw.h',
 'rte_crc_x86.h',
 'rte_thash_x86_gfni.h',
diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h
new file mode 100644
index 00..94f6857c69
--- /dev/null
+++ b/lib/hash/rte_crc_riscv64.h
@@ -0,0 +1,89 @@
+/* SPDX-License_Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+
+#ifndef _RTE_CRC_RISCV64_H_
+#define _RTE_CRC_RISCV64_H_
+
+/*
+ * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected
+ * output. As reflecting the value we're checksumming is expensive, we instead
+ * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm.
+ *
+ * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6)
+ * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc
+ * using only two multiplications (https://mary.rs/lab/crc32/)
+ */
+static const uint64_t p = 0x105EC76F1;
+static const uint64_t mu = 0x4869EC38DEA713F1UL;
+
+/* Calculate the CRC32C checksum using a Barrett reduction */
+static inline uint32_t
+crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   crc = __riscv_clmul_64(crc, mu);
+   /* Multiply by P */
+   crc = __riscv_clmulh_64(crc, p);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   crc ^= init_val >> bits;
+
+   return crc;
+}
+
+/*
+ * Use carryless multiply to perform hash on a value, falling back on the
+ * software in case the Zbc extension is not supported
+ */
+static inline uint32_t
+rte_hash_crc_1byte(uint8_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 8);
+
+   return crc32c_1byte(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_2byte(uint16_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 16);
+
+   return crc32c_2bytes(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_4byte(uint32_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 32);
+
+   return crc32c_1word(data, init_val);

[PATCH v2 3/9] net: implement crc using riscv carryless multiply

2024-07-12 Thread Daniel Gregory
Using carryless multiply instructions (clmul) from RISC-V's Zbc
extension, implement CRC-32 and CRC-16 calculations on buffers.

Based on the approach described in Intel's whitepaper on "Fast CRC
Computation for Generic Polynomails Using PCLMULQDQ Instructions", we
perform repeated folds-by-1 whilst the buffer is still big enough, then
perform Barrett's reductions on the rest.

Add a case to the crc_autotest suite that tests this implementation.

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS   |   1 +
 app/test/test_crc.c   |   9 ++
 lib/net/meson.build   |   4 +
 lib/net/net_crc.h |  11 +++
 lib/net/net_crc_zbc.c | 191 ++
 lib/net/rte_net_crc.c |  40 +
 lib/net/rte_net_crc.h |   2 +
 7 files changed, 258 insertions(+)
 create mode 100644 lib/net/net_crc_zbc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 81f13ebcf2..58fbc51e64 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -319,6 +319,7 @@ F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
 F: lib/hash/rte_crc_riscv64.h
+F: lib/net/net_crc_zbc.c
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_crc.c b/app/test/test_crc.c
index b85fca35fe..fa91557cf5 100644
--- a/app/test/test_crc.c
+++ b/app/test/test_crc.c
@@ -168,6 +168,15 @@ test_crc(void)
return ret;
}
 
+   /* set CRC riscv mode */
+   rte_net_crc_set_alg(RTE_NET_CRC_ZBC);
+
+   ret = test_crc_calc();
+   if (ret < 0) {
+   printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret);
+   return ret;
+   }
+
return 0;
 }
 
diff --git a/lib/net/meson.build b/lib/net/meson.build
index 0b69138949..404d8dd3ae 100644
--- a/lib/net/meson.build
+++ b/lib/net/meson.build
@@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and
 cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '')
 sources += files('net_crc_neon.c')
 cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT']
+elif (dpdk_conf.has('RTE_ARCH_RISCV') and
+cc.get_define('RTE_RISCV_FEATURE_ZBC', args: machine_args) != '')
+sources += files('net_crc_zbc.c')
+cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT']
 endif
diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h
index 7a74d5406c..06ae113b47 100644
--- a/lib/net/net_crc.h
+++ b/lib/net/net_crc.h
@@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t 
data_len);
 uint32_t
 rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len);
 
+/* RISCV64 Zbc */
+void
+rte_net_crc_zbc_init(void);
+
+uint32_t
+rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+uint32_t
+rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+
 #endif /* _NET_CRC_H_ */
diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c
new file mode 100644
index 00..be416ba52f
--- /dev/null
+++ b/lib/net/net_crc_zbc.c
@@ -0,0 +1,191 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+
+#include "net_crc.h"
+
+/* CLMUL CRC computation context structure */
+struct crc_clmul_ctx {
+   uint64_t Pr;
+   uint64_t mu;
+   uint64_t k3;
+   uint64_t k4;
+   uint64_t k5;
+};
+
+struct crc_clmul_ctx crc32_eth_clmul;
+struct crc_clmul_ctx crc16_ccitt_clmul;
+
+/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */
+static inline uint32_t
+crc32_barrett_zbc(
+   const uint64_t data,
+   uint32_t crc,
+   uint32_t bits,
+   const struct crc_clmul_ctx *params)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   temp = __riscv_clmul_64(temp, params->mu);
+   /* Multiply by P */
+   temp = __riscv_clmulh_64(temp, params->Pr);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   temp ^= crc >> bits;
+
+   return temp;
+}
+
+/* Repeat Barrett's reduction for short buffer sizes */
+static inline uint32_t
+crc32_repeated_barrett_zbc(
+   const uint8_t *data,
+   uint32_t data_len,
+   uint32_t crc,
+   const struct crc_clmul_ctx *params)
+{
+   while (data_len >= 8) {
+   crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, 
params);
+   data += 8;
+   data_len -= 8;
+   }
+   if (data_len >= 4) {
+   crc = crc32_barrett_zbc(*(const uint32_t *)data, crc, 32, 
params);
+ 

[PATCH v2 4/9] config/riscv: add qemu crossbuild target

2024-07-12 Thread Daniel Gregory
A new cross-compilation target that has extensions that DPDK uses and
QEMU supports. Initially, this is just the Zbc extension for hardware
crc support.

Signed-off-by: Daniel Gregory 
---
 config/riscv/meson.build|  3 ++-
 config/riscv/riscv64_qemu_linux_gcc | 17 +
 .../linux_gsg/cross_build_dpdk_for_riscv.rst|  5 +
 3 files changed, 24 insertions(+), 1 deletion(-)
 create mode 100644 config/riscv/riscv64_qemu_linux_gcc

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 5d8411b254..337b26bbac 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -43,7 +43,8 @@ vendor_generic = {
 ['RTE_MAX_NUMA_NODES', 2]
 ],
 'arch_config': {
-'generic': {'machine_args': ['-march=rv64gc']}
+'generic': {'machine_args': ['-march=rv64gc']},
+'qemu': {'machine_args': ['-march=rv64gc_zbc']},
 }
 }
 
diff --git a/config/riscv/riscv64_qemu_linux_gcc 
b/config/riscv/riscv64_qemu_linux_gcc
new file mode 100644
index 00..007cc98885
--- /dev/null
+++ b/config/riscv/riscv64_qemu_linux_gcc
@@ -0,0 +1,17 @@
+[binaries]
+c = ['ccache', 'riscv64-linux-gnu-gcc']
+cpp = ['ccache', 'riscv64-linux-gnu-g++']
+ar = 'riscv64-linux-gnu-ar'
+strip = 'riscv64-linux-gnu-strip'
+pcap-config = ''
+
+[host_machine]
+system = 'linux'
+cpu_family = 'riscv64'
+cpu = 'rv64gc_zbc'
+endian = 'little'
+
+[properties]
+vendor_id = 'generic'
+arch_id = 'qemu'
+pkg_config_libdir = '/usr/lib/riscv64-linux-gnu/pkgconfig'
diff --git a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst 
b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
index 7d7f7ac72b..c3b67671a0 100644
--- a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
+++ b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
@@ -110,6 +110,11 @@ Currently the following targets are supported:
 
 * SiFive U740 SoC: ``config/riscv/riscv64_sifive_u740_linux_gcc``
 
+* QEMU: ``config/riscv/riscv64_qemu_linux_gcc``
+
+  * A target with all the extensions that QEMU supports that DPDK has a use for
+(currently ``rv64gc_zbc``). Requires QEMU version 7.0.0 or newer.
+
 To add a new target support, ``config/riscv/meson.build`` has to be modified by
 adding a new vendor/architecture id and a corresponding cross-file has to be
 added to ``config/riscv`` directory.
-- 
2.39.2



[PATCH v2 5/9] examples/l3fwd: use accelerated crc on riscv

2024-07-12 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 examples/l3fwd/l3fwd_em.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
index d98e66ea2c..78cec7f5cc 100644
--- a/examples/l3fwd/l3fwd_em.c
+++ b/examples/l3fwd/l3fwd_em.c
@@ -29,7 +29,7 @@
 #include "l3fwd_event.h"
 #include "em_route_parse.c"
 
-#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32)
+#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || 
defined(RTE_RISCV_FEATURE_ZBC)
 #define EM_HASH_CRC 1
 #endif
 
-- 
2.39.2



[PATCH v2 6/9] ipfrag: use accelerated crc on riscv

2024-07-12 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/ip_frag/ip_frag_internal.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index 7cbef647df..19a28c447b 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *)&key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || 
defined(RTE_RISCV_FEATURE_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(key->id, v);
 #else
 
v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE);
-#endif /* RTE_ARCH_X86 */
+#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_FEATURE_ZBC */
 
*v1 =  v;
*v2 = (v << 7) + (v >> 14);
@@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *) &key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || 
defined(RTE_RISCV_FEATURE_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(p[2], v);
-- 
2.39.2



[PATCH v2 7/9] examples/l3fwd-power: use accelerated crc on riscv

2024-07-12 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 examples/l3fwd-power/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index fba11da7ca..c67a3c4011 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -270,7 +270,7 @@ static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
 
 #if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
 
-#ifdef RTE_ARCH_X86
+#if defined(RTE_ARCH_X86) || defined(RTE_RISCV_FEATURE_ZBC)
 #include 
 #define DEFAULT_HASH_FUNC   rte_hash_crc
 #else
-- 
2.39.2



[PATCH v2 8/9] hash/cuckoo: use accelerated crc on riscv

2024-07-12 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/hash/rte_cuckoo_hash.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
index d87aa52b5b..8bdb1ff69d 100644
--- a/lib/hash/rte_cuckoo_hash.c
+++ b/lib/hash/rte_cuckoo_hash.c
@@ -409,6 +409,9 @@ rte_hash_create(const struct rte_hash_parameters *params)
 #elif defined(RTE_ARCH_ARM64)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_CRC32))
default_hash_func = (rte_hash_function)rte_hash_crc;
+#elif defined(RTE_ARCH_RISCV)
+   if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RISCV_EXT_ZBC))
+   default_hash_func = (rte_hash_function)rte_hash_crc;
 #endif
/* Setup hash context */
strlcpy(h->name, params->name, sizeof(h->name));
-- 
2.39.2



[PATCH v2 9/9] member: use accelerated crc on riscv

2024-07-12 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/member/rte_member.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/member/rte_member.h b/lib/member/rte_member.h
index aec192eba5..152659628a 100644
--- a/lib/member/rte_member.h
+++ b/lib/member/rte_member.h
@@ -92,7 +92,7 @@ typedef uint16_t member_set_t;
 #define RTE_MEMBER_SKETCH_COUNT_BYTE 0x02
 
 /** @internal Hash function used by membership library. */
-#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32)
+#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || 
defined(RTE_RISCV_FEATURE_ZBC)
 #include 
 #define MEMBER_HASH_FUNC   rte_hash_crc
 #else
-- 
2.39.2



[PATCH v3 0/9] riscv: implement accelerated crc using zbc

2024-08-27 Thread Daniel Gregory
The RISC-V Zbc extension adds instructions for carry-less multiplication
we can use to implement CRC in hardware. This patch set contains two new
implementations:

- one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to
  implement the four rte_hash_crc_* functions
- one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce
  the buffer until it is small enough for a Barrett reduction to
  implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler

My approach is largely based on the Intel's "Fast CRC Computation Using
PCLMULQDQ Instruction" white paper
https://www.researchgate.net/publication/263424619_Fast_CRC_computation
and a post about "Optimizing CRC32 for small payload sizes on x86"
https://mary.rs/lab/crc32/

Whether these new implementations are enabled is controlled by new
build-time and run-time detection of the RISC-V extensions present in
the compiler and on the target system.

I have carried out some performance comparisons between the generic
table implementations and the new hardware implementations. Listed below
is the number of cycles it takes to compute the CRC hash for buffers of
various sizes (as reported by rte_get_timer_cycles()). These results
were collected on a Kendryte K230 and averaged over 20 samples:

|Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash)   |
|Size (MB) | Table| Hardware | Table| Hardware |
|--|--|--|--|--|
|1 |   155168 |11610 |73026 |18385 |
|2 |   311203 |22998 |   145586 |35886 |
|3 |   466744 |34370 |   218536 |53939 |
|4 |   621843 |45536 |   291574 |71944 |
|5 |   777908 |56989 |   364152 |89706 |
|6 |   932736 |68023 |   437016 |   107726 |
|7 |  1088756 |79236 |   510197 |   125426 |
|8 |  1243794 |90467 |   583231 |   143614 |

These results suggest a speed-up of lib/net by thirteen times, and of
lib/hash by four times.

I have also run the hash_functions_autotest benchmark in dpdk_test,
which measures the performance of the lib/hash implementation on small
buffers, getting the following times:

| Key Length | Time (ticks/op) |
| (bytes)| Table| Hardware |
||--|--|
|  1 | 0.47 | 0.85 |
|  2 | 0.57 | 0.87 |
|  4 | 0.99 | 0.88 |
|  8 | 1.35 | 0.88 |
|  9 | 1.20 | 1.09 |
| 13 | 1.76 | 1.35 |
| 16 | 1.87 | 1.02 |
| 32 | 2.96 | 0.98 |
| 37 | 3.35 | 1.45 |
| 40 | 3.49 | 1.12 |
| 48 | 4.02 | 1.25 |
| 64 | 5.08 | 1.54 |

v3:
- rebase on 24.07
- replace crc with CRC in commits (check-git-log.sh)
v2:
- replace compile flag with build-time (riscv extension macros) and
  run-time detection (linux hwprobe syscall) (Stephen Hemminger)
- add qemu target that supports zbc (Stanislaw Kardach)
- fix spelling error in commit message
- fix a bug in the net/ implementation that would cause segfaults on
  small unaligned buffers
- refactor net/ implementation to move variable declarations to top of
  functions
- enable the optimisation in a couple other places optimised crc is
  preferred to jhash
  - l3fwd-power
  - cuckoo-hash

Daniel Gregory (9):
  config/riscv: detect presence of Zbc extension
  hash: implement CRC using riscv carryless multiply
  net: implement CRC using riscv carryless multiply
  config/riscv: add qemu crossbuild target
  examples/l3fwd: use accelerated CRC on riscv
  ipfrag: use accelerated CRC on riscv
  examples/l3fwd-power: use accelerated CRC on riscv
  hash/cuckoo: use accelerated CRC on riscv
  member: use accelerated CRC on riscv

 MAINTAINERS   |   2 +
 app/test/test_crc.c   |   9 +
 app/test/test_hash.c  |   7 +
 config/riscv/meson.build  |  44 +++-
 config/riscv/riscv64_qemu_linux_gcc   |  17 ++
 .../linux_gsg/cross_build_dpdk_for_riscv.rst  |   5 +
 examples/l3fwd-power/main.c   |   2 +-
 examples/l3fwd/l3fwd_em.c |   2 +-
 lib/eal/riscv/include/rte_cpuflags.h  |   2 +
 lib/eal/riscv/rte_cpuflags.c  | 112 +++---
 lib/hash/meson.build  |   1 +
 lib/hash/rte_crc_riscv64.h|  89 
 lib/hash/rte_cuckoo_hash.c|   3 +
 lib/hash/rte_hash_crc.c   |  13 +-
 lib/hash/rte_hash_crc.h   |   6 +-
 lib/ip_frag/ip_frag_internal.c|   6 +-
 lib/member/rte_member.h   |   2 +-
 lib/net/meson.build   |   4 +
 lib/net/net_crc.h |  11 +
 lib/net/net_crc_zbc.c | 191 ++
 lib/net/rte_net_crc.c   

[PATCH v3 1/9] config/riscv: detect presence of Zbc extension

2024-08-27 Thread Daniel Gregory
The RISC-V Zbc extension adds carry-less multiply instructions we can
use to implement more efficient CRC hashing algorithms.

The RISC-V C api defines architecture extension test macros
https://github.com/riscv-non-isa/riscv-c-api-doc/blob/main/riscv-c-api.md#architecture-extension-test-macros
These let us detect whether the Zbc extension is supported on the
compiler and -march we're building with. The C api also defines Zbc
intrinsics we can use rather than inline assembly on newer versions of
GCC (14.1.0+) and Clang (18.1.0+).

The Linux kernel exposes a RISC-V hardware probing syscall for getting
information about the system at run-time including which extensions are
available. We detect whether this interface is present by looking for
the  header, as it's only present in newer kernels
(v6.4+). Furthermore, support for detecting certain extensions,
including Zbc, wasn't present until versions after this, so we need to
check the constants this header exports.

The kernel exposes bitmasks for each extension supported by the probing
interface, rather than the bit index that is set if that extensions is
present, so modify the existing cpu flag HWCAP table entries to line up
with this. The values returned by the interface are 64-bits long, so
grow the hwcap registers array to be able to hold them.

If the Zbc extension and intrinsics are both present and we can detect
the Zbc extension at runtime, we define a flag, RTE_RISCV_FEATURE_ZBC.

Signed-off-by: Daniel Gregory 
---
 config/riscv/meson.build |  41 ++
 lib/eal/riscv/include/rte_cpuflags.h |   2 +
 lib/eal/riscv/rte_cpuflags.c | 112 +++
 3 files changed, 123 insertions(+), 32 deletions(-)

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 07d7d9da23..5d8411b254 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -119,6 +119,47 @@ foreach flag: arch_config['machine_args']
 endif
 endforeach
 
+# check if we can do buildtime detection of extensions supported by the target
+riscv_extension_macros = false
+if (cc.get_define('__riscv_arch_test', args: machine_args) == '1')
+  message('Detected architecture extension test macros')
+  riscv_extension_macros = true
+else
+  warning('RISC-V architecture extension test macros not available. Build-time 
detection of extensions not possible')
+endif
+
+# check if we can use hwprobe interface for runtime extension detection
+riscv_hwprobe = false
+if (cc.check_header('asm/hwprobe.h', args: machine_args))
+  message('Detected hwprobe interface, enabling runtime detection of supported 
extensions')
+  machine_args += ['-DRTE_RISCV_FEATURE_HWPROBE']
+  riscv_hwprobe = true
+else
+  warning('Hwprobe interface not available (present in Linux v6.4+), 
instruction extensions won\'t be enabled')
+endif
+
+# detect extensions
+# RISC-V Carry-less multiplication extension (Zbc) for hardware implementations
+# of CRC-32C (lib/hash/rte_crc_riscv64.h) and CRC-32/16 
(lib/net/net_crc_zbc.c).
+# Requires intrinsics available in GCC 14.1.0+ and Clang 18.1.0+
+if (riscv_extension_macros and riscv_hwprobe and
+(cc.get_define('__riscv_zbc', args: machine_args) != ''))
+  if ((cc.get_id() == 'gcc' and cc.version().version_compare('>=14.1.0'))
+  or (cc.get_id() == 'clang' and cc.version().version_compare('>=18.1.0')))
+# determine whether we can detect Zbc extension (this wasn't possible until
+# Linux kernel v6.8)
+if (cc.compiles('''#include 
+   int a = RISCV_HWPROBE_EXT_ZBC;''', args: machine_args))
+  message('Compiling with the Zbc extension')
+  machine_args += ['-DRTE_RISCV_FEATURE_ZBC']
+else
+  warning('Detected Zbc extension but cannot use because runtime detection 
doesn\'t support it (support present in Linux kernel v6.8+)')
+endif
+  else
+warning('Detected Zbc extension but cannot use because intrinsics are not 
available (present in GCC 14.1.0+ and Clang 18.1.0+)')
+  endif
+endif
+
 # apply flags
 foreach flag: dpdk_flags
 if flag.length() > 0
diff --git a/lib/eal/riscv/include/rte_cpuflags.h 
b/lib/eal/riscv/include/rte_cpuflags.h
index d742efc40f..4e26b584b3 100644
--- a/lib/eal/riscv/include/rte_cpuflags.h
+++ b/lib/eal/riscv/include/rte_cpuflags.h
@@ -42,6 +42,8 @@ enum rte_cpu_flag_t {
RTE_CPUFLAG_RISCV_ISA_X, /* Non-standard extension present */
RTE_CPUFLAG_RISCV_ISA_Y, /* Reserved */
RTE_CPUFLAG_RISCV_ISA_Z, /* Reserved */
+
+   RTE_CPUFLAG_RISCV_EXT_ZBC, /* Carry-less multiplication */
 };
 
 #include "generic/rte_cpuflags.h"
diff --git a/lib/eal/riscv/rte_cpuflags.c b/lib/eal/riscv/rte_cpuflags.c
index eb4105c18b..dedf0395ab 100644
--- a/lib/eal/riscv/rt

[PATCH v3 2/9] hash: implement CRC using riscv carryless multiply

2024-08-27 Thread Daniel Gregory
Using carryless multiply instructions from RISC-V's Zbc extension,
implement a Barrett reduction that calculates CRC-32C checksums.

Based on the approach described by Intel's whitepaper on "Fast CRC
Computation for Generic Polynomials Using PCLMULQDQ Instruction", which
is also described here
(https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/)

Add a case to the autotest_hash unit test.

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS|  1 +
 app/test/test_hash.c   |  7 +++
 lib/hash/meson.build   |  1 +
 lib/hash/rte_crc_riscv64.h | 89 ++
 lib/hash/rte_hash_crc.c| 13 +-
 lib/hash/rte_hash_crc.h|  6 ++-
 6 files changed, 115 insertions(+), 2 deletions(-)
 create mode 100644 lib/hash/rte_crc_riscv64.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c5a703b5c0..fa081552c7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -322,6 +322,7 @@ M: Stanislaw Kardach 
 F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
+F: lib/hash/rte_crc_riscv64.h
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_hash.c b/app/test/test_hash.c
index 65b9cad93c..dd491ea4d9 100644
--- a/app/test/test_hash.c
+++ b/app/test/test_hash.c
@@ -231,6 +231,13 @@ test_crc32_hash_alg_equiv(void)
printf("Failed checking CRC32_SW against 
CRC32_ARM64\n");
break;
}
+
+   /* Check against 8-byte-operand RISCV64 CRC32 if available */
+   rte_hash_crc_set_alg(CRC32_RISCV64);
+   if (hash_val != rte_hash_crc(data64, data_len, init_val)) {
+   printf("Failed checking CRC32_SW against 
CRC32_RISC64\n");
+   break;
+   }
}
 
/* Resetting to best available algorithm */
diff --git a/lib/hash/meson.build b/lib/hash/meson.build
index 277eb9fa93..8355869a80 100644
--- a/lib/hash/meson.build
+++ b/lib/hash/meson.build
@@ -12,6 +12,7 @@ headers = files(
 indirect_headers += files(
 'rte_crc_arm64.h',
 'rte_crc_generic.h',
+'rte_crc_riscv64.h',
 'rte_crc_sw.h',
 'rte_crc_x86.h',
 'rte_thash_x86_gfni.h',
diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h
new file mode 100644
index 00..94f6857c69
--- /dev/null
+++ b/lib/hash/rte_crc_riscv64.h
@@ -0,0 +1,89 @@
+/* SPDX-License_Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+
+#ifndef _RTE_CRC_RISCV64_H_
+#define _RTE_CRC_RISCV64_H_
+
+/*
+ * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected
+ * output. As reflecting the value we're checksumming is expensive, we instead
+ * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm.
+ *
+ * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6)
+ * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc
+ * using only two multiplications (https://mary.rs/lab/crc32/)
+ */
+static const uint64_t p = 0x105EC76F1;
+static const uint64_t mu = 0x4869EC38DEA713F1UL;
+
+/* Calculate the CRC32C checksum using a Barrett reduction */
+static inline uint32_t
+crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   crc = __riscv_clmul_64(crc, mu);
+   /* Multiply by P */
+   crc = __riscv_clmulh_64(crc, p);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   crc ^= init_val >> bits;
+
+   return crc;
+}
+
+/*
+ * Use carryless multiply to perform hash on a value, falling back on the
+ * software in case the Zbc extension is not supported
+ */
+static inline uint32_t
+rte_hash_crc_1byte(uint8_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 8);
+
+   return crc32c_1byte(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_2byte(uint16_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 16);
+
+   return crc32c_2bytes(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_4byte(uint32_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 32);
+
+   return crc32c_1word(data, init_val);

[PATCH v3 3/9] net: implement CRC using riscv carryless multiply

2024-08-27 Thread Daniel Gregory
Using carryless multiply instructions (clmul) from RISC-V's Zbc
extension, implement CRC-32 and CRC-16 calculations on buffers.

Based on the approach described in Intel's whitepaper on "Fast CRC
Computation for Generic Polynomails Using PCLMULQDQ Instructions", we
perform repeated folds-by-1 whilst the buffer is still big enough, then
perform Barrett's reductions on the rest.

Add a case to the crc_autotest suite that tests this implementation.

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS   |   1 +
 app/test/test_crc.c   |   9 ++
 lib/net/meson.build   |   4 +
 lib/net/net_crc.h |  11 +++
 lib/net/net_crc_zbc.c | 191 ++
 lib/net/rte_net_crc.c |  40 +
 lib/net/rte_net_crc.h |   2 +
 7 files changed, 258 insertions(+)
 create mode 100644 lib/net/net_crc_zbc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index fa081552c7..eeaa2c645e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -323,6 +323,7 @@ F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
 F: lib/hash/rte_crc_riscv64.h
+F: lib/net/net_crc_zbc.c
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_crc.c b/app/test/test_crc.c
index b85fca35fe..fa91557cf5 100644
--- a/app/test/test_crc.c
+++ b/app/test/test_crc.c
@@ -168,6 +168,15 @@ test_crc(void)
return ret;
}
 
+   /* set CRC riscv mode */
+   rte_net_crc_set_alg(RTE_NET_CRC_ZBC);
+
+   ret = test_crc_calc();
+   if (ret < 0) {
+   printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret);
+   return ret;
+   }
+
return 0;
 }
 
diff --git a/lib/net/meson.build b/lib/net/meson.build
index 0b69138949..404d8dd3ae 100644
--- a/lib/net/meson.build
+++ b/lib/net/meson.build
@@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and
 cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '')
 sources += files('net_crc_neon.c')
 cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT']
+elif (dpdk_conf.has('RTE_ARCH_RISCV') and
+cc.get_define('RTE_RISCV_FEATURE_ZBC', args: machine_args) != '')
+sources += files('net_crc_zbc.c')
+cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT']
 endif
diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h
index 7a74d5406c..06ae113b47 100644
--- a/lib/net/net_crc.h
+++ b/lib/net/net_crc.h
@@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t 
data_len);
 uint32_t
 rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len);
 
+/* RISCV64 Zbc */
+void
+rte_net_crc_zbc_init(void);
+
+uint32_t
+rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+uint32_t
+rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+
 #endif /* _NET_CRC_H_ */
diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c
new file mode 100644
index 00..be416ba52f
--- /dev/null
+++ b/lib/net/net_crc_zbc.c
@@ -0,0 +1,191 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+
+#include "net_crc.h"
+
+/* CLMUL CRC computation context structure */
+struct crc_clmul_ctx {
+   uint64_t Pr;
+   uint64_t mu;
+   uint64_t k3;
+   uint64_t k4;
+   uint64_t k5;
+};
+
+struct crc_clmul_ctx crc32_eth_clmul;
+struct crc_clmul_ctx crc16_ccitt_clmul;
+
+/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */
+static inline uint32_t
+crc32_barrett_zbc(
+   const uint64_t data,
+   uint32_t crc,
+   uint32_t bits,
+   const struct crc_clmul_ctx *params)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   temp = __riscv_clmul_64(temp, params->mu);
+   /* Multiply by P */
+   temp = __riscv_clmulh_64(temp, params->Pr);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   temp ^= crc >> bits;
+
+   return temp;
+}
+
+/* Repeat Barrett's reduction for short buffer sizes */
+static inline uint32_t
+crc32_repeated_barrett_zbc(
+   const uint8_t *data,
+   uint32_t data_len,
+   uint32_t crc,
+   const struct crc_clmul_ctx *params)
+{
+   while (data_len >= 8) {
+   crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, 
params);
+   data += 8;
+   data_len -= 8;
+   }
+   if (data_len >= 4) {
+   crc = crc32_barrett_zbc(*(const uint32_t *)data, crc, 32, 
params);
+ 

[PATCH v3 4/9] config/riscv: add qemu crossbuild target

2024-08-27 Thread Daniel Gregory
A new cross-compilation target that has extensions that DPDK uses and
QEMU supports. Initially, this is just the Zbc extension for hardware
CRC support.

Signed-off-by: Daniel Gregory 
---
 config/riscv/meson.build|  3 ++-
 config/riscv/riscv64_qemu_linux_gcc | 17 +
 .../linux_gsg/cross_build_dpdk_for_riscv.rst|  5 +
 3 files changed, 24 insertions(+), 1 deletion(-)
 create mode 100644 config/riscv/riscv64_qemu_linux_gcc

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 5d8411b254..337b26bbac 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -43,7 +43,8 @@ vendor_generic = {
 ['RTE_MAX_NUMA_NODES', 2]
 ],
 'arch_config': {
-'generic': {'machine_args': ['-march=rv64gc']}
+'generic': {'machine_args': ['-march=rv64gc']},
+'qemu': {'machine_args': ['-march=rv64gc_zbc']},
 }
 }
 
diff --git a/config/riscv/riscv64_qemu_linux_gcc 
b/config/riscv/riscv64_qemu_linux_gcc
new file mode 100644
index 00..007cc98885
--- /dev/null
+++ b/config/riscv/riscv64_qemu_linux_gcc
@@ -0,0 +1,17 @@
+[binaries]
+c = ['ccache', 'riscv64-linux-gnu-gcc']
+cpp = ['ccache', 'riscv64-linux-gnu-g++']
+ar = 'riscv64-linux-gnu-ar'
+strip = 'riscv64-linux-gnu-strip'
+pcap-config = ''
+
+[host_machine]
+system = 'linux'
+cpu_family = 'riscv64'
+cpu = 'rv64gc_zbc'
+endian = 'little'
+
+[properties]
+vendor_id = 'generic'
+arch_id = 'qemu'
+pkg_config_libdir = '/usr/lib/riscv64-linux-gnu/pkgconfig'
diff --git a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst 
b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
index 7d7f7ac72b..c3b67671a0 100644
--- a/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
+++ b/doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
@@ -110,6 +110,11 @@ Currently the following targets are supported:
 
 * SiFive U740 SoC: ``config/riscv/riscv64_sifive_u740_linux_gcc``
 
+* QEMU: ``config/riscv/riscv64_qemu_linux_gcc``
+
+  * A target with all the extensions that QEMU supports that DPDK has a use for
+(currently ``rv64gc_zbc``). Requires QEMU version 7.0.0 or newer.
+
 To add a new target support, ``config/riscv/meson.build`` has to be modified by
 adding a new vendor/architecture id and a corresponding cross-file has to be
 added to ``config/riscv`` directory.
-- 
2.39.2



[PATCH v3 5/9] examples/l3fwd: use accelerated CRC on riscv

2024-08-27 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 examples/l3fwd/l3fwd_em.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
index 31a7e05e39..36520401e5 100644
--- a/examples/l3fwd/l3fwd_em.c
+++ b/examples/l3fwd/l3fwd_em.c
@@ -29,7 +29,7 @@
 #include "l3fwd_event.h"
 #include "em_route_parse.c"
 
-#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32)
+#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || 
defined(RTE_RISCV_FEATURE_ZBC)
 #define EM_HASH_CRC 1
 #endif
 
-- 
2.39.2



[PATCH v3 6/9] ipfrag: use accelerated CRC on riscv

2024-08-27 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/ip_frag/ip_frag_internal.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index 7cbef647df..19a28c447b 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *)&key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || 
defined(RTE_RISCV_FEATURE_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(key->id, v);
 #else
 
v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE);
-#endif /* RTE_ARCH_X86 */
+#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_FEATURE_ZBC */
 
*v1 =  v;
*v2 = (v << 7) + (v >> 14);
@@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *) &key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || 
defined(RTE_RISCV_FEATURE_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(p[2], v);
-- 
2.39.2



[PATCH v3 8/9] hash/cuckoo: use accelerated CRC on riscv

2024-08-27 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/hash/rte_cuckoo_hash.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
index 577b5839d3..872f88fdce 100644
--- a/lib/hash/rte_cuckoo_hash.c
+++ b/lib/hash/rte_cuckoo_hash.c
@@ -427,6 +427,9 @@ rte_hash_create(const struct rte_hash_parameters *params)
 #elif defined(RTE_ARCH_ARM64)
if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_CRC32))
default_hash_func = (rte_hash_function)rte_hash_crc;
+#elif defined(RTE_ARCH_RISCV)
+   if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_RISCV_EXT_ZBC))
+   default_hash_func = (rte_hash_function)rte_hash_crc;
 #endif
/* Setup hash context */
strlcpy(h->name, params->name, sizeof(h->name));
-- 
2.39.2



[PATCH v3 7/9] examples/l3fwd-power: use accelerated CRC on riscv

2024-08-27 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 examples/l3fwd-power/main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-power/main.c
index 2bb6b092c3..c631c14193 100644
--- a/examples/l3fwd-power/main.c
+++ b/examples/l3fwd-power/main.c
@@ -270,7 +270,7 @@ static struct rte_mempool * pktmbuf_pool[NB_SOCKETS];
 
 #if (APP_LOOKUP_METHOD == APP_LOOKUP_EXACT_MATCH)
 
-#ifdef RTE_ARCH_X86
+#if defined(RTE_ARCH_X86) || defined(RTE_RISCV_FEATURE_ZBC)
 #include 
 #define DEFAULT_HASH_FUNC   rte_hash_crc
 #else
-- 
2.39.2



[PATCH v3 9/9] member: use accelerated CRC on riscv

2024-08-27 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/member/rte_member.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/member/rte_member.h b/lib/member/rte_member.h
index aec192eba5..152659628a 100644
--- a/lib/member/rte_member.h
+++ b/lib/member/rte_member.h
@@ -92,7 +92,7 @@ typedef uint16_t member_set_t;
 #define RTE_MEMBER_SKETCH_COUNT_BYTE 0x02
 
 /** @internal Hash function used by membership library. */
-#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32)
+#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || 
defined(RTE_RISCV_FEATURE_ZBC)
 #include 
 #define MEMBER_HASH_FUNC   rte_hash_crc
 #else
-- 
2.39.2



[PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-02 Thread Daniel Gregory
The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check
memorder, which is not constant. This causes compile errors when it is
enabled with RTE_ARM_USE_WFE. eg.

../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’:
../lib/eal/include/rte_common.h:530:56: error: expression in static assertion 
is not constant
  530 | #define RTE_BUILD_BUG_ON(condition) do { static_assert(!(condition), 
#condition); } while (0)
  |^~~~
../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro 
‘RTE_BUILD_BUG_ON’
  156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
  | ^~~~

This has been the case since the switch to C11 assert (537caad2). Fix
the compile errors by replacing the check with an RTE_ASSERT.

Signed-off-by: Daniel Gregory 
---
 lib/eal/arm/include/rte_pause_64.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index 5cb8b59056..98e10e91c4 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -11,6 +11,7 @@ extern "C" {
 #endif
 
 #include 
+#include 
 
 #ifdef RTE_ARM_USE_WFE
 #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
@@ -153,7 +154,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
 {
uint16_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   RTE_ASSERT(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_16(addr, value, memorder)
@@ -172,7 +173,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
expected,
 {
uint32_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   RTE_ASSERT(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_32(addr, value, memorder)
@@ -191,7 +192,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t 
expected,
 {
uint64_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   RTE_ASSERT(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_64(addr, value, memorder)
-- 
2.39.2



[RFC PATCH] eal/riscv: add support for zawrs extension

2024-05-02 Thread Daniel Gregory
The zawrs extension adds a pair of instructions that stall a core until
a memory location is written to. This patch uses one of them to
implement RISCV-specific versions of the rte_wait_until_equal_*
functions. This is potentially more energy efficient than the default
implementation that uses rte_pause/Zihintpause.

The technique works as follows:

* Create a reservation set containing the address we want to wait on
  using an atomic load (lr.dw)
* Call wrs.nto - this blocks until the reservation set is invalidated by
  someone else writing to that address
* Execution can also resume arbitrarily, so we still need to check
  whether a change occurred and loop if not

Due to RISC-V atomics only supporting naturally aligned word (32 bit)
and double word (64 bit) loads, I've used pointer rounding and bit
shifting to implement waiting on 16-bit values.

This new functionality is controlled by a Meson flag that is disabled by
default.

Signed-off-by: Daniel Gregory 
Suggested-by: Punit Agrawal 
---

Posting as an RFC to get early feedback and enable testing by others
with Zawrs-enabled hardware. Whilst I have been able to test it compiles
& passes tests using QEMU, I am waiting on some Zawrs-enabled hardware
to become available before I carry out performance tests.

Nonetheless, I would be glad to hear any feedback on the general
approach. Thanks, Daniel

 config/riscv/meson.build  |   5 ++
 lib/eal/riscv/include/rte_pause.h | 139 ++
 2 files changed, 144 insertions(+)

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 07d7d9da23..4cfdc42ecb 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -26,6 +26,11 @@ flags_common = [
 # read from /proc/device-tree/cpus/timebase-frequency. This property is
 # guaranteed on Linux, as riscv time_init() requires it.
 ['RTE_RISCV_TIME_FREQ', 0],
+
+# Enable use of RISC-V Wait-on-Reservation-Set extension (Zawrs)
+# Mitigates looping when polling on memory locations
+# Make sure to add '_zawrs' to your target's -march below
+['RTE_RISCV_ZAWRS', false]
 ]
 
 ## SoC-specific options.
diff --git a/lib/eal/riscv/include/rte_pause.h 
b/lib/eal/riscv/include/rte_pause.h
index cb8e9ca52d..e7b43dffa3 100644
--- a/lib/eal/riscv/include/rte_pause.h
+++ b/lib/eal/riscv/include/rte_pause.h
@@ -11,6 +11,12 @@
 extern "C" {
 #endif
 
+#ifdef RTE_RISCV_ZAWRS
+#define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
+#endif
+
+#include 
+
 #include "rte_atomic.h"
 
 #include "generic/rte_pause.h"
@@ -24,6 +30,139 @@ static inline void rte_pause(void)
asm volatile(".int 0x010F" : : : "memory");
 }
 
+#ifdef RTE_RISCV_ZAWRS
+
+/*
+ * Atomic load from an address, it returns either a sign-extended word or
+ * doubleword and creates a 'reservation set' containing the read memory
+ * location. When someone else writes to the reservation set, it is 
invalidated,
+ * causing any stalled WRS instructions to resume.
+ *
+ * Address needs to be naturally aligned.
+ */
+#define __RTE_RISCV_LR_32(src, dst, memorder) do {\
+   if ((memorder) == rte_memory_order_relaxed) { \
+   asm volatile("lr.w %0, (%1)"  \
+   : "=r" (dst)  \
+   : "r" (src)   \
+   : "memory");  \
+   } else {  \
+   asm volatile("lr.w.aq %0, (%1)"   \
+   : "=r" (dst)  \
+   : "r" (src)   \
+   : "memory");  \
+   } } while (0)
+#define __RTE_RISCV_LR_64(src, dst, memorder) do {\
+   if ((memorder) == rte_memory_order_relaxed) { \
+   asm volatile("lr.d %0, (%1)"  \
+   : "=r" (dst)  \
+   : "r" (src)   \
+   : "memory");  \
+   } else {  \
+   asm volatile("lr.d.aq %0, (%1)"   \
+   : "=r" (dst)  \
+   : "r" (src)   \
+   : "memory");  \
+   } } while (0)
+
+/*
+ * There's not a RISC-V atomic load primitive for halfwords, so cast up to a
+ * _naturally aligned_ word and extract the halfw

Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-02 Thread Daniel Gregory
On Thu, May 02, 2024 at 09:20:45AM -0700, Stephen Hemminger wrote:
> Why not:
> diff --git a/lib/eal/arm/include/rte_pause_64.h 
> b/lib/eal/arm/include/rte_pause_64.h
> index 5cb8b59056..81987de771 100644
> --- a/lib/eal/arm/include/rte_pause_64.h
> +++ b/lib/eal/arm/include/rte_pause_64.h
> @@ -172,6 +172,8 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
> expected,
>  {
> uint32_t value;
>  
> +   static_assert(__builtin_constant_p(memorder), "memory order is not a 
> constant");
> +
> RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
> memorder != rte_memory_order_relaxed);
>  
> @@ -191,6 +193,8 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t 
> expected,
>  {
> uint64_t value;
>  
> +   static_assert(__builtin_constant_p(memorder), "memory order is not a 
> constant");
> +
> RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
> memorder != rte_memory_order_relaxed);
>  

What toolchain are you using? With your change I still get errors about
the expression not being constant:

In file included from ../lib/eal/arm/include/rte_pause.h:13,
 from ../lib/eal/include/generic/rte_spinlock.h:25,
 from ../lib/eal/arm/include/rte_spinlock.h:17,
 from ../lib/telemetry/telemetry.c:20:
../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’:
../lib/eal/arm/include/rte_pause_64.h:156:23: error: expression in static 
assertion is not constant
156 |   static_assert(__builtin_constant_p(memorder), "memory order is 
not a constant");
| ^~

I'm cross-compiling with GCC v12.2 using the
config/arm/arm64_armv8_linux_gcc cross-file, and enabling
RTE_ARM_USE_WFE by uncommenting it in config/arm/meson.build and setting
its value to true.


Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-03 Thread Daniel Gregory
On Thu, May 02, 2024 at 02:48:26PM -0700, Stephen Hemminger wrote:
> There are already constant checks like this elsewhere in the file.

Yes, but they're in macros, rather than inlined functions, so my
understanding was that at compile time, macro expansion has put the
memorder constant in the _Static_assert call as opposed to still being
a function parameter in the inline definition.

This is also the same approach used by the generic implementation
(lib/eal/include/generic/rte_pause.h), the inline functions use assert
and the macros use RTE_BUILD_BUG_ON.

To give a minimal example, the following inline function doesn't
compile (Godbolt demo here https://godbolt.org/z/aPqTf3v4o ):

static inline __attribute__((always_inline)) void
add(int *dst, int val)
{
_Static_assert(val != 0, "adding zero does nothing");
*dst += val;
}

But as a macro it does ( https://godbolt.org/z/x4a8fTf8h ):

#define add(dst, val) do {\
_Static_assert(val != 0, "adding zero does nothing"); \
*(dst) += (val);  \
} while(0);

I don't believe this is a compiler bug as both GCC and Clang produce the
same error message.


Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-03 Thread Daniel Gregory
On Fri, May 03, 2024 at 03:32:20PM +0200, David Marchand wrote:
> - RTE_BUILD_BUG_ON() should not be used indeed.
> IIRC, this issue was introduced with 875f350924b8 ("eal: add a new
> helper for wait until scheme").
> Please add a corresponding Fixes: tag in next revision.

Will do. Should I CC sta...@dpdk.org too?
 
> - This ARM specific implementation should take a rte_memory_order type
> instead of a int type for the memorder input variable.
> This was missed in 1ec6a845b5cb ("eal: use stdatomic API in public headers").
> 
> Could you send a fix for this small issue too?

Yes, sure thing.

Thanks, Daniel


[PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-03 Thread Daniel Gregory
The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check
memorder, which is not constant. This causes compile errors when it is
enabled with RTE_ARM_USE_WFE. eg.

../lib/eal/arm/include/rte_pause_64.h: In function ‘rte_wait_until_equal_16’:
../lib/eal/include/rte_common.h:530:56: error: expression in static assertion 
is not constant
  530 | #define RTE_BUILD_BUG_ON(condition) do { static_assert(!(condition), 
#condition); } while (0)
  |^~~~
../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro 
‘RTE_BUILD_BUG_ON’
  156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
  | ^~~~

Fix the compile errors by replacing the check with an assert, like in
the generic implementation (lib/eal/include/generic/rte_pause.h).

Fixes: 875f350924b8 ("eal: add a new helper for wait until scheme")

Signed-off-by: Daniel Gregory 
---
Cc: feifei.wa...@arm.com
---
 lib/eal/arm/include/rte_pause_64.h | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index 5cb8b59056..852660091a 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -10,6 +10,8 @@
 extern "C" {
 #endif
 
+#include 
+
 #include 
 
 #ifdef RTE_ARM_USE_WFE
@@ -153,7 +155,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
 {
uint16_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   assert(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_16(addr, value, memorder)
@@ -172,7 +174,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
expected,
 {
uint32_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   assert(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_32(addr, value, memorder)
@@ -191,7 +193,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t 
expected,
 {
uint64_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
+   assert(memorder != rte_memory_order_acquire &&
memorder != rte_memory_order_relaxed);
 
__RTE_ARM_LOAD_EXC_64(addr, value, memorder)
-- 
2.39.2



Re: [PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-03 Thread Daniel Gregory
Apologies, mis-sent this before attaching a changelog:

v2:
* replaced RTE_ASSERT with assert
* added Fixes: tag


[PATCH] eal/arm: use stdatomic api in rte_pause

2024-05-03 Thread Daniel Gregory
Missed during commit 1ec6a845b5cb
("eal: use stdatomic API in public headers")

Signed-off-by: Daniel Gregory 
---
 lib/eal/arm/include/rte_pause_64.h | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index 5cb8b59056..9e2dbf3531 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -11,6 +11,7 @@ extern "C" {
 #endif
 
 #include 
+#include 
 
 #ifdef RTE_ARM_USE_WFE
 #define RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
@@ -149,7 +150,7 @@ static inline void rte_pause(void)
 
 static __rte_always_inline void
 rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
-   int memorder)
+   rte_memory_order memorder)
 {
uint16_t value;
 
@@ -168,7 +169,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
 
 static __rte_always_inline void
 rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
-   int memorder)
+   rte_memory_order memorder)
 {
uint32_t value;
 
@@ -187,7 +188,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
expected,
 
 static __rte_always_inline void
 rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
-   int memorder)
+   rte_memory_order memorder)
 {
uint64_t value;
 
-- 
2.39.2



Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-09 Thread Daniel Gregory
On Fri, May 03, 2024 at 05:56:24PM -0700, Stephen Hemminger wrote:
> On Fri, 3 May 2024 10:46:05 +0100
> Daniel Gregory  wrote:
> 
> > On Thu, May 02, 2024 at 02:48:26PM -0700, Stephen Hemminger wrote:
> > > There are already constant checks like this elsewhere in the file.  
> > 
> > Yes, but they're in macros, rather than inlined functions, so my
> > understanding was that at compile time, macro expansion has put the
> > memorder constant in the _Static_assert call as opposed to still being
> > a function parameter in the inline definition.
> 
> Gcc and clang are smart enough that it is possible to use the internal
> __builtin_constant_p() in the function. Some examples in DPDK:
> 
> static __rte_always_inline int
> rte_mempool_do_generic_get(struct rte_mempool *mp, void **obj_table,
>  unsigned int n, struct rte_mempool_cache *cache)
> {
>   int ret;
>   unsigned int remaining;
>   uint32_t index, len;
>   void **cache_objs;
> 
>   /* No cache provided */
>   if (unlikely(cache == NULL)) {
>   remaining = n;
>   goto driver_dequeue;
>   }
> 
>   /* The cache is a stack, so copy will be in reverse order. */
>   cache_objs = &cache->objs[cache->len];
> 
>   if (__extension__(__builtin_constant_p(n)) && n <= cache->len) {
> 
> It should be possible to use RTE_BUILD_BUG_ON() or static_assert here.

Yes, it's possible to use RTE_BUILD_BUG_ON(!__builtin_constant_p(n)) on
Clang, I am simply not seeing it succeed. In fact, the opposite check,
that the memorder is not constant, builds just fine with Clang 16.


diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index 5cb8b59056..d0646320e6 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -153,8 +153,7 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
 {
uint16_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
-   memorder != rte_memory_order_relaxed);
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
 
__RTE_ARM_LOAD_EXC_16(addr, value, memorder)
if (value != expected) {
@@ -172,8 +171,7 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
expected,
 {
uint32_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
-   memorder != rte_memory_order_relaxed);
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
 
__RTE_ARM_LOAD_EXC_32(addr, value, memorder)
if (value != expected) {
@@ -191,8 +189,7 @@ rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t 
expected,
 {
uint64_t value;
 
-   RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
-   memorder != rte_memory_order_relaxed);
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
 
__RTE_ARM_LOAD_EXC_64(addr, value, memorder)
if (value != expected) {
diff --git a/lib/eal/include/generic/rte_pause.h 
b/lib/eal/include/generic/rte_pause.h
index f2a1eadcbd..3973488865 100644
--- a/lib/eal/include/generic/rte_pause.h
+++ b/lib/eal/include/generic/rte_pause.h
@@ -80,6 +80,7 @@ static __rte_always_inline void
 rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
rte_memory_order memorder)
 {
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
assert(memorder == rte_memory_order_acquire || memorder == 
rte_memory_order_relaxed);
 
while (rte_atomic_load_explicit((volatile __rte_atomic uint16_t *)addr, 
memorder)
@@ -91,6 +92,7 @@ static __rte_always_inline void
 rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t expected,
rte_memory_order memorder)
 {
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
assert(memorder == rte_memory_order_acquire || memorder == 
rte_memory_order_relaxed);
 
while (rte_atomic_load_explicit((volatile __rte_atomic uint32_t *)addr, 
memorder)
@@ -102,6 +104,7 @@ static __rte_always_inline void
 rte_wait_until_equal_64(volatile uint64_t *addr, uint64_t expected,
rte_memory_order memorder)
 {
+   RTE_BUILD_BUG_ON(__builtin_constant_p(memorder));
assert(memorder == rte_memory_order_acquire || memorder == 
rte_memory_order_relaxed);
 
while (rte_atomic_load_explicit((volatile __rte_atomic uint64_t *)addr, 
memorder)


This seemed odd, and it doesn't line up with what the GCC documentation
says about __builtin_constant_p:

> [__builtin_constant_p] does not return 1 when you pass a constant
> numeric value to the inline function unless you specify the -O option.
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

So I did some more looking and the behaviour I've seen is that both
C

Re: [PATCH] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-05-09 Thread Daniel Gregory
On Fri, May 03, 2024 at 06:02:36PM -0700, Stephen Hemminger wrote:
> On Thu,  2 May 2024 15:21:16 +0100
> Daniel Gregory  wrote:
> 
> > The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check
> > memorder, which is not constant. This causes compile errors when it is
> > enabled with RTE_ARM_USE_WFE. eg.
> > 
> > ../lib/eal/arm/include/rte_pause_64.h: In function 
> > ‘rte_wait_until_equal_16’:
> > ../lib/eal/include/rte_common.h:530:56: error: expression in static 
> > assertion is not constant
> >   530 | #define RTE_BUILD_BUG_ON(condition) do { 
> > static_assert(!(condition), #condition); } while (0)
> >   |^~~~
> > ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro 
> > ‘RTE_BUILD_BUG_ON’
> >   156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
> >   | ^~~~
> > 
> > This has been the case since the switch to C11 assert (537caad2). Fix
> > the compile errors by replacing the check with an RTE_ASSERT.
> > 
> > Signed-off-by: Daniel Gregory 
> 
> The only calls to rte_wait_until_equal_16 in upstream code
> are in the test_bbdev_perf.c and test_timer.c.  Looks like
> these test never got fixed to use rte_memory_order instead of __ATOMIC_ 
> defines.

Apologies, the commit message could make it clearer, but this is also an
issue for rte_wait_until_equal_32 and rte_wait_until_equal_64.

rte_wait_until_equal_32 is used in a dozen or so lock tests with the old
__ATOMIC_ defines, as well as rte_ring_generic_pvt.h and
rte_ring_c11_pvt.h, where it's used with the new rte_memorder_order
values. Correct me if I'm wrong, but shouldn't the static assertions in
rte_stdatomic.h ensure that mixed usage doesn't cause any issues, even
if using the older __ATOMIC_ defines isn't ideal?
 
> And there should be a CI test for ARM that enables the WFE code at least
> to ensure it works!

Yes, that could've caught this sooner.


Re: [RFC PATCH] eal/riscv: add support for zawrs extension

2024-05-20 Thread Daniel Gregory
On Sun, May 12, 2024 at 09:10:49AM +0200, Stanisław Kardach wrote:
> On Thu, May 2, 2024 at 4:44 PM Daniel Gregory
>  wrote:
> > diff --git a/config/riscv/meson.build b/config/riscv/meson.build
> > index 07d7d9da23..4cfdc42ecb 100644
> > --- a/config/riscv/meson.build
> > +++ b/config/riscv/meson.build
> > @@ -26,6 +26,11 @@ flags_common = [
> >  # read from /proc/device-tree/cpus/timebase-frequency. This property is
> >  # guaranteed on Linux, as riscv time_init() requires it.
> >  ['RTE_RISCV_TIME_FREQ', 0],
> > +
> > +# Enable use of RISC-V Wait-on-Reservation-Set extension (Zawrs)
> > +# Mitigates looping when polling on memory locations
> > +# Make sure to add '_zawrs' to your target's -march below
> > +['RTE_RISCV_ZAWRS', false]
> A bit orthogonal to this patch (or maybe not?)
> Should we perhaps add a Qemu target in meson.build which would have
> the modified -march for what qemu supports now?

Yes, I can see that being worth doing as part of this patch. In addition
to Zawrs for this patch, GCC 13+ should generate prefetch instructions
for __builtin_prefetch() (lib/eal/include/generic/rte_prefetch.h) if the
Zicbop extension is enabled. Any more in particular you think would
benefit or would it be best to add every extension GCC 14 supports?

> Or perhaps add machine detection logic based either on the "riscv,isa"
> cpu@0 property in the DT or RHCT ACPI table?

I have had a look and, at least on QEMU 9, this seems non-trivial. The
RHCT acpi table at /proc/cpuinfo doesn't list every extension present
(eg. it's missing Zawrs), and the DT, whilst complete, can't be fed
directly into GCC because QEMU reports several newer and non-ratified
extensions that GCC doesn't support yet.

> Or add perhaps some other way we could specify the extension list
> suffix for -march?

Setting -Dcpu_instruction_set to some arbitrary ISA could work with
somes minor changes to the build script to not discard it in favour of
rv64gc. Then, we could add a map from ISA extensions to flags that are
enabled when that extension is present in cpu_instruction_set?

Thanks for your review,
Daniel


[PATCH 0/2] eal/riscv: implement prefetch using zicbop

2024-05-30 Thread Daniel Gregory
Instructions from RISC-V's Zicbop extension can be used to implement the
rte_prefetch* family of functions. On modern versions of GCC (13.1.0+)
and Clang (17.0.1+), these are emitted by __builtin_prefetch() when the
extension is present.

In order to support older compiler versions, this patchset manually
emits these instructions using inline assembly. To do this, I have added
a new flag, RTE_PREFETCH_WRITE_ARCH_DEFINED, that
(similarly to RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED) hides the generic
implementation of rte_prefetch*_write.

I am still in the process of acquiring hardware that supports this
extension, so I haven't tested how this affects performance yet.

Daniel Gregory (2):
  eal: add flag to hide generic prefetch_write
  eal/riscv: add support for zicbop extension

 config/riscv/meson.build   |  6 +++
 lib/eal/include/generic/rte_prefetch.h | 47 +
 lib/eal/riscv/include/rte_prefetch.h   | 57 --
 3 files changed, 90 insertions(+), 20 deletions(-)

-- 
2.39.2



[PATCH 1/2] eal: add flag to hide generic prefetch_write

2024-05-30 Thread Daniel Gregory
This allows for the definition of architecture-specific implementations
of the rte_prefetch*_write collection of functions by defining
RTE_PREFETCH_WRITE_ARCH_DEFINED.

Signed-off-by: Daniel Gregory 
---
 lib/eal/include/generic/rte_prefetch.h | 47 +-
 1 file changed, 31 insertions(+), 16 deletions(-)

diff --git a/lib/eal/include/generic/rte_prefetch.h 
b/lib/eal/include/generic/rte_prefetch.h
index f9fab5e359..5558376cba 100644
--- a/lib/eal/include/generic/rte_prefetch.h
+++ b/lib/eal/include/generic/rte_prefetch.h
@@ -65,14 +65,7 @@ static inline void rte_prefetch_non_temporal(const volatile 
void *p);
  */
 __rte_experimental
 static inline void
-rte_prefetch0_write(const void *p)
-{
-   /* 1 indicates intention to write, 3 sets target cache level to L1. See
-* GCC docs where these integer constants are described in more detail:
-*  https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
-*/
-   __builtin_prefetch(p, 1, 3);
-}
+rte_prefetch0_write(const void *p);
 
 /**
  * @warning
@@ -86,14 +79,7 @@ rte_prefetch0_write(const void *p)
  */
 __rte_experimental
 static inline void
-rte_prefetch1_write(const void *p)
-{
-   /* 1 indicates intention to write, 2 sets target cache level to L2. See
-* GCC docs where these integer constants are described in more detail:
-*  https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
-*/
-   __builtin_prefetch(p, 1, 2);
-}
+rte_prefetch1_write(const void *p);
 
 /**
  * @warning
@@ -105,6 +91,33 @@ rte_prefetch1_write(const void *p)
  *
  * @param p Address to prefetch
  */
+__rte_experimental
+static inline void
+rte_prefetch2_write(const void *p);
+
+#ifndef RTE_PREFETCH_WRITE_ARCH_DEFINED
+__rte_experimental
+static inline void
+rte_prefetch0_write(const void *p)
+{
+   /* 1 indicates intention to write, 3 sets target cache level to L1. See
+* GCC docs where these integer constants are described in more detail:
+*  https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
+*/
+   __builtin_prefetch(p, 1, 3);
+}
+
+__rte_experimental
+static inline void
+rte_prefetch1_write(const void *p)
+{
+   /* 1 indicates intention to write, 2 sets target cache level to L2. See
+* GCC docs where these integer constants are described in more detail:
+*  https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html
+*/
+   __builtin_prefetch(p, 1, 2);
+}
+
 __rte_experimental
 static inline void
 rte_prefetch2_write(const void *p)
@@ -116,6 +129,8 @@ rte_prefetch2_write(const void *p)
__builtin_prefetch(p, 1, 1);
 }
 
+#endif /* RTE_PREFETCH_WRITE_ARCH_DEFINED */
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change, or be removed, without prior notice
-- 
2.39.2



[PATCH 2/2] eal/riscv: add support for zicbop extension

2024-05-30 Thread Daniel Gregory
The zicbop extension adds instructions for prefetching data into cache.
Use them to implement RISCV-specific versions of the rte_prefetch* and
rte_prefetch*_write functions.

- prefetch.r indicates to hardware that the cache block will be accessed
  by a data read soon
- prefetch.w indicates to hardware that the cache block will be accessed
  by a data write soon

These instructions are emitted by __builtin_prefetch on modern versions
of Clang (17.0.1+) and GCC (13.1.0+). For earlier versions, we may not
have support for assembling Zicbop instructions, so emit the word
that encodes a 'prefetch.[rw] 0(a0)' instruction.

This new functionality is controlled by a Meson flag that is disabled by
default. Whilst it's a hint, like rte_pause(), and so has no effect if
the target doesn't support the extension, it requires the address
prefetched to be loaded into a0, which may be costly.

Signed-off-by: Daniel Gregory 
Suggested-by: Punit Agrawal 
---
 config/riscv/meson.build |  6 +++
 lib/eal/riscv/include/rte_prefetch.h | 57 ++--
 2 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 07d7d9da23..ecf9da1c39 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -26,6 +26,12 @@ flags_common = [
 # read from /proc/device-tree/cpus/timebase-frequency. This property is
 # guaranteed on Linux, as riscv time_init() requires it.
 ['RTE_RISCV_TIME_FREQ', 0],
+
+# When true override the default implementation of the prefetching 
functions
+# (rte_prefetch*) with a version that explicitly uses the Zicbop extension.
+# Do not enable when using modern versions of GCC (13.1.0+) or Clang
+# (17.0.1+). They will emit these instructions in the default 
implementation
+['RTE_RISCV_ZICBOP', false],
 ]
 
 ## SoC-specific options.
diff --git a/lib/eal/riscv/include/rte_prefetch.h 
b/lib/eal/riscv/include/rte_prefetch.h
index 748cf1b626..82cad526b3 100644
--- a/lib/eal/riscv/include/rte_prefetch.h
+++ b/lib/eal/riscv/include/rte_prefetch.h
@@ -14,21 +14,42 @@ extern "C" {
 
 #include 
 #include 
+
+#ifdef RTE_RISCV_ZICBOP
+#define RTE_PREFETCH_WRITE_ARCH_DEFINED
+#endif
+
 #include "generic/rte_prefetch.h"
 
+/*
+ * Modern versions of GCC & Clang will emit prefetch instructions for
+ * __builtin_prefetch when the Zicbop extension is present.
+ * The RTE_RISCV_ZICBOP option controls whether we emit them manually for older
+ * compilers that may not have the support to assemble them.
+ */
 static inline void rte_prefetch0(const volatile void *p)
 {
-   RTE_SET_USED(p);
+#ifndef RTE_RISCV_ZICBOP
+   /* by default __builtin_prefetch prepares for a read */
+   __builtin_prefetch((const void *)p);
+#else
+   /* prefetch.r 0(a0) */
+   register const volatile void *a0 asm("a0") = p;
+   asm volatile (".int 0x00156013" : : "r" (a0));
+#endif
 }
 
+/*
+ * The RISC-V Zicbop extension doesn't have instructions to prefetch to only a
+ * subset of cache levels, so fallback to rte_prefetch0
+ */
 static inline void rte_prefetch1(const volatile void *p)
 {
-   RTE_SET_USED(p);
+   rte_prefetch0(p);
 }
-
 static inline void rte_prefetch2(const volatile void *p)
 {
-   RTE_SET_USED(p);
+   rte_prefetch0(p);
 }
 
 static inline void rte_prefetch_non_temporal(const volatile void *p)
@@ -44,6 +65,34 @@ rte_cldemote(const volatile void *p)
RTE_SET_USED(p);
 }
 
+#ifdef RTE_RISCV_ZICBOP
+__rte_experimental
+static inline void
+rte_prefetch0_write(const void *p)
+{
+   /* prefetch.w 0(a0) */
+   register const void *a0 asm("a0") = p;
+   asm volatile (".int 0x00356013" : : "r" (a0));
+}
+
+/*
+ * The RISC-V Zicbop extension doesn't have instructions to prefetch to only a
+ * subset of cache levels, so fallback to rte_prefetch0_write
+ */
+__rte_experimental
+static inline void
+rte_prefetch1_write(const void *p)
+{
+   rte_prefetch0_write(p);
+}
+__rte_experimental
+static inline void
+rte_prefetch2_write(const void *p)
+{
+   rte_prefetch0_write(p);
+}
+#endif /* RTE_RISCV_ZICBOP */
+
 #ifdef __cplusplus
 }
 #endif
-- 
2.39.2



Re: [PATCH 2/2] eal/riscv: add support for zicbop extension

2024-05-31 Thread Daniel Gregory
On Thu, May 30, 2024 at 06:19:48PM +0100, Daniel Gregory wrote:
> + * The RTE_RISCV_ZICBOP option controls whether we emit them manually for 
> older
> + * compilers that may not have the support to assemble them.
> + */
>  static inline void rte_prefetch0(const volatile void *p)
>  {
> - RTE_SET_USED(p);
> +#ifndef RTE_RISCV_ZICBOP
> + /* by default __builtin_prefetch prepares for a read */
> + __builtin_prefetch((const void *)p);

This cast causes warnings (which are treated as errors by the 0-day
Robot) due to it discarding the 'volatile' on p. 

Removing the volatile from the definition of rte_prefetch0 causes build
failures in some drivers (txgbe_rxtx.c:1809, ixgbe_rxtx.c:2174,
enic_rxtx.c:127, ...).

rte_prefetch0_write takes its argument as 'const void *' and so can use
__builtin_prefetch().


[PATCH 0/5] riscv: implement accelerated crc using zbc

2024-06-18 Thread Daniel Gregory
The RISC-V Zbc extension adds instructions for carry-less multiplication
we can use to implement CRC in hardware. This patchset contains two new
implementations:

- one in lib/hash/rte_crc_riscv64.h that uses a Barrett reduction to
  implement the four rte_hash_crc_* functions
- one in lib/net/net_crc_zbc.c that uses repeated single-folds to reduce
  the buffer until it is small enough for a Barrett reduction to
  implement rte_crc16_ccitt_zbc_handler and rte_crc32_eth_zbc_handler

My approach is largely based on the Intel's "Fast CRC Computation Using
PCLMULQDQ Instruction" white paper
https://www.researchgate.net/publication/263424619_Fast_CRC_computation
and a post about "Optimizing CRC32 for small payload sizes on x86"
https://mary.rs/lab/crc32/

These implementations are behind a new flag, RTE_RISCV_ZBC. Due to use
of bitmanip compiler intrinsics, a modern version of GCC (14+) or Clang
(18+) is required to compile with this flag enabled.

I have carried out some performance comparisons between the generic
table implementations and the new hardware implementations. Listed below
is the number of cycles it takes to compute the CRC hash for buffers of
various sizes (as reported by rte_get_timer_cycles()). These results
were collected on a Kendryte K230 and averaged over 20 samples:

|Buffer| CRC32-ETH (lib/net) | CRC32C (lib/hash)   |
|Size (MB) | Table| Hardware | Table| Hardware |
|--|--|--|--|--|
|1 |   155168 |11610 |73026 |18385 |
|2 |   311203 |22998 |   145586 |35886 |
|3 |   466744 |34370 |   218536 |53939 |
|4 |   621843 |45536 |   291574 |71944 |
|5 |   777908 |56989 |   364152 |89706 |
|6 |   932736 |68023 |   437016 |   107726 |
|7 |  1088756 |79236 |   510197 |   125426 |
|8 |  1243794 |90467 |   583231 |   143614 |

These results suggest a speed-up of lib/net by thirteen times, and of
lib/hash by four times.

Daniel Gregory (5):
  config/riscv: add flag for using Zbc extension
  hash: implement crc using riscv carryless multiply
  net: implement crc using riscv carryless multiply
  examples/l3fwd: use accelerated crc on riscv
  ipfrag: use accelerated crc on riscv

 MAINTAINERS|   2 +
 app/test/test_crc.c|   9 ++
 app/test/test_hash.c   |   7 ++
 config/riscv/meson.build   |   7 ++
 examples/l3fwd/l3fwd_em.c  |   2 +-
 lib/hash/meson.build   |   1 +
 lib/hash/rte_crc_riscv64.h |  89 +++
 lib/hash/rte_hash_crc.c|  12 +-
 lib/hash/rte_hash_crc.h|   6 +-
 lib/ip_frag/ip_frag_internal.c |   6 +-
 lib/net/meson.build|   4 +
 lib/net/net_crc.h  |  11 ++
 lib/net/net_crc_zbc.c  | 202 +
 lib/net/rte_net_crc.c  |  35 ++
 lib/net/rte_net_crc.h  |   2 +
 15 files changed, 389 insertions(+), 6 deletions(-)
 create mode 100644 lib/hash/rte_crc_riscv64.h
 create mode 100644 lib/net/net_crc_zbc.c

-- 
2.39.2



[PATCH 1/5] config/riscv: add flag for using Zbc extension

2024-06-18 Thread Daniel Gregory
The RISC-V Zbc extension adds carry-less multiply instructions we can
use to implement more efficient CRC hashing algorithms.

Signed-off-by: Daniel Gregory 
---
 config/riscv/meson.build | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/config/riscv/meson.build b/config/riscv/meson.build
index 07d7d9da23..4bda4089bd 100644
--- a/config/riscv/meson.build
+++ b/config/riscv/meson.build
@@ -26,6 +26,13 @@ flags_common = [
 # read from /proc/device-tree/cpus/timebase-frequency. This property is
 # guaranteed on Linux, as riscv time_init() requires it.
 ['RTE_RISCV_TIME_FREQ', 0],
+
+# Use RISC-V Carry-less multiplication extension (Zbc) for hardware
+# implementations of CRC-32C (lib/hash/rte_crc_riscv64.h), CRC-32 and 
CRC-16
+# (lib/net/net_crc_zbc.c). Requires intrinsics available in GCC 14.1.0+ and
+# Clang 18.1.0+
+# Make sure to add '_zbc' to your target's -march below
+['RTE_RISCV_ZBC', false],
 ]
 
 ## SoC-specific options.
-- 
2.39.2



[PATCH 2/5] hash: implement crc using riscv carryless multiply

2024-06-18 Thread Daniel Gregory
Using carryless multiply instructions from RISC-V's Zbc extension,
implement a Barrett reduction that calculates CRC-32C checksums.

Based on the approach described by Intel's whitepaper on "Fast CRC
Computation for Generic Polynomials Using PCLMULQDQ Instruction", which
is also described here
(https://web.archive.org/web/20240111232520/https://mary.rs/lab/crc32/)

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS|  1 +
 app/test/test_hash.c   |  7 +++
 lib/hash/meson.build   |  1 +
 lib/hash/rte_crc_riscv64.h | 89 ++
 lib/hash/rte_hash_crc.c| 12 -
 lib/hash/rte_hash_crc.h|  6 ++-
 6 files changed, 114 insertions(+), 2 deletions(-)
 create mode 100644 lib/hash/rte_crc_riscv64.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 472713124c..48800f39c4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -318,6 +318,7 @@ M: Stanislaw Kardach 
 F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
+F: lib/hash/rte_crc_riscv64.h
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_hash.c b/app/test/test_hash.c
index 24d3b547ad..c8c4197ad8 100644
--- a/app/test/test_hash.c
+++ b/app/test/test_hash.c
@@ -205,6 +205,13 @@ test_crc32_hash_alg_equiv(void)
printf("Failed checking CRC32_SW against 
CRC32_ARM64\n");
break;
}
+
+   /* Check against 8-byte-operand RISCV64 CRC32 if available */
+   rte_hash_crc_set_alg(CRC32_RISCV64);
+   if (hash_val != rte_hash_crc(data64, data_len, init_val)) {
+   printf("Failed checking CRC32_SW against 
CRC32_RISC64\n");
+   break;
+   }
}
 
/* Resetting to best available algorithm */
diff --git a/lib/hash/meson.build b/lib/hash/meson.build
index 277eb9fa93..8355869a80 100644
--- a/lib/hash/meson.build
+++ b/lib/hash/meson.build
@@ -12,6 +12,7 @@ headers = files(
 indirect_headers += files(
 'rte_crc_arm64.h',
 'rte_crc_generic.h',
+'rte_crc_riscv64.h',
 'rte_crc_sw.h',
 'rte_crc_x86.h',
 'rte_thash_x86_gfni.h',
diff --git a/lib/hash/rte_crc_riscv64.h b/lib/hash/rte_crc_riscv64.h
new file mode 100644
index 00..94f6857c69
--- /dev/null
+++ b/lib/hash/rte_crc_riscv64.h
@@ -0,0 +1,89 @@
+/* SPDX-License_Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+
+#ifndef _RTE_CRC_RISCV64_H_
+#define _RTE_CRC_RISCV64_H_
+
+/*
+ * CRC-32C takes a reflected input (bit 7 is the lsb) and produces a reflected
+ * output. As reflecting the value we're checksumming is expensive, we instead
+ * reflect the polynomial P (0x11EDC6F41) and mu and our CRC32 algorithm.
+ *
+ * The mu constant is used for a Barrett reduction. It's 2^96 / P (0x11F91CAF6)
+ * reflected. Picking 2^96 rather than 2^64 means we can calculate a 64-bit crc
+ * using only two multiplications (https://mary.rs/lab/crc32/)
+ */
+static const uint64_t p = 0x105EC76F1;
+static const uint64_t mu = 0x4869EC38DEA713F1UL;
+
+/* Calculate the CRC32C checksum using a Barrett reduction */
+static inline uint32_t
+crc32c_riscv64(uint64_t data, uint32_t init_val, uint32_t bits)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t crc = (uint64_t)(data ^ init_val) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   crc = __riscv_clmul_64(crc, mu);
+   /* Multiply by P */
+   crc = __riscv_clmulh_64(crc, p);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   crc ^= init_val >> bits;
+
+   return crc;
+}
+
+/*
+ * Use carryless multiply to perform hash on a value, falling back on the
+ * software in case the Zbc extension is not supported
+ */
+static inline uint32_t
+rte_hash_crc_1byte(uint8_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 8);
+
+   return crc32c_1byte(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_2byte(uint16_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 16);
+
+   return crc32c_2bytes(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_4byte(uint32_t data, uint32_t init_val)
+{
+   if (likely(rte_hash_crc32_alg & CRC32_RISCV64))
+   return crc32c_riscv64(data, init_val, 32);
+
+   return crc32c_1word(data, init_val);
+}
+
+static inline uint32_t
+rte_hash_crc_8b

[PATCH 3/5] net: implement crc using riscv carryless multiply

2024-06-18 Thread Daniel Gregory
Using carryless multiply instructions (clmul) from RISC-V's Zbc
extension, implement CRC-32 and CRC-16 calculations on buffers.

Based on the approach described in Intel's whitepaper on "Fast CRC
Computation for Generic Polynomails Using PCLMULQDQ Instructions", we
perfom repeated folds-by-1 whilst the buffer is still big enough, then
perform Barrett's reductions on the rest.

Add a case to the crc_autotest suite that tests this implementation.

This implementation is enabled by setting the RTE_RISCV_ZBC flag
(see config/riscv/meson.build).

Signed-off-by: Daniel Gregory 
---
 MAINTAINERS   |   1 +
 app/test/test_crc.c   |   9 ++
 lib/net/meson.build   |   4 +
 lib/net/net_crc.h |  11 +++
 lib/net/net_crc_zbc.c | 202 ++
 lib/net/rte_net_crc.c |  35 
 lib/net/rte_net_crc.h |   2 +
 7 files changed, 264 insertions(+)
 create mode 100644 lib/net/net_crc_zbc.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 48800f39c4..6562e62779 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -319,6 +319,7 @@ F: config/riscv/
 F: doc/guides/linux_gsg/cross_build_dpdk_for_riscv.rst
 F: lib/eal/riscv/
 F: lib/hash/rte_crc_riscv64.h
+F: lib/net/net_crc_zbc.c
 
 Intel x86
 M: Bruce Richardson 
diff --git a/app/test/test_crc.c b/app/test/test_crc.c
index b85fca35fe..fa91557cf5 100644
--- a/app/test/test_crc.c
+++ b/app/test/test_crc.c
@@ -168,6 +168,15 @@ test_crc(void)
return ret;
}
 
+   /* set CRC riscv mode */
+   rte_net_crc_set_alg(RTE_NET_CRC_ZBC);
+
+   ret = test_crc_calc();
+   if (ret < 0) {
+   printf("test crc (riscv64 zbc clmul): failed (%d)\n", ret);
+   return ret;
+   }
+
return 0;
 }
 
diff --git a/lib/net/meson.build b/lib/net/meson.build
index 0b69138949..f2ae019bea 100644
--- a/lib/net/meson.build
+++ b/lib/net/meson.build
@@ -125,4 +125,8 @@ elif (dpdk_conf.has('RTE_ARCH_ARM64') and
 cc.get_define('__ARM_FEATURE_CRYPTO', args: machine_args) != '')
 sources += files('net_crc_neon.c')
 cflags += ['-DCC_ARM64_NEON_PMULL_SUPPORT']
+elif (dpdk_conf.has('RTE_ARCH_RISCV') and dpdk_conf.has('RTE_RISCV_ZBC') and
+   dpdk_conf.get('RTE_RISCV_ZBC'))
+sources += files('net_crc_zbc.c')
+cflags += ['-DCC_RISCV64_ZBC_CLMUL_SUPPORT']
 endif
diff --git a/lib/net/net_crc.h b/lib/net/net_crc.h
index 7a74d5406c..06ae113b47 100644
--- a/lib/net/net_crc.h
+++ b/lib/net/net_crc.h
@@ -42,4 +42,15 @@ rte_crc16_ccitt_neon_handler(const uint8_t *data, uint32_t 
data_len);
 uint32_t
 rte_crc32_eth_neon_handler(const uint8_t *data, uint32_t data_len);
 
+/* RISCV64 Zbc */
+void
+rte_net_crc_zbc_init(void);
+
+uint32_t
+rte_crc16_ccitt_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+uint32_t
+rte_crc32_eth_zbc_handler(const uint8_t *data, uint32_t data_len);
+
+
 #endif /* _NET_CRC_H_ */
diff --git a/lib/net/net_crc_zbc.c b/lib/net/net_crc_zbc.c
new file mode 100644
index 00..5907d69471
--- /dev/null
+++ b/lib/net/net_crc_zbc.c
@@ -0,0 +1,202 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) ByteDance 2024
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+
+#include "net_crc.h"
+
+/* CLMUL CRC computation context structure */
+struct crc_clmul_ctx {
+   uint64_t Pr;
+   uint64_t mu;
+   uint64_t k3;
+   uint64_t k4;
+   uint64_t k5;
+};
+
+struct crc_clmul_ctx crc32_eth_clmul;
+struct crc_clmul_ctx crc16_ccitt_clmul;
+
+/* Perform Barrett's reduction on 8, 16, 32 or 64-bit value */
+static inline uint32_t
+crc32_barrett_zbc(
+   const uint64_t data,
+   uint32_t crc,
+   uint32_t bits,
+   const struct crc_clmul_ctx *params)
+{
+   assert((bits == 64) || (bits == 32) || (bits == 16) || (bits == 8));
+
+   /* Combine data with the initial value */
+   uint64_t temp = (uint64_t)(data ^ crc) << (64 - bits);
+
+   /*
+* Multiply by mu, which is 2^96 / P. Division by 2^96 occurs by taking
+* the lower 64 bits of the result (remember we're inverted)
+*/
+   temp = __riscv_clmul_64(temp, params->mu);
+   /* Multiply by P */
+   temp = __riscv_clmulh_64(temp, params->Pr);
+
+   /* Subtract from original (only needed for smaller sizes) */
+   if (bits == 16 || bits == 8)
+   temp ^= crc >> bits;
+
+   return temp;
+}
+
+/* Repeat Barrett's reduction for short buffer sizes */
+static inline uint32_t
+crc32_repeated_barrett_zbc(
+   const uint8_t *data,
+   uint32_t data_len,
+   uint32_t crc,
+   const struct crc_clmul_ctx *params)
+{
+   while (data_len >= 8) {
+   crc = crc32_barrett_zbc(*(const uint64_t *)data, crc, 64, 
params);
+   data += 8;
+   data_len -= 8;
+   }
+   if (data_len &

[PATCH 4/5] examples/l3fwd: use accelerated crc on riscv

2024-06-18 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 examples/l3fwd/l3fwd_em.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/examples/l3fwd/l3fwd_em.c b/examples/l3fwd/l3fwd_em.c
index d98e66ea2c..4cec2dc6a9 100644
--- a/examples/l3fwd/l3fwd_em.c
+++ b/examples/l3fwd/l3fwd_em.c
@@ -29,7 +29,7 @@
 #include "l3fwd_event.h"
 #include "em_route_parse.c"
 
-#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32)
+#if defined(RTE_ARCH_X86) || defined(__ARM_FEATURE_CRC32) || 
defined(RTE_RISCV_ZBC)
 #define EM_HASH_CRC 1
 #endif
 
-- 
2.39.2



[PATCH 5/5] ipfrag: use accelerated crc on riscv

2024-06-18 Thread Daniel Gregory
When the RISC-V Zbc (carryless multiplication) extension is present, an
implementation of CRC hashing using hardware instructions is available.
Use it rather than jhash.

Signed-off-by: Daniel Gregory 
---
 lib/ip_frag/ip_frag_internal.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ip_frag/ip_frag_internal.c b/lib/ip_frag/ip_frag_internal.c
index 7cbef647df..7806264078 100644
--- a/lib/ip_frag/ip_frag_internal.c
+++ b/lib/ip_frag/ip_frag_internal.c
@@ -45,14 +45,14 @@ ipv4_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *)&key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(key->id, v);
 #else
 
v = rte_jhash_3words(p[0], p[1], key->id, PRIME_VALUE);
-#endif /* RTE_ARCH_X86 */
+#endif /* RTE_ARCH_X86 || RTE_ARCH_ARM64 || RTE_RISCV_ZBC */
 
*v1 =  v;
*v2 = (v << 7) + (v >> 14);
@@ -66,7 +66,7 @@ ipv6_frag_hash(const struct ip_frag_key *key, uint32_t *v1, 
uint32_t *v2)
 
p = (const uint32_t *) &key->src_dst;
 
-#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64)
+#if defined(RTE_ARCH_X86) || defined(RTE_ARCH_ARM64) || defined(RTE_RISCV_ZBC)
v = rte_hash_crc_4byte(p[0], PRIME_VALUE);
v = rte_hash_crc_4byte(p[1], v);
v = rte_hash_crc_4byte(p[2], v);
-- 
2.39.2



RE: [PATCH 1/5] config/riscv: add flag for using Zbc extension

2024-06-19 Thread Daniel Gregory
On Wed, Jun 19, 2024 at 09:08:14AM +0200, Morten Brørup wrote:
> > From: Stephen Hemminger [mailto:step...@networkplumber.org]
> 1/5] config/riscv: add flag for using Zbc extension
> > 
> > On Tue, 18 Jun 2024 18:41:29 +0100
> > Daniel Gregory  wrote:
> > 
> > > diff --git a/config/riscv/meson.build b/config/riscv/meson.build
> > > index 07d7d9da23..4bda4089bd 100644
> > > --- a/config/riscv/meson.build
> > > +++ b/config/riscv/meson.build
> > > @@ -26,6 +26,13 @@ flags_common = [
> > >  # read from /proc/device-tree/cpus/timebase-frequency. This property 
> > > is
> > >  # guaranteed on Linux, as riscv time_init() requires it.
> > >  ['RTE_RISCV_TIME_FREQ', 0],
> > > +
> > > +# Use RISC-V Carry-less multiplication extension (Zbc) for hardware
> > > +# implementations of CRC-32C (lib/hash/rte_crc_riscv64.h), CRC-32 and
> > CRC-16
> > > +# (lib/net/net_crc_zbc.c). Requires intrinsics available in GCC 
> > > 14.1.0+
> > and
> > > +# Clang 18.1.0+
> > > +# Make sure to add '_zbc' to your target's -march below
> > > +['RTE_RISCV_ZBC', false],
> > >  ]
> > 
> > Please do not add more config options via compile flags.
> > It makes it impossible for distros to ship one version.
> > 
> > Instead, detect at compile or runtime
> 
> Build time detection is not possible for cross builds.
> 

How about build time detection based on the target's configured
instruction set (either specified by cross-file or passed in through
-Dinstruction_set)? We could have a map from extensions present in the
ISA string to compile flags that should be enabled.

I suggested this whilst discussing a previous patch adding support for
the Zawrs extension, but haven't heard back from Stanisław yet:
https://lore.kernel.org/dpdk-dev/20240520094854.GA3672529@ste-uk-lab-gw/

As for runtime detection, newer kernel versions have a hardware probing
interface for detecting the presence of extensions, support could be
added to rte_cpuflags.c?
https://docs.kernel.org/arch/riscv/hwprobe.html

In combination, distros on newer kernels could ship a version that has
these optimisations baked in that falls back to a generic implementation
when the extension is detected to not be present, and systems without
the latest GCC/Clang can still compile by specifying a target ISA that
doesn't include "_zbc".


Re: [PATCH v2] eal/arm: replace RTE_BUILD_BUG on non-constant

2024-06-28 Thread Daniel Gregory
On Thu, Jun 27, 2024 at 05:08:51PM +0200, Thomas Monjalon wrote:
> 04/05/2024 02:59, Stephen Hemminger:
> > On Fri,  3 May 2024 19:27:30 +0100
> > Daniel Gregory  wrote:
> > 
> > > The ARM implementation of rte_pause uses RTE_BUILD_BUG_ON to check
> > > memorder, which is not constant. This causes compile errors when it is
> > > enabled with RTE_ARM_USE_WFE. eg.
> > > 
> > > ../lib/eal/arm/include/rte_pause_64.h: In function 
> > > ‘rte_wait_until_equal_16’:
> > > ../lib/eal/include/rte_common.h:530:56: error: expression in static 
> > > assertion is not constant
> > >   530 | #define RTE_BUILD_BUG_ON(condition) do { 
> > > static_assert(!(condition), #condition); } while (0)
> > >   |
> > > ^~~~
> > > ../lib/eal/arm/include/rte_pause_64.h:156:9: note: in expansion of macro 
> > > ‘RTE_BUILD_BUG_ON’
> > >   156 | RTE_BUILD_BUG_ON(memorder != rte_memory_order_acquire &&
> > >   | ^~~~
> > > 
> > > Fix the compile errors by replacing the check with an assert, like in
> > > the generic implementation (lib/eal/include/generic/rte_pause.h).
> > 
> > No, don't hide the problem.
> > 
> > What code is calling these. Looks like a real bug. Could be behind layers 
> > of wrappers.
> 
> I support Stephen's opinion.
> Please look for the real issue.

In DPDK, I have found 26 calls of rte_wait_until_equal_16, largely split
between app/test-bbdev/test_bbdev_perf.c and app/test/test_timer.c, with
a couple calls in lib/eal/include/rte_pflock.h and
lib/eal/include/rte_ticketlock.h as well. 16 calls of
rte_wait_until_equal_32, spread amongst various test cases
(test_func_reentrancy.c test_mcslock.c, test_mempool_perf.c, ...), two
drivers (drivers/event/opdl/opdl_ring.c and
drivers/net/thunderx/nicvf_rxrx.c), lib/eal/common/eal_common_mcfg.c,
lib/eal/include/generic/rte_spinlock.h, lib/ring/rte_ring_c11_pvt.h,
lib/ring/rte_ring_generic_pvt.h and lib/eal/include/rte_mcslock.h. There
is a single call to rte_wait_until_equal_64 in app/test/test_pmd_perf.c

They all correctly use the primitives from rte_stdatomic.h

As I discussed on another chain
https://lore.kernel.org/dpdk-dev/20240509110251.GA3795959@ste-uk-lab-gw/
from what I've seen, it seems that neither Clang nor GCC allow for
static checks on the parameters of inline functions. For instance, the
following does not compile:

static inline __attribute__((always_inline)) int
fn(int val)
{
_Static_assert(val == 0, "val nonzero");
return 0;
}

int main(void) {
return fn(0);
}

( https://godbolt.org/z/TrfWqYoGo )

With the same "expression in static assertion is not constant" error
that I get when cross-compiling DPDK for ARM with WFE enabled on main:

diff --git a/config/arm/meson.build b/config/arm/meson.build
index a45aa9e466..661c735977 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -18,7 +18,7 @@ flags_common = [
 #['RTE_ARM64_MEMCPY_STRICT_ALIGN', false],
 
 # Enable use of ARM wait for event instruction.
-# ['RTE_ARM_USE_WFE', false],
+['RTE_ARM_USE_WFE', true],
 
 ['RTE_ARCH_ARM64', true],
 ['RTE_CACHE_LINE_SIZE', 128]


Re: [PATCH v3 0/9] riscv: implement accelerated crc using zbc

2024-09-17 Thread Daniel Gregory
Would it be possible to get a review on this patchset? I would be happy
to hear any feedback on the approach to RISC-V extension detection or
how I have implemented the hardware-optimised CRCs.

Kind regards,
Daniel