[PATCH 0/7] add Nitrox compress device support

2024-03-02 Thread Nagadheeraj Rottela
Add the Nitrox PMD to support Nitrox compress device.
---
v5:
* Added missing entry for nitrox folder in compress meson.json

v4:
* Fixed checkpatch warnings.
* Updated release notes.

v3:
* Fixed ABI compatibility issue.

v2:
* Reformatted patches to minimize number of changes.
* Removed empty file with only copyright.
* Updated all feature flags in nitrox.ini file.
* Added separate gotos in nitrox_pci_probe() function.

Nagadheeraj Rottela (7):
  crypto/nitrox: move common code
  drivers/compress: add Nitrox driver
  common/nitrox: add compress hardware queue management
  crypto/nitrox: set queue type during queue pair setup
  compress/nitrox: add software queue management
  compress/nitrox: support stateless request
  compress/nitrox: support stateful request

 MAINTAINERS   |8 +
 doc/guides/compressdevs/features/nitrox.ini   |   17 +
 doc/guides/compressdevs/index.rst |1 +
 doc/guides/compressdevs/nitrox.rst|   50 +
 doc/guides/rel_notes/release_24_03.rst|3 +
 drivers/common/nitrox/meson.build |   19 +
 .../{crypto => common}/nitrox/nitrox_csr.h|   12 +
 .../{crypto => common}/nitrox/nitrox_device.c |   51 +-
 .../{crypto => common}/nitrox/nitrox_device.h |4 +-
 .../{crypto => common}/nitrox/nitrox_hal.c|  116 ++
 .../{crypto => common}/nitrox/nitrox_hal.h|  115 ++
 .../{crypto => common}/nitrox/nitrox_logs.c   |0
 .../{crypto => common}/nitrox/nitrox_logs.h   |0
 drivers/{crypto => common}/nitrox/nitrox_qp.c |   56 +-
 drivers/{crypto => common}/nitrox/nitrox_qp.h |   60 +-
 drivers/common/nitrox/version.map |9 +
 drivers/compress/meson.build  |1 +
 drivers/compress/nitrox/meson.build   |   16 +
 drivers/compress/nitrox/nitrox_comp.c |  604 +
 drivers/compress/nitrox/nitrox_comp.h |   35 +
 drivers/compress/nitrox/nitrox_comp_reqmgr.c  | 1194 +
 drivers/compress/nitrox/nitrox_comp_reqmgr.h  |   58 +
 drivers/crypto/nitrox/meson.build |   11 +-
 drivers/crypto/nitrox/nitrox_sym.c|1 +
 drivers/meson.build   |1 +
 25 files changed, 2412 insertions(+), 30 deletions(-)
 create mode 100644 doc/guides/compressdevs/features/nitrox.ini
 create mode 100644 doc/guides/compressdevs/nitrox.rst
 create mode 100644 drivers/common/nitrox/meson.build
 rename drivers/{crypto => common}/nitrox/nitrox_csr.h (67%)
 rename drivers/{crypto => common}/nitrox/nitrox_device.c (77%)
 rename drivers/{crypto => common}/nitrox/nitrox_device.h (81%)
 rename drivers/{crypto => common}/nitrox/nitrox_hal.c (65%)
 rename drivers/{crypto => common}/nitrox/nitrox_hal.h (59%)
 rename drivers/{crypto => common}/nitrox/nitrox_logs.c (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_logs.h (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_qp.c (67%)
 rename drivers/{crypto => common}/nitrox/nitrox_qp.h (55%)
 create mode 100644 drivers/common/nitrox/version.map
 create mode 100644 drivers/compress/nitrox/meson.build
 create mode 100644 drivers/compress/nitrox/nitrox_comp.c
 create mode 100644 drivers/compress/nitrox/nitrox_comp.h
 create mode 100644 drivers/compress/nitrox/nitrox_comp_reqmgr.c
 create mode 100644 drivers/compress/nitrox/nitrox_comp_reqmgr.h

-- 
2.42.0



[PATCH v5 1/7] crypto/nitrox: move common code

2024-03-02 Thread Nagadheeraj Rottela
A new compressdev Nitrox PMD will be added in next few patches.
This patch moves some of the common code which is shared across
Nitrox crypto and compress drivers to drivers/common/nitrox folder.

Signed-off-by: Nagadheeraj Rottela 
---
 MAINTAINERS|  1 +
 drivers/common/nitrox/meson.build  | 18 ++
 drivers/{crypto => common}/nitrox/nitrox_csr.h |  0
 .../{crypto => common}/nitrox/nitrox_device.c  | 14 ++
 .../{crypto => common}/nitrox/nitrox_device.h  |  1 -
 drivers/{crypto => common}/nitrox/nitrox_hal.c |  0
 drivers/{crypto => common}/nitrox/nitrox_hal.h |  0
 .../{crypto => common}/nitrox/nitrox_logs.c|  0
 .../{crypto => common}/nitrox/nitrox_logs.h|  0
 drivers/{crypto => common}/nitrox/nitrox_qp.c  |  2 +-
 drivers/{crypto => common}/nitrox/nitrox_qp.h  | 11 ++-
 drivers/common/nitrox/version.map  |  9 +
 drivers/crypto/nitrox/meson.build  | 11 +--
 drivers/meson.build|  1 +
 14 files changed, 59 insertions(+), 9 deletions(-)
 create mode 100644 drivers/common/nitrox/meson.build
 rename drivers/{crypto => common}/nitrox/nitrox_csr.h (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_device.c (92%)
 rename drivers/{crypto => common}/nitrox/nitrox_device.h (94%)
 rename drivers/{crypto => common}/nitrox/nitrox_hal.c (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_hal.h (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_logs.c (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_logs.h (100%)
 rename drivers/{crypto => common}/nitrox/nitrox_qp.c (99%)
 rename drivers/{crypto => common}/nitrox/nitrox_qp.h (91%)
 create mode 100644 drivers/common/nitrox/version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 962c359cdd..d6abebc55c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1142,6 +1142,7 @@ Marvell Nitrox
 M: Nagadheeraj Rottela 
 M: Srikanth Jampala 
 F: drivers/crypto/nitrox/
+F: drivers/common/nitrox/
 F: doc/guides/cryptodevs/nitrox.rst
 F: doc/guides/cryptodevs/features/nitrox.ini
 
diff --git a/drivers/common/nitrox/meson.build 
b/drivers/common/nitrox/meson.build
new file mode 100644
index 00..99fadbbfc9
--- /dev/null
+++ b/drivers/common/nitrox/meson.build
@@ -0,0 +1,18 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright (c) 2024 Marvell.
+
+if not is_linux
+build = false
+reason = 'only supported on Linux'
+endif
+
+deps += ['bus_pci']
+
+sources += files(
+'nitrox_device.c',
+'nitrox_hal.c',
+'nitrox_logs.c',
+'nitrox_qp.c',
+)
+
+includes += include_directories('../../crypto/nitrox')
diff --git a/drivers/crypto/nitrox/nitrox_csr.h 
b/drivers/common/nitrox/nitrox_csr.h
similarity index 100%
rename from drivers/crypto/nitrox/nitrox_csr.h
rename to drivers/common/nitrox/nitrox_csr.h
diff --git a/drivers/crypto/nitrox/nitrox_device.c 
b/drivers/common/nitrox/nitrox_device.c
similarity index 92%
rename from drivers/crypto/nitrox/nitrox_device.c
rename to drivers/common/nitrox/nitrox_device.c
index 5b319dd681..b2f638ec8a 100644
--- a/drivers/crypto/nitrox/nitrox_device.c
+++ b/drivers/common/nitrox/nitrox_device.c
@@ -120,5 +120,19 @@ static struct rte_pci_driver nitrox_pmd = {
.remove = nitrox_pci_remove,
 };
 
+__rte_weak int
+nitrox_sym_pmd_create(struct nitrox_device *ndev)
+{
+   RTE_SET_USED(ndev);
+   return 0;
+}
+
+__rte_weak int
+nitrox_sym_pmd_destroy(struct nitrox_device *ndev)
+{
+   RTE_SET_USED(ndev);
+   return 0;
+}
+
 RTE_PMD_REGISTER_PCI(nitrox, nitrox_pmd);
 RTE_PMD_REGISTER_PCI_TABLE(nitrox, pci_id_nitrox_map);
diff --git a/drivers/crypto/nitrox/nitrox_device.h 
b/drivers/common/nitrox/nitrox_device.h
similarity index 94%
rename from drivers/crypto/nitrox/nitrox_device.h
rename to drivers/common/nitrox/nitrox_device.h
index 1ff7c59b63..b7c7ffd772 100644
--- a/drivers/crypto/nitrox/nitrox_device.h
+++ b/drivers/common/nitrox/nitrox_device.h
@@ -6,7 +6,6 @@
 #define _NITROX_DEVICE_H_
 
 #include 
-#include 
 
 struct nitrox_sym_device;
 
diff --git a/drivers/crypto/nitrox/nitrox_hal.c 
b/drivers/common/nitrox/nitrox_hal.c
similarity index 100%
rename from drivers/crypto/nitrox/nitrox_hal.c
rename to drivers/common/nitrox/nitrox_hal.c
diff --git a/drivers/crypto/nitrox/nitrox_hal.h 
b/drivers/common/nitrox/nitrox_hal.h
similarity index 100%
rename from drivers/crypto/nitrox/nitrox_hal.h
rename to drivers/common/nitrox/nitrox_hal.h
diff --git a/drivers/crypto/nitrox/nitrox_logs.c 
b/drivers/common/nitrox/nitrox_logs.c
similarity index 100%
rename from drivers/crypto/nitrox/nitrox_logs.c
rename to drivers/common/nitrox/nitrox_logs.c
diff --git a/drivers/crypto/nitrox/nitrox_logs.h 
b/drivers/common/nitrox/nitrox_logs.h
similarity index 100%
rename from drivers/crypto/nitrox/nitrox_logs.h
rename to drivers/common/nitrox/nitrox_logs.h
diff --git a/drivers/crypto/nitrox/nitrox_qp.c 
b/drivers/common/nitrox/nitrox_qp.c

[PATCH v5 2/7] drivers/compress: add Nitrox driver

2024-03-02 Thread Nagadheeraj Rottela
Introduce Nitrox compressdev driver.
This patch implements below operations
- dev_configure
- dev_close
- dev_infos_get
- private_xform_create
- private_xform_free

Signed-off-by: Nagadheeraj Rottela 
---
 MAINTAINERS  |   7 +
 doc/guides/compressdevs/features/nitrox.ini  |  17 +
 doc/guides/compressdevs/index.rst|   1 +
 doc/guides/compressdevs/nitrox.rst   |  50 +++
 doc/guides/rel_notes/release_24_03.rst   |   3 +
 drivers/common/nitrox/meson.build|   1 +
 drivers/common/nitrox/nitrox_device.c|  37 +-
 drivers/common/nitrox/nitrox_device.h|   3 +
 drivers/compress/meson.build |   1 +
 drivers/compress/nitrox/meson.build  |  15 +
 drivers/compress/nitrox/nitrox_comp.c| 353 +++
 drivers/compress/nitrox/nitrox_comp.h|  33 ++
 drivers/compress/nitrox/nitrox_comp_reqmgr.h |  40 +++
 13 files changed, 556 insertions(+), 5 deletions(-)
 create mode 100644 doc/guides/compressdevs/features/nitrox.ini
 create mode 100644 doc/guides/compressdevs/nitrox.rst
 create mode 100644 drivers/compress/nitrox/meson.build
 create mode 100644 drivers/compress/nitrox/nitrox_comp.c
 create mode 100644 drivers/compress/nitrox/nitrox_comp.h
 create mode 100644 drivers/compress/nitrox/nitrox_comp_reqmgr.h

diff --git a/MAINTAINERS b/MAINTAINERS
index d6abebc55c..a6e2cf6eae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1215,6 +1215,13 @@ F: drivers/compress/isal/
 F: doc/guides/compressdevs/isal.rst
 F: doc/guides/compressdevs/features/isal.ini
 
+Marvell Nitrox
+M: Nagadheeraj Rottela 
+F: drivers/compress/nitrox/
+F: drivers/common/nitrox/
+F: doc/guides/compressdevs/nitrox.rst
+F: doc/guides/compressdevs/features/nitrox.ini
+
 NVIDIA mlx5
 M: Matan Azrad 
 F: drivers/compress/mlx5/
diff --git a/doc/guides/compressdevs/features/nitrox.ini 
b/doc/guides/compressdevs/features/nitrox.ini
new file mode 100644
index 00..1b6a96ac6d
--- /dev/null
+++ b/doc/guides/compressdevs/features/nitrox.ini
@@ -0,0 +1,17 @@
+;
+; Refer to default.ini for the full list of available PMD features.
+;
+; Supported features of 'nitrox' compression driver.
+;
+[Features]
+HW Accelerated = Y
+Stateful Compression   = Y
+Stateful Decompression = Y
+OOP SGL In SGL Out = Y
+OOP SGL In LB  Out = Y
+OOP LB  In SGL Out = Y
+Deflate= Y
+Adler32= Y
+Crc32  = Y
+Fixed  = Y
+Dynamic= Y
diff --git a/doc/guides/compressdevs/index.rst 
b/doc/guides/compressdevs/index.rst
index 54a3ef4273..849f211688 100644
--- a/doc/guides/compressdevs/index.rst
+++ b/doc/guides/compressdevs/index.rst
@@ -12,6 +12,7 @@ Compression Device Drivers
 overview
 isal
 mlx5
+nitrox
 octeontx
 qat_comp
 zlib
diff --git a/doc/guides/compressdevs/nitrox.rst 
b/doc/guides/compressdevs/nitrox.rst
new file mode 100644
index 00..840fd7241a
--- /dev/null
+++ b/doc/guides/compressdevs/nitrox.rst
@@ -0,0 +1,50 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Marvell.
+
+Marvell NITROX Compression Poll Mode Driver
+===
+
+The Nitrox compression poll mode driver provides support for offloading
+compression and decompression operations to the NITROX V processor.
+Detailed information about the NITROX V processor can be obtained here:
+
+* 
https://www.marvell.com/security-solutions/nitrox-security-processors/nitrox-v/
+
+Features
+
+
+NITROX V compression PMD has support for:
+
+Compression/Decompression algorithm:
+
+* DEFLATE
+
+Huffman code type:
+
+* FIXED
+* DYNAMIC
+
+Window size support:
+
+* Min - 2 bytes
+* Max - 32KB
+
+Checksum generation:
+
+* CRC32, Adler
+
+Limitations
+---
+
+* Compressdev level 0, no compression, is not supported.
+
+Initialization
+--
+
+Nitrox compression PMD depends on Nitrox kernel PF driver being installed on
+the platform. Nitrox PF driver is required to create VF devices which will
+be used by the PMD. Each VF device can enable one compressdev PMD.
+
+Nitrox kernel PF driver is available as part of CNN55XX-Driver SDK. The SDK
+and it's installation instructions can be obtained from:
+`Marvell Customer Portal `_.
diff --git a/doc/guides/rel_notes/release_24_03.rst 
b/doc/guides/rel_notes/release_24_03.rst
index 879bb4944c..bb91953a23 100644
--- a/doc/guides/rel_notes/release_24_03.rst
+++ b/doc/guides/rel_notes/release_24_03.rst
@@ -138,6 +138,9 @@ New Features
 to support TLS v1.2, TLS v1.3 and DTLS v1.2.
   * Added PMD API to allow raw submission of instructions to CPT.
 
+* **Added Marvell NITROX compression PMD.**
+
+  * Added support for DEFLATE compression and decompression.
 
 Removed Items
 -
diff --git a/drivers/common/nitrox/meson.build 
b/drivers/common/nitrox/meson.build
index 99fadbbfc9..f3cb42f006 100644
--- a/drivers/commo

[PATCH v5 3/7] common/nitrox: add compress hardware queue management

2024-03-02 Thread Nagadheeraj Rottela
Added compress device hardware ring initialization.

Signed-off-by: Nagadheeraj Rottela 
---
 drivers/common/nitrox/nitrox_csr.h |  12 +++
 drivers/common/nitrox/nitrox_hal.c | 116 +
 drivers/common/nitrox/nitrox_hal.h | 115 
 drivers/common/nitrox/nitrox_qp.c  |  54 --
 drivers/common/nitrox/nitrox_qp.h  |  49 ++--
 5 files changed, 330 insertions(+), 16 deletions(-)

diff --git a/drivers/common/nitrox/nitrox_csr.h 
b/drivers/common/nitrox/nitrox_csr.h
index de7a3c6713..97c797c2e2 100644
--- a/drivers/common/nitrox/nitrox_csr.h
+++ b/drivers/common/nitrox/nitrox_csr.h
@@ -25,6 +25,18 @@
 /* AQM Virtual Function Registers */
 #define AQMQ_QSZX(_i)  (0x20008UL + ((_i) * 0x4UL))
 
+/* ZQM virtual function registers */
+#define ZQMQ_DRBLX(_i) (0x3UL + ((_i) * 0x4UL))
+#define ZQMQ_QSZX(_i)  (0x30008UL + ((_i) * 0x4UL))
+#define ZQMQ_BADRX(_i) (0x30010UL + ((_i) * 0x4UL))
+#define ZQMQ_NXT_CMDX(_i)  (0x30018UL + ((_i) * 0x4UL))
+#define ZQMQ_CMD_CNTX(_i)  (0x30020UL + ((_i) * 0x4UL))
+#define ZQMQ_CMP_THRX(_i)  (0x30028UL + ((_i) * 0x4UL))
+#define ZQMQ_CMP_CNTX(_i)  (0x30030UL + ((_i) * 0x4UL))
+#define ZQMQ_TIMER_LDX(_i) (0x30038UL + ((_i) * 0x4UL))
+#define ZQMQ_ENX(_i)   (0x30048UL + ((_i) * 0x4UL))
+#define ZQMQ_ACTIVITY_STATX(_i)(0x30050UL + ((_i) * 0x4UL))
+
 static inline uint64_t
 nitrox_read_csr(uint8_t *bar_addr, uint64_t offset)
 {
diff --git a/drivers/common/nitrox/nitrox_hal.c 
b/drivers/common/nitrox/nitrox_hal.c
index 433f3adb20..451549a664 100644
--- a/drivers/common/nitrox/nitrox_hal.c
+++ b/drivers/common/nitrox/nitrox_hal.c
@@ -9,6 +9,7 @@
 
 #include "nitrox_hal.h"
 #include "nitrox_csr.h"
+#include "nitrox_logs.h"
 
 #define MAX_VF_QUEUES  8
 #define MAX_PF_QUEUES  64
@@ -164,6 +165,121 @@ setup_nps_pkt_solicit_output_port(uint8_t *bar_addr, 
uint16_t port)
}
 }
 
+int
+zqmq_input_ring_disable(uint8_t *bar_addr, uint16_t ring)
+{
+   union zqmq_activity_stat zqmq_activity_stat;
+   union zqmq_en zqmq_en;
+   union zqmq_cmp_cnt zqmq_cmp_cnt;
+   uint64_t reg_addr;
+   int max_retries = 5;
+
+   /* clear queue enable */
+   reg_addr = ZQMQ_ENX(ring);
+   zqmq_en.u64 = nitrox_read_csr(bar_addr, reg_addr);
+   zqmq_en.s.queue_enable = 0;
+   nitrox_write_csr(bar_addr, reg_addr, zqmq_en.u64);
+   rte_delay_us_block(100);
+
+   /* wait for queue active to clear */
+   reg_addr = ZQMQ_ACTIVITY_STATX(ring);
+   zqmq_activity_stat.u64 = nitrox_read_csr(bar_addr, reg_addr);
+   while (zqmq_activity_stat.s.queue_active && max_retries--) {
+   rte_delay_ms(10);
+   zqmq_activity_stat.u64 = nitrox_read_csr(bar_addr, reg_addr);
+   }
+
+   if (zqmq_activity_stat.s.queue_active) {
+   NITROX_LOG(ERR, "Failed to disable zqmq ring %d\n", ring);
+   return -EBUSY;
+   }
+
+   /* clear commands completed count */
+   reg_addr = ZQMQ_CMP_CNTX(ring);
+   zqmq_cmp_cnt.u64 = nitrox_read_csr(bar_addr, reg_addr);
+   nitrox_write_csr(bar_addr, reg_addr, zqmq_cmp_cnt.u64);
+   rte_delay_us_block(CSR_DELAY);
+   return 0;
+}
+
+int
+setup_zqmq_input_ring(uint8_t *bar_addr, uint16_t ring, uint32_t rsize,
+ phys_addr_t raddr)
+{
+   union zqmq_drbl zqmq_drbl;
+   union zqmq_qsz zqmq_qsz;
+   union zqmq_en zqmq_en;
+   union zqmq_cmp_thr zqmq_cmp_thr;
+   union zqmq_timer_ld zqmq_timer_ld;
+   uint64_t reg_addr = 0;
+   int max_retries = 5;
+   int err = 0;
+
+   err = zqmq_input_ring_disable(bar_addr, ring);
+   if (err)
+   return err;
+
+   /* clear doorbell count */
+   reg_addr = ZQMQ_DRBLX(ring);
+   zqmq_drbl.u64 = 0;
+   zqmq_drbl.s.dbell_count = 0x;
+   nitrox_write_csr(bar_addr, reg_addr, zqmq_drbl.u64);
+   rte_delay_us_block(CSR_DELAY);
+
+   reg_addr = ZQMQ_NXT_CMDX(ring);
+   nitrox_write_csr(bar_addr, reg_addr, 0);
+   rte_delay_us_block(CSR_DELAY);
+
+   /* write queue length */
+   reg_addr = ZQMQ_QSZX(ring);
+   zqmq_qsz.u64 = 0;
+   zqmq_qsz.s.host_queue_size = rsize;
+   nitrox_write_csr(bar_addr, reg_addr, zqmq_qsz.u64);
+   rte_delay_us_block(CSR_DELAY);
+
+   /* write queue base address */
+   reg_addr = ZQMQ_BADRX(ring);
+   nitrox_write_csr(bar_addr, reg_addr, raddr);
+   rte_delay_us_block(CSR_DELAY);
+
+   /* write commands completed threshold */
+   reg_addr = ZQMQ_CMP_THRX(ring);
+   zqmq_cmp_thr.u64 = 0;
+   zqmq_cmp_thr.s.commands_completed_threshold = 0;
+   nitrox_write_csr(bar_addr, reg_addr, zqmq_cmp_thr.u64);
+   rte_delay_us_block(CSR_DELAY);
+
+   /* write timer load value */

[PATCH v5 4/7] crypto/nitrox: set queue type during queue pair setup

2024-03-02 Thread Nagadheeraj Rottela
Set queue type as SE to initialize symmetric hardware queue.

Signed-off-by: Nagadheeraj Rottela 
---
 drivers/crypto/nitrox/nitrox_sym.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/crypto/nitrox/nitrox_sym.c 
b/drivers/crypto/nitrox/nitrox_sym.c
index 1244317438..03652d3ade 100644
--- a/drivers/crypto/nitrox/nitrox_sym.c
+++ b/drivers/crypto/nitrox/nitrox_sym.c
@@ -198,6 +198,7 @@ nitrox_sym_dev_qp_setup(struct rte_cryptodev *cdev, 
uint16_t qp_id,
return -ENOMEM;
}
 
+   qp->type = NITROX_QUEUE_SE;
qp->qno = qp_id;
err = nitrox_qp_setup(qp, ndev->bar_addr, cdev->data->name,
  qp_conf->nb_descriptors, NPS_PKT_IN_INSTR_SIZE,
-- 
2.42.0



[PATCH v5 6/7] compress/nitrox: support stateless request

2024-03-02 Thread Nagadheeraj Rottela
Implement enqueue and dequeue burst operations
for stateless request support.

Signed-off-by: Nagadheeraj Rottela 
---
 drivers/compress/nitrox/meson.build  |   1 +
 drivers/compress/nitrox/nitrox_comp.c|  91 ++-
 drivers/compress/nitrox/nitrox_comp_reqmgr.c | 792 +++
 drivers/compress/nitrox/nitrox_comp_reqmgr.h |  10 +
 4 files changed, 885 insertions(+), 9 deletions(-)
 create mode 100644 drivers/compress/nitrox/nitrox_comp_reqmgr.c

diff --git a/drivers/compress/nitrox/meson.build 
b/drivers/compress/nitrox/meson.build
index f137303689..2c35aba60b 100644
--- a/drivers/compress/nitrox/meson.build
+++ b/drivers/compress/nitrox/meson.build
@@ -10,6 +10,7 @@ deps += ['common_nitrox', 'bus_pci', 'compressdev']
 
 sources += files(
 'nitrox_comp.c',
+   'nitrox_comp_reqmgr.c',
 )
 
 includes += include_directories('../../common/nitrox')
diff --git a/drivers/compress/nitrox/nitrox_comp.c 
b/drivers/compress/nitrox/nitrox_comp.c
index 299cb8e783..0ea5ed43ed 100644
--- a/drivers/compress/nitrox/nitrox_comp.c
+++ b/drivers/compress/nitrox/nitrox_comp.c
@@ -187,10 +187,17 @@ static int nitrox_comp_queue_pair_setup(struct 
rte_compressdev *dev,
if (unlikely(err))
goto qp_setup_err;
 
+   qp->sr_mp = nitrox_comp_req_pool_create(dev, qp->count, qp_id,
+   socket_id);
+   if (unlikely(!qp->sr_mp))
+   goto req_pool_err;
+
dev->data->queue_pairs[qp_id] = qp;
NITROX_LOG(DEBUG, "queue %d setup done\n", qp_id);
return 0;
 
+req_pool_err:
+   nitrox_qp_release(qp, ndev->bar_addr);
 qp_setup_err:
rte_free(qp);
return err;
@@ -224,6 +231,7 @@ static int nitrox_comp_queue_pair_release(struct 
rte_compressdev *dev,
 
dev->data->queue_pairs[qp_id] = NULL;
err = nitrox_qp_release(qp, ndev->bar_addr);
+   nitrox_comp_req_pool_free(qp->sr_mp);
rte_free(qp);
NITROX_LOG(DEBUG, "queue %d release done\n", qp_id);
return err;
@@ -349,24 +357,89 @@ static int nitrox_comp_private_xform_free(struct 
rte_compressdev *dev,
return 0;
 }
 
-static uint16_t nitrox_comp_dev_enq_burst(void *qp,
+static int nitrox_enq_single_op(struct nitrox_qp *qp, struct rte_comp_op *op)
+{
+   struct nitrox_softreq *sr;
+   int err;
+
+   if (unlikely(rte_mempool_get(qp->sr_mp, (void **)&sr)))
+   return -ENOMEM;
+
+   err = nitrox_process_comp_req(op, sr);
+   if (unlikely(err)) {
+   rte_mempool_put(qp->sr_mp, sr);
+   return err;
+   }
+
+   nitrox_qp_enqueue(qp, nitrox_comp_instr_addr(sr), sr);
+   return 0;
+}
+
+static uint16_t nitrox_comp_dev_enq_burst(void *queue_pair,
  struct rte_comp_op **ops,
  uint16_t nb_ops)
 {
-   RTE_SET_USED(qp);
-   RTE_SET_USED(ops);
-   RTE_SET_USED(nb_ops);
+   struct nitrox_qp *qp = queue_pair;
+   uint16_t free_slots = 0;
+   uint16_t cnt = 0;
+   bool err = false;
+
+   free_slots = nitrox_qp_free_count(qp);
+   if (nb_ops > free_slots)
+   nb_ops = free_slots;
+
+   for (cnt = 0; cnt < nb_ops; cnt++) {
+   if (unlikely(nitrox_enq_single_op(qp, ops[cnt]))) {
+   err = true;
+   break;
+   }
+   }
+
+   nitrox_ring_dbell(qp, cnt);
+   qp->stats.enqueued_count += cnt;
+   if (unlikely(err))
+   qp->stats.enqueue_err_count++;
+
+   return cnt;
+}
+
+static int nitrox_deq_single_op(struct nitrox_qp *qp,
+   struct rte_comp_op **op_ptr)
+{
+   struct nitrox_softreq *sr;
+   int err;
+
+   sr = nitrox_qp_get_softreq(qp);
+   err = nitrox_check_comp_req(sr, op_ptr);
+   if (err == -EAGAIN)
+   return err;
+
+   nitrox_qp_dequeue(qp);
+   rte_mempool_put(qp->sr_mp, sr);
+   if (err == 0)
+   qp->stats.dequeued_count++;
+   else
+   qp->stats.dequeue_err_count++;
+
return 0;
 }
 
-static uint16_t nitrox_comp_dev_deq_burst(void *qp,
+static uint16_t nitrox_comp_dev_deq_burst(void *queue_pair,
  struct rte_comp_op **ops,
  uint16_t nb_ops)
 {
-   RTE_SET_USED(qp);
-   RTE_SET_USED(ops);
-   RTE_SET_USED(nb_ops);
-   return 0;
+   struct nitrox_qp *qp = queue_pair;
+   uint16_t filled_slots = nitrox_qp_used_count(qp);
+   int cnt = 0;
+
+   if (nb_ops > filled_slots)
+   nb_ops = filled_slots;
+
+   for (cnt = 0; cnt < nb_ops; cnt++)
+   if (nitrox_deq_single_op(qp, &ops[cnt]))
+   break;
+
+   return cnt;
 }
 
 static struct rte_compressdev_ops nitrox_compressdev_ops = {
diff --git a/drivers/compress/nitrox/nitrox_comp_reqmgr.c 
b/drivers/compress/

[PATCH v5 5/7] compress/nitrox: add software queue management

2024-03-02 Thread Nagadheeraj Rottela
Added software queue management code corresponding to
queue pair setup and release functions.

Signed-off-by: Nagadheeraj Rottela 
---
 drivers/compress/nitrox/nitrox_comp.c | 115 +++---
 drivers/compress/nitrox/nitrox_comp.h |   1 +
 2 files changed, 105 insertions(+), 11 deletions(-)

diff --git a/drivers/compress/nitrox/nitrox_comp.c 
b/drivers/compress/nitrox/nitrox_comp.c
index e97a686fbf..299cb8e783 100644
--- a/drivers/compress/nitrox/nitrox_comp.c
+++ b/drivers/compress/nitrox/nitrox_comp.c
@@ -5,11 +5,13 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nitrox_comp.h"
 #include "nitrox_device.h"
 #include "nitrox_logs.h"
 #include "nitrox_comp_reqmgr.h"
+#include "nitrox_qp.h"
 
 static const char nitrox_comp_drv_name[] = 
RTE_STR(COMPRESSDEV_NAME_NITROX_PMD);
 static const struct rte_driver nitrox_rte_comp_drv = {
@@ -17,6 +19,9 @@ static const struct rte_driver nitrox_rte_comp_drv = {
.alias = nitrox_comp_drv_name
 };
 
+static int nitrox_comp_queue_pair_release(struct rte_compressdev *dev,
+ uint16_t qp_id);
+
 static const struct rte_compressdev_capabilities
nitrox_comp_pmd_capabilities[] = {
{   .algo = RTE_COMP_ALGO_DEFLATE,
@@ -84,8 +89,15 @@ static void nitrox_comp_dev_stop(struct rte_compressdev *dev)
 
 static int nitrox_comp_dev_close(struct rte_compressdev *dev)
 {
+   int i, ret;
struct nitrox_comp_device *comp_dev = dev->data->dev_private;
 
+   for (i = 0; i < dev->data->nb_queue_pairs; i++) {
+   ret = nitrox_comp_queue_pair_release(dev, i);
+   if (ret)
+   return ret;
+   }
+
rte_mempool_free(comp_dev->xform_pool);
comp_dev->xform_pool = NULL;
return 0;
@@ -94,13 +106,33 @@ static int nitrox_comp_dev_close(struct rte_compressdev 
*dev)
 static void nitrox_comp_stats_get(struct rte_compressdev *dev,
  struct rte_compressdev_stats *stats)
 {
-   RTE_SET_USED(dev);
-   RTE_SET_USED(stats);
+   int qp_id;
+
+   for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+   struct nitrox_qp *qp = dev->data->queue_pairs[qp_id];
+
+   if (!qp)
+   continue;
+
+   stats->enqueued_count += qp->stats.enqueued_count;
+   stats->dequeued_count += qp->stats.dequeued_count;
+   stats->enqueue_err_count += qp->stats.enqueue_err_count;
+   stats->dequeue_err_count += qp->stats.dequeue_err_count;
+   }
 }
 
 static void nitrox_comp_stats_reset(struct rte_compressdev *dev)
 {
-   RTE_SET_USED(dev);
+   int qp_id;
+
+   for (qp_id = 0; qp_id < dev->data->nb_queue_pairs; qp_id++) {
+   struct nitrox_qp *qp = dev->data->queue_pairs[qp_id];
+
+   if (!qp)
+   continue;
+
+   memset(&qp->stats, 0, sizeof(qp->stats));
+   }
 }
 
 static void nitrox_comp_dev_info_get(struct rte_compressdev *dev,
@@ -121,19 +153,80 @@ static int nitrox_comp_queue_pair_setup(struct 
rte_compressdev *dev,
uint16_t qp_id,
uint32_t max_inflight_ops, int 
socket_id)
 {
-   RTE_SET_USED(dev);
-   RTE_SET_USED(qp_id);
-   RTE_SET_USED(max_inflight_ops);
-   RTE_SET_USED(socket_id);
-   return -1;
+   struct nitrox_comp_device *comp_dev = dev->data->dev_private;
+   struct nitrox_device *ndev = comp_dev->ndev;
+   struct nitrox_qp *qp = NULL;
+   int err;
+
+   NITROX_LOG(DEBUG, "queue %d\n", qp_id);
+   if (qp_id >= ndev->nr_queues) {
+   NITROX_LOG(ERR, "queue %u invalid, max queues supported %d\n",
+  qp_id, ndev->nr_queues);
+   return -EINVAL;
+   }
+
+   if (dev->data->queue_pairs[qp_id]) {
+   err = nitrox_comp_queue_pair_release(dev, qp_id);
+   if (err)
+   return err;
+   }
+
+   qp = rte_zmalloc_socket("nitrox PMD qp", sizeof(*qp),
+   RTE_CACHE_LINE_SIZE,
+   socket_id);
+   if (!qp) {
+   NITROX_LOG(ERR, "Failed to allocate nitrox qp\n");
+   return -ENOMEM;
+   }
+
+   qp->type = NITROX_QUEUE_ZIP;
+   qp->qno = qp_id;
+   err = nitrox_qp_setup(qp, ndev->bar_addr, dev->data->name,
+ max_inflight_ops, ZIP_INSTR_SIZE,
+ socket_id);
+   if (unlikely(err))
+   goto qp_setup_err;
+
+   dev->data->queue_pairs[qp_id] = qp;
+   NITROX_LOG(DEBUG, "queue %d setup done\n", qp_id);
+   return 0;
+
+qp_setup_err:
+   rte_free(qp);
+   return err;
 }
 
 static int nitrox_comp_queue_pair_release(struct rte_compressdev *dev,
  uint16_t qp_id)
 {
-   RTE

[PATCH v5 7/7] compress/nitrox: support stateful request

2024-03-02 Thread Nagadheeraj Rottela
Implement enqueue and dequeue burst operations
for stateful request support.

Signed-off-by: Nagadheeraj Rottela 
---
 drivers/compress/nitrox/nitrox_comp.c|  97 +++-
 drivers/compress/nitrox/nitrox_comp.h|   1 +
 drivers/compress/nitrox/nitrox_comp_reqmgr.c | 550 ---
 drivers/compress/nitrox/nitrox_comp_reqmgr.h |   8 +
 4 files changed, 576 insertions(+), 80 deletions(-)

diff --git a/drivers/compress/nitrox/nitrox_comp.c 
b/drivers/compress/nitrox/nitrox_comp.c
index 0ea5ed43ed..97d2c4a0e8 100644
--- a/drivers/compress/nitrox/nitrox_comp.c
+++ b/drivers/compress/nitrox/nitrox_comp.c
@@ -32,7 +32,9 @@ static const struct rte_compressdev_capabilities
  RTE_COMP_FF_SHAREABLE_PRIV_XFORM |
  RTE_COMP_FF_OOP_SGL_IN_SGL_OUT |
  RTE_COMP_FF_OOP_SGL_IN_LB_OUT |
- RTE_COMP_FF_OOP_LB_IN_SGL_OUT,
+ RTE_COMP_FF_OOP_LB_IN_SGL_OUT |
+ RTE_COMP_FF_STATEFUL_COMPRESSION |
+ RTE_COMP_FF_STATEFUL_DECOMPRESSION,
.window_size = {
.min = NITROX_COMP_WINDOW_SIZE_MIN,
.max = NITROX_COMP_WINDOW_SIZE_MAX,
@@ -334,6 +336,13 @@ static int nitrox_comp_private_xform_create(struct 
rte_compressdev *dev,
goto err_exit;
}
 
+   nxform->context = NULL;
+   nxform->history_window = NULL;
+   nxform->window_size = 0;
+   nxform->hlen = 0;
+   nxform->exn = 0;
+   nxform->exbits = 0;
+   nxform->bf = true;
return 0;
 err_exit:
memset(nxform, 0, sizeof(*nxform));
@@ -357,6 +366,74 @@ static int nitrox_comp_private_xform_free(struct 
rte_compressdev *dev,
return 0;
 }
 
+static int nitrox_comp_stream_free(struct rte_compressdev *dev, void *stream)
+{
+   struct nitrox_comp_xform *nxform = stream;
+
+   if (unlikely(nxform == NULL))
+   return -EINVAL;
+
+   rte_free(nxform->history_window);
+   nxform->history_window = NULL;
+   rte_free(nxform->context);
+   nxform->context = NULL;
+   return nitrox_comp_private_xform_free(dev, stream);
+}
+
+static int nitrox_comp_stream_create(struct rte_compressdev *dev,
+   const struct rte_comp_xform *xform, void **stream)
+{
+   int err;
+   struct nitrox_comp_xform *nxform;
+   struct nitrox_comp_device *comp_dev = dev->data->dev_private;
+
+   err = nitrox_comp_private_xform_create(dev, xform, stream);
+   if (unlikely(err))
+   return err;
+
+   nxform = *stream;
+   if (xform->type == RTE_COMP_COMPRESS) {
+   uint8_t window_size = xform->compress.window_size;
+
+   if (unlikely(window_size < NITROX_COMP_WINDOW_SIZE_MIN ||
+ window_size > NITROX_COMP_WINDOW_SIZE_MAX)) {
+   NITROX_LOG(ERR, "Invalid window size %d\n",
+  window_size);
+   return -EINVAL;
+   }
+
+   if (window_size == NITROX_COMP_WINDOW_SIZE_MAX)
+   nxform->window_size = NITROX_CONSTANTS_MAX_SEARCH_DEPTH;
+   else
+   nxform->window_size = RTE_BIT32(window_size);
+   } else {
+   nxform->window_size = NITROX_DEFAULT_DEFLATE_SEARCH_DEPTH;
+   }
+
+   nxform->history_window = rte_zmalloc_socket(NULL, nxform->window_size,
+   8, comp_dev->xform_pool->socket_id);
+   if (unlikely(nxform->history_window == NULL)) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+   if (xform->type == RTE_COMP_COMPRESS)
+   return 0;
+
+   nxform->context = rte_zmalloc_socket(NULL,
+   NITROX_DECOMP_CTX_SIZE, 8,
+   comp_dev->xform_pool->socket_id);
+   if (unlikely(nxform->context == NULL)) {
+   err = -ENOMEM;
+   goto err_exit;
+   }
+
+   return 0;
+err_exit:
+   nitrox_comp_stream_free(dev, *stream);
+   return err;
+}
+
 static int nitrox_enq_single_op(struct nitrox_qp *qp, struct rte_comp_op *op)
 {
struct nitrox_softreq *sr;
@@ -371,8 +448,12 @@ static int nitrox_enq_single_op(struct nitrox_qp *qp, 
struct rte_comp_op *op)
return err;
}
 
-   nitrox_qp_enqueue(qp, nitrox_comp_instr_addr(sr), sr);
-   return 0;
+   if (op->status == RTE_COMP_OP_STATUS_SUCCESS)
+   err = nitrox_qp_enqueue_sr(qp, sr);
+   else
+   nitrox_qp_enqueue(qp, nitrox_comp_instr_addr(sr), sr);
+
+   return err;
 }
 
 static uint16_t nitrox_comp_dev_enq_burst(void *queue_pair,
@@ -382,6 +463,7 @@ static uint16_t nitrox_comp_dev_enq_burst(void *queue_pair,
struct

RE: [PATCH v5 0/4] add pointer compression API

2024-03-02 Thread Morten Brørup
> From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com]
> Sent: Friday, 1 March 2024 20.57
> 
> > On Mar 1, 2024, at 5:16 AM, Morten Brørup 
> wrote:
> >
> >> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com]
> >> Sent: Thursday, 22 February 2024 17.16
> >>
> >>> For some reason your email is not visible to me, even though it's in
> the
> >>> archive.
> >>
> >> No worries.
> >>
> >>>
> >>> On 02/11/202416:32,Konstantin Ananyev konstantin.v.ananyev  wrote:
> >>>
>  From one side the code itself is very small and straightforward, >
> from
> >> other side - it is not clear to me what is intended usage for it
>  within DPDK and it's applianances?
>  Konstantin
> >>>
> >>> The intended usage is explained in the cover email (see below) and
> >> demonstrated
> >>> in the test supplied in the following patch - when sending arrays of
> >> pointers
> >>> between cores as it happens in a forwarding example.
> >>
> >> Yes, I saw that. The thing is that test is a 'synthetic' one.
> >> My question was about how do you expect people to use it in more
> realistic
> >> scenarios?
> >> Let say user has a bunch of mbuf pointers, possibly from different
> mempools.
> >> How he can use this API: how to deduce the base pointer for all of
> them and
> >> what to
> >> do if it can't be done?
> >
> > I share Konstantin's concerns with this feature.
> >
> > If we want to compress mbuf pointers in applications with a few mbuf
> pools, e.g. an mbuf pool per CPU socket, the compression algorithm would
> be different.
> This feature is targeted for pipeline mode of applications. We see many
> customers using pipeline mode. This feature helps in reducing the cost
> of transferring the packets between cores by reducing the copies
> involved.

OK. I agree this is a very common use case, worth optimizing for.

> For an application with multiple pools, it depends on how the
> applications are using multiple pools. But, if there is a bunch of
> packets belonging to multiple mempools, compressing those mbufs may not
> be possible. But if those mbufs are grouped per mempool and are
> transferred on different queues, then it is possible. Hence the APIs are
> implemented very generically.

OK.


And for a possible future extension:
If there are very few mbuf pools, such as 2 or 4, it might be possible to 
develop similar functions to efficiently compress/decompress pointers in a 
shared queue. E.g. the highest bits could identify the pool, and the lowest 
bits could identify the pointer offset (with bit shift) in that pool. Or if the 
pools are less than 4 GB each, the lowest bits could identify the pool, and be 
masked away for getting the offset (no bit shift), taking advantage of lowest 
bits of the pointer address always being zero anyway.
I am mentioning this, so it can be taken into consideration when designing the 
pointer compression library and its API. I don't expect it to be implemented at 
this time. Also, it might not lead to any changes of the already proposed 
pointer compression API - just give it a few thoughts.


+1 for the simplicity of the functions and the API in this patch.
E.g. the bit_shift is most likely known constant at build time, so inlining 
allows the compiler to optimize for this. In many use cases, it might be 1, and 
thus optimized away.

> 
> >
> > I would like to add:
> > If we want to offer optimizations specifically for applications with a
> single mbuf pool, I think it should be considered in a system-wide
> context to determine if performance could be improved in more areas.
> > E.g. removing the pool field from the rte_mbuf structure might free up
> space to move hot fields from the second cache line to the first, so the
> second cache line rarely needs to be touched. (As an alternative to
> removing the pool field, it could be moved to the second cache line,
> only to be used if the global "single mbuf pool" is NULL.)
> Agree on this. The feedback I have received is on similar lines, many
> are using simple features. I also received feedback that 90% of the
> applications use less than 4GB of memory for mbuf and burst sizes are up
> to 256.

Interesting.
Keeping the most common use cases in mind is important for steering DPDK in the 
right direction as it evolves.

If a very large percentage of use cases use one mbuf pool of less than 4 GB, we 
should seriously consider the broader opportunity for optimizing by generally 
referencing mbufs by an uint32_t pointer offset (no bit shifting) instead of by 
pointers.

> 
> >
> > On the other hand, I agree that pointer compression can be useful for
> some applications, so we should accept it.
> >
> > However, pointer compression has nothing to do with the underlying
> hardware or operating system, so it does not belong in the EAL (which is
> already too bloated). It should be a separate library.
> Yes, this is generic (though there is SIMD code). We could move it out
> of EAL.

Thank you.

I think that a misconception that a

Re: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Mattias Rönnblom

On 2024-03-01 18:14, Stephen Hemminger wrote:

The DPDK has a lot of "cargo cult" usage of rte_memcpy.
This patch set replaces cases where rte_memcpy is used with a fixed
size constant size.

Typical example is:
rte_memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
which can be replaced with:
memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);

This has two benefits. Gcc (and clang) are smart enough that for
all small fixed size values, they just generate the necessary instructions
to do it inline. It also means that fortify, Coverity, and ASAN
analyzers can check these memcpy's.



Instead of smearing out the knowledge of when to use rte_memcpy(), and 
when to use memcpy() across the code base, wouldn't it be better to 
*always* call rte_memcpy() in the fast path, and leave the policy 
decision to the rte_memcpy() implementation?


In rte_memcpy(), add:
if (__builtin_constant_p(n) && n < RTE_LIBC_MEMCPY_SIZE_THRESHOLD)
memcpy(/../);
..or something to that effect.

Could you have a #ifdef for dumb static analysis tools? To make it look 
like you are always using memcpy()?



So faster, better, safer.



What is "faster" based on?

My experience with replacing rte_memcpy() with memcpy() (or vice versa) 
is mixed.


I've also tried just dropping the DPDK-custom memcpy() implementation 
altogether, and that caused a performance drop (in a particular app, on 
a particular compiler and CPU).



The first patch is a simple coccinelle script to do the replacement
and the rest are the results broken out by module.

The coccinelle script can be used again to make sure more bad
usage doesn't creep in with new drivers.

v2 - fix CI failure on some OS by adding string.h
  remove rte_memcpy.h if no longer used

Stephen Hemminger (71):
   cocci/rte_memcpy: add script to eliminate fixed size rte_memcpy
   eal: replace use of fixed size rte_memcpy
   ethdev: replace use of fixed size rte_memcpy
   eventdev: replace use of fixed size rte_memcpy
   cryptodev: replace use of fixed size rte_memcpy
   ip_frag: replace use of fixed size rte_memcpy
   net: replace use of fixed size rte_memcpy
   lpm: replace use of fixed size rte_memcpy
   node: replace use of fixed size rte_memcpy
   pdcp: replace use of fixed size rte_memcpy
   pipeline: replace use of fixed size rte_memcpy
   rib: replace use of fixed size rte_memcpy
   security: replace use of fixed size rte_memcpy
   net/mlx5: replace use of fixed size rte_memcpy
   net/nfp: replace use of fixed size rte_memcpy
   net/ngbe: replace use of fixed size rte_memcpy
   net/null: replace use of fixed size rte_memcpy
   net/pcap: replace use of fixed size rte_memcpy
   net/sfc: replace use of fixed size rte_memcpy
   net/tap: replace use of fixed size rte_memcpy
   net/txgbe: replace use of fixed size rte_memcpy
   raw/ifpga: replace use of fixed size rte_memcpy
   raw/skeleton: replace use of fixed size rte_memcpy
   net/hns3: replace use of fixed size rte_memcpy
   net/i40e: replace use of fixed size rte_memcpy
   net/iavf: replace use of fixed size rte_memcpy
   net/ice: replace use of fixed size rte_memcpy
   net/idpf: replace use of fixed size rte_memcpy
   net/ipn3ke: replace use of fixed size rte_memcpy
   net/ixgbe: replace use of fixed size rte_memcpy
   net/memif: replace use of fixed size rte_memcpy
   net/qede: replace use of fixed size rte_memcpy
   baseband/acc: replace use of fixed size rte_memcpy
   baseband/la12xx: replace use of fixed size rte_memcpy
   common/idpf: replace use of fixed size rte_memcpy
   common/qat: replace use of fixed size rte_memcpy
   compress/qat: replace use of fixed size rte_memcpy
   crypto/ccp: replace use of fixed size rte_memcpy
   crypto/cnxk: replace use of fixed size rte_memcpy
   crypto/dpaa_sec: replace use of fixed size rte_memcpy
   crypto/ipsec_mb: replace use of fixed size rte_memcpy
   crypto/qat: replace use of fixed size rte_memcpy
   crypto/scheduler: replace use of fixed size rte_memcpy
   event/cnxk: replace use of fixed size rte_memcpy
   event/dlb2: replace use of fixed size rte_memcpy
   event/dpaa2: replace use of fixed size rte_memcpy
   event/octeontx: replace use of fixed size rte_memcpy
   mempool/dpaa: replace use of fixed size rte_memcpy
   mempool/dpaa2: replace use of fixed size rte_memcpy
   ml/cnxk: replace use of fixed size rte_memcpy
   net/af_xdp: replace use of fixed size rte_memcpy
   net/avp: replace use of fixed size rte_memcpy
   net/axgbe: replace use of fixed size rte_memcpy
   net/bnx2x: replace use of fixed size rte_memcpy
   net/bnxt: replace use of fixed size rte_memcpy
   net/bonding: replace use of fixed size rte_memcpy
   net/cnxk: replace use of fixed size rte_memcpy
   net/cpfl: replace use of fixed size rte_memcpy
   net/cxgbe: replace use of fixed size rte_memcpy
   net/dpaa2: replace use of fixed size rte_memcpy
   net/e1000: replace use of fixed size rte_memcpy
   net/enic: replace use of fixed size rte_memcpy
   net/failsafe: replace use of fixed 

Re: [PATCH v2 01/71] cocci/rte_memcpy: add script to eliminate fixed size rte_memcpy

2024-03-02 Thread Mattias Rönnblom

On 2024-03-01 18:14, Stephen Hemminger wrote:

Rte_memcpy should not be used for the simple case of copying
a fix size structure because it is slower and will hide problems
from code analysis tools. Coverity, fortify and other analyzers
special case memcpy().

Gcc (and Clang) are smart enough to inline copies which
will be faster.



Are you suggesting rte_memcpy() calls aren't inlined?


Signed-off-by: Stephen Hemminger 
---
  devtools/cocci/rte_memcpy.cocci | 11 +++
  1 file changed, 11 insertions(+)
  create mode 100644 devtools/cocci/rte_memcpy.cocci

diff --git a/devtools/cocci/rte_memcpy.cocci b/devtools/cocci/rte_memcpy.cocci
new file mode 100644
index ..fa1038fc066d
--- /dev/null
+++ b/devtools/cocci/rte_memcpy.cocci
@@ -0,0 +1,11 @@
+//
+// rte_memcpy should not be used for simple fixed size structure
+// because compiler's are smart enough to inline these.
+//


What do you do in code where it's not known if the size will be constant 
or not?



+@@
+expression src, dst; constant size;
+@@
+(
+- rte_memcpy(dst, src, size)
++ memcpy(dst, src, size)
+)


Re: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Mattias Rönnblom

On 2024-03-02 12:14, Mattias Rönnblom wrote:

On 2024-03-01 18:14, Stephen Hemminger wrote:

The DPDK has a lot of "cargo cult" usage of rte_memcpy.
This patch set replaces cases where rte_memcpy is used with a fixed
size constant size.

Typical example is:
rte_memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
which can be replaced with:
memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);

This has two benefits. Gcc (and clang) are smart enough that for
all small fixed size values, they just generate the necessary 
instructions

to do it inline. It also means that fortify, Coverity, and ASAN
analyzers can check these memcpy's.



Instead of smearing out the knowledge of when to use rte_memcpy(), and 
when to use memcpy() across the code base, wouldn't it be better to 
*always* call rte_memcpy() in the fast path, and leave the policy 
decision to the rte_memcpy() implementation?


In rte_memcpy(), add:
if (__builtin_constant_p(n) && n < RTE_LIBC_MEMCPY_SIZE_THRESHOLD)
 memcpy(/../);
..or something to that effect.

Could you have a #ifdef for dumb static analysis tools? To make it look 
like you are always using memcpy()?



So faster, better, safer.



What is "faster" based on?



I ran some DSW benchmarks, and if you add

diff --git a/lib/eal/x86/include/rte_memcpy.h 
b/lib/eal/x86/include/rte_memcpy.h

index 72a92290e0..64cd82d78d 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -862,6 +862,11 @@ rte_memcpy_aligned(void *dst, const void *src, 
size_t n)

 static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
+   if (__builtin_constant_p(n) && n <= 32) {
+   memcpy(dst, src, n);
+   return dst;
+   }
+
if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
return rte_memcpy_aligned(dst, src, n);
else

...the overhead increases from roughly 48 core clock cycles/event to 59 
cc/event. The same for "n < 128". (I'm not sure what counts as "small" 
here.)


So something rte_memcpy() does for small and constant memory copies does 
make things go *significantly* faster, at least in certain cases.


(Linux, GCC 11.4, Intel Gracemont.)

My experience with replacing rte_memcpy() with memcpy() (or vice versa) 
is mixed.


I've also tried just dropping the DPDK-custom memcpy() implementation 
altogether, and that caused a performance drop (in a particular app, on 
a particular compiler and CPU).



The first patch is a simple coccinelle script to do the replacement
and the rest are the results broken out by module.

The coccinelle script can be used again to make sure more bad
usage doesn't creep in with new drivers.

v2 - fix CI failure on some OS by adding string.h
  remove rte_memcpy.h if no longer used

Stephen Hemminger (71):
   cocci/rte_memcpy: add script to eliminate fixed size rte_memcpy
   eal: replace use of fixed size rte_memcpy
   ethdev: replace use of fixed size rte_memcpy
   eventdev: replace use of fixed size rte_memcpy
   cryptodev: replace use of fixed size rte_memcpy
   ip_frag: replace use of fixed size rte_memcpy
   net: replace use of fixed size rte_memcpy
   lpm: replace use of fixed size rte_memcpy
   node: replace use of fixed size rte_memcpy
   pdcp: replace use of fixed size rte_memcpy
   pipeline: replace use of fixed size rte_memcpy
   rib: replace use of fixed size rte_memcpy
   security: replace use of fixed size rte_memcpy
   net/mlx5: replace use of fixed size rte_memcpy
   net/nfp: replace use of fixed size rte_memcpy
   net/ngbe: replace use of fixed size rte_memcpy
   net/null: replace use of fixed size rte_memcpy
   net/pcap: replace use of fixed size rte_memcpy
   net/sfc: replace use of fixed size rte_memcpy
   net/tap: replace use of fixed size rte_memcpy
   net/txgbe: replace use of fixed size rte_memcpy
   raw/ifpga: replace use of fixed size rte_memcpy
   raw/skeleton: replace use of fixed size rte_memcpy
   net/hns3: replace use of fixed size rte_memcpy
   net/i40e: replace use of fixed size rte_memcpy
   net/iavf: replace use of fixed size rte_memcpy
   net/ice: replace use of fixed size rte_memcpy
   net/idpf: replace use of fixed size rte_memcpy
   net/ipn3ke: replace use of fixed size rte_memcpy
   net/ixgbe: replace use of fixed size rte_memcpy
   net/memif: replace use of fixed size rte_memcpy
   net/qede: replace use of fixed size rte_memcpy
   baseband/acc: replace use of fixed size rte_memcpy
   baseband/la12xx: replace use of fixed size rte_memcpy
   common/idpf: replace use of fixed size rte_memcpy
   common/qat: replace use of fixed size rte_memcpy
   compress/qat: replace use of fixed size rte_memcpy
   crypto/ccp: replace use of fixed size rte_memcpy
   crypto/cnxk: replace use of fixed size rte_memcpy
   crypto/dpaa_sec: replace use of fixed size rte_memcpy
   crypto/ipsec_mb: replace use of fixed size rte_memcpy
   crypto/qat: replace use of fixed size rte_memcpy
   crypto/scheduler: replace use

RE: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Morten Brørup
> From: Mattias Rönnblom [mailto:hof...@lysator.liu.se]
> Sent: Saturday, 2 March 2024 13.02
> 
> On 2024-03-02 12:14, Mattias Rönnblom wrote:
> > On 2024-03-01 18:14, Stephen Hemminger wrote:
> >> The DPDK has a lot of "cargo cult" usage of rte_memcpy.
> >> This patch set replaces cases where rte_memcpy is used with a fixed
> >> size constant size.
> >>
> >> Typical example is:
> >> rte_memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
> >> which can be replaced with:
> >> memcpy(mac_addrs, mac.addr_bytes, RTE_ETHER_ADDR_LEN);
> >>
> >> This has two benefits. Gcc (and clang) are smart enough that for
> >> all small fixed size values, they just generate the necessary
> >> instructions
> >> to do it inline. It also means that fortify, Coverity, and ASAN
> >> analyzers can check these memcpy's.
> >>
> >
> > Instead of smearing out the knowledge of when to use rte_memcpy(), and
> > when to use memcpy() across the code base, wouldn't it be better to
> > *always* call rte_memcpy() in the fast path, and leave the policy
> > decision to the rte_memcpy() implementation?
> >
> > In rte_memcpy(), add:
> > if (__builtin_constant_p(n) && n < RTE_LIBC_MEMCPY_SIZE_THRESHOLD)
> >  memcpy(/../);
> > ..or something to that effect.
> >
> > Could you have a #ifdef for dumb static analysis tools? To make it
> look
> > like you are always using memcpy()?
> >
> >> So faster, better, safer.
> >>
> >
> > What is "faster" based on?
> >
> 
> I ran some DSW benchmarks, and if you add
> 
> diff --git a/lib/eal/x86/include/rte_memcpy.h
> b/lib/eal/x86/include/rte_memcpy.h
> index 72a92290e0..64cd82d78d 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -862,6 +862,11 @@ rte_memcpy_aligned(void *dst, const void *src,
> size_t n)
>   static __rte_always_inline void *
>   rte_memcpy(void *dst, const void *src, size_t n)
>   {
> +   if (__builtin_constant_p(n) && n <= 32) {
> +   memcpy(dst, src, n);
> +   return dst;
> +   }
> +
>  if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
>  return rte_memcpy_aligned(dst, src, n);
>  else
> 
> ...the overhead increases from roughly 48 core clock cycles/event to 59
> cc/event. The same for "n < 128". (I'm not sure what counts as "small"
> here.)

Thank you for digging deep into this, Mattias.
Your performance data are very interesting.

> 
> So something rte_memcpy() does for small and constant memory copies does
> make things go *significantly* faster, at least in certain cases.

Interesting.
Perhaps something with aligned copies...
The performance benefits of well known alignment was something I was looking 
into when working on non-temporal memcpy functions, because non-temporal 
load/store has some alignment requirements. (NB: NT memcpy development is hold, 
until I get more time to work on it again.)
Passing alignment information as a flag to an extended memcpy, to be used by 
__builtin_constant_p(n), could speed up copying when alignment is known by the 
developer, but impossible for the compiler to determine at build time.
The rte_memcpy() checks for one specific alignment criteria at runtime. I 
suppose the branch predictor makes it execute nearly as fast as if determined 
at build time, but it still consumes a lot more instruction memory.
Perhaps something else...?

> 
> (Linux, GCC 11.4, Intel Gracemont.)
> 
> > My experience with replacing rte_memcpy() with memcpy() (or vice
> versa)
> > is mixed.
> >
> > I've also tried just dropping the DPDK-custom memcpy() implementation
> > altogether, and that caused a performance drop (in a particular app,
> on
> > a particular compiler and CPU).

I guess the compilers are just not where we want them to be yet.

I don't mind generally replacing rte_memcpy() with memcpy() in the control 
plane.
But we should use whatever is more efficient in the data plane.

We must also keep in mind that DPDK supports old distros with old compilers. We 
should not remove a superfluous hand crafted optimization if a supported old 
compiler hasn't caught up with it yet, i.e. if it isn't superfluous on some of 
the old compilers supported by DPDK.



[RFC 1/7] eal: extend bit manipulation functions

2024-03-02 Thread Mattias Rönnblom
Add functionality to test, set, clear, and assign the value to
individual bits in 32-bit or 64-bit words.

These functions have no implications on memory ordering, atomicity and
does not use volatile and thus does not prevent any compiler
optimizations.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 194 ++-
 1 file changed, 192 insertions(+), 2 deletions(-)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 449565eeae..9a368724d5 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -2,6 +2,7 @@
  * Copyright(c) 2020 Arm Limited
  * Copyright(c) 2010-2019 Intel Corporation
  * Copyright(c) 2023 Microsoft Corporation
+ * Copyright(c) 2024 Ericsson AB
  */
 
 #ifndef _RTE_BITOPS_H_
@@ -11,8 +12,9 @@
  * @file
  * Bit Operations
  *
- * This file defines a family of APIs for bit operations
- * without enforcing memory ordering.
+ * This file provides functionality for low-level, single-word
+ * arithmetic and bit-level operations, such as counting or
+ * setting individual bits.
  */
 
 #include 
@@ -105,6 +107,194 @@ extern "C" {
 #define RTE_FIELD_GET64(mask, reg) \
((typeof(mask))(((reg) & (mask)) >> rte_ctz64(mask)))
 
+/**
+ * Test if a particular bit in a 32-bit word is set.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to query.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+static inline bool
+rte_bit_test32(const uint32_t *addr, unsigned int nr);
+
+/**
+ * Set bit in 32-bit word.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to '1'.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ */
+static inline void
+rte_bit_set32(uint32_t *addr, unsigned int nr);
+
+/**
+ * Clear bit in 32-bit word.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to '0'.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ */
+static inline void
+rte_bit_clear32(uint32_t *addr, unsigned int nr);
+
+/**
+ * Assign a value to bit in a 32-bit word.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to the value indicated by @c value.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+static inline void
+rte_bit_assign32(uint32_t *addr, unsigned int nr, bool value)
+{
+   if (value)
+   rte_bit_set32(addr, nr);
+   else
+   rte_bit_clear32(addr, nr);
+}
+
+/**
+ * Test if a particular bit in a 64-bit word is set.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 64-bit word to query.
+ * @param nr
+ *   The index of the bit (0-63).
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+static inline bool
+rte_bit_test64(const uint64_t *addr, unsigned int nr);
+
+/**
+ * Set bit in 64-bit word.
+ *
+ * Set bit specified by @c nr in the 64-bit word pointed to by
+ * @c addr to '1'.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 64-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-63).
+ */
+static inline void
+rte_bit_set64(uint64_t *addr, unsigned int nr);
+
+/**
+ * Clear bit in 64-bit word.
+ *
+ * Set bit specified by @c nr in the 64-bit word pointed to by
+ * @c addr to '0'.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 64-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-63).
+ */
+static inline void
+rte_bit_clear64(uint64_t *addr, unsigned int nr);
+
+/**
+ * Assign a value to bit in a 64-bit word.
+ *
+ * Set bit specified by @c nr in the 64-bit word pointed to by
+ * @c addr to the value indicated by @c value.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 64-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-63).
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+static inline void
+rte_bit_assign64(uint64_t *addr, unsigned int nr, bool value)
+{
+ 

[RFC 4/7] eal: add generic once-type bit operations macros

2024-03-02 Thread Mattias Rönnblom
Add macros for once-type bit operations operating on both 32-bit and
64-bit words by means of C11 generic selection.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 101 +++
 1 file changed, 101 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 3118c51748..450334c751 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -188,6 +188,107 @@ extern "C" {
 uint32_t *: rte_bit_assign32,  \
 uint64_t *: rte_bit_assign64)(addr, nr, value)
 
+/**
+ * Test exactly once if a particular bit in a word is set.
+ *
+ * Generic selection macro to exactly once test the value of a bit in
+ * a 32-bit or 64-bit word. The type of operation depends on the type
+ * of the @c addr parameter.
+ *
+ * This macro is guaranteed to result in exactly one memory load. See
+ * rte_bit_once_test32() for more information and uses cases for the
+ * "once" class of functions.
+ *
+ * rte_bit_once_test() does give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to query.
+ * @param nr
+ *   The index of the bit.
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+
+#define rte_bit_once_test(addr, nr)\
+   _Generic((addr),\
+uint32_t *: rte_bit_once_test32,   \
+uint64_t *: rte_bit_once_test64)(addr, nr)
+
+/**
+ * Set bit in word exactly once.
+ *
+ * Set bit specified by @c nr in the word pointed to by @c addr to '1'
+ * exactly once.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit set operation.
+ *
+ * See rte_bit_test_once32() for more information and uses cases for
+ * the "once" class of functions.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_once_set(addr, nr) \
+   _Generic((addr),\
+uint32_t *: rte_bit_once_set32,\
+uint64_t *: rte_bit_once_set64)(addr, nr)
+
+/**
+ * Clear bit in word exactly once.
+ *
+ * Set bit specified by @c nr in the word pointed to by @c addr to '0'
+ * exactly once.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit clear operation.
+ *
+ * See rte_bit_test_once32() for more information and uses cases for
+ * the "once" class of functions.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_once_clear(addr, nr)   \
+   _Generic((addr),\
+uint32_t *: rte_bit_once_clear32,  \
+uint64_t *: rte_bit_once_clear64)(addr, nr)
+
+/**
+ * Assign a value to bit in a word exactly once.
+ *
+ * Set bit specified by @c nr in the word pointed to by @c addr to the
+ * value indicated by @c value exactly once.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit clear operation.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+#define rte_bit_once_assign(addr, nr, value)   \
+   _Generic((addr),\
+uint32_t *: rte_bit_once_assign32, \
+uint64_t *: rte_bit_once_assign64)(addr, nr, value)
+
 /**
  * Test if a particular bit in a 32-bit word is set.
  *
-- 
2.34.1



[RFC 2/7] eal: add generic bit manipulation macros

2024-03-02 Thread Mattias Rönnblom
Add bit-level test/set/clear/assign macros operating on both 32-bit
and 64-bit words by means of C11 generic selection.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 81 
 1 file changed, 81 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 9a368724d5..afd0f11033 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -107,6 +107,87 @@ extern "C" {
 #define RTE_FIELD_GET64(mask, reg) \
((typeof(mask))(((reg) & (mask)) >> rte_ctz64(mask)))
 
+/**
+ * Test bit in word.
+ *
+ * Generic selection macro to test the value of a bit in a 32-bit or
+ * 64-bit word. The type of operation depends on the type of the @c
+ * addr parameter.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_test(addr, nr) \
+   _Generic((addr),\
+uint32_t *: rte_bit_test32,\
+uint64_t *: rte_bit_test64)(addr, nr)
+
+/**
+ * Set bit in word.
+ *
+ * Generic selection macro to set a bit in a 32-bit or 64-bit
+ * word. The type of operation depends on the type of the @c addr
+ * parameter.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_set(addr, nr)  \
+   _Generic((addr),\
+uint32_t *: rte_bit_set32, \
+uint64_t *: rte_bit_set64)(addr, nr)
+
+/**
+ * Clear bit in word.
+ *
+ * Generic selection macro to clear a bit in a 32-bit or 64-bit
+ * word. The type of operation depends on the type of the @c addr
+ * parameter.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ */
+#define rte_bit_clear(addr, nr)\
+   _Generic((addr),\
+uint32_t *: rte_bit_clear32,   \
+uint64_t *: rte_bit_clear64)(addr, nr)
+
+/**
+ * Assign a value to a bit in word.
+ *
+ * Generic selection macro to assign a value to a bit in a 32-bit or 64-bit
+ * word. The type of operation depends on the type of the @c addr parameter.
+ *
+ * This macro does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+#define rte_bit_assign(addr, nr, value)\
+   _Generic((addr),\
+uint32_t *: rte_bit_assign32,  \
+uint64_t *: rte_bit_assign64)(addr, nr, value)
+
 /**
  * Test if a particular bit in a 32-bit word is set.
  *
-- 
2.34.1



[RFC 3/7] eal: add bit manipulation functions which read or write once

2024-03-02 Thread Mattias Rönnblom
Add bit test/set/clear/assign functions which prevents certain
compiler optimizations and guarantees that program-level memory loads
and/or stores will actually occur.

These functions are useful when interacting with memory-mapped
hardware devices.

The "once" family of functions does not promise atomicity and provides
no memory ordering guarantees beyond the C11 relaxed memory model.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 229 +++
 1 file changed, 229 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index afd0f11033..3118c51748 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -338,6 +338,227 @@ rte_bit_assign64(uint64_t *addr, unsigned int nr, bool 
value)
rte_bit_clear64(addr, nr);
 }
 
+/**
+ * Test exactly once if a particular bit in a 32-bit word is set.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * (e.g., it may not be eliminate or merged by the compiler).
+ *
+ * \code{.c}
+ * rte_bit_once_set32(addr, 17);
+ * if (rte_bit_once_test32(addr, 17)) {
+ * ...
+ * }
+ * \endcode
+ *
+ * In the above example, rte_bit_once_set32() may not be removed by
+ * the compiler, which would be allowed in case rte_bit_set32() and
+ * rte_bit_test32() was used.
+ *
+ * \code{.c}
+ * while (rte_bit_once_test32(addr, 17);
+ * ;
+ * \endcode
+ *
+ * In case rte_bit_test32(addr, 17) was used instead, the resulting
+ * object code could (and in many cases would be) replaced with
+ * with the equivalent to
+ * \code{.c}
+ * if (rte_bit_test32(addr, 17)) {
+ *   for (;;) // spin forever
+ *   ;
+ * }
+ * \endcode
+ *
+ * The regular bit set operations (e.g., rte_bit_test32()) should be
+ * preffered over the "once" family of operations (e.g.,
+ * rte_bit_once_test32()), since the latter may prevent optimizations
+ * crucial for run-time performance.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering (except ordering from the same thread to the same memory
+ * location) or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to query.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+
+static inline bool
+rte_bit_once_test32(const volatile uint32_t *addr, unsigned int nr);
+
+/**
+ * Set bit in 32-bit word exactly once.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to '1'.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit set operation.
+ *
+ * See rte_bit_test_once32() for more information and uses cases for
+ * the "once" class of functions.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ */
+static inline void
+rte_bit_once_set32(volatile uint32_t *addr, unsigned int nr);
+
+/**
+ * Clear bit in 32-bit word exactly once.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by @c addr
+ * to '0'.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit clear operation.
+ *
+ * See rte_bit_once_test32() for more information and uses cases for the
+ * "once" class of functions.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ */
+static inline void
+rte_bit_once_clear32(volatile uint32_t *addr, unsigned int nr);
+
+/**
+ * Assign a value to bit in a 32-bit word exactly once.
+ *
+ * Set bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to the value indicated by @c value.
+ *
+ * This function is guaranteed to result in exactly one memory load
+ * and exactly one memory store, *or* an atomic bit clear operation.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ */
+static inline void
+rte_bit_once_assign32(uint32_t *addr, unsigned int nr, bool value)
+{
+   if (value)
+   rte_bit_once_set32(addr, nr);
+   else
+   rte_bit_once_clear32(addr, nr);
+}
+
+/**
+ * Test exactly once if a particular bit in a 64-bit word is set.
+ *
+ * This function is guaranteed to result in exactly one memory load.
+ * See rte_bit_once_test32() for more information and uses cases for the
+ * "once" class of functions.
+ *
+ * rte_v_bit_test64() does give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param 

[RFC 5/7] eal: add atomic bit operations

2024-03-02 Thread Mattias Rönnblom
Add atomic bit test/set/clear/assign and test-and-set/clear functions.

All atomic bit functions allow (and indeed, require) the caller to
specify a memory order.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 337 +++
 1 file changed, 337 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 450334c751..7eb08bc768 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -20,6 +20,7 @@
 #include 
 
 #include 
+#include 
 
 #ifdef __cplusplus
 extern "C" {
@@ -706,6 +707,342 @@ __RTE_GEN_BIT_TEST(rte_bit_once_test64, 64, volatile)
 __RTE_GEN_BIT_SET(rte_bit_once_set64, 64, volatile)
 __RTE_GEN_BIT_CLEAR(rte_bit_once_clear64, 64, volatile)
 
+/**
+ * Test if a particular bit in a 32-bit word is set with a particular
+ * memory order.
+ *
+ * Test a bit with the resulting memory load ordered as per the
+ * specified memory order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to query.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+static inline bool
+rte_bit_atomic_test32(const uint32_t *addr, unsigned int nr, int memory_order);
+
+/**
+ * Atomically set bit in 32-bit word.
+ *
+ * Atomically bit specified by @c nr in the 32-bit word pointed to by
+ * @c addr to '1', with the memory ordering as specified by @c
+ * memory_order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+static inline void
+rte_bit_atomic_set32(uint32_t *addr, unsigned int nr, int memory_order);
+
+/**
+ * Atomically clear bit in 32-bit word.
+ *
+ * Atomically set bit specified by @c nr in the 32-bit word pointed to
+ * by @c addr to '0', with the memory ordering as specified by @c
+ * memory_order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+static inline void
+rte_bit_atomic_clear32(uint32_t *addr, unsigned int nr, int memory_order);
+
+/**
+ * Atomically assign a value to bit in a 32-bit word.
+ *
+ * Atomically set bit specified by @c nr in the 32-bit word pointed to
+ * by @c addr to the value indicated by @c value, with the memory
+ * ordering as specified with @c memory_order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+static inline void
+rte_bit_atomic_assign32(uint32_t *addr, unsigned int nr, bool value,
+   int memory_order);
+
+/*
+ * Atomic test-and-assign is not considered useful-enough to document
+ * and expose in the public API.
+ */
+static inline bool
+__rte_bit_atomic_test_and_assign32(uint32_t *addr, unsigned int nr, bool value,
+  int memory_order);
+
+/**
+ * Atomically test and set a bit in a 32-bit word.
+ *
+ * Atomically test and set bit specified by @c nr in the 32-bit word
+ * pointed to by @c addr to the value indicated by @c value, with the
+ * memory ordering as specified with @c memory_order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit was set, and false otherwise.
+ */
+static inline bool
+rte_bit_atomic_test_and_set32(uint32_t *addr, unsigned int nr,
+ int memory_order)
+{
+   return __rte_bit_atomic_test_and_assign32(addr, nr, true, memory_order);
+}
+
+/**
+ * Atomically test and clear a bit in a 32-bit word.
+ *
+ * Atomically test and clear bit specified by @c nr in the 32-bit word
+ * pointed to by @c addr to the value indicated by @c value, with the
+ * memory ordering as specified with @c memory_order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to modify.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit was set, and false otherwise.
+ */
+static inline bool
+rte_bit_atomic_test_and_clear32(uint32_t *addr, unsigned int nr,
+   int memory_order)
+{
+   return __rte_bit_atomic_test_and_assign32(addr, nr, false, 
memory_order);
+}
+
+/**
+ * Test if a particular bit in a 32-bit word is set with a particular
+ * memory order.
+ *
+ * Test a bit with the resulting memory load ordered as per the
+ * specified memory order.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to query.
+ * @

[RFC 0/7] Improve EAL bit operations API

2024-03-02 Thread Mattias Rönnblom
This patch set represent an attempt to improve and extend the RTE
bitops API, in particular for functions that operate on individual
bits.

RFCv1 is submitted primarily to 1) receive general feedback on if
improvements in this area is worth working on, and 2) receive feedback
on the API.

No test cases are included in v1 and the various functions may well
not do what they are intended to.

The legacy  rte_bit_relaxed_*() family of functions is
replaced with three families:

rte_bit_[test|set|clear|assign][32|64]() which provides no memory
ordering or atomicity guarantees and no read-once or write-once
semantics (e.g., no use of volatile), but does provide the best
performance. The performance degradation resulting from the use of
volatile (e.g., forcing loads and stores to actually occur and in the
number specified) and atomic (e.g., LOCK instructions on x86) may be a
significant.

rte_bit_once_*() which guarantees program-level load and stores
actually occurring (i.e., prevents certain optimizations). The primary
use of these functions are in the context of memory mapped
I/O. Feedback on the details (semantics, naming) here would be greatly
appreciated, since the author is not much of a driver developer.

rte_bit_atomic_*() which provides atomic bit-level operations,
including the possibility to specifying memory ordering constraints
(or the lack thereof).

The atomic functions take non-_Atomic pointers, to be flexible, just
like the GCC builtins and default . The issue with
_Atomic APIs is that it may well be the case that the user wants to
perform both non-atomic and atomic operations on the same word.

Having _Atomic-marked addresses would complicate supporting atomic
bit-level operations in the proposed bitset API (and potentially other
APIs depending on RTE bitops for atomic bit-level ops). Either one
needs two bitset variants, one _Atomic bitset and one non-atomic one,
or the bitset code needs to cast the non-_Atomic pointer to an _Atomic
one. Having a separate _Atomic bitset would be bloat and also prevent
the user from both, in some situations, doing atomic operations
against a bit set, while in other situations (e.g., at times when MT
safety is not a concern) operating on the same words in a non-atomic
manner. That said, all this is still unclear to the author and much
depending on the future path of DPDK atomics.

Unlike rte_bit_relaxed_*(), individual bits are represented by bool,
not uint32_t or uint64_t. The author found the use of such large types
confusing, and also failed to see any performance benefits.

A set of functions rte_bit_*_assign*() are added, to assign a
particular boolean value to a particular bit.

All functions have properly documented semantics.

All functions are available in uint32_t and uint64_t variants.

In addition, for every function there is a generic selection variant
which operates on both 32-bit and 64-bit words (depending on the
pointer type). The use of C11 generic selection is the first in the
DPDK code base.

_Generic allow the user code to be a little more impact. Have a
generic atomic test/set/clear/assign bit API also seems consistent
with the "core" (word-size) atomics API, which is generic (both GCC
builtins and  are).

The _Generic versions also may avoid having explicit unsigned long
versions of all functions. If you have an unsigned long, it's safe to
use the generic version (e.g., rte_set_bit()) and _Generic will pick
the right function, provided long is either 32 or 64 bit on your
platform (which it is on all DPDK-supported ABIs).

The generic rte_bit_set() is a macro, and not a function, but
nevertheless has been given a lower-case name. That's how C11 does it
(for atomics, and other _Generic), and . Its address
can't be taken, but it does not evaluate its parameters more than
once.

Things that are left out of this patch set, that may be included
in future versions:

 * Have all functions returning a bit number have the same return type
   (i.e., unsigned int).
 * Harmonize naming of some GCC builtin wrappers (i.e., rte_fls_u32()).
 * Add __builtin_ffsll()/ffs() wrapper and potentially other wrappers
   for useful/used bit-level GCC builtins.
 * Eliminate the MSVC #ifdef-induced documentation duplication.
 * _Generic versions of things like rte_popcount32(). (?)

ABI-breaking patches should probably go into a separate patch set (?).

Mattias Rönnblom (7):
  eal: extend bit manipulation functions
  eal: add generic bit manipulation macros
  eal: add bit manipulation functions which read or write once
  eal: add generic once-type bit operations macros
  eal: add atomic bit operations
  eal: add generic atomic bit operations
  eal: deprecate relaxed family of bit operations

 lib/eal/include/rte_bitops.h | 1115 +-
 1 file changed, 1113 insertions(+), 2 deletions(-)

-- 
2.34.1



[RFC 6/7] eal: add generic atomic bit operations

2024-03-02 Thread Mattias Rönnblom
Add atomic bit-level test/set/clear/assign macros operating on both
32-bit and 64-bit words by means of C11 generic selection.

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 125 +++
 1 file changed, 125 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 7eb08bc768..b5a9df5930 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -290,6 +290,131 @@ extern "C" {
 uint32_t *: rte_bit_once_assign32, \
 uint64_t *: rte_bit_once_assign64)(addr, nr, value)
 
+/**
+ * Test if a particular bit in a word is set with a particular memory
+ * order.
+ *
+ * Test a bit with the resulting memory load ordered as per the
+ * specified memory order.
+ *
+ * @param addr
+ *   A pointer to the word to query.
+ * @param nr
+ *   The index of the bit.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+#define rte_bit_atomic_test(addr, nr, memory_order)\
+   _Generic((addr),\
+uint32_t *: rte_bit_atomic_test32, \
+uint64_t *: rte_bit_atomic_test64)(addr, nr, memory_order)
+
+/**
+ * Atomically set bit in word.
+ *
+ * Atomically set bit specified by @c nr in the word pointed to by @c
+ * addr to '1', with the memory ordering as specified by @c
+ * memory_order.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+#define rte_bit_atomic_set(addr, nr, memory_order) \
+   _Generic((addr),\
+uint32_t *: rte_bit_atomic_set32,  \
+uint64_t *: rte_bit_atomic_set64)(addr, nr, memory_order)
+
+/**
+ * Atomically clear bit in word.
+ *
+ * Atomically set bit specified by @c nr in the word pointed to by @c
+ * addr to '0', with the memory ordering as specified by @c
+ * memory_order.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+#define rte_bit_atomic_clear(addr, nr, memory_order)   \
+   _Generic((addr),\
+uint32_t *: rte_bit_atomic_clear32,\
+uint64_t *: rte_bit_atomic_clear64)(addr, nr, memory_order)
+
+/**
+ * Atomically assign a value to bit in word.
+ *
+ * Atomically set bit specified by @c nr in the word pointed to by @c
+ * addr to the value indicated by @c value, with the memory ordering
+ * as specified with @c memory_order.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param value
+ *   The new value of the bit - true for '1', or false for '0'.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ */
+#define rte_bit_atomic_assign(addr, nr, value, memory_order)   \
+   _Generic((addr),\
+uint32_t *: rte_bit_atomic_assign32,   \
+uint64_t *: rte_bit_atomic_assign64)(addr, nr, value,  \
+ memory_order)
+
+/**
+ * Atomically test and set a bit in word.
+ *
+ * Atomically test and set bit specified by @c nr in the word pointed
+ * to by @c addr to '1', with the memory ordering as specified with @c
+ * memory_order.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit was set, and false otherwise.
+ */
+#define rte_bit_atomic_test_and_set(addr, nr, memory_order)\
+   _Generic((addr),\
+uint32_t *: rte_bit_atomic_test_and_set32, \
+uint64_t *: rte_bit_atomic_test_and_set64)(addr, nr,   \
+   memory_order))
+
+/**
+ * Atomically test and clear a bit in word.
+ *
+ * Atomically test and clear bit specified by @c nr in the word
+ * pointed to by @c addr to '0', with the memory ordering as specified
+ * with @c memory_order.
+ *
+ * @param addr
+ *   A pointer to the word to modify.
+ * @param nr
+ *   The index of the bit.
+ * @param memory_order
+ *   The memory order to use. See  for details.
+ * @return
+ *   Returns true if the bit was set, and false otherwise.
+ */
+#define rte_bit_atomic_test_and_clear(addr, nr, memory_order)  \
+   _Generic((addr),   

[RFC 7/7] eal: deprecate relaxed family of bit operations

2024-03-02 Thread Mattias Rönnblom
Informally (by means of documentation) deprecate the
rte_bit_relaxed_*() family of bit-level operations.

rte_bit_relaxed_*() has been replaced by three new families of
bit-level query and manipulation functions.

rte_bit_relaxed_*() failed to deliver the atomicity guarantees their
name suggested. If deprecated, it will encourage the user to consider
whether the actual, implemented behavior (e.g., non-atomic
test-and-set with read/write-once semantics) or the semantics implied
by their names (i.e., atomic), or something else, is what's actually
needed.

Bugzilla ID: 1385

Signed-off-by: Mattias Rönnblom 
---
 lib/eal/include/rte_bitops.h | 48 
 1 file changed, 48 insertions(+)

diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index b5a9df5930..783dd0e1ee 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -1179,6 +1179,10 @@ __RTE_GEN_BIT_ATOMIC_OPS(64)
  *   The address holding the bit.
  * @return
  *   The target bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_test32(),
+ *   rte_bit_once_test32(), or rte_bit_atomic_test32() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline uint32_t
 rte_bit_relaxed_get32(unsigned int nr, volatile uint32_t *addr)
@@ -1196,6 +1200,10 @@ rte_bit_relaxed_get32(unsigned int nr, volatile uint32_t 
*addr)
  *   The target bit to set.
  * @param addr
  *   The address holding the bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_set32(),
+ *   rte_bit_once_set32(), or rte_bit_atomic_set32() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline void
 rte_bit_relaxed_set32(unsigned int nr, volatile uint32_t *addr)
@@ -1213,6 +1221,10 @@ rte_bit_relaxed_set32(unsigned int nr, volatile uint32_t 
*addr)
  *   The target bit to clear.
  * @param addr
  *   The address holding the bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_clear32(),
+ *   rte_bit_once_clear32(), or rte_bit_atomic_clear32() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline void
 rte_bit_relaxed_clear32(unsigned int nr, volatile uint32_t *addr)
@@ -1233,6 +1245,12 @@ rte_bit_relaxed_clear32(unsigned int nr, volatile 
uint32_t *addr)
  *   The address holding the bit.
  * @return
  *   The original bit.
+ * @note
+ *   This function is deprecated and replaced by
+ *   rte_bit_atomic_test_and_set32(), for use cases where the
+ *   operation needs to be atomic. For non-atomic/non-ordered use
+ *   cases, use rte_bit_test32() + rte_bit_set32() or
+ *   rte_bit_once_test32() + rte_bit_once_set32().
  */
 static inline uint32_t
 rte_bit_relaxed_test_and_set32(unsigned int nr, volatile uint32_t *addr)
@@ -1255,6 +1273,12 @@ rte_bit_relaxed_test_and_set32(unsigned int nr, volatile 
uint32_t *addr)
  *   The address holding the bit.
  * @return
  *   The original bit.
+ * @note
+ *   This function is deprecated and replaced by
+ *   rte_bit_atomic_test_and_clear32(), for use cases where the
+ *   operation needs to be atomic. For non-atomic/non-ordered use
+ *   cases, use rte_bit_test32() + rte_bit_clear32() or
+ *   rte_bit_once_test32() + rte_bit_once_clear32().
  */
 static inline uint32_t
 rte_bit_relaxed_test_and_clear32(unsigned int nr, volatile uint32_t *addr)
@@ -1278,6 +1302,10 @@ rte_bit_relaxed_test_and_clear32(unsigned int nr, 
volatile uint32_t *addr)
  *   The address holding the bit.
  * @return
  *   The target bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_test64(),
+ *   rte_bit_once_test64(), or rte_bit_atomic_test64() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline uint64_t
 rte_bit_relaxed_get64(unsigned int nr, volatile uint64_t *addr)
@@ -1295,6 +1323,10 @@ rte_bit_relaxed_get64(unsigned int nr, volatile uint64_t 
*addr)
  *   The target bit to set.
  * @param addr
  *   The address holding the bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_set64(),
+ *   rte_bit_once_set64(), or rte_bit_atomic_set64() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline void
 rte_bit_relaxed_set64(unsigned int nr, volatile uint64_t *addr)
@@ -1312,6 +1344,10 @@ rte_bit_relaxed_set64(unsigned int nr, volatile uint64_t 
*addr)
  *   The target bit to clear.
  * @param addr
  *   The address holding the bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_clear64(),
+ *   rte_bit_once_clear64(), or rte_bit_atomic_clear64() instead,
+ *   depending on exactly what guarantees are required.
  */
 static inline void
 rte_bit_relaxed_clear64(unsigned int nr, volatile uint64_t *addr)
@@ -1332,6 +1368,12 @@ rte_bit_relaxed_clear64(unsigned int nr, volatile 
uint64_t *addr)
  *   The address holding the bit.
  * @return
  *   The original bit.
+ * @note
+ *   This function is deprecated and replaced by
+ *   rte_bit_atomic_test_and_set64(), for use cases where the
+ *   operation needs to 

Re: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 13:01:51 +0100
Mattias Rönnblom  wrote:

> I ran some DSW benchmarks, and if you add
> 
> diff --git a/lib/eal/x86/include/rte_memcpy.h 
> b/lib/eal/x86/include/rte_memcpy.h
> index 72a92290e0..64cd82d78d 100644
> --- a/lib/eal/x86/include/rte_memcpy.h
> +++ b/lib/eal/x86/include/rte_memcpy.h
> @@ -862,6 +862,11 @@ rte_memcpy_aligned(void *dst, const void *src, 
> size_t n)
>   static __rte_always_inline void *
>   rte_memcpy(void *dst, const void *src, size_t n)
>   {
> +   if (__builtin_constant_p(n) && n <= 32) {
> +   memcpy(dst, src, n);
> +   return dst;
> +   }
> +

The default GCC inline threshold is 64 bytes (ie cache line size)
and that makes sense.  Since using __builtin_constant_p could
do:
if (__builtin_constant_p(p) && n < RTE_CACHE_LINE_SIZE)
return __builtin_memcpy(dst, src, n);


Re: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 14:05:45 +0100
Morten Brørup  wrote:

> >   
> > > My experience with replacing rte_memcpy() with memcpy() (or vice  
> > versa)  
> > > is mixed.
> > >
> > > I've also tried just dropping the DPDK-custom memcpy() implementation
> > > altogether, and that caused a performance drop (in a particular app,  
> > on  
> > > a particular compiler and CPU).  
> 
> I guess the compilers are just not where we want them to be yet.
> 
> I don't mind generally replacing rte_memcpy() with memcpy() in the control 
> plane.
> But we should use whatever is more efficient in the data plane.
> 
> We must also keep in mind that DPDK supports old distros with old compilers. 
> We should not remove a superfluous hand crafted optimization if a supported 
> old compiler hasn't caught up with it yet, i.e. if it isn't superfluous on 
> some of the old compilers supported by DPDK.

When I scanned the result.
1. Most copies were small (like Ether address or IPv6 address) and 
compiler
   inlining should beat a function call every time.
2. Larger structure copies were in control path.


Re: [PATCH v2 01/71] cocci/rte_memcpy: add script to eliminate fixed size rte_memcpy

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 12:19:13 +0100
Mattias Rönnblom  wrote:

> On 2024-03-01 18:14, Stephen Hemminger wrote:
> > Rte_memcpy should not be used for the simple case of copying
> > a fix size structure because it is slower and will hide problems
> > from code analysis tools. Coverity, fortify and other analyzers
> > special case memcpy().
> > 
> > Gcc (and Clang) are smart enough to inline copies which
> > will be faster.
> >   
> 
> Are you suggesting rte_memcpy() calls aren't inlined?

With a simple made up example of copying an IPv6 address.

rte_memcpy() inlines to:
rte_copy_addr:
movdqu  xmm0, XMMWORD PTR [rsi]
movups  XMMWORD PTR [rdi], xmm0
movdqu  xmm0, XMMWORD PTR [rsi]
movups  XMMWORD PTR [rdi], xmm0
ret

Vs memcpy becomes.

copy_addr:
movdqu  xmm0, XMMWORD PTR [rsi]
movups  XMMWORD PTR [rdi], xmm0
ret

Looks like a bug in rte_memcpy_generic? Why is it not handling
16 bytes correctly.

For me, it was more about getting fortify and compiler warnings to
catch bugs. For most code, the output should be the same.


Re: [RFC 1/7] eal: extend bit manipulation functions

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 14:53:22 +0100
Mattias Rönnblom  wrote:

> diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
> index 449565eeae..9a368724d5 100644
> --- a/lib/eal/include/rte_bitops.h
> +++ b/lib/eal/include/rte_bitops.h
> @@ -2,6 +2,7 @@
>   * Copyright(c) 2020 Arm Limited
>   * Copyright(c) 2010-2019 Intel Corporation
>   * Copyright(c) 2023 Microsoft Corporation
> + * Copyright(c) 2024 Ericsson AB
>   */
>  

Unless this is coming from another project code base, the common
practice is not to add copyright for each contributor in later versions.

> +/**
> + * Test if a particular bit in a 32-bit word is set.
> + *
> + * This function does not give any guarantees in regards to memory
> + * ordering or atomicity.
> + *
> + * @param addr
> + *   A pointer to the 32-bit word to query.
> + * @param nr
> + *   The index of the bit (0-31).
> + * @return
> + *   Returns true if the bit is set, and false otherwise.
> + */
> +static inline bool
> +rte_bit_test32(const uint32_t *addr, unsigned int nr);

Is it possible to reorder these inlines to avoid having
forward declarations?

Also, new functions should be marked __rte_experimental
for a release or two.


Re: [RFC 7/7] eal: deprecate relaxed family of bit operations

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 14:53:28 +0100
Mattias Rönnblom  wrote:

> diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
> index b5a9df5930..783dd0e1ee 100644
> --- a/lib/eal/include/rte_bitops.h
> +++ b/lib/eal/include/rte_bitops.h
> @@ -1179,6 +1179,10 @@ __RTE_GEN_BIT_ATOMIC_OPS(64)
>   *   The address holding the bit.
>   * @return
>   *   The target bit.
> + * @note
> + *   This function is deprecated. Use rte_bit_test32(),
> + *   rte_bit_once_test32(), or rte_bit_atomic_test32() instead,
> + *   depending on exactly what guarantees are required.
>   */
>  static inline uint32_t
>  rte_bit_relaxed_get32(unsigned int nr, volatile uint32_t *addr)

The DPDK process is:
- mark these as deprecated in release notes of release N.
- mark these as deprecated using __rte_deprecated in next LTS
- drop these in LTS release after that.

Don't use notes for this.


RE: [PATCH v2 00/71] replace use of fixed size rte_mempcy

2024-03-02 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Saturday, 2 March 2024 17.38
> 
> On Sat, 2 Mar 2024 14:05:45 +0100
> Morten Brørup  wrote:
> 
> > >
> > > > My experience with replacing rte_memcpy() with memcpy() (or vice
> > > versa)
> > > > is mixed.
> > > >
> > > > I've also tried just dropping the DPDK-custom memcpy()
> implementation
> > > > altogether, and that caused a performance drop (in a particular
> app,
> > > on
> > > > a particular compiler and CPU).
> >
> > I guess the compilers are just not where we want them to be yet.
> >
> > I don't mind generally replacing rte_memcpy() with memcpy() in the
> control plane.
> > But we should use whatever is more efficient in the data plane.
> >
> > We must also keep in mind that DPDK supports old distros with old
> compilers. We should not remove a superfluous hand crafted optimization
> if a supported old compiler hasn't caught up with it yet, i.e. if it
> isn't superfluous on some of the old compilers supported by DPDK.
> 
> When I scanned the result.
> 1. Most copies were small (like Ether address or IPv6 address)
>and compiler
>inlining should beat a function call every time.

Please note that rte_memcpy() is inline, so no function call is involved.

> 2. Larger structure copies were in control path.

Yep, I saw the same two things when scanning v1 of the series before acking it.
If we didn't overlook any fast path copies, this series is a good clean-up

I must admit that I assume that any compiler's built-in memcpy() is able to 
efficiently copy small structures of build time constant size.
Assumptions are the mother of all FU's, but being wrong on this would be a very 
big surprise to me.



RE: [PATCH v2 01/71] cocci/rte_memcpy: add script to eliminate fixed size rte_memcpy

2024-03-02 Thread Morten Brørup
+To: x86 maintainers, another bug in rte_memcpy()

> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Saturday, 2 March 2024 18.02
> 
> On Sat, 2 Mar 2024 12:19:13 +0100
> Mattias Rönnblom  wrote:
> 
> > On 2024-03-01 18:14, Stephen Hemminger wrote:
> > > Rte_memcpy should not be used for the simple case of copying
> > > a fix size structure because it is slower and will hide problems
> > > from code analysis tools. Coverity, fortify and other analyzers
> > > special case memcpy().
> > >
> > > Gcc (and Clang) are smart enough to inline copies which
> > > will be faster.
> > >
> >
> > Are you suggesting rte_memcpy() calls aren't inlined?
> 
> With a simple made up example of copying an IPv6 address.
> 
> rte_memcpy() inlines to:
> rte_copy_addr:
> movdqu  xmm0, XMMWORD PTR [rsi]
> movups  XMMWORD PTR [rdi], xmm0
> movdqu  xmm0, XMMWORD PTR [rsi]
> movups  XMMWORD PTR [rdi], xmm0
> ret
> 
> Vs memcpy becomes.
> 
> copy_addr:
> movdqu  xmm0, XMMWORD PTR [rsi]
> movups  XMMWORD PTR [rdi], xmm0
> ret
> 
> Looks like a bug in rte_memcpy_generic? Why is it not handling
> 16 bytes correctly.

Looks like you have run into another bunch of off-by-one errors like this one:
https://git.dpdk.org/dpdk/commit/lib/eal/x86/include/rte_memcpy.h?id=2ef17be88e8b26f871cfb0265227341e36f486ea

> 
> For me, it was more about getting fortify and compiler warnings to
> catch bugs. For most code, the output should be the same.



[PATCH v7] mempool: test performance with larger bursts

2024-03-02 Thread Morten Brørup
Bursts of up to 64, 128 and 256 packets are not uncommon, so increase the
maximum tested get and put burst sizes from 32 to 256.
For convenience, also test get and put burst sizes of
RTE_MEMPOOL_CACHE_MAX_SIZE.

Some applications keep more than 512 objects, so increase the maximum
number of kept objects from 512 to 32768, still in jumps of factor four.
This exceeds the typical mempool cache size of 512 objects, so the test
also exercises the mempool driver.

Increased the precision of rate_persec calculation by timing the actual
duration of the test, instead of assuming it took exactly 5 seconds.

Added cache guard to per-lcore stats structure.

Signed-off-by: Morten Brørup 
Acked-by: Chengwen Feng 
---

v7:
* Increase max burst size to 256. (Inspired by Honnappa)
v6:
* Do not test with more lcores than available. (Thomas)
v5:
* Increased N, to reduce measurement overhead with large numbers of kept
  objects.
* Increased precision of rate_persec calculation.
* Added missing cache guard to per-lcore stats structure.
v4:
* v3 failed to apply; I had messed up something with git.
* Added ACK from Chengwen Feng.
v3:
* Increased max number of kept objects to 32768.
* Added get and put burst sizes of RTE_MEMPOOL_CACHE_MAX_SIZE objects.
* Print error if unable to allocate mempool.
* Initialize use_external_cache with each test.
  A previous version of this patch had a bug, where all test runs
  following the first would use external cache. (Chengwen Feng)
v2: Addressed feedback by Chengwen Feng
* Added get and put burst sizes of 64 objects, which is probably also not
  uncommon packet burst size.
* Fixed list of number of kept objects so list remains in jumps of factor
  four.
* Added three derivative test cases, for faster testing.
---
 app/test/test_mempool_perf.c | 144 +++
 1 file changed, 96 insertions(+), 48 deletions(-)

diff --git a/app/test/test_mempool_perf.c b/app/test/test_mempool_perf.c
index 96de347f04..bb40d1d911 100644
--- a/app/test/test_mempool_perf.c
+++ b/app/test/test_mempool_perf.c
@@ -54,22 +54,25 @@
  *
  *- Bulk size (*n_get_bulk*, *n_put_bulk*)
  *
- *  - Bulk get from 1 to 32
- *  - Bulk put from 1 to 32
- *  - Bulk get and put from 1 to 32, compile time constant
+ *  - Bulk get from 1 to 256, and RTE_MEMPOOL_CACHE_MAX_SIZE
+ *  - Bulk put from 1 to 256, and RTE_MEMPOOL_CACHE_MAX_SIZE
+ *  - Bulk get and put from 1 to 256, and RTE_MEMPOOL_CACHE_MAX_SIZE, 
compile time constant
  *
  *- Number of kept objects (*n_keep*)
  *
  *  - 32
  *  - 128
  *  - 512
+ *  - 2048
+ *  - 8192
+ *  - 32768
  */
 
-#define N 65536
 #define TIME_S 5
 #define MEMPOOL_ELT_SIZE 2048
-#define MAX_KEEP 512
-#define MEMPOOL_SIZE 
((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE))-1)
+#define MAX_KEEP 32768
+#define N (128 * MAX_KEEP)
+#define MEMPOOL_SIZE 
((rte_lcore_count()*(MAX_KEEP+RTE_MEMPOOL_CACHE_MAX_SIZE*2))-1)
 
 /* Number of pointers fitting into one cache line. */
 #define CACHE_LINE_BURST (RTE_CACHE_LINE_SIZE / sizeof(uintptr_t))
@@ -100,9 +103,11 @@ static unsigned n_keep;
 /* true if we want to test with constant n_get_bulk and n_put_bulk */
 static int use_constant_values;
 
-/* number of enqueues / dequeues */
+/* number of enqueues / dequeues, and time used */
 struct mempool_test_stats {
uint64_t enq_count;
+   uint64_t duration_cycles;
+   RTE_CACHE_GUARD;
 } __rte_cache_aligned;
 
 static struct mempool_test_stats stats[RTE_MAX_LCORE];
@@ -185,6 +190,7 @@ per_lcore_mempool_test(void *arg)
GOTO_ERR(ret, out);
 
stats[lcore_id].enq_count = 0;
+   stats[lcore_id].duration_cycles = 0;
 
/* wait synchro for workers */
if (lcore_id != rte_get_main_lcore())
@@ -204,6 +210,15 @@ per_lcore_mempool_test(void *arg)
CACHE_LINE_BURST, CACHE_LINE_BURST);
else if (n_get_bulk == 32)
ret = test_loop(mp, cache, n_keep, 32, 32);
+   else if (n_get_bulk == 64)
+   ret = test_loop(mp, cache, n_keep, 64, 64);
+   else if (n_get_bulk == 128)
+   ret = test_loop(mp, cache, n_keep, 128, 128);
+   else if (n_get_bulk == 256)
+   ret = test_loop(mp, cache, n_keep, 256, 256);
+   else if (n_get_bulk == RTE_MEMPOOL_CACHE_MAX_SIZE)
+   ret = test_loop(mp, cache, n_keep,
+   RTE_MEMPOOL_CACHE_MAX_SIZE, 
RTE_MEMPOOL_CACHE_MAX_SIZE);
else
ret = -1;
 
@@ -215,6 +230,8 @@ per_lcore_mempool_test(void *arg)
stats[lcore_id].enq_count += N;
}
 
+   stats[lcore_id].duration_cycles = time_diff;
+
 out:
if (use_external_cache) {
rte_mempool_cache_flush(cache, mp);
@@ -232,6 +249,7 @@ launch_cores(struct rte_mempool *mp, unsigned int cores)
uin

Re: [PATCH] lib/hash,lib/rcu: feature hidden key count in hash

2024-03-02 Thread Abdullah Ömer Yamaç
Sorry for the late reply. I understood what you mean. I will create only
the reclaim API for the hash library. Thanks for the explanation.

On Wed, Feb 28, 2024 at 5:51 PM Honnappa Nagarahalli <
honnappa.nagaraha...@arm.com> wrote:

>
>
> > On Feb 28, 2024, at 5:44 AM, Abdullah Ömer Yamaç 
> wrote:
> >
> > While I was implementing the new API, I realized one issue, and it would
> be good to discuss it here. First of all rte_rcu_qsbr_dq_reclaim function
> checks the state of the qsbr values. It means that all threads should
> report the quiescent states. It conflicts with my aim.
> >
> > Let's think about below scenario:
> > Eight threads use a hash table and periodically report their quiescent
> states. One additional thread (main thread) periodically reports the hash
> size. I implemented the reclaim function in that thread. I mean, the main
> thread calls reclaim before the rte_hash_count.
> >
> > Here is the exceptional case that I couldn't retrieve the correct hash
> size:
> > Assume that 6 of 8 threads reported quiescent states and 2 of them are
> still working on some process and haven't reported quiescent states yet.
> The main thread calls reclaim functions every time, but elements in dq will
> not be freed because 2 of the worker threads haven't reported their states
> (especially if they are waiting for some packets). So, my first proposed
> method is more suitable for this case. Any idea?
> If 2 out of 8 threads have not reported their quiescent state then the
> elements that have not been acknowledged by those 2 threads cannot be
> reclaimed and cannot be allocated for further use. Using this you can
> calculate the most accurate number of entries in the hash table available
> for allocation.
>
> >
> > On Thu, Feb 22, 2024 at 7:44 PM Honnappa Nagarahalli <
> honnappa.nagaraha...@arm.com> wrote:
> >
> >
> > > On Feb 22, 2024, at 6:39 AM, Abdullah Ömer Yamaç 
> wrote:
> > >
> > > As a final decision, I will add a new hash API that forces the
> reclaim. Is it ok for everyone?
> > Ack from my side
> >
> > >
> > > On Thu, Feb 22, 2024 at 5:37 AM Honnappa Nagarahalli <
> honnappa.nagaraha...@arm.com> wrote:
> > >
> > >
> > > > On Feb 21, 2024, at 3:51 PM, Abdullah Ömer Yamaç <
> aomerya...@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Wed, Feb 21, 2024 at 6:24 AM Honnappa Nagarahalli <
> honnappa.nagaraha...@arm.com> wrote:
> > > >
> > > >
> > > > > On Feb 20, 2024, at 12:58 PM, Abdullah Ömer Yamaç <
> aomerya...@gmail.com> wrote:
> > > > >
> > > > > I appreciate that you gave me suggestions and comments. I will
> make changes according to all your recommendations, but before that, I want
> to make everyone's minds clear. Then, I will apply modifications.
> > > > >
> > > > > On Tue, Feb 20, 2024 at 2:35 AM Honnappa Nagarahalli <
> honnappa.nagaraha...@arm.com> wrote:
> > > > >
> > > > >
> > > > > > On Feb 19, 2024, at 3:28 PM, Abdullah Ömer Yamaç <
> aomerya...@gmail.com> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Let me explain a use case;
> > > > > >
> > > > > > I have a hash table whose key value is IP addresses, and data
> (let's say the username of the IP) is related to the IP address. The key
> point is matching these data with flows. Flows are dynamic, and this hash
> table is dynamic, as well; both can change anytime. For example, when a
> flow starts, we look up the hash table with the corresponding IP and
> retrieve the username. We need to hold this username until the flow
> terminates, although we removed this IP key from the hash table
> (multithread). That's why we have RCU and defer queue is necessary for high
> performance. In my application, I need to know the number of IP-username
> entries. These numbers can be calculated by rte_hash_count - defer queue
> size.
> > > > > The entries in the defer queue are not reclaimed (there is a
> probability that all of them can be reclaimed) and hence they are not
> available for allocation. So, rte_hash_count - defer queue size might not
> give you the correct number you are expecting.
> > > > >
> > > > > Currently, there is no API in hash library that forces a reclaim.
> Does it makes sense to have an API that just does the reclaim (and returns
> the number of entries pending in the defer queue)? A call to rte_hash_count
> should provide the exact count you are looking for.
> > > > > You are right; no API in the hash library forces a reclaim. In my
> application, I periodically call rte_count to retrieve hash size, and this
> data is shown in my GUI. So that means I need to call regularly reclaim. I
> am trying to figure out which is better, calling reclaim or retrieving the
> defer queue size. Any comment about this?
> > > > Retrieving the defer queue size will be cheaper. However, calling
> the reclaim API will ensure the entries are freed hence providing an
> accurate number. Calling the reclaim API on an empty defer queue does not
> consume many cycles. If needed we could add a check for empty defer queue
> in th

[PATCH] rte_memcpy: fix off by one for size 16 and 32

2024-03-02 Thread Stephen Hemminger
The rte_memcpy code would do extra instructions for size 16
and 32 which potentially could reference past end of data.

For size of 16, only single mov16 is needed.
same for size of 32, only single mov32.

Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86")
Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")

Suggested-by: Morten Brørup 
Signed-off-by: Stephen Hemminger 
---
 lib/eal/x86/include/rte_memcpy.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 72a92290e05d..00479a4bfbe6 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -233,13 +233,15 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 */
if (n <= 32) {
rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-   rte_mov16((uint8_t *)dst - 16 + n,
+   if (n > 16)
+   rte_mov16((uint8_t *)dst - 16 + n,
  (const uint8_t *)src - 16 + n);
return ret;
}
if (n <= 64) {
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-   rte_mov32((uint8_t *)dst - 32 + n,
+   if (n > 32)
+   rte_mov32((uint8_t *)dst - 32 + n,
  (const uint8_t *)src - 32 + n);
return ret;
}
-- 
2.43.0



[PATCH v2] lib/hash: feature reclaim defer queue

2024-03-02 Thread Abdullah Ömer Yamaç
This patch adds a new feature to the hash library to allow the user to
reclaim the defer queue. This is useful when the user wants to force
reclaim resources that are not being used. This API is only available
if the RCU is enabled.

Signed-off-by: Abdullah Ömer Yamaç 
Acked-by: Honnappa Nagarahalli 
---
 lib/hash/rte_cuckoo_hash.c | 23 +++
 lib/hash/rte_hash.h| 14 ++
 lib/hash/version.map   |  7 +++
 3 files changed, 44 insertions(+)

diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
index 9cf94645f6..254fa80cc5 100644
--- a/lib/hash/rte_cuckoo_hash.c
+++ b/lib/hash/rte_cuckoo_hash.c
@@ -1588,6 +1588,27 @@ rte_hash_rcu_qsbr_add(struct rte_hash *h, struct 
rte_hash_rcu_config *cfg)
return 0;
 }
 
+int
+rte_hash_rcu_qsbr_dq_reclaim(struct rte_hash *h)
+{
+   int ret;
+
+   if (h->hash_rcu_cfg == NULL || h->dq == NULL) {
+   rte_errno = EINVAL;
+   return -1;
+   }
+
+   ret = rte_rcu_qsbr_dq_reclaim(h->dq, h->hash_rcu_cfg->max_reclaim_size, 
NULL, NULL, NULL);
+   if (ret != 0) {
+   HASH_LOG(ERR,
+   "%s: could not reclaim the defer queue in hash table",
+   __func__);
+   return -1;
+   }
+
+   return 0;
+}
+
 static inline void
 remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt,
unsigned int i)
diff --git a/lib/hash/rte_hash.h b/lib/hash/rte_hash.h
index 7ecc02..c119477d50 100644
--- a/lib/hash/rte_hash.h
+++ b/lib/hash/rte_hash.h
@@ -674,6 +674,21 @@ rte_hash_iterate(const struct rte_hash *h, const void 
**key, void **data, uint32
  */
 int rte_hash_rcu_qsbr_add(struct rte_hash *h, struct rte_hash_rcu_config *cfg);
 
+/**
+ * Reclaim resources from the defer queue.
+ * This API reclaim the resources from the defer queue if rcu is enabled.
+ *
+ * @param h
+ *   the hash object to reclaim resources
+ * @return
+ *   On success - 0
+ *   On error - 1 with error code set in rte_errno.
+ *   Possible rte_errno codes are:
+ *   - EINVAL - invalid pointer or invalid rcu mode
+ */
+__rte_experimental
+int rte_hash_rcu_qsbr_dq_reclaim(struct rte_hash *h);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/hash/version.map b/lib/hash/version.map
index 6b2afebf6b..cec0e8fc67 100644
--- a/lib/hash/version.map
+++ b/lib/hash/version.map
@@ -48,3 +48,10 @@ DPDK_24 {
 
local: *;
 };
+
+EXPERIMENTAL {
+   global:
+
+   #added in 24.1
+   rte_hash_rcu_qsbr_dq_reclaim;
+}
\ No newline at end of file
-- 
2.34.1



[PATCH v2] lib/hash: feature reclaim defer queue

2024-03-02 Thread Abdullah Ömer Yamaç
This patch adds a new feature to the hash library to allow the user to
reclaim the defer queue. This is useful when the user wants to force
reclaim resources that are not being used. This API is only available
if the RCU is enabled.

Signed-off-by: Abdullah Ömer Yamaç 
Acked-by: Honnappa Nagarahalli 
---
 lib/hash/rte_cuckoo_hash.c | 23 +++
 lib/hash/rte_hash.h| 14 ++
 lib/hash/version.map   |  7 +++
 3 files changed, 44 insertions(+)

diff --git a/lib/hash/rte_cuckoo_hash.c b/lib/hash/rte_cuckoo_hash.c
index 9cf94645f6..254fa80cc5 100644
--- a/lib/hash/rte_cuckoo_hash.c
+++ b/lib/hash/rte_cuckoo_hash.c
@@ -1588,6 +1588,27 @@ rte_hash_rcu_qsbr_add(struct rte_hash *h, struct 
rte_hash_rcu_config *cfg)
return 0;
 }
 
+int
+rte_hash_rcu_qsbr_dq_reclaim(struct rte_hash *h)
+{
+   int ret;
+
+   if (h->hash_rcu_cfg == NULL || h->dq == NULL) {
+   rte_errno = EINVAL;
+   return -1;
+   }
+
+   ret = rte_rcu_qsbr_dq_reclaim(h->dq, h->hash_rcu_cfg->max_reclaim_size, 
NULL, NULL, NULL);
+   if (ret != 0) {
+   HASH_LOG(ERR,
+   "%s: could not reclaim the defer queue in hash table",
+   __func__);
+   return -1;
+   }
+
+   return 0;
+}
+
 static inline void
 remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt,
unsigned int i)
diff --git a/lib/hash/rte_hash.h b/lib/hash/rte_hash.h
index 7ecc02..c119477d50 100644
--- a/lib/hash/rte_hash.h
+++ b/lib/hash/rte_hash.h
@@ -674,6 +674,21 @@ rte_hash_iterate(const struct rte_hash *h, const void 
**key, void **data, uint32
  */
 int rte_hash_rcu_qsbr_add(struct rte_hash *h, struct rte_hash_rcu_config *cfg);
 
+/**
+ * Reclaim resources from the defer queue.
+ * This API reclaim the resources from the defer queue if rcu is enabled.
+ *
+ * @param h
+ *   the hash object to reclaim resources
+ * @return
+ *   On success - 0
+ *   On error - 1 with error code set in rte_errno.
+ *   Possible rte_errno codes are:
+ *   - EINVAL - invalid pointer or invalid rcu mode
+ */
+__rte_experimental
+int rte_hash_rcu_qsbr_dq_reclaim(struct rte_hash *h);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/lib/hash/version.map b/lib/hash/version.map
index 6b2afebf6b..cec0e8fc67 100644
--- a/lib/hash/version.map
+++ b/lib/hash/version.map
@@ -48,3 +48,9 @@ DPDK_24 {
 
local: *;
 };
+
+EXPERIMENTAL {
+   global:
+
+   rte_hash_rcu_qsbr_dq_reclaim;
+}
\ No newline at end of file
-- 
2.34.1



RE: [PATCH] rte_memcpy: fix off by one for size 16 and 32

2024-03-02 Thread Morten Brørup
I'm also working on a fix.

Med venlig hilsen / Kind regards,
-Morten Brørup

> -Original Message-
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Saturday, 2 March 2024 21.57
> To: dev@dpdk.org
> Cc: Morten Brørup; Bruce Richardson; Konstantin Ananyev; Zhihong Wang;
> Yuanhan Liu; Xiaoyun Li
> Subject: Re: [PATCH] rte_memcpy: fix off by one for size 16 and 32
> 
> On Sat,  2 Mar 2024 12:49:23 -0800
> Stephen Hemminger  wrote:
> 
> > The rte_memcpy code would do extra instructions for size 16
> > and 32 which potentially could reference past end of data.
> >
> > For size of 16, only single mov16 is needed.
> > same for size of 32, only single mov32.
> >
> > Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86")
> > Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-
> time")
> >
> > Suggested-by: Morten Brørup 
> > Signed-off-by: Stephen Hemminger 
> 
> Self-NAK, more is needed here.
> 
> The code has lots of pre-existing bugs where it will reference past the
> end
> of the data in some cases.


[PATCH] eal/x86: improve rte_memcpy const size 16 performance

2024-03-02 Thread Morten Brørup
When the rte_memcpy() size is 16, the same 16 bytes are copied twice.
In the case where the size is knownto be 16 at build tine, omit the
duplicate copy.

Reduced the amount of effectively copy-pasted code by using #ifdef
inside functions instead of outside functions.

Suggested-by: Stephen Hemminger 
Signed-off-by: Morten Brørup 
---
 lib/eal/x86/include/rte_memcpy.h | 224 ---
 1 file changed, 54 insertions(+), 170 deletions(-)

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 72a92290e0..6cc0e8ee16 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2024 SmartShare Systems
  */
 
 #ifndef _RTE_MEMCPY_X86_64_H_
@@ -91,14 +92,6 @@ rte_mov15_or_less(void *dst, const void *src, size_t n)
return ret;
 }
 
-#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
-
-#define ALIGNMENT_MASK 0x3F
-
-/**
- * AVX512 implementation below
- */
-
 /**
  * Copy 16 bytes from one location to another,
  * locations should not overlap.
@@ -119,10 +112,15 @@ rte_mov16(uint8_t *dst, const uint8_t *src)
 static __rte_always_inline void
 rte_mov32(uint8_t *dst, const uint8_t *src)
 {
+#if (defined __AVX512F__ && defined RTE_MEMCPY_AVX512) || defined __AVX2__ || 
defined __AVX__
__m256i ymm0;
 
ymm0 = _mm256_loadu_si256((const __m256i *)src);
_mm256_storeu_si256((__m256i *)dst, ymm0);
+#else /* SSE implementation */
+   rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+   rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+#endif
 }
 
 /**
@@ -132,10 +130,15 @@ rte_mov32(uint8_t *dst, const uint8_t *src)
 static __rte_always_inline void
 rte_mov64(uint8_t *dst, const uint8_t *src)
 {
+#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
__m512i zmm0;
 
zmm0 = _mm512_loadu_si512((const void *)src);
_mm512_storeu_si512((void *)dst, zmm0);
+#else /* AVX2, AVX & SSE implementation */
+   rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+   rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+#endif
 }
 
 /**
@@ -156,12 +159,18 @@ rte_mov128(uint8_t *dst, const uint8_t *src)
 static __rte_always_inline void
 rte_mov256(uint8_t *dst, const uint8_t *src)
 {
-   rte_mov64(dst + 0 * 64, src + 0 * 64);
-   rte_mov64(dst + 1 * 64, src + 1 * 64);
-   rte_mov64(dst + 2 * 64, src + 2 * 64);
-   rte_mov64(dst + 3 * 64, src + 3 * 64);
+   rte_mov128(dst + 0 * 128, src + 0 * 128);
+   rte_mov128(dst + 1 * 128, src + 1 * 128);
 }
 
+#if defined __AVX512F__ && defined RTE_MEMCPY_AVX512
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
 /**
  * Copy 128-byte blocks from one location to another,
  * locations should not overlap.
@@ -231,12 +240,22 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
/**
 * Fast way when copy size doesn't exceed 512 bytes
 */
+   if (__builtin_constant_p(n) && n == 32) {
+   rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+   return ret;
+   }
if (n <= 32) {
rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+   if (__builtin_constant_p(n) && n == 16)
+   return ret; /* avoid (harmless) duplicate copy */
rte_mov16((uint8_t *)dst - 16 + n,
  (const uint8_t *)src - 16 + n);
return ret;
}
+   if (__builtin_constant_p(n) && n == 64) {
+   rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+   return ret;
+   }
if (n <= 64) {
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n,
@@ -321,73 +340,6 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
  * AVX2 implementation below
  */
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static __rte_always_inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-   __m128i xmm0;
-
-   xmm0 = _mm_loadu_si128((const __m128i *)(const void *)src);
-   _mm_storeu_si128((__m128i *)(void *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static __rte_always_inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-   __m256i ymm0;
-
-   ymm0 = _mm256_loadu_si256((const __m256i *)(const void *)src);
-   _mm256_storeu_si256((__m256i *)(void *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static __rte_always_inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-   rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-   rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *

RE: [PATCH] rte_memcpy: fix off by one for size 16 and 32

2024-03-02 Thread Morten Brørup
> From: Stephen Hemminger [mailto:step...@networkplumber.org]
> Sent: Saturday, 2 March 2024 21.49
> 
> The rte_memcpy code would do extra instructions for size 16
> and 32 which potentially could reference past end of data.

It's a somewhat weird concept, but they don't reference past end of data.
They reference data in chunks of 16.
E.g. when copying 17 bytes: first copy [0..15] and then copy [1..16] (as "-16 + 
n" in the code).

By referencing an address "-16" in a block of 16 bytes, they are not 
referencing past end of data.

> 
> For size of 16, only single mov16 is needed.
> same for size of 32, only single mov32.

I fixed the duplicate copies with my patch [1]. Please review.

[1]: 
https://inbox.dpdk.org/dev/20240302234812.9137-1...@smartsharesystems.com/T/#u



RE: [PATCH] eal/x86: improve rte_memcpy const size 16 performance

2024-03-02 Thread Morten Brørup
Recheck-request: iol-broadcom-Performance

Patch only modifies x86 code, but fails performance on aarch64.



Re: [PATCH] eal/x86: improve rte_memcpy const size 16 performance

2024-03-02 Thread Stephen Hemminger
On Sun,  3 Mar 2024 00:48:12 +0100
Morten Brørup  wrote:

> When the rte_memcpy() size is 16, the same 16 bytes are copied twice.
> In the case where the size is knownto be 16 at build tine, omit the
> duplicate copy.
> 
> Reduced the amount of effectively copy-pasted code by using #ifdef
> inside functions instead of outside functions.
> 
> Suggested-by: Stephen Hemminger 
> Signed-off-by: Morten Brørup 
> ---

Looks good, let me see how it looks in goldbolt vs Gcc.

One other issue is that for the non-constant case, rte_memcpy has an excessively
large inline code footprint. That is one of the reasons Gcc doesn't always
inline.  For > 128 bytes, it really should be a function.


Re: [PATCH] eal/x86: improve rte_memcpy const size 16 performance

2024-03-02 Thread Stephen Hemminger
On Sat, 2 Mar 2024 21:40:03 -0800
Stephen Hemminger  wrote:

> On Sun,  3 Mar 2024 00:48:12 +0100
> Morten Brørup  wrote:
> 
> > When the rte_memcpy() size is 16, the same 16 bytes are copied twice.
> > In the case where the size is knownto be 16 at build tine, omit the
> > duplicate copy.
> > 
> > Reduced the amount of effectively copy-pasted code by using #ifdef
> > inside functions instead of outside functions.
> > 
> > Suggested-by: Stephen Hemminger 
> > Signed-off-by: Morten Brørup 
> > ---  
> 
> Looks good, let me see how it looks in goldbolt vs Gcc.
> 
> One other issue is that for the non-constant case, rte_memcpy has an 
> excessively
> large inline code footprint. That is one of the reasons Gcc doesn't always
> inline.  For > 128 bytes, it really should be a function.

For size of 4,6,8,16, 32, 64, up to 128 Gcc inline and rte_memcpy match.

For size 128. It looks gcc is simpler.

rte_copy_addr:
vmovdqu ymm0, YMMWORD PTR [rsi]
vextracti128XMMWORD PTR [rdi+16], ymm0, 0x1
vmovdqu XMMWORD PTR [rdi], xmm0
vmovdqu ymm0, YMMWORD PTR [rsi+32]
vextracti128XMMWORD PTR [rdi+48], ymm0, 0x1
vmovdqu XMMWORD PTR [rdi+32], xmm0
vmovdqu ymm0, YMMWORD PTR [rsi+64]
vextracti128XMMWORD PTR [rdi+80], ymm0, 0x1
vmovdqu XMMWORD PTR [rdi+64], xmm0
vmovdqu ymm0, YMMWORD PTR [rsi+96]
vextracti128XMMWORD PTR [rdi+112], ymm0, 0x1
vmovdqu XMMWORD PTR [rdi+96], xmm0
vzeroupper
ret
copy_addr:
vmovdqu ymm0, YMMWORD PTR [rsi]
vmovdqu YMMWORD PTR [rdi], ymm0
vmovdqu ymm1, YMMWORD PTR [rsi+32]
vmovdqu YMMWORD PTR [rdi+32], ymm1
vmovdqu ymm2, YMMWORD PTR [rsi+64]
vmovdqu YMMWORD PTR [rdi+64], ymm2
vmovdqu ymm3, YMMWORD PTR [rsi+96]
vmovdqu YMMWORD PTR [rdi+96], ymm3
vzeroupper
ret


Re: [RFC 1/7] eal: extend bit manipulation functions

2024-03-02 Thread Mattias Rönnblom

On 2024-03-02 18:05, Stephen Hemminger wrote:

On Sat, 2 Mar 2024 14:53:22 +0100
Mattias Rönnblom  wrote:


diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index 449565eeae..9a368724d5 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -2,6 +2,7 @@
   * Copyright(c) 2020 Arm Limited
   * Copyright(c) 2010-2019 Intel Corporation
   * Copyright(c) 2023 Microsoft Corporation
+ * Copyright(c) 2024 Ericsson AB
   */
  


Unless this is coming from another project code base, the common
practice is not to add copyright for each contributor in later versions.



Unless it's a large contribution (compared to the rest of the file)?

I guess that's why the 916c50d commit adds the Microsoft copyright notice.


+/**
+ * Test if a particular bit in a 32-bit word is set.
+ *
+ * This function does not give any guarantees in regards to memory
+ * ordering or atomicity.
+ *
+ * @param addr
+ *   A pointer to the 32-bit word to query.
+ * @param nr
+ *   The index of the bit (0-31).
+ * @return
+ *   Returns true if the bit is set, and false otherwise.
+ */
+static inline bool
+rte_bit_test32(const uint32_t *addr, unsigned int nr);


Is it possible to reorder these inlines to avoid having
forward declarations?



Yes, but I'm not sure it's a net gain.

A statement expression macro seems like a perfect tool for the job, but 
then MSVC doesn't support statement expressions. You could also have a 
macro that just generate the function body, as oppose to the whole function.


I'll consider if I should just bite the bullet and expand all the 
macros. 4x duplication.



Also, new functions should be marked __rte_experimental
for a release or two.


Yes, thanks.


Re: [RFC 7/7] eal: deprecate relaxed family of bit operations

2024-03-02 Thread Mattias Rönnblom

On 2024-03-02 18:07, Stephen Hemminger wrote:

On Sat, 2 Mar 2024 14:53:28 +0100
Mattias Rönnblom  wrote:


diff --git a/lib/eal/include/rte_bitops.h b/lib/eal/include/rte_bitops.h
index b5a9df5930..783dd0e1ee 100644
--- a/lib/eal/include/rte_bitops.h
+++ b/lib/eal/include/rte_bitops.h
@@ -1179,6 +1179,10 @@ __RTE_GEN_BIT_ATOMIC_OPS(64)
   *   The address holding the bit.
   * @return
   *   The target bit.
+ * @note
+ *   This function is deprecated. Use rte_bit_test32(),
+ *   rte_bit_once_test32(), or rte_bit_atomic_test32() instead,
+ *   depending on exactly what guarantees are required.
   */
  static inline uint32_t
  rte_bit_relaxed_get32(unsigned int nr, volatile uint32_t *addr)


The DPDK process is:
- mark these as deprecated in release notes of release N.
- mark these as deprecated using __rte_deprecated in next LTS
- drop these in LTS release after that.

Don't use notes for this.


Don't use notes to replace the above process, or don't use notes at all?

A note seems useful to me, especially considering there is a choice to 
be made (not just mindlessly replacing one call with another).


Anyway, release notes updates have to wait so I'll just drop this patch 
for now.


Re: [PATCH] rte_memcpy: fix off by one for size 16 and 32

2024-03-02 Thread Mattias Rönnblom

On 2024-03-02 21:56, Stephen Hemminger wrote:

On Sat,  2 Mar 2024 12:49:23 -0800
Stephen Hemminger  wrote:


The rte_memcpy code would do extra instructions for size 16
and 32 which potentially could reference past end of data.

For size of 16, only single mov16 is needed.
same for size of 32, only single mov32.

Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86")
Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")

Suggested-by: Morten Brørup 
Signed-off-by: Stephen Hemminger 


Self-NAK, more is needed here.

The code has lots of pre-existing bugs where it will reference past the end
of the data in some cases.


Memory beyond the buffer is not accessed in this case. The rte_mov16() 
copies just overlap.


A colleague pointed out the same "bug" to me a couple of years ago. We 
didn't realize what code would be generated in the n == 16 case though. 
That seems very much worth fixing.


Maybe it's worth adding a comment regarding the overlap.


Re: [PATCH] rte_memcpy: fix off by one for size 16 and 32

2024-03-02 Thread Mattias Rönnblom

On 2024-03-02 21:49, Stephen Hemminger wrote:

The rte_memcpy code would do extra instructions for size 16
and 32 which potentially could reference past end of data.

For size of 16, only single mov16 is needed.
same for size of 32, only single mov32.

Fixes: f5472703c0bd ("eal: optimize aligned memcpy on x86")
Fixes: d35cc1fe6a7a ("eal/x86: revert select optimized memcpy at run-time")

Suggested-by: Morten Brørup 
Signed-off-by: Stephen Hemminger 
---
  lib/eal/x86/include/rte_memcpy.h | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/eal/x86/include/rte_memcpy.h b/lib/eal/x86/include/rte_memcpy.h
index 72a92290e05d..00479a4bfbe6 100644
--- a/lib/eal/x86/include/rte_memcpy.h
+++ b/lib/eal/x86/include/rte_memcpy.h
@@ -233,13 +233,15 @@ rte_memcpy_generic(void *dst, const void *src, size_t n)
 */
if (n <= 32) {
rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-   rte_mov16((uint8_t *)dst - 16 + n,
+   if (n > 16)
+   rte_mov16((uint8_t *)dst - 16 + n,
  (const uint8_t *)src - 16 + n);
return ret;
}
if (n <= 64) {
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-   rte_mov32((uint8_t *)dst - 32 + n,
+   if (n > 32)


n is always > 32 here.


+   rte_mov32((uint8_t *)dst - 32 + n,
  (const uint8_t *)src - 32 + n);
return ret;
}