[RFC v2 0/2] add pointer compression API

2023-10-11 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode this translated into a ~5% throughput
increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch

Paul Szczepanek (2):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test

 .mailmap   |   1 +
 app/test/test_ring.h   |  59 +-
 app/test/test_ring_perf.c  | 324 ++---
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 160 ++
 5 files changed, 421 insertions(+), 124 deletions(-)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

-- 
2.25.1



[RFC v2 1/2] eal: add pointer compression functions

2023-10-11 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 160 +
 3 files changed, 162 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 864d33ee46..3f0c9d32f5 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1058,6 +1058,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index a0463efac7..17d8373648 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..73bde22973
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,160 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE
+   svuint64_t v_src_table;
+   svuint64_t v_dest_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_src_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_dest_table = svsub_x(pg, v_src_table, (uint64_t)ptr_base);
+   v_dest_table = svlsr_x(pg, v_dest_table, bit_shift);
+   svst1w(pg, &dest_table[i], v_dest_table);
+   i += svcntd();
+   pg = svwhilelt_b64(i, n);
+   } while (svptest_any(svptrue_b64(), pg));
+#elif defined __ARM_NEON
+   uint64_t ptr_diff;
+   uint64x2_t v_src_table;
+   uint64x2_t v_dest_table;
+   /* right shift is done by left shifting by negative int */
+   int64x2_t v_shift = vdupq_n_s64(-bit_shift);
+   uint64x2_t v_ptr_base = vdupq_n_u64((uint64_t)ptr_base);
+   for (; i < (n & ~0x1); i += 2) {
+   v_src_table = vld1q_u64((const uint64_t *)src_table + i);
+   v_dest_table = vsubq_u64(v_src_table, v_ptr_base);
+   v_dest_table = vshlq_u64(v_dest_table, v_shift);
+   vst1_u32(dest_table + i, vqmovn_u64(v_dest_table));
+   }
+   /* process leftover single item in case of odd number of n */
+   if (unlikely(n & 0x1)) {
+   ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base);
+   dest_table[i] = (uint32_t) (ptr_diff >> bit_shift);
+   }
+#else
+   uintptr_t ptr_diff;
+   for (; i < n; i++) {
+

[RFC v2 2/2] test: add pointer compress tests to ring perf test

2023-10-11 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  59 ++-
 app/test/test_ring_perf.c | 324 +++---
 2 files changed, 259 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..e8b7525c23 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */
 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,9 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32
 
+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 128
+
 #define TEST_RING_IGNORE_API_TYPE ~0U
 
 /* This function is placed here as it is required for both
@@ -101,6 +106,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +160,29 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +193,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -213,6 +247,29 @@ test_ring_dequeue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mc_dequeue_burst_elem(r, obj, esize,
n, NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_dequeue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy((char *)obj, zcd.ptr1, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy((char *)obj + zcd.n1 * esize,
+   zcd.ptr2,
+   (ret - zcd.n1) * esize);
+   rte_ring_dequeue_zc_finish(r, ret);
+   return ret

Re: [RFC 1/2] eal: add pointer compression functions

2023-10-11 Thread Paul Szczepanek

On 11/10/2023 14:36, Honnappa Nagarahalli wrote:

-Original Message-
From: Thomas Monjalon 
Sent: Monday, October 9, 2023 10:54 AM
To: Paul Szczepanek 
Cc: dev@dpdk.org; Honnappa Nagarahalli ;
Kamalakshitha Aligeri 
Subject: Re: [RFC 1/2] eal: add pointer compression functions

[...]

I see it is providing some per-CPU optimizations, so it is in favor of having 
it in
DPDK.
Other than that, it looks very generic, so it is questionable to have in DPDK.

We had it done for mbuf pointers. But then, we thought it could be generic.

Right now the API results in 32b indices. We could make it generic further by 
allowing for 16b indices. 8b indices does not make sense.


To add to this, this being generic is I think a good thing.

I think it belongs in DPDK as it will make it easy for other 
architectures to add their versions and maintain the abstraction.




Re: [PATCH v2] config/arm: update aarch32 build with gcc13

2023-10-12 Thread Paul Szczepanek



On 09/10/2023 10:53, Juraj Linkeš wrote:

The aarch32 with gcc13 fails with:

Compiler for C supports arguments -march=armv8-a: NO

../config/arm/meson.build:714:12: ERROR: Problem encountered: No
suitable armv8 march version found.

This is because we test -march=armv8-a alone (without the -mpfu option),
which is no longer supported in gcc13 aarch32 builds.

The most recent recommendation from the compiler team is to build with
-march=armv8-a+simd -mfpu=auto, which should work for compilers old and
new. The suggestion is to first check -march=armv8-a+simd and only then
check -mfpu=auto.

To address this, add a way to force the architecture (the value of
the -march option).

Signed-off-by: Juraj Linkeš 
Acked-by: Ruifeng Wang 
---
  config/arm/meson.build | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 3f22d8a2fc..5303d0e969 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -43,7 +43,9 @@ implementer_generic = {
  },
  'generic_aarch32': {
  'march': 'armv8-a',
-'compiler_options': ['-mfpu=neon'],
+'force_march': true,
+'march_features': ['simd'],
+'compiler_options': ['-mfpu=auto'],
  'flags': [
  ['RTE_ARCH_ARM_NEON_MEMCPY', false],
  ['RTE_ARCH_STRICT_ALIGN', true],
@@ -711,7 +713,11 @@ if update_flags
  endif
  endforeach
  if candidate_march == ''
-error('No suitable armv8 march version found.')
+if part_number_config.get('force_march', false)
+candidate_march = part_number_config['march']
+else
+error('No suitable armv8 march version found.')
+endif
This section is only used when no candidate is found, this would make it 
not really be a forced arch but more a fallback arch. If we want the 
user to be able to really force the march string we'd need to put the 
"is forced?" check higher. Am I reading the code right?

  endif
  if candidate_march != part_number_config['march']
  warning('Configuration march version is ' +
@@ -741,7 +747,7 @@ if update_flags
  # apply supported compiler options
  if part_number_config.has_key('compiler_options')
  foreach flag: part_number_config['compiler_options']
-if cc.has_argument(flag)
+if cc.has_multi_arguments(machine_args + [flag])
  machine_args += flag
  else
  warning('Configuration compiler option ' +


Re: [PATCH v3 1/2] doc: increase python max line length to 100

2023-10-12 Thread Paul Szczepanek



On 28/09/2023 13:18, Juraj Linkeš wrote:

Unify with C recommendations which allow line length of up to 100
characters.

Signed-off-by: Owen Hilyard 
Signed-off-by: Juraj Linkeš 
---
  .editorconfig| 2 +-
  doc/doc_build/meson-private/meson.lock   | 0
  doc/guides/contributing/coding_style.rst | 3 +++
  dts/pyproject.toml   | 4 ++--
  4 files changed, 6 insertions(+), 3 deletions(-)
  create mode 100644 doc/doc_build/meson-private/meson.lock

diff --git a/.editorconfig b/.editorconfig
index ab41c95085..1e7d74213f 100644
--- a/.editorconfig
+++ b/.editorconfig
@@ -16,7 +16,7 @@ max_line_length = 100
  [*.py]
  indent_style = space
  indent_size = 4
-max_line_length = 79
+max_line_length = 100
  
  [meson.build]

  indent_style = space
diff --git a/doc/doc_build/meson-private/meson.lock 
b/doc/doc_build/meson-private/meson.lock
new file mode 100644
index 00..e69de29bb2
diff --git a/doc/guides/contributing/coding_style.rst 
b/doc/guides/contributing/coding_style.rst
index 648849899d..a42cd3d58d 100644
--- a/doc/guides/contributing/coding_style.rst
+++ b/doc/guides/contributing/coding_style.rst
@@ -880,6 +880,9 @@ All Python code should be compliant with
  `PEP8 (Style Guide for Python Code) 
`_.
  
  The ``pep8`` tool can be used for testing compliance with the guidelines.

+Note that line lengths are acceptable up to 100 characters, which is in line 
with C recommendations.
+
+..


Presumably the bare ".." is some accidental leftover markup.


  
  Integrating with the Build System

  -
diff --git a/dts/pyproject.toml b/dts/pyproject.toml
index 6762edfa6b..980ac3c7db 100644
--- a/dts/pyproject.toml
+++ b/dts/pyproject.toml
@@ -41,7 +41,7 @@ build-backend = "poetry.core.masonry.api"
  [tool.pylama]
  linters = "mccabe,pycodestyle,pyflakes"
  format = "pylint"
-max_line_length = 88 # 
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length
+max_line_length = 100
  
  [tool.mypy]

  python_version = "3.10"
@@ -55,4 +55,4 @@ profile = "black"
  [tool.black]
  target-version = ['py310']
  include = '\.pyi?$'
-line-length = 88 # 
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length
+line-length = 100


The rest looks good. Wholeheartedly support longer line length.




Re: [PATCH v1] dts: add Dockerfile

2023-10-17 Thread Paul Szczepanek



On 03/11/2022 13:46, Juraj Linkeš wrote:

The Dockerfile defines development and CI runner images.

Signed-off-by: Juraj Linkeš 
Acked-by: Jeremy Spweock 
---
  dts/.devcontainer/devcontainer.json | 30 
  dts/Dockerfile  | 38 
  dts/README.md   | 55 +
  3 files changed, 123 insertions(+)
  create mode 100644 dts/.devcontainer/devcontainer.json
  create mode 100644 dts/Dockerfile
  create mode 100644 dts/README.md

[..]


Acked-by: Paul Szczepanek 



[RFC 0/2] add pointer compression API

2023-09-27 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode this translated into a ~5% throughput
increase on an ampere altra.

Paul Szczepanek (2):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test

 .mailmap   |   1 +
 app/test/test_ring.h   |  59 +-
 app/test/test_ring_perf.c  | 324 ++---
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 158 ++
 5 files changed, 419 insertions(+), 124 deletions(-)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

-- 
2.25.1



[RFC 1/2] eal: add pointer compression functions

2023-09-27 Thread Paul Szczepanek
Add a new utility header for compressing pointers. Pointers are
compressed by taking advantage of their locality. Instead of
storing the full address only an offset from a known base is stored.

The provided functions can store pointers in 32bit offsets.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 158 +
 3 files changed, 160 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 864d33ee46..3f0c9d32f5 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1058,6 +1058,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index a0463efac7..60b056ef96 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -35,6 +35,7 @@ headers += files(
 'rte_pci_dev_feature_defs.h',
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
+   'rte_ptr_compress.h',
 'rte_pflock.h',
 'rte_random.h',
 'rte_reciprocal.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..6498587c0b
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef _RTE_PTR_COMPRESS_H_
+#define _RTE_PTR_COMPRESS_H_
+
+/**
+ * @file
+ * RTE pointer compression and decompression.
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32 bit offsets from base pointer.
+ *
+ * @note Offsets from the base pointer must fit within 32bits. Alignment allows
+ * us to drop bits from the offsets - this means that for pointers aligned by
+ * 8 bytes they must be within 32GB of the base pointer. Unaligned pointers
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE
+   svuint64_t v_src_table;
+   svuint64_t v_dest_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_src_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_dest_table = svsub_x(pg, v_src_table, (uint64_t)ptr_base);
+   v_dest_table = svlsr_x(pg, v_dest_table, bit_shift);
+   svst1w(pg, &dest_table[i], v_dest_table);
+   i += svcntd();
+   pg = svwhilelt_b64(i, n);
+   } while (svptest_any(svptrue_b64(), pg));
+#elif defined __ARM_NEON
+   uint64_t ptr_diff;
+   uint64x2_t v_src_table;
+   uint64x2_t v_dest_table;
+   /* right shift is done by left shifting by negative int */
+   int64x2_t v_shift = vdupq_n_s64(-bit_shift);
+   uint64x2_t v_ptr_base = vdupq_n_u64((uint64_t)ptr_base);
+   for (; i < (n & ~0x1); i += 2) {
+   v_src_table = vld1q_u64((const uint64_t *)src_table + i);
+   v_dest_table = vsubq_u64(v_src_table, v_ptr_base);
+   v_dest_table = vshlq_u64(v_dest_table, v_shift);
+   vst1_u32(dest_table + i, vqmovn_u64(v_dest_table));
+   }
+   /* process leftover single item in case of odd number of n */
+   if (unlikely(n & 0x1)) {
+   ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base);
+   dest_table[i] = (uint32_t) (ptr_diff >> bit_shift);
+   }
+#else
+   uint64_t ptr_diff;
+   for (; i < n; i++) {
+   ptr_diff = RTE_PTR_DIFF(src_table[i], ptr_base);
+   /* save extra bits that are redundant due to alignment */
+   ptr_diff = ptr_diff >> bit_shift;
+   /* make sure no truncation will happen when casting */
+   RTE_ASSERT(ptr_diff <= UINT32_MAX);
+   dest_table[i] = (uint32_t) ptr_diff;
+   }
+#endif
+}
+
+/**
+ * Decompress po

[RFC 2/2] test: add pointer compress tests to ring perf test

2023-09-27 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs

To reuse existing code, some refactoring was done to pass more
parameters to test threads. Additionally more bulk sizes were
added to showcase their effects on compression. To keep runtime
reasoanble iterations where adjusted to take into account bulk sizes.

Old printfs are adjusted to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  59 ++-
 app/test/test_ring_perf.c | 324 +++---
 2 files changed, 259 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..e8b7525c23 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */
 
 #include 
 #include 
 #include 
+#include 
+#include 
 
 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,9 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32
 
+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 128
+
 #define TEST_RING_IGNORE_API_TYPE ~0U
 
 /* This function is placed here as it is required for both
@@ -101,6 +106,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +160,29 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +193,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -213,6 +247,29 @@ test_ring_dequeue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mc_dequeue_burst_elem(r, obj, esize,
n, NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_dequeue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy((char *)obj, zcd.ptr1, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy((char *)obj + zcd.n1 * esize,
+   zcd.ptr2,
+   (ret - zcd.n1) * esize);
+   rte_ring_dequeue_zc_finish(r, ret);
+   return ret

Re: [PATCH v7 4/4] test-pmd: add more packet verbose decode options

2024-08-22 Thread Paul Szczepanek
On 20/08/2024 16:54, Stephen Hemminger wrote:
> On Tue, 20 Aug 2024 14:42:56 +0100
> Alex Chapman  wrote:
> 
>> Hi Stephen,
>>
>> I have gone through your patch series and the hexdump option would be 
>> quite valuable for use in DTS.
>>
>> However I am currently facing the issue of distinguishing noise packets 
>> from intentional packets within the verbose output. Prior to your patch, 
>> the intention was to use the Layer 4 port to distinguish between them, 
>> however with the hexdump option, the plan is to now use a custom payload.
>>
>> The one issue is that with verbose level 5 does not print the required 
>> RSS hash and RSS queue information.
> 
> The queue is there, in the output. Not sure if the hash matters.
> I wanted to keep to tshark format as much as possible.
> 

We appreciate that but the RSS info is valuable to us. Seeing as
different enum values offer different info rather than different levels,
maybe we could change the enum to flags

+enum verbose_mode {
+   VERBOSE_OFF = 0,
+   VERBOSE_RX = 0x1,
+   VERBOSE_TX = 0x2,
+   VERBOSE_BOTH = 0x4,
+   VERBOSE_DISSECT = 0x8,
+   VERBOSE_HEX = 0x10
+};

Then the flags can be ORed together:

verbose_flags = VERBOSE_RX | VERBOSE_TX | VERBOSE_HEX | VERBOSE_RSS

And then instead of switch each print is an if that checks
for flag being set in the verbose_flags. This way you only get the RSS
if you request it.


Re: [PATCH v5 0/4] add pointer compression API

2024-02-22 Thread Paul Szczepanek
For some reason your email is not visible to me, even though it's in the 
archive.


On 02/11/202416:32,Konstantin Ananyev konstantin.v.ananyev  wrote:


From one side the code itself is very small and straightforward, > from other 
side - it is not clear to me what is intended usage for it
within DPDK and it's applianances?
Konstantin


The intended usage is explained in the cover email (see below) and demonstrated
in the test supplied in the following patch - when sending arrays of pointers
between cores as it happens in a forwarding example.

On 01/11/2023 18:12, Paul Szczepanek wrote:


This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size

Paul Szczepanek (4):
   eal: add pointer compression functions
   test: add pointer compress tests to ring perf test
   docs: add pointer compression to the EAL guide
   test: add unit test for ptr compression

  .mailmap  |   1 +
  app/test/meson.build  |   1 +
  app/test/test_eal_ptr_compress.c  | 108 ++
  app/test/test_ring.h  |  94 -
  app/test/test_ring_perf.c | 354 --
  .../prog_guide/env_abstraction_layer.rst  | 142 +++
  lib/eal/include/meson.build   |   1 +
  lib/eal/include/rte_ptr_compress.h| 266 +
  8 files changed, 843 insertions(+), 124 deletions(-)
  create mode 100644 app/test/test_eal_ptr_compress.c
  create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v6 1/4] eal: add pointer compression functions

2024-02-29 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

This can be used for example when passing caches full of pointers
between threads. Memory containing the pointers is copied multiple
times which is especially costly between cores. This compression
method will allow us to shrink the memory size copied. Further
commits add a test to evaluate the effectiveness of this approach.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 266 +
 3 files changed, 268 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 3f5bab26a8..004751d27a 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1069,6 +1069,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..ce2c733633 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..47a72e4213
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,266 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/env_abstraction_layer.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32
+   svuint64_t v_ptr_table;
+ 

[PATCH v6 0/4] add pointer compression API

2024-02-29 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit

Paul Szczepanek (4):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test
  docs: add pointer compression to the EAL guide
  test: add unit test for ptr compression

 .mailmap  |   1 +
 app/test/meson.build  |   1 +
 app/test/test_eal_ptr_compress.c  | 108 ++
 app/test/test_ring.h  |  94 -
 app/test/test_ring_perf.c | 354 --
 .../prog_guide/env_abstraction_layer.rst  | 142 +++
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_ptr_compress.h| 266 +
 8 files changed, 843 insertions(+), 124 deletions(-)
 create mode 100644 app/test/test_eal_ptr_compress.c
 create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v6 2/4] test: add pointer compress tests to ring perf test

2024-02-29 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  94 +-
 app/test/test_ring_perf.c | 354 +-
 2 files changed, 324 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..3b00f2465d 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */

 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_16(0, obj, zcd.ptr1, zcd.n1 * 2, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_16(0,
+   obj + (zcd.n1 * 2),
+   zcd.ptr2,
+   (ret - zcd.n1) * 2, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret * 2;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +211,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
 

[PATCH v6 3/4] docs: add pointer compression to the EAL guide

2024-02-29 Thread Paul Szczepanek
Documentation added in the EAL guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 .../prog_guide/env_abstraction_layer.rst  | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6debf54efb..f04d032442 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -1192,3 +1192,145 @@ will not be deallocated.

 Any successful deallocation event will trigger a callback, for which user
 applications and other DPDK subsystems can register.
+
+.. _pointer_compression:
+
+Pointer Compression
+---
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v6 4/4] test: add unit test for ptr compression

2024-02-29 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly.

Signed-off-by: Paul Szczepanek 
---
 app/test/meson.build |   1 +
 app/test/test_eal_ptr_compress.c | 108 +++
 2 files changed, 109 insertions(+)
 create mode 100644 app/test/test_eal_ptr_compress.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 4183d66b0e..3e172b154d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -66,6 +66,7 @@ source_file_deps = {
 'test_dmadev_api.c': ['dmadev'],
 'test_eal_flags.c': [],
 'test_eal_fs.c': [],
+'test_eal_ptr_compress.c': [],
 'test_efd.c': ['efd', 'net'],
 'test_efd_perf.c': ['efd', 'hash'],
 'test_errno.c': [],
diff --git a/app/test/test_eal_ptr_compress.c b/app/test/test_eal_ptr_compress.c
new file mode 100644
index 00..c1c9a98be7
--- /dev/null
+++ b/app/test/test_eal_ptr_compress.c
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define PTRS_SIZE 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_eal_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[PTRS_SIZE] = {0};
+   void *ptrs_out[PTRS_SIZE] = {0};
+   uint32_t offsets32[PTRS_SIZE] = {0};
+   uint16_t offsets16[PTRS_SIZE] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32(base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16(base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_eal_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < PTRS_SIZE; n++) {
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_16[j],
+   j /* exponent of alignment */,
+   n,
+   false
+   );
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_32[j],
+   j /* exponent of alignment */,
+   n,
+   true
+   );
+   if (ret != 0)
+   return ret;
+   }
+   }
+   }
+
+   return ret;
+}
+
+REGISTER_FAST_TEST(eal_ptr_compress_autotest, true, true, 
test_eal_ptr_compress);
--
2.25.1



[PATCH v7 1/4] eal: add pointer compression functions

2024-03-01 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

This can be used for example when passing caches full of pointers
between threads. Memory containing the pointers is copied multiple
times which is especially costly between cores. This compression
method will allow us to shrink the memory size copied. Further
commits add a test to evaluate the effectiveness of this approach.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 266 +
 2 files changed, 267 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..ce2c733633 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..47a72e4213
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,266 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/env_abstraction_layer.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32
+   svuint64_t v_ptr_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_ptr_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_ptr_table = svsub_x(pg, v_ptr_table, (uint64_t)ptr_base);
+   v_ptr_table = svlsr_x(pg, v_ptr_table, bit_shift);
+   svst1w(p

[PATCH v7 0/4] add pointer compression API

2024-03-01 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes

Paul Szczepanek (4):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test
  docs: add pointer compression to the EAL guide
  test: add unit test for ptr compression

 app/test/meson.build  |   1 +
 app/test/test_eal_ptr_compress.c  | 108 ++
 app/test/test_ring.h  |  94 -
 app/test/test_ring_perf.c | 354 --
 .../prog_guide/env_abstraction_layer.rst  | 142 +++
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_ptr_compress.h| 266 +
 7 files changed, 842 insertions(+), 124 deletions(-)
 create mode 100644 app/test/test_eal_ptr_compress.c
 create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v7 2/4] test: add pointer compress tests to ring perf test

2024-03-01 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  94 +-
 app/test/test_ring_perf.c | 354 +-
 2 files changed, 324 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..3b00f2465d 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */

 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_16(0, obj, zcd.ptr1, zcd.n1 * 2, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_16(0,
+   obj + (zcd.n1 * 2),
+   zcd.ptr2,
+   (ret - zcd.n1) * 2, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret * 2;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +211,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
 

[PATCH v7 3/4] docs: add pointer compression to the EAL guide

2024-03-01 Thread Paul Szczepanek
Documentation added in the EAL guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 .../prog_guide/env_abstraction_layer.rst  | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9559c12a98..aaa1665c4b 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -1192,3 +1192,145 @@ will not be deallocated.

 Any successful deallocation event will trigger a callback, for which user
 applications and other DPDK subsystems can register.
+
+.. _pointer_compression:
+
+Pointer Compression
+---
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v7 4/4] test: add unit test for ptr compression

2024-03-01 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly.

Signed-off-by: Paul Szczepanek 
---
 app/test/meson.build |   1 +
 app/test/test_eal_ptr_compress.c | 108 +++
 2 files changed, 109 insertions(+)
 create mode 100644 app/test/test_eal_ptr_compress.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..0d1b777199 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -68,6 +68,7 @@ source_file_deps = {
 'test_dmadev_api.c': ['dmadev'],
 'test_eal_flags.c': [],
 'test_eal_fs.c': [],
+'test_eal_ptr_compress.c': [],
 'test_efd.c': ['efd', 'net'],
 'test_efd_perf.c': ['efd', 'hash'],
 'test_errno.c': [],
diff --git a/app/test/test_eal_ptr_compress.c b/app/test/test_eal_ptr_compress.c
new file mode 100644
index 00..c1c9a98be7
--- /dev/null
+++ b/app/test/test_eal_ptr_compress.c
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define PTRS_SIZE 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_eal_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[PTRS_SIZE] = {0};
+   void *ptrs_out[PTRS_SIZE] = {0};
+   uint32_t offsets32[PTRS_SIZE] = {0};
+   uint16_t offsets16[PTRS_SIZE] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32(base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16(base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_eal_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < PTRS_SIZE; n++) {
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_16[j],
+   j /* exponent of alignment */,
+   n,
+   false
+   );
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_32[j],
+   j /* exponent of alignment */,
+   n,
+   true
+   );
+   if (ret != 0)
+   return ret;
+   }
+   }
+   }
+
+   return ret;
+}
+
+REGISTER_FAST_TEST(eal_ptr_compress_autotest, true, true, 
test_eal_ptr_compress);
--
2.25.1



Re: RE: [PATCH v5 0/4] add pointer compression API

2024-05-15 Thread Paul Szczepanek
On 04/03/2024 14:44, Konstantin Ananyev wrote:
>> This feature is targeted for pipeline mode of applications. We see many 
>> customers using pipeline mode. This feature helps in reducing
>> the cost of transferring the packets between cores by reducing the copies 
>> involved.
> 
> I do understand the intention, and I am not arguing about usefulness of the 
> pipeline model. 
> My point is you are introducing new API: compress/decompress pointers,
> but don't provide (or even describe) any proper way for the developer to use 
> it in a safe and predictable manner.
> Which from my perspective make it nearly useless and misleading.

In the latest version there is an example in the docs showing how to use
it. There is an integration test that shows how to use it. The comments
in the header also provide detailed guidance.

>> For an application with multiple pools, it depends on how the applications 
>> are using multiple pools. But, if there is a bunch of packets
>> belonging to multiple mempools, compressing those mbufs may not be possible. 
>> But if those mbufs are grouped per mempool and
>> are transferred on different queues, then it is possible. Hence the APIs are 
>> implemented very generically.
> 
> Ok, let's consider even more simplistic scenario - all pointers belong to one 
> mempool.
> AFAIK, even one mempool can contain elements from different memzones,
> and these memzones are not guaranteed to have consecutive VAs.
> So even one mempool, with total size <=4GB can contain elements with 
> distances between them more than 4GB. 
> Now let say at startup user created a mempool, how he can determine 
> programmatically
> can he apply your compress API safely on it or not?
> I presume that if you are serious about this API usage, then such ability has 
> to be provided.
> Something like:
> 
> int compress_pointer_deduce_mempool_base(const struct rte_memepool *mp[],
>   uint32_t nb_mp, uint32_t compress_size, uintptr_t *base_ptr);
> 
> Or probably even more generic one:
> 
> struct mem_buf {uintptr_t base, size_t len;}; 
> int compress_pointer_deduce_base(const struct mem_buf *mem_buf[],
>   uint32_t nb_membuf, uint32_t compress_size, uintptr_t *base_ptr);
> 
> Even with these functions in-place, user has to be extra careful:
>  - he can't add new memory chunks to these mempools (or he'll need to 
> re-calcualte the new base_ptr)
>  - he needs to make sure that pointers from only these mempools will be used 
> by compress/decompress.
> But at least it provides some ability to use this feature in real apps.
> 
> With such API in place it should be possible to make the auto-test more 
> realistic:
> - allocate mempool 
> - deduce base_pointer
> - then we can have a loop with producer/consumer to mimic realistic workload.
> As an example:
>  producer(s):  mempool_alloc(); ; 
> ring_enqueue();  
>  consumer(s): ring_dequeue(); ; free_mbuf();
> - free mempool

I understand your objections and agree that the proposed API is limited
in its applicability due to its strict requirements.

AFAIK DPDK rte_mempool does require the addresses to be virtually
contiguous as the memory reservation is done during creation of the
mempool and a single memzone is reserved. However, I do not require
users to use the rte_mempool as the API accepts pointers so other
mempool implementations could indeed allow non-contiguous VAs.

To help decide at compile time if compression is allowed I will add
extra macros to the header

#define BITS_REQUIRED_TO_STORE_VALUE(x) \
((x) == 0 ? 1 : ((sizeof(x)) - __builtin_clzl(x)))

#define BIT_SHIFT_FROM_ALIGNMENT(x) ((x) == 0 ? 0 : __builtin_clzl(x))

#define CAN_USE_RTE_PTR_COMPRESS_16(memory_range, object_alignment) \
((BITS_REQUIRED_TO_STORE_VALUE(memory_range) - \
BIT_SHIFT_FROM_ALIGNMENT(object_alignment)) <= 16 ? 1 : 0)

#define CAN_USE_RTE_PTR_COMPRESS_32(memory_range, object_alignment) \
((BITS_REQUIRED_TO_STORE_VALUE(memory_range) - \
BIT_SHIFT_FROM_ALIGNMENT(object_alignment)) <= 32 ? 1 : 0)

And explain usage in the docs.

The API is very generic and does not even require you to use a mempool.
There are no runtime checks to verify or calculate if the pointers can
be compressed. This is because doing this at runtime would remove any
gains achieved through compression. The code doing the compression needs
to remain limited in size, branching and execution time to remain fast.

This is IMHO the nature of C applications, they trade off runtime checks
for performance. Program correctness needs to be enforced through other
means, linting, valgrind, tests, peer review, etc. It is up to the
programmer to calculate and decide on the viability of compression as it
cannot be done at compile time automatically. There is no way for me to
programmatically verify the alignment and distance of the pointers being
passed in at compile time as I don't require the user to use any
particular mempool implementation.

These limitation

Re: [PATCH v5 0/4] add pointer compression API

2024-05-16 Thread Paul Szczepanek
On 15/05/2024 23:34, Morten Brørup wrote:
>> From: Paul Szczepanek [mailto:paul.szczepa...@arm.com]
>>
>> AFAIK DPDK rte_mempool does require the addresses to be virtually
>> contiguous as the memory reservation is done during creation of the
>> mempool and a single memzone is reserved.
> 
> No, it does not.
> rte_pktmbuf_pool_create() first creates an empty mempool using 
> rte_mempool_create_empty(), and then populates it using 
> rte_mempool_populate_default(), which may use multiple memzones.
> 
> As one possible solution to this, the application can call 
> rte_pktmbuf_pool_create() as usual, and then check that mp->nb_mem_chunks == 
> 1 to confirm that all objects in the created mempool reside in one contiguous 
> chunk of memory and thus compression can be used.
> 
> Or even better, add a new mempool flag to the mempool library, e.g. 
> RTE_MEMPOOL_F_COMPRESSIBLE, specifying that the mempool objects must be 
> allocated as one chunk of memory with contiguous addresses.
> Unfortunately, rte_pktmbuf_pool_create() is missing the memzone flags 
> parameter, so a new rte_pktmbuf_pool_create() API with the flags parameter 
> added would also need to be added.
> 

You're right, my misunderstanding stemmed from only one mz being stored
in the rte_mempool struct, but nb_mem_chunks is in fact the variable to
check in mempool to verify contiguous VAs. I'll look into the
possibility of adding a mempool contiguous option to mempool.

> 
> For future proofing, please rename the compression functions to include the 
> compression algorithm, i.e. "shift" or similar, in the function names.
> 
> Specifically I'm thinking about an alternative "multiply" compression 
> algorithm based on division/multiplication by a constant "multiplier" 
> parameter (instead of the "bit_shift" parameter).
> This "multiplier" would typically be the object size of the packet mbuf 
> mempool.
> The "multiplier" could be constant at built time, e.g. 2368, or determined at 
> runtime.
> I don't know the performance of division/multiplication compared to bit 
> shifting for various CPUs, but it would make compression to 16 bit compressed 
> pointers viable for more applications.
> 
> The perf test in this series could be used to determine 
> compression/decompression performance of such an algorithm, and the 
> application developer can determine which algorithm to use; "shift" with 32 
> bit compressed pointers, or "multiply" with 16 bit compressed pointers.
> 

Will add shift to name.


Re: [PATCH v5 0/4] add pointer compression API

2024-05-24 Thread Paul Szczepanek
I have added macros to help find the parameters and I have added mempool
functions that allow you to determine if you can use the mempool and
what params it needs. The new mempool functions are mentioned in the
docs for ptr compress.

Please take a look at v11.

I did not add a new make pkt buf API that takes mempool flags as this
would be a much bigger task requiring a lot of new APIs and I think this
is something that is useful but not directly related to this patchset.
It's something I could I develop separately without having to worry
about 24.07 deadlines.


[PATCH v11 0/6] add pointer compression API

2024-05-24 Thread Paul Szczepanek
This patchset is proposing adding a new header only library
with utility functions that allow compression of arrays of pointers.

Since this is a header only library a patch needed to be added to amend
the build system to allow adding libraries without source files.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region. We can compress them by converting them
to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos
v9
* added MAINTAINERS entries, release notes, doc indexes etc.
* added patch for build system to allow header only library
v10
* fixed problem with meson build adding shared deps to static deps
v11
* added mempool functions to get information about memory range and
alignment
* added tests for the new mempool functions
* added macros to help find the parameters for compression functions
* minor improvement in the SVE compression code
* amended documentation to reflect these changes

Paul Szczepanek (6):
  lib: allow libraries with no sources
  mempool: add functions to get extra mempool info
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 MAINTAINERS|   6 +
 app/test/meson.build   |  21 +-
 app/test/test_mempool.c|  61 
 app/test/test_ptr_compress.c   | 110 +++
 app/test/test_ring.h   |  94 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 ++
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/mempool/rte_mempool.c  |  39 +++
 lib/mempool/rte_mempool.h  |  37 +++
 lib/mempool/version.map|   3 +
 lib/meson.build|  17 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 278 
 17 files changed, 1058 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

--
2.25.1



[PATCH v11 1/6] lib: allow libraries with no sources

2024-05-24 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Jack Bond-Preston 
Acked-by: Bruce Richardson 
---
 lib/meson.build | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..7c90602bf5 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,22 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

+# special case for header only libraries
+if sources.length() == 0
+shared_dep = declare_dependency(include_directories: includes,
+dependencies: shared_deps)
+static_dep = declare_dependency(include_directories: includes,
+dependencies: static_deps)
+set_variable('shared_rte_' + name, shared_dep)
+set_variable('static_rte_' + name, static_dep)
+dpdk_shared_lib_deps += shared_dep
+dpdk_static_lib_deps += static_dep
+if developer_mode
+message('lib/@0@: Defining dependency "@1@"'.format(l, name))
+endif
+continue
+endif
+
 if developer_mode and is_windows and use_function_versioning
 message('@0@: Function versioning is not supported by 
Windows.'.format(name))
 endif
--
2.25.1



[PATCH v11 2/6] mempool: add functions to get extra mempool info

2024-05-24 Thread Paul Szczepanek
Add two functions:
- rte_mempool_get_mem_range - get virtual memory range
of the objects in the mempool,
- rte_mempool_get_obj_alignment - get alignment of
objects in the mempool.

Add two tests that test these new functions.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Jack Bond-Preston 
Reviewed-by: Yoan Picchi 
Reviewed-by: Nathan Brown 
---
 app/test/test_mempool.c   | 61 +++
 lib/mempool/rte_mempool.c | 39 +
 lib/mempool/rte_mempool.h | 37 
 lib/mempool/version.map   |  3 ++
 4 files changed, 140 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index ad7ebd6363..16eeeb899c 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -843,12 +843,16 @@ test_mempool(void)
int ret = -1;
uint32_t nb_objs = 0;
uint32_t nb_mem_chunks = 0;
+   void *start = NULL;
+   size_t length = 0;
+   size_t alignment = 0;
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
struct rte_mempool *mp_stack_anon = NULL;
struct rte_mempool *mp_stack_mempool_iter = NULL;
struct rte_mempool *mp_stack = NULL;
struct rte_mempool *default_pool = NULL;
+   struct rte_mempool *mp_alignment = NULL;
struct mp_data cb_arg = {
.ret = -1
};
@@ -967,6 +971,62 @@ test_mempool(void)
}
rte_mempool_obj_iter(default_pool, my_obj_init, NULL);

+   if (rte_mempool_get_mem_range(default_pool, &start, &length)) {
+   printf("cannot get mem range from default mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(NULL, NULL, NULL) != -EINVAL) {
+   printf("rte_mempool_get_mem_range failed to return -EINVAL "
+   "when passed invalid arguments\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (start == NULL || length < (MEMPOOL_SIZE * MEMPOOL_ELT_SIZE)) {
+   printf("mem range of default mempool is invalid\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* by default mempool objects are aligned by RTE_MEMPOOL_ALIGN */
+   alignment = rte_mempool_get_obj_alignment(default_pool);
+   if (alignment != RTE_MEMPOOL_ALIGN) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)RTE_MEMPOOL_ALIGN, alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   /* create a mempool with a RTE_MEMPOOL_F_NO_CACHE_ALIGN flag */
+   mp_alignment = rte_mempool_create("test_alignment", MEMPOOL_SIZE,
+   MEMPOOL_ELT_SIZE, 0, 0,
+   NULL, NULL,
+   my_obj_init, NULL,
+   SOCKET_ID_ANY, RTE_MEMPOOL_F_NO_CACHE_ALIGN);
+
+   if (mp_alignment == NULL) {
+   printf("cannot allocate mempool with "
+   "RTE_MEMPOOL_F_NO_CACHE_ALIGN flag\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* mempool was created with RTE_MEMPOOL_F_NO_CACHE_ALIGN
+* and minimum alignment is expected which is sizeof(uint64_t)
+*/
+   alignment = rte_mempool_get_obj_alignment(mp_alignment);
+   if (alignment != sizeof(uint64_t)) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)sizeof(uint64_t), alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   alignment = rte_mempool_get_obj_alignment(NULL);
+   if (alignment != 0) {
+   printf("rte_mempool_get_obj_alignment failed to return 0 for "
+   " an invalid mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
/* retrieve the mempool from its name */
if (rte_mempool_lookup("test_nocache") != mp_nocache) {
printf("Cannot lookup mempool from its name\n");
@@ -1039,6 +1099,7 @@ test_mempool(void)
rte_mempool_free(mp_stack_mempool_iter);
rte_mempool_free(mp_stack);
rte_mempool_free(default_pool);
+   rte_mempool_free(mp_alignment);

return ret;
 }
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 12390a2c81..7a4bafb664 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -1386,6 +1386,45 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, 
void *),
rte_mcfg_mempool_read_unlock();
 }

+int rte_mempool_get_mem_range(struct rte_mempool *mp,
+   void **mem_range_start, size_t *mem_range_length)
+{
+   if (mp == NULL || mem_range_start == NULL || mem_range_length == NULL)
+

[PATCH v11 3/6] ptr_compress: add pointer compression library

2024-05-24 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers as 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   4 +
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 278 +
 7 files changed, 294 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c9adff9846..27b2f03e6c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1694,6 +1694,10 @@ M: Chenbo Xia 
 M: Gaetan Rivet 
 F: lib/pci/

+Pointer Compression
+M: Paul Szczepanek 
+F: lib/ptr_compress/
+
 Power management
 M: Anatoly Burakov 
 M: David Hunt 
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..f9283154f8 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,6 +222,7 @@ The public API headers are grouped by topics:
   [config file](@ref rte_cfgfile.h),
   [key/value args](@ref rte_kvargs.h),
   [argument parsing](@ref rte_argparse.h),
+  [ptr_compress](@ref rte_ptr_compress.h),
   [string](@ref rte_string_fns.h),
   [thread](@ref rte_thread.h)

diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index 27afec8b3b..a8823c046f 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -71,6 +71,7 @@ INPUT   = @TOPDIR@/doc/api/doxy-api-index.md \
   @TOPDIR@/lib/pipeline \
   @TOPDIR@/lib/port \
   @TOPDIR@/lib/power \
+  @TOPDIR@/lib/ptr_compress \
   @TOPDIR@/lib/rawdev \
   @TOPDIR@/lib/rcu \
   @TOPDIR@/lib/regexdev \
diff --git a/doc/guides/rel_notes/release_24_07.rst 
b/doc/guides/rel_notes/release_24_07.rst
index a69f24cf99..4711792e61 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -55,6 +55,11 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===

+* **Introduced pointer compression library.**
+
+  Library provides functions to compress and decompress arrays of pointers
+  which can improve application performance under certain conditions.
+  Performance test was added to help users evaluate performance on their setup.

 Removed Items
 -
diff --git a/lib/meson.build b/lib/meson.build
index 7c90602bf5..63becee142 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..f697f66075
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,278 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region. We compress them by converting them to offsets from
+ * a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To dete

[PATCH v11 4/6] test: add pointer compress tests to ring perf test

2024-05-24 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  94 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 334 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..9e97c5e3e7 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,47 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+ 

[PATCH v11 5/6] docs: add pointer compression guide

2024-05-24 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 +
 3 files changed, 162 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 27b2f03e6c..ed50121bd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
 M: Anatoly Burakov 
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index d09d958e6c..6366849eb0 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -73,6 +73,7 @@ Programmer's Guide
 telemetry_lib
 bpf_lib
 graph_lib
+ptr_compress_lib
 build-sdk-meson
 meson_ut
 build_app
diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..349da2695e
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,160 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16_shift()`` and ``rte_ptr_decompress_16_shift()`` to
+compress and decompress pointers into 16-bit offsets.
+Use ``rte_ptr_compress_32_shift()`` and ``rte_ptr_decompress_32_shift()`` to
+compress and decompress pointers into 32-bit offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+Macros present in the rte_ptr_compress.h header may be used to evaluate whether
+compression is possible:
+
+*   BITS_REQUIRED_TO_STORE_VALUE
+
+*   BIT_SHIFT_FROM_ALIGNMENT
+
+*   CAN_USE_RTE_PTR_compress_16_shift
+
+*   CAN_USE_RTE_PTR_compress_32_shift
+
+These will help you calculate compression parameters and whether these are
+legal for particular compression function.
+
+If using an rte_mempool you can get the parameters you need to use in the
+compression macros and functions by using ``rte_mempool_get_mem_range()``
+and ``rte_mempool_get_obj_alignment()``.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v11 6/6] test: add unit test for ptr compression

2024-05-24 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS  |   1 +
 app/test/meson.build |   1 +
 app/test/test_ptr_compress.c | 110 +++
 3 files changed, 112 insertions(+)
 create mode 100644 app/test/test_ptr_compress.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ed50121bd2..2565ef5f4b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: app/test/test_ptr_compress.c
 F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
diff --git a/app/test/meson.build b/app/test/meson.build
index df8cc00730..e29258e6ec 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -144,6 +144,7 @@ source_file_deps = {
 'test_power_intel_uncore.c': ['power'],
 'test_power_kvm_vm.c': ['power'],
 'test_prefetch.c': [],
+'test_ptr_compress.c': ['ptr_compress'],
 'test_rand_perf.c': [],
 'test_rawdev.c': ['rawdev', 'bus_vdev'],
 'test_rcu_qsbr.c': ['rcu', 'hash'],
diff --git a/app/test/test_ptr_compress.c b/app/test/test_ptr_compress.c
new file mode 100644
index 00..a917331d60
--- /dev/null
+++ b/app/test/test_ptr_compress.c
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define MAX_PTRS 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[MAX_PTRS] = {0};
+   void *ptrs_out[MAX_PTRS] = {0};
+   uint32_t offsets32[MAX_PTRS] = {0};
+   uint16_t offsets16[MAX_PTRS] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32_shift(
+   base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32_shift(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16_shift(
+   base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16_shift(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < MAX_PTRS; n++) {
+   ret |= test_ptr_compress_params(
+   bases[k],
+   region_sizes_16[j],
+   j /* exponent of alignment */,
+   n,
+   false
+   );
+

Re: [PATCH v5 0/4] add pointer compression API

2024-05-28 Thread Paul Szczepanek



On 24/05/2024 10:09, Konstantin Ananyev wrote:
> 
> 
>> I have added macros to help find the parameters and I have added mempool
>> functions that allow you to determine if you can use the mempool and
>> what params it needs. The new mempool functions are mentioned in the
>> docs for ptr compress.
>> Please take a look at v11.
> 
> Great, thanks.
> Will try to have a look in next few days. 
> With these functions in place, can we produce a unit-test that
> will use together these new mempool functions and compress API? 
> Something like: 
> - allocate mempool 
> - deduce base_pointer for it
> - main_loop_start:
> producer(s):  mempool_get(); ; 
> ring_enqueue();  
> consumer(s): ring_dequeue(); ; mempool_put();
> - main_loop_end
> - free mempool

The v11 already includes mempool base pointer and range calculation in
the mempool test and the functions are mentioned in the ptr compress lib
docs. The ptr compress test doesn't use a mempool to minimise dependencies.

I have a v12 pending (awaiting internal reviews) that addresses Morten's
comments (adds prefix, adds tests and doxygen for all the macros, uses
rte_bitops) and a fix for the guide which had the wrong letter case for
the MACRO.


Re: [PATCH v11 2/6] mempool: add functions to get extra mempool info

2024-05-28 Thread Paul Szczepanek


On 24/05/2024 13:20, Morten Brørup wrote:
>> From: Paul Szczepanek [mailto:paul.szczepa...@arm.com]
>> Sent: Friday, 24 May 2024 10.37
>>
>> +size_t rte_mempool_get_obj_alignment(struct rte_mempool *mp)
>> +{
>> +if (mp == NULL)
>> +return 0;
>> +
>> +if (mp->flags & RTE_MEMPOOL_F_NO_CACHE_ALIGN)
>> +return sizeof(uint64_t);
>> +else
>> +return RTE_MEMPOOL_ALIGN;
>> +}
> 
> The object alignment depends on the underlying mempool driver. You cannot 
> assume that it is either sizeof(uint64_t) or cache line aligned.
> 
> Refer to the calc_mem_size driver operation, which also provides object 
> alignment information:
> https://elixir.bootlin.com/dpdk/v24.03/source/lib/mempool/rte_mempool.h#L529
> 
> If you need this function, you need to add a new driver operation, and your 
> function above can be the default for this operation, like for the the 
> calc_mem_size driver operation:
> https://elixir.bootlin.com/dpdk/v24.03/source/lib/mempool/rte_mempool_ops.c#L120
> 

As discussed on slack the alignment you mention is the memzone alignment
which is distinct from the object alignment which is enforced by the
mempool according to the RTE_MEMPOOL_F_NO_CACHE_ALIGN flag. Objects may
have higher alignment, the alignment returned by the new function is the
minimum guaranteed one.

I addressed your other comments in v12 (pending internal review).


[PATCH v12 1/6] lib: allow libraries with no sources

2024-05-29 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Acked-by: Bruce Richardson 
---
 lib/meson.build | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..7c90602bf5 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,22 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

+# special case for header only libraries
+if sources.length() == 0
+shared_dep = declare_dependency(include_directories: includes,
+dependencies: shared_deps)
+static_dep = declare_dependency(include_directories: includes,
+dependencies: static_deps)
+set_variable('shared_rte_' + name, shared_dep)
+set_variable('static_rte_' + name, static_dep)
+dpdk_shared_lib_deps += shared_dep
+dpdk_static_lib_deps += static_dep
+if developer_mode
+message('lib/@0@: Defining dependency "@1@"'.format(l, name))
+endif
+continue
+endif
+
 if developer_mode and is_windows and use_function_versioning
 message('@0@: Function versioning is not supported by 
Windows.'.format(name))
 endif
--
2.25.1



[PATCH v12 2/6] mempool: add functions to get extra mempool info

2024-05-29 Thread Paul Szczepanek
Add two functions:
- rte_mempool_get_mem_range - get virtual memory range
of the objects in the mempool,
- rte_mempool_get_obj_alignment - get alignment of
objects in the mempool.

Add two tests that test these new functions.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Jack Bond-Preston 
Reviewed-by: Nathan Brown 
---
 app/test/test_mempool.c   | 61 +++
 lib/mempool/rte_mempool.c | 39 +
 lib/mempool/rte_mempool.h | 37 
 lib/mempool/version.map   |  3 ++
 4 files changed, 140 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index ad7ebd6363..973f4318a8 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -843,12 +843,16 @@ test_mempool(void)
int ret = -1;
uint32_t nb_objs = 0;
uint32_t nb_mem_chunks = 0;
+   void *start = NULL;
+   size_t length = 0;
+   size_t alignment = 0;
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
struct rte_mempool *mp_stack_anon = NULL;
struct rte_mempool *mp_stack_mempool_iter = NULL;
struct rte_mempool *mp_stack = NULL;
struct rte_mempool *default_pool = NULL;
+   struct rte_mempool *mp_alignment = NULL;
struct mp_data cb_arg = {
.ret = -1
};
@@ -967,6 +971,62 @@ test_mempool(void)
}
rte_mempool_obj_iter(default_pool, my_obj_init, NULL);

+   if (rte_mempool_get_mem_range(default_pool, &start, &length)) {
+   printf("cannot get mem range from default mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(NULL, NULL, NULL) != -EINVAL) {
+   printf("rte_mempool_get_mem_range failed to return -EINVAL "
+   "when passed invalid arguments\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (start == NULL || length < (MEMPOOL_SIZE * MEMPOOL_ELT_SIZE)) {
+   printf("mem range of default mempool is invalid\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* by default mempool objects are aligned by RTE_MEMPOOL_ALIGN */
+   alignment = rte_mempool_get_obj_alignment(default_pool);
+   if (alignment != RTE_MEMPOOL_ALIGN) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)RTE_MEMPOOL_ALIGN, alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   /* create a mempool with a RTE_MEMPOOL_F_NO_CACHE_ALIGN flag */
+   mp_alignment = rte_mempool_create("test_alignment", MEMPOOL_SIZE,
+   MEMPOOL_ELT_SIZE, 0, 0,
+   NULL, NULL,
+   my_obj_init, NULL,
+   SOCKET_ID_ANY, RTE_MEMPOOL_F_NO_CACHE_ALIGN);
+
+   if (mp_alignment == NULL) {
+   printf("cannot allocate mempool with "
+   "RTE_MEMPOOL_F_NO_CACHE_ALIGN flag\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* mempool was created with RTE_MEMPOOL_F_NO_CACHE_ALIGN
+* and minimum alignment is expected which is sizeof(uint64_t)
+*/
+   alignment = rte_mempool_get_obj_alignment(mp_alignment);
+   if (alignment != sizeof(uint64_t)) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)sizeof(uint64_t), alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   alignment = rte_mempool_get_obj_alignment(NULL);
+   if (alignment != 0) {
+   printf("rte_mempool_get_obj_alignment failed to return 0 for "
+   " an invalid mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
/* retrieve the mempool from its name */
if (rte_mempool_lookup("test_nocache") != mp_nocache) {
printf("Cannot lookup mempool from its name\n");
@@ -1039,6 +1099,7 @@ test_mempool(void)
rte_mempool_free(mp_stack_mempool_iter);
rte_mempool_free(mp_stack);
rte_mempool_free(default_pool);
+   rte_mempool_free(mp_alignment);

return ret;
 }
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 12390a2c81..7a4bafb664 100644
--- a/lib/mempool/rte_mempool.c
+++ b/lib/mempool/rte_mempool.c
@@ -1386,6 +1386,45 @@ void rte_mempool_walk(void (*func)(struct rte_mempool *, 
void *),
rte_mcfg_mempool_read_unlock();
 }

+int rte_mempool_get_mem_range(struct rte_mempool *mp,
+   void **mem_range_start, size_t *mem_range_length)
+{
+   if (mp == NULL || mem_range_start == NULL || mem_range_length == NULL)
+   return -EINVAL;
+
+

[PATCH v12 0/6] add pointer compression API

2024-05-29 Thread Paul Szczepanek
This patchset is proposing adding a new header only library
with utility functions that allow compression of arrays of pointers.

Since this is a header only library a patch needed to be added to amend
the build system to allow adding libraries without source files.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region. We can compress them by converting them
to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos
v9
* added MAINTAINERS entries, release notes, doc indexes etc.
* added patch for build system to allow header only library
v10
* fixed problem with meson build adding shared deps to static deps
v11
* added mempool functions to get information about memory range and
alignment
* added tests for the new mempool functions
* added macros to help find the parameters for compression functions
* minor improvement in the SVE compression code
* amended documentation to reflect these changes
v12
* added doxygen and prefixes to macros
* use rte_bitops for clz and ctz
* added unit tests to verify macros
* fixed incorrect letter case in docs

Paul Szczepanek (6):
  lib: allow libraries with no sources
  mempool: add functions to get extra mempool info
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 MAINTAINERS|   6 +
 app/test/meson.build   |  21 +-
 app/test/test_mempool.c|  61 
 app/test/test_ptr_compress.c   | 193 +++
 app/test/test_ring.h   |  94 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 ++
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/mempool/rte_mempool.c  |  39 +++
 lib/mempool/rte_mempool.h  |  37 +++
 lib/mempool/version.map|   3 +
 lib/meson.build|  17 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +++
 17 files changed, 1187 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

--
2.25.1



[PATCH v12 4/6] test: add pointer compress tests to ring perf test

2024-05-29 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  94 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 334 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..9e97c5e3e7 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,47 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+ 

[PATCH v12 3/6] ptr_compress: add pointer compression library

2024-05-29 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers as 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   4 +
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +
 7 files changed, 340 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c9adff9846..27b2f03e6c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1694,6 +1694,10 @@ M: Chenbo Xia 
 M: Gaetan Rivet 
 F: lib/pci/

+Pointer Compression
+M: Paul Szczepanek 
+F: lib/ptr_compress/
+
 Power management
 M: Anatoly Burakov 
 M: David Hunt 
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..f9283154f8 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,6 +222,7 @@ The public API headers are grouped by topics:
   [config file](@ref rte_cfgfile.h),
   [key/value args](@ref rte_kvargs.h),
   [argument parsing](@ref rte_argparse.h),
+  [ptr_compress](@ref rte_ptr_compress.h),
   [string](@ref rte_string_fns.h),
   [thread](@ref rte_thread.h)

diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index 27afec8b3b..a8823c046f 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -71,6 +71,7 @@ INPUT   = @TOPDIR@/doc/api/doxy-api-index.md \
   @TOPDIR@/lib/pipeline \
   @TOPDIR@/lib/port \
   @TOPDIR@/lib/power \
+  @TOPDIR@/lib/ptr_compress \
   @TOPDIR@/lib/rawdev \
   @TOPDIR@/lib/rcu \
   @TOPDIR@/lib/regexdev \
diff --git a/doc/guides/rel_notes/release_24_07.rst 
b/doc/guides/rel_notes/release_24_07.rst
index a69f24cf99..4711792e61 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -55,6 +55,11 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===

+* **Introduced pointer compression library.**
+
+  Library provides functions to compress and decompress arrays of pointers
+  which can improve application performance under certain conditions.
+  Performance test was added to help users evaluate performance on their setup.

 Removed Items
 -
diff --git a/lib/meson.build b/lib/meson.build
index 7c90602bf5..63becee142 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..11586246a0
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,324 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region. We compress them by converting them to offsets from
+ * a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To dete

[PATCH v12 6/6] test: add unit test for ptr compression

2024-05-29 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly. Additionally tests helper macros perform
calculations correctly.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS  |   1 +
 app/test/meson.build |   1 +
 app/test/test_ptr_compress.c | 193 +++
 3 files changed, 195 insertions(+)
 create mode 100644 app/test/test_ptr_compress.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ed50121bd2..2565ef5f4b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: app/test/test_ptr_compress.c
 F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
diff --git a/app/test/meson.build b/app/test/meson.build
index df8cc00730..e29258e6ec 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -144,6 +144,7 @@ source_file_deps = {
 'test_power_intel_uncore.c': ['power'],
 'test_power_kvm_vm.c': ['power'],
 'test_prefetch.c': [],
+'test_ptr_compress.c': ['ptr_compress'],
 'test_rand_perf.c': [],
 'test_rawdev.c': ['rawdev', 'bus_vdev'],
 'test_rcu_qsbr.c': ['rcu', 'hash'],
diff --git a/app/test/test_ptr_compress.c b/app/test/test_ptr_compress.c
new file mode 100644
index 00..ab33815974
--- /dev/null
+++ b/app/test/test_ptr_compress.c
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define MAX_PTRS 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[MAX_PTRS] = {0};
+   void *ptrs_out[MAX_PTRS] = {0};
+   uint32_t offsets32[MAX_PTRS] = {0};
+   uint16_t offsets16[MAX_PTRS] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32_shift(
+   base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32_shift(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16_shift(
+   base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16_shift(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   /* the test is run with multiple memory regions and base addresses */
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   /* main test compresses and decompresses arrays of pointers
+* and compares the array before and after to verify that
+* pointers are successfully decompressed
+*/
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < MAX_PTRS; n++) {
+

[PATCH v12 5/6] docs: add pointer compression guide

2024-05-29 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 +
 3 files changed, 162 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 27b2f03e6c..ed50121bd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
 M: Anatoly Burakov 
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index d09d958e6c..6366849eb0 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -73,6 +73,7 @@ Programmer's Guide
 telemetry_lib
 bpf_lib
 graph_lib
+ptr_compress_lib
 build-sdk-meson
 meson_ut
 build_app
diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..49e94e6c4e
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,160 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16_shift()`` and ``rte_ptr_decompress_16_shift()`` to
+compress and decompress pointers into 16-bit offsets.
+Use ``rte_ptr_compress_32_shift()`` and ``rte_ptr_decompress_32_shift()`` to
+compress and decompress pointers into 32-bit offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+Macros present in the rte_ptr_compress.h header may be used to evaluate whether
+compression is possible:
+
+*   RTE_PTR_COMPRESS_BITS_REQUIRED_TO_STORE_VALUE_IN_RANGE
+
+*   RTE_PTR_COMPRESS_BIT_SHIFT_FROM_ALIGNMENT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_16_SHIFT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_32_SHIFT
+
+These will help you calculate compression parameters and whether these are
+legal for particular compression function.
+
+If using an rte_mempool you can get the parameters you need to use in the
+compression macros and functions by using ``rte_mempool_get_mem_range()``
+and ``rte_mempool_get_obj_alignment()``.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

Re: [PATCH v5 0/4] add pointer compression API

2024-05-29 Thread Paul Szczepanek



On 28/05/2024 20:29, Paul Szczepanek wrote:
> 
> 
> On 24/05/2024 10:09, Konstantin Ananyev wrote:
>>
>>
>>> I have added macros to help find the parameters and I have added mempool
>>> functions that allow you to determine if you can use the mempool and
>>> what params it needs. The new mempool functions are mentioned in the
>>> docs for ptr compress.
>>> Please take a look at v11.
>>
>> Great, thanks.
>> Will try to have a look in next few days. 
>> With these functions in place, can we produce a unit-test that
>> will use together these new mempool functions and compress API? 
>> Something like: 
>> - allocate mempool 
>> - deduce base_pointer for it
>> - main_loop_start:
>> producer(s):  mempool_get(); ; 
>> ring_enqueue();  
>> consumer(s): ring_dequeue(); ; mempool_put();
>> - main_loop_end
>> - free mempool
> 
> The v11 already includes mempool base pointer and range calculation in
> the mempool test and the functions are mentioned in the ptr compress lib
> docs. The ptr compress test doesn't use a mempool to minimise dependencies.
> 
> I have a v12 pending (awaiting internal reviews) that addresses Morten's
> comments (adds prefix, adds tests and doxygen for all the macros, uses
> rte_bitops) and a fix for the guide which had the wrong letter case for
> the MACRO.

v12 is now up ready for your review. I hope that the explanation in the
ptr compress guide document is enough to show users how to use mempool
pointers with it. Between the guide and the doxygen it should be clear
what values are needed as parameters of the compression function.


Re: [PATCH] mempool: dump includes list of memory chunks

2024-05-29 Thread Paul Szczepanek



On 16/05/2024 09:59, Morten Brørup wrote:
> Added information about the memory chunks holding the objects in the
> mempool when dumping the status of the mempool to a file.
> 
> Signed-off-by: Morten Brørup 
> ---
>  lib/mempool/rte_mempool.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
> index 12390a2c81..e9a8a5b411 100644
> --- a/lib/mempool/rte_mempool.c
> +++ b/lib/mempool/rte_mempool.c
> @@ -1230,6 +1230,7 @@ rte_mempool_dump(FILE *f, struct rte_mempool *mp)
>  #endif
>   struct rte_mempool_memhdr *memhdr;
>   struct rte_mempool_ops *ops;
> + unsigned int n;
>   unsigned common_count;
>   unsigned cache_count;
>   size_t mem_len = 0;
> @@ -1264,6 +1265,15 @@ rte_mempool_dump(FILE *f, struct rte_mempool *mp)
>   (long double)mem_len / mp->size);
>   }
>  
> + fprintf(f, "  mem_list:\n");
> + n = 0;
> + STAILQ_FOREACH(memhdr, &mp->mem_list, next) {
> + fprintf(f, "addr[%u]=%p\n", n, memhdr->addr);
> + fprintf(f, "iova[%u]=0x%" PRIx64 "\n", n, memhdr->iova);
> + fprintf(f, "len[%u]=%zu\n", n, memhdr->len);
> + n++;
> + }
> +
>   cache_count = rte_mempool_dump_cache(f, mp);
>   common_count = rte_mempool_ops_get_count(mp);
>   if ((cache_count + common_count) > mp->size)

It's useful information to dump. Maybe consider adding something akin to
RTE_LIBRTE_MEMPOOL_STATS to gate this in case the prints are
overwhelming due to high list element number.

Reviewed-by: Paul Szczepanek 


Re: [PATCH v12 2/6] mempool: add functions to get extra mempool info

2024-05-29 Thread Paul Szczepanek



On 29/05/2024 14:56, Morten Brørup wrote:
>> From: Paul Szczepanek [mailto:paul.szczepa...@arm.com]
>> Sent: Wednesday, 29 May 2024 12.23
>>
>> Add two functions:
>> - rte_mempool_get_mem_range - get virtual memory range
>> of the objects in the mempool,
>> - rte_mempool_get_obj_alignment - get alignment of
>> objects in the mempool.
>>
>> Add two tests that test these new functions.
>>
>> Signed-off-by: Paul Szczepanek 
>> Reviewed-by: Jack Bond-Preston 
>> Reviewed-by: Nathan Brown 
>> ---
>>
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Get information about the memory range used by the mempool.
>> + *
>> + * @param[in] mp
>> + *   Pointer to an initialized mempool.
>> + * @param[out] mem_range_start
>> + *   Returns lowest address in mempool.
>> + * @param[out] mem_range_length
>> + *   Returns the length of the memory range containing all the addresses
>> + *   in the memory pool.
>> + * @return
>> + *   0 on success, -EINVAL if arguments are not valid.
>> + *
>> + **/
>> +__rte_experimental
>> +int rte_mempool_get_mem_range(struct rte_mempool *mp,
>> +void **mem_range_start, size_t *mem_range_length);
> 
> Paul,
> 
> Could you please add one more output parameter "bool *mem_range_contiguous" 
> to this function, returning true if the memory chunks are contiguous.
> 
> It will be useful instead of implementing get_memhdr_info() locally in this 
> other patch series:
> https://inbox.dpdk.org/dev/mw4pr11mb58724ac82a34a3eefef78e898e...@mw4pr11mb5872.namprd11.prod.outlook.com/
> 
> Please coordinate this change directly with Frank Du .
> 
> -Morten
> 

Does this work for you?

int rte_mempool_get_mem_range(struct rte_mempool *mp,
void **mem_range_start, size_t *mem_range_length,
bool *contiguous)
{
if (mp == NULL)
return -EINVAL;

void *address_low = (void *)UINTPTR_MAX;
void *address_high = 0;
size_t address_diff = 0;
size_t mem_total_size = 0;
struct rte_mempool_memhdr *hdr;

/* go through memory chunks and find the lowest and highest addresses */
STAILQ_FOREACH(hdr, &mp->mem_list, next) {
if (address_low > hdr->addr)
address_low = hdr->addr;
if (address_high < RTE_PTR_ADD(hdr->addr, hdr->len))
address_high = RTE_PTR_ADD(hdr->addr, hdr->len);
mem_total_size += hdr->len;
}

/* check if mempool was not populated yet (no memory chunks) */
if (address_low == (void *)UINTPTR_MAX)
return -EINVAL;

address_diff = (size_t)RTE_PTR_DIFF(address_high, address_low);
if (mem_range_start != NULL)
*mem_range_start = address_low;
if (mem_range_length != NULL)
*mem_range_length = address_diff;
if (contiguous != NULL)
*contiguous = (mem_total_size == address_diff) ? true : false;

return 0;
}


[PATCH v13 1/6] lib: allow libraries with no sources

2024-05-30 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Acked-by: Bruce Richardson 
---
 lib/meson.build | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..7c90602bf5 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,22 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

+# special case for header only libraries
+if sources.length() == 0
+shared_dep = declare_dependency(include_directories: includes,
+dependencies: shared_deps)
+static_dep = declare_dependency(include_directories: includes,
+dependencies: static_deps)
+set_variable('shared_rte_' + name, shared_dep)
+set_variable('static_rte_' + name, static_dep)
+dpdk_shared_lib_deps += shared_dep
+dpdk_static_lib_deps += static_dep
+if developer_mode
+message('lib/@0@: Defining dependency "@1@"'.format(l, name))
+endif
+continue
+endif
+
 if developer_mode and is_windows and use_function_versioning
 message('@0@: Function versioning is not supported by 
Windows.'.format(name))
 endif
--
2.25.1



[PATCH v13 0/6] add pointer compression API

2024-05-30 Thread Paul Szczepanek
This patchset is proposing adding a new header only library
with utility functions that allow compression of arrays of pointers.

Since this is a header only library a patch needed to be added to amend
the build system to allow adding libraries without source files.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region. We can compress them by converting them
to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos
v9
* added MAINTAINERS entries, release notes, doc indexes etc.
* added patch for build system to allow header only library
v10
* fixed problem with meson build adding shared deps to static deps
v11
* added mempool functions to get information about memory range and
alignment
* added tests for the new mempool functions
* added macros to help find the parameters for compression functions
* minor improvement in the SVE compression code
* amended documentation to reflect these changes
v12
* added doxygen and prefixes to macros
* use rte_bitops for clz and ctz
* added unit tests to verify macros
* fixed incorrect letter case in docs
v13
* added contiguous parameter to rte_mempool_get_mem_range
* made rte_mempool_get_mem_range parameters optional

Paul Szczepanek (6):
  lib: allow libraries with no sources
  mempool: add functions to get extra mempool info
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 MAINTAINERS|   6 +
 app/test/meson.build   |  21 +-
 app/test/test_mempool.c|  71 +
 app/test/test_ptr_compress.c   | 193 +++
 app/test/test_ring.h   |  94 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 ++
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/mempool/rte_mempool.c  |  48 +++
 lib/mempool/rte_mempool.h  |  41 +++
 lib/mempool/version.map|   3 +
 lib/meson.build|  17 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +++
 17 files changed, 1210 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

--
2.25.1



[PATCH v13 2/6] mempool: add functions to get extra mempool info

2024-05-30 Thread Paul Szczepanek
Add two functions:
- rte_mempool_get_mem_range - get virtual memory range
of the objects in the mempool,
- rte_mempool_get_obj_alignment - get alignment of
objects in the mempool.

Add two tests that test these new functions.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Jack Bond-Preston 
Reviewed-by: Nathan Brown 
Acked-by: Morten Brørup 
---
 app/test/test_mempool.c   | 71 +++
 lib/mempool/rte_mempool.c | 48 ++
 lib/mempool/rte_mempool.h | 41 ++
 lib/mempool/version.map   |  3 ++
 4 files changed, 163 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index ad7ebd6363..f32d4a3bb9 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -843,12 +843,17 @@ test_mempool(void)
int ret = -1;
uint32_t nb_objs = 0;
uint32_t nb_mem_chunks = 0;
+   void *start = NULL;
+   size_t length = 0;
+   size_t alignment = 0;
+   bool ret_bool = false;
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
struct rte_mempool *mp_stack_anon = NULL;
struct rte_mempool *mp_stack_mempool_iter = NULL;
struct rte_mempool *mp_stack = NULL;
struct rte_mempool *default_pool = NULL;
+   struct rte_mempool *mp_alignment = NULL;
struct mp_data cb_arg = {
.ret = -1
};
@@ -967,6 +972,71 @@ test_mempool(void)
}
rte_mempool_obj_iter(default_pool, my_obj_init, NULL);

+   if (rte_mempool_get_mem_range(default_pool, &start, &length, NULL)) {
+   printf("cannot get mem range from default mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(NULL, NULL, NULL, NULL) != -EINVAL) {
+   printf("rte_mempool_get_mem_range failed to return -EINVAL "
+   "when passed invalid arguments\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (start == NULL || length < (MEMPOOL_SIZE * MEMPOOL_ELT_SIZE)) {
+   printf("mem range of default mempool is invalid\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* by default mempool objects are aligned by RTE_MEMPOOL_ALIGN */
+   alignment = rte_mempool_get_obj_alignment(default_pool);
+   if (alignment != RTE_MEMPOOL_ALIGN) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)RTE_MEMPOOL_ALIGN, alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   /* create a mempool with a RTE_MEMPOOL_F_NO_CACHE_ALIGN flag */
+   mp_alignment = rte_mempool_create("test_alignment",
+   1, 8, /* the small size guarantees single memory chunk */
+   0, 0, NULL, NULL, my_obj_init, NULL,
+   SOCKET_ID_ANY, RTE_MEMPOOL_F_NO_CACHE_ALIGN);
+
+   if (mp_alignment == NULL) {
+   printf("cannot allocate mempool with "
+   "RTE_MEMPOOL_F_NO_CACHE_ALIGN flag\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* mempool was created with RTE_MEMPOOL_F_NO_CACHE_ALIGN
+* and minimum alignment is expected which is sizeof(uint64_t)
+*/
+   alignment = rte_mempool_get_obj_alignment(mp_alignment);
+   if (alignment != sizeof(uint64_t)) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)sizeof(uint64_t), alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   alignment = rte_mempool_get_obj_alignment(NULL);
+   if (alignment != 0) {
+   printf("rte_mempool_get_obj_alignment failed to return 0 for "
+   " an invalid mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(mp_alignment, NULL, NULL, &ret_bool)) {
+   printf("cannot get mem range from mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (!ret_bool) {
+   printf("mempool not contiguous\n");
+   GOTO_ERR(ret, err);
+   }
+
/* retrieve the mempool from its name */
if (rte_mempool_lookup("test_nocache") != mp_nocache) {
printf("Cannot lookup mempool from its name\n");
@@ -1039,6 +1109,7 @@ test_mempool(void)
rte_mempool_free(mp_stack_mempool_iter);
rte_mempool_free(mp_stack);
rte_mempool_free(default_pool);
+   rte_mempool_free(mp_alignment);

return ret;
 }
diff --git a/lib/mempool/rte_mempool.c b/lib/mempool/rte_mempool.c
index 12390a2c81..b2551572ed 100644
--- a/lib/mempool

[PATCH v13 3/6] ptr_compress: add pointer compression library

2024-05-30 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers as 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
Acked-by: Morten Brørup 
---
 MAINTAINERS|   4 +
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +
 7 files changed, 340 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c9adff9846..27b2f03e6c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1694,6 +1694,10 @@ M: Chenbo Xia 
 M: Gaetan Rivet 
 F: lib/pci/

+Pointer Compression
+M: Paul Szczepanek 
+F: lib/ptr_compress/
+
 Power management
 M: Anatoly Burakov 
 M: David Hunt 
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..f9283154f8 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,6 +222,7 @@ The public API headers are grouped by topics:
   [config file](@ref rte_cfgfile.h),
   [key/value args](@ref rte_kvargs.h),
   [argument parsing](@ref rte_argparse.h),
+  [ptr_compress](@ref rte_ptr_compress.h),
   [string](@ref rte_string_fns.h),
   [thread](@ref rte_thread.h)

diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index 27afec8b3b..a8823c046f 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -71,6 +71,7 @@ INPUT   = @TOPDIR@/doc/api/doxy-api-index.md \
   @TOPDIR@/lib/pipeline \
   @TOPDIR@/lib/port \
   @TOPDIR@/lib/power \
+  @TOPDIR@/lib/ptr_compress \
   @TOPDIR@/lib/rawdev \
   @TOPDIR@/lib/rcu \
   @TOPDIR@/lib/regexdev \
diff --git a/doc/guides/rel_notes/release_24_07.rst 
b/doc/guides/rel_notes/release_24_07.rst
index a69f24cf99..4711792e61 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -55,6 +55,11 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===

+* **Introduced pointer compression library.**
+
+  Library provides functions to compress and decompress arrays of pointers
+  which can improve application performance under certain conditions.
+  Performance test was added to help users evaluate performance on their setup.

 Removed Items
 -
diff --git a/lib/meson.build b/lib/meson.build
index 7c90602bf5..63becee142 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..11586246a0
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,324 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region. We compress them by converting them to offsets from
+ * a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit a

[PATCH v13 4/6] test: add pointer compress tests to ring perf test

2024-05-30 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  94 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 334 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..9e97c5e3e7 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,47 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+ 

[PATCH v13 6/6] test: add unit test for ptr compression

2024-05-30 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly. Additionally tests helper macros perform
calculations correctly.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS  |   1 +
 app/test/meson.build |   1 +
 app/test/test_ptr_compress.c | 193 +++
 3 files changed, 195 insertions(+)
 create mode 100644 app/test/test_ptr_compress.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ed50121bd2..2565ef5f4b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: app/test/test_ptr_compress.c
 F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
diff --git a/app/test/meson.build b/app/test/meson.build
index df8cc00730..e29258e6ec 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -144,6 +144,7 @@ source_file_deps = {
 'test_power_intel_uncore.c': ['power'],
 'test_power_kvm_vm.c': ['power'],
 'test_prefetch.c': [],
+'test_ptr_compress.c': ['ptr_compress'],
 'test_rand_perf.c': [],
 'test_rawdev.c': ['rawdev', 'bus_vdev'],
 'test_rcu_qsbr.c': ['rcu', 'hash'],
diff --git a/app/test/test_ptr_compress.c b/app/test/test_ptr_compress.c
new file mode 100644
index 00..ab33815974
--- /dev/null
+++ b/app/test/test_ptr_compress.c
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define MAX_PTRS 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[MAX_PTRS] = {0};
+   void *ptrs_out[MAX_PTRS] = {0};
+   uint32_t offsets32[MAX_PTRS] = {0};
+   uint16_t offsets16[MAX_PTRS] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32_shift(
+   base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32_shift(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16_shift(
+   base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16_shift(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   /* the test is run with multiple memory regions and base addresses */
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   /* main test compresses and decompresses arrays of pointers
+* and compares the array before and after to verify that
+* pointers are successfully decompressed
+*/
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < MAX_PTRS; n++) {
+

[PATCH v13 5/6] docs: add pointer compression guide

2024-05-30 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 +
 3 files changed, 162 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 27b2f03e6c..ed50121bd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
 M: Anatoly Burakov 
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index d09d958e6c..6366849eb0 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -73,6 +73,7 @@ Programmer's Guide
 telemetry_lib
 bpf_lib
 graph_lib
+ptr_compress_lib
 build-sdk-meson
 meson_ut
 build_app
diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..49e94e6c4e
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,160 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16_shift()`` and ``rte_ptr_decompress_16_shift()`` to
+compress and decompress pointers into 16-bit offsets.
+Use ``rte_ptr_compress_32_shift()`` and ``rte_ptr_decompress_32_shift()`` to
+compress and decompress pointers into 32-bit offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+Macros present in the rte_ptr_compress.h header may be used to evaluate whether
+compression is possible:
+
+*   RTE_PTR_COMPRESS_BITS_REQUIRED_TO_STORE_VALUE_IN_RANGE
+
+*   RTE_PTR_COMPRESS_BIT_SHIFT_FROM_ALIGNMENT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_16_SHIFT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_32_SHIFT
+
+These will help you calculate compression parameters and whether these are
+legal for particular compression function.
+
+If using an rte_mempool you can get the parameters you need to use in the
+compression macros and functions by using ``rte_mempool_get_mem_range()``
+and ``rte_mempool_get_obj_alignment()``.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

Re: [PATCH v13 0/6] add pointer compression API

2024-05-30 Thread Paul Szczepanek
Recheck-request: github-robot


Re: [PATCH v13 0/6] add pointer compression API

2024-06-04 Thread Paul Szczepanek
Recheck-request: iol-unit-amd64-testing


Re: [PATCH v13 6/6] test: add unit test for ptr compression

2024-06-04 Thread Paul Szczepanek
Recheck-request: github-robot


Re: [PATCH v13 6/6] test: add unit test for ptr compression

2024-06-04 Thread Paul Szczepanek
Recheck-request: iol-unit-amd64-testing


[PATCH v14 0/6] add pointer compression API

2024-06-07 Thread Paul Szczepanek
This patchset is proposing adding a new header only library
with utility functions that allow compression of arrays of pointers.

Since this is a header only library a patch needed to be added to amend
the build system to allow adding libraries without source files.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region. We can compress them by converting them
to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos
v9
* added MAINTAINERS entries, release notes, doc indexes etc.
* added patch for build system to allow header only library
v10
* fixed problem with meson build adding shared deps to static deps
v11
* added mempool functions to get information about memory range and
alignment
* added tests for the new mempool functions
* added macros to help find the parameters for compression functions
* minor improvement in the SVE compression code
* amended documentation to reflect these changes
v12
* added doxygen and prefixes to macros
* use rte_bitops for clz and ctz
* added unit tests to verify macros
* fixed incorrect letter case in docs
v13
* added contiguous parameter to rte_mempool_get_mem_range
* made rte_mempool_get_mem_range parameters optional
v14
* encapsulated parameters of rte_mempool_get_mem_range in a struct
* added consts to function parameters where appropriate

Paul Szczepanek (6):
  lib: allow libraries with no sources
  mempool: add functions to get extra mempool info
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 MAINTAINERS|   6 +
 app/test/meson.build   |  21 +-
 app/test/test_mempool.c|  70 
 app/test/test_ptr_compress.c   | 193 +++
 app/test/test_ring.h   |  94 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 ++
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/mempool/rte_mempool.c  |  45 +++
 lib/mempool/rte_mempool.h  |  47 +++
 lib/mempool/version.map|   3 +
 lib/meson.build|  17 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +++
 17 files changed, 1212 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

--
2.25.1



[PATCH v14 1/6] lib: allow libraries with no sources

2024-06-07 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Acked-by: Bruce Richardson 
---
 lib/meson.build | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..7c90602bf5 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,22 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

+# special case for header only libraries
+if sources.length() == 0
+shared_dep = declare_dependency(include_directories: includes,
+dependencies: shared_deps)
+static_dep = declare_dependency(include_directories: includes,
+dependencies: static_deps)
+set_variable('shared_rte_' + name, shared_dep)
+set_variable('static_rte_' + name, static_dep)
+dpdk_shared_lib_deps += shared_dep
+dpdk_static_lib_deps += static_dep
+if developer_mode
+message('lib/@0@: Defining dependency "@1@"'.format(l, name))
+endif
+continue
+endif
+
 if developer_mode and is_windows and use_function_versioning
 message('@0@: Function versioning is not supported by 
Windows.'.format(name))
 endif
--
2.25.1



[PATCH v14 2/6] mempool: add functions to get extra mempool info

2024-06-07 Thread Paul Szczepanek
Add two functions:
- rte_mempool_get_mem_range - get virtual memory range
of the objects in the mempool,
- rte_mempool_get_obj_alignment - get alignment of
objects in the mempool.

Add two tests that test these new functions.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Jack Bond-Preston 
Reviewed-by: Nathan Brown 
Reviewed-by: Morten Brørup 
Acked-by: Morten Brørup 
---
 app/test/test_mempool.c   | 70 +++
 lib/mempool/rte_mempool.c | 45 +
 lib/mempool/rte_mempool.h | 47 ++
 lib/mempool/version.map   |  3 ++
 4 files changed, 165 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index ad7ebd6363..3f7ba5872d 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -843,16 +843,19 @@ test_mempool(void)
int ret = -1;
uint32_t nb_objs = 0;
uint32_t nb_mem_chunks = 0;
+   size_t alignment = 0;
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
struct rte_mempool *mp_stack_anon = NULL;
struct rte_mempool *mp_stack_mempool_iter = NULL;
struct rte_mempool *mp_stack = NULL;
struct rte_mempool *default_pool = NULL;
+   struct rte_mempool *mp_alignment = NULL;
struct mp_data cb_arg = {
.ret = -1
};
const char *default_pool_ops = rte_mbuf_best_mempool_ops();
+   struct rte_mempool_mem_range_info mem_range = { 0 };

/* create a mempool (without cache) */
mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
@@ -967,6 +970,72 @@ test_mempool(void)
}
rte_mempool_obj_iter(default_pool, my_obj_init, NULL);

+   if (rte_mempool_get_mem_range(default_pool, &mem_range)) {
+   printf("cannot get mem range from default mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(NULL, NULL) != -EINVAL) {
+   printf("rte_mempool_get_mem_range failed to return -EINVAL "
+   "when passed invalid arguments\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (mem_range.start == NULL || mem_range.length <
+   (MEMPOOL_SIZE * MEMPOOL_ELT_SIZE)) {
+   printf("mem range of default mempool is invalid\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* by default mempool objects are aligned by RTE_MEMPOOL_ALIGN */
+   alignment = rte_mempool_get_obj_alignment(default_pool);
+   if (alignment != RTE_MEMPOOL_ALIGN) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)RTE_MEMPOOL_ALIGN, alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   /* create a mempool with a RTE_MEMPOOL_F_NO_CACHE_ALIGN flag */
+   mp_alignment = rte_mempool_create("test_alignment",
+   1, 8, /* the small size guarantees single memory chunk */
+   0, 0, NULL, NULL, my_obj_init, NULL,
+   SOCKET_ID_ANY, RTE_MEMPOOL_F_NO_CACHE_ALIGN);
+
+   if (mp_alignment == NULL) {
+   printf("cannot allocate mempool with "
+   "RTE_MEMPOOL_F_NO_CACHE_ALIGN flag\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* mempool was created with RTE_MEMPOOL_F_NO_CACHE_ALIGN
+* and minimum alignment is expected which is sizeof(uint64_t)
+*/
+   alignment = rte_mempool_get_obj_alignment(mp_alignment);
+   if (alignment != sizeof(uint64_t)) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)sizeof(uint64_t), alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   alignment = rte_mempool_get_obj_alignment(NULL);
+   if (alignment != 0) {
+   printf("rte_mempool_get_obj_alignment failed to return 0 for "
+   " an invalid mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(mp_alignment, &mem_range)) {
+   printf("cannot get mem range from mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (!mem_range.is_contiguous) {
+   printf("mempool not contiguous\n");
+   GOTO_ERR(ret, err);
+   }
+
/* retrieve the mempool from its name */
if (rte_mempool_lookup("test_nocache") != mp_nocache) {
printf("Cannot lookup mempool from its name\n");
@@ -1039,6 +1108,7 @@ test_mempool(void)
rte_mempool_free(mp_stack_mempool_iter);
rte_mempool_free(mp_stac

[PATCH v14 3/6] ptr_compress: add pointer compression library

2024-06-07 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers as 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
Acked-by: Morten Brørup 
---
 MAINTAINERS|   4 +
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 324 +
 7 files changed, 340 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c9adff9846..27b2f03e6c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1694,6 +1694,10 @@ M: Chenbo Xia 
 M: Gaetan Rivet 
 F: lib/pci/

+Pointer Compression
+M: Paul Szczepanek 
+F: lib/ptr_compress/
+
 Power management
 M: Anatoly Burakov 
 M: David Hunt 
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..f9283154f8 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,6 +222,7 @@ The public API headers are grouped by topics:
   [config file](@ref rte_cfgfile.h),
   [key/value args](@ref rte_kvargs.h),
   [argument parsing](@ref rte_argparse.h),
+  [ptr_compress](@ref rte_ptr_compress.h),
   [string](@ref rte_string_fns.h),
   [thread](@ref rte_thread.h)

diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index 27afec8b3b..a8823c046f 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -71,6 +71,7 @@ INPUT   = @TOPDIR@/doc/api/doxy-api-index.md \
   @TOPDIR@/lib/pipeline \
   @TOPDIR@/lib/port \
   @TOPDIR@/lib/power \
+  @TOPDIR@/lib/ptr_compress \
   @TOPDIR@/lib/rawdev \
   @TOPDIR@/lib/rcu \
   @TOPDIR@/lib/regexdev \
diff --git a/doc/guides/rel_notes/release_24_07.rst 
b/doc/guides/rel_notes/release_24_07.rst
index a69f24cf99..4711792e61 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -55,6 +55,11 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===

+* **Introduced pointer compression library.**
+
+  Library provides functions to compress and decompress arrays of pointers
+  which can improve application performance under certain conditions.
+  Performance test was added to help users evaluate performance on their setup.

 Removed Items
 -
diff --git a/lib/meson.build b/lib/meson.build
index 7c90602bf5..63becee142 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..bf9cfb0661
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,324 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region. We compress them by converting them to offsets from
+ * a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit a

[PATCH v14 4/6] test: add pointer compress tests to ring perf test

2024-06-07 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  94 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 334 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..9e97c5e3e7 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,47 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+ 

[PATCH v14 5/6] docs: add pointer compression guide

2024-06-07 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 +
 3 files changed, 162 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 27b2f03e6c..ed50121bd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
 M: Anatoly Burakov 
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index d09d958e6c..6366849eb0 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -73,6 +73,7 @@ Programmer's Guide
 telemetry_lib
 bpf_lib
 graph_lib
+ptr_compress_lib
 build-sdk-meson
 meson_ut
 build_app
diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..49e94e6c4e
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,160 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16_shift()`` and ``rte_ptr_decompress_16_shift()`` to
+compress and decompress pointers into 16-bit offsets.
+Use ``rte_ptr_compress_32_shift()`` and ``rte_ptr_decompress_32_shift()`` to
+compress and decompress pointers into 32-bit offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+Macros present in the rte_ptr_compress.h header may be used to evaluate whether
+compression is possible:
+
+*   RTE_PTR_COMPRESS_BITS_REQUIRED_TO_STORE_VALUE_IN_RANGE
+
+*   RTE_PTR_COMPRESS_BIT_SHIFT_FROM_ALIGNMENT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_16_SHIFT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_32_SHIFT
+
+These will help you calculate compression parameters and whether these are
+legal for particular compression function.
+
+If using an rte_mempool you can get the parameters you need to use in the
+compression macros and functions by using ``rte_mempool_get_mem_range()``
+and ``rte_mempool_get_obj_alignment()``.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v14 6/6] test: add unit test for ptr compression

2024-06-07 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly. Additionally tests helper macros perform
calculations correctly.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS  |   1 +
 app/test/meson.build |   1 +
 app/test/test_ptr_compress.c | 193 +++
 3 files changed, 195 insertions(+)
 create mode 100644 app/test/test_ptr_compress.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ed50121bd2..2565ef5f4b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: app/test/test_ptr_compress.c
 F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
diff --git a/app/test/meson.build b/app/test/meson.build
index df8cc00730..e29258e6ec 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -144,6 +144,7 @@ source_file_deps = {
 'test_power_intel_uncore.c': ['power'],
 'test_power_kvm_vm.c': ['power'],
 'test_prefetch.c': [],
+'test_ptr_compress.c': ['ptr_compress'],
 'test_rand_perf.c': [],
 'test_rawdev.c': ['rawdev', 'bus_vdev'],
 'test_rcu_qsbr.c': ['rcu', 'hash'],
diff --git a/app/test/test_ptr_compress.c b/app/test/test_ptr_compress.c
new file mode 100644
index 00..ab33815974
--- /dev/null
+++ b/app/test/test_ptr_compress.c
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define MAX_PTRS 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[MAX_PTRS] = {0};
+   void *ptrs_out[MAX_PTRS] = {0};
+   uint32_t offsets32[MAX_PTRS] = {0};
+   uint16_t offsets16[MAX_PTRS] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32_shift(
+   base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32_shift(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16_shift(
+   base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16_shift(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   /* the test is run with multiple memory regions and base addresses */
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   /* main test compresses and decompresses arrays of pointers
+* and compares the array before and after to verify that
+* pointers are successfully decompressed
+*/
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < MAX_PTRS; n++) {
+

Re: [PATCH v13 2/6] mempool: add functions to get extra mempool info

2024-06-07 Thread Paul Szczepanek


On 06/06/2024 13:28, Konstantin Ananyev wrote:
> 
> 
>> Add two functions:
>> - rte_mempool_get_mem_range - get virtual memory range
>> of the objects in the mempool,
>> - rte_mempool_get_obj_alignment - get alignment of
>> objects in the mempool.
>>
>> Add two tests that test these new functions.
> 
> LGTM in general, few nits/suggestions below.
>  

I have added your suggestions.
Can I please have an Acked by or Reviewed by from you?


[PATCH v15 1/6] lib: allow libraries with no sources

2024-06-11 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Acked-by: Bruce Richardson 
---
 lib/meson.build | 16 
 1 file changed, 16 insertions(+)

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..7c90602bf5 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,22 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

+# special case for header only libraries
+if sources.length() == 0
+shared_dep = declare_dependency(include_directories: includes,
+dependencies: shared_deps)
+static_dep = declare_dependency(include_directories: includes,
+dependencies: static_deps)
+set_variable('shared_rte_' + name, shared_dep)
+set_variable('static_rte_' + name, static_dep)
+dpdk_shared_lib_deps += shared_dep
+dpdk_static_lib_deps += static_dep
+if developer_mode
+message('lib/@0@: Defining dependency "@1@"'.format(l, name))
+endif
+continue
+endif
+
 if developer_mode and is_windows and use_function_versioning
 message('@0@: Function versioning is not supported by 
Windows.'.format(name))
 endif
--
2.25.1



[PATCH v15 0/6] add pointer compression API

2024-06-11 Thread Paul Szczepanek
This patchset is proposing adding a new header only library
with utility functions that allow compression of arrays of pointers.

Since this is a header only library a patch needed to be added to amend
the build system to allow adding libraries without source files.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region. We can compress them by converting them
to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos
v9
* added MAINTAINERS entries, release notes, doc indexes etc.
* added patch for build system to allow header only library
v10
* fixed problem with meson build adding shared deps to static deps
v11
* added mempool functions to get information about memory range and
alignment
* added tests for the new mempool functions
* added macros to help find the parameters for compression functions
* minor improvement in the SVE compression code
* amended documentation to reflect these changes
v12
* added doxygen and prefixes to macros
* use rte_bitops for clz and ctz
* added unit tests to verify macros
* fixed incorrect letter case in docs
v13
* added contiguous parameter to rte_mempool_get_mem_range
* made rte_mempool_get_mem_range parameters optional
v14
* encapsulated parameters of rte_mempool_get_mem_range in a struct
* added consts to function parameters where appropriate
v15
* fix whitespace in rel notes
* move parameter check to after variable declaration
* change the naming of the range variable in macros
* improve doxygen

Paul Szczepanek (6):
  lib: allow libraries with no sources
  mempool: add functions to get extra mempool info
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 MAINTAINERS|   6 +
 app/test/meson.build   |  21 +-
 app/test/test_mempool.c|  70 
 app/test/test_ptr_compress.c   | 193 +++
 app/test/test_ring.h   |  94 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 ++
 doc/guides/rel_notes/release_24_07.rst |   7 +
 lib/mempool/rte_mempool.c  |  45 +++
 lib/mempool/rte_mempool.h  |  49 +++
 lib/mempool/version.map|   3 +
 lib/meson.build|  17 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 325 +++
 17 files changed, 1217 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress

[PATCH v15 2/6] mempool: add functions to get extra mempool info

2024-06-11 Thread Paul Szczepanek
Add two functions:
- rte_mempool_get_mem_range - get virtual memory range
of the objects in the mempool,
- rte_mempool_get_obj_alignment - get alignment of
objects in the mempool.

Add two tests that test these new functions.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Jack Bond-Preston 
Reviewed-by: Nathan Brown 
Acked-by: Morten Brørup 
Acked-by: Konstantin Ananyev 
---
 app/test/test_mempool.c   | 70 +++
 lib/mempool/rte_mempool.c | 45 +
 lib/mempool/rte_mempool.h | 49 +++
 lib/mempool/version.map   |  3 ++
 4 files changed, 167 insertions(+)

diff --git a/app/test/test_mempool.c b/app/test/test_mempool.c
index ad7ebd6363..3f7ba5872d 100644
--- a/app/test/test_mempool.c
+++ b/app/test/test_mempool.c
@@ -843,16 +843,19 @@ test_mempool(void)
int ret = -1;
uint32_t nb_objs = 0;
uint32_t nb_mem_chunks = 0;
+   size_t alignment = 0;
struct rte_mempool *mp_cache = NULL;
struct rte_mempool *mp_nocache = NULL;
struct rte_mempool *mp_stack_anon = NULL;
struct rte_mempool *mp_stack_mempool_iter = NULL;
struct rte_mempool *mp_stack = NULL;
struct rte_mempool *default_pool = NULL;
+   struct rte_mempool *mp_alignment = NULL;
struct mp_data cb_arg = {
.ret = -1
};
const char *default_pool_ops = rte_mbuf_best_mempool_ops();
+   struct rte_mempool_mem_range_info mem_range = { 0 };

/* create a mempool (without cache) */
mp_nocache = rte_mempool_create("test_nocache", MEMPOOL_SIZE,
@@ -967,6 +970,72 @@ test_mempool(void)
}
rte_mempool_obj_iter(default_pool, my_obj_init, NULL);

+   if (rte_mempool_get_mem_range(default_pool, &mem_range)) {
+   printf("cannot get mem range from default mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(NULL, NULL) != -EINVAL) {
+   printf("rte_mempool_get_mem_range failed to return -EINVAL "
+   "when passed invalid arguments\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (mem_range.start == NULL || mem_range.length <
+   (MEMPOOL_SIZE * MEMPOOL_ELT_SIZE)) {
+   printf("mem range of default mempool is invalid\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* by default mempool objects are aligned by RTE_MEMPOOL_ALIGN */
+   alignment = rte_mempool_get_obj_alignment(default_pool);
+   if (alignment != RTE_MEMPOOL_ALIGN) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)RTE_MEMPOOL_ALIGN, alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   /* create a mempool with a RTE_MEMPOOL_F_NO_CACHE_ALIGN flag */
+   mp_alignment = rte_mempool_create("test_alignment",
+   1, 8, /* the small size guarantees single memory chunk */
+   0, 0, NULL, NULL, my_obj_init, NULL,
+   SOCKET_ID_ANY, RTE_MEMPOOL_F_NO_CACHE_ALIGN);
+
+   if (mp_alignment == NULL) {
+   printf("cannot allocate mempool with "
+   "RTE_MEMPOOL_F_NO_CACHE_ALIGN flag\n");
+   GOTO_ERR(ret, err);
+   }
+
+   /* mempool was created with RTE_MEMPOOL_F_NO_CACHE_ALIGN
+* and minimum alignment is expected which is sizeof(uint64_t)
+*/
+   alignment = rte_mempool_get_obj_alignment(mp_alignment);
+   if (alignment != sizeof(uint64_t)) {
+   printf("rte_mempool_get_obj_alignment returned wrong value, "
+   "expected %zu, returned %zu\n",
+   (size_t)sizeof(uint64_t), alignment);
+   GOTO_ERR(ret, err);
+   }
+
+   alignment = rte_mempool_get_obj_alignment(NULL);
+   if (alignment != 0) {
+   printf("rte_mempool_get_obj_alignment failed to return 0 for "
+   " an invalid mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (rte_mempool_get_mem_range(mp_alignment, &mem_range)) {
+   printf("cannot get mem range from mempool\n");
+   GOTO_ERR(ret, err);
+   }
+
+   if (!mem_range.is_contiguous) {
+   printf("mempool not contiguous\n");
+   GOTO_ERR(ret, err);
+   }
+
/* retrieve the mempool from its name */
if (rte_mempool_lookup("test_nocache") != mp_nocache) {
printf("Cannot lookup mempool from its name\n");
@@ -1039,6 +1108,7 @@ test_mempool(void)
rte_mempool_free(mp_stack_mempool_iter);
rte_mempool_free(mp_stac

[PATCH v15 3/6] ptr_compress: add pointer compression library

2024-06-11 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers as 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
Acked-by: Morten Brørup 
---
 MAINTAINERS|   4 +
 doc/api/doxy-api-index.md  |   1 +
 doc/api/doxy-api.conf.in   |   1 +
 doc/guides/rel_notes/release_24_07.rst |   5 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 325 +
 7 files changed, 341 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c9adff9846..27b2f03e6c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1694,6 +1694,10 @@ M: Chenbo Xia 
 M: Gaetan Rivet 
 F: lib/pci/

+Pointer Compression
+M: Paul Szczepanek 
+F: lib/ptr_compress/
+
 Power management
 M: Anatoly Burakov 
 M: David Hunt 
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 8c1eb8fafa..f9283154f8 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -222,6 +222,7 @@ The public API headers are grouped by topics:
   [config file](@ref rte_cfgfile.h),
   [key/value args](@ref rte_kvargs.h),
   [argument parsing](@ref rte_argparse.h),
+  [ptr_compress](@ref rte_ptr_compress.h),
   [string](@ref rte_string_fns.h),
   [thread](@ref rte_thread.h)

diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index 27afec8b3b..a8823c046f 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -71,6 +71,7 @@ INPUT   = @TOPDIR@/doc/api/doxy-api-index.md \
   @TOPDIR@/lib/pipeline \
   @TOPDIR@/lib/port \
   @TOPDIR@/lib/power \
+  @TOPDIR@/lib/ptr_compress \
   @TOPDIR@/lib/rawdev \
   @TOPDIR@/lib/rcu \
   @TOPDIR@/lib/regexdev \
diff --git a/doc/guides/rel_notes/release_24_07.rst 
b/doc/guides/rel_notes/release_24_07.rst
index a69f24cf99..4711792e61 100644
--- a/doc/guides/rel_notes/release_24_07.rst
+++ b/doc/guides/rel_notes/release_24_07.rst
@@ -55,6 +55,11 @@ New Features
  Also, make sure to start the actual text at the margin.
  ===

+* **Introduced pointer compression library.**
+
+  Library provides functions to compress and decompress arrays of pointers
+  which can improve application performance under certain conditions.
+  Performance test was added to help users evaluate performance on their setup.

 Removed Items
 -
diff --git a/lib/meson.build b/lib/meson.build
index 7c90602bf5..63becee142 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..b9ab17b2db
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,325 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region. We compress them by converting them to offsets from
+ * a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit a

[PATCH v15 5/6] docs: add pointer compression guide

2024-06-11 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS|   1 +
 doc/guides/prog_guide/index.rst|   1 +
 doc/guides/prog_guide/ptr_compress_lib.rst | 160 +
 doc/guides/rel_notes/release_24_07.rst |   2 +
 4 files changed, 164 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/MAINTAINERS b/MAINTAINERS
index 27b2f03e6c..ed50121bd2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
 M: Anatoly Burakov 
diff --git a/doc/guides/prog_guide/index.rst b/doc/guides/prog_guide/index.rst
index d09d958e6c..6366849eb0 100644
--- a/doc/guides/prog_guide/index.rst
+++ b/doc/guides/prog_guide/index.rst
@@ -73,6 +73,7 @@ Programmer's Guide
 telemetry_lib
 bpf_lib
 graph_lib
+ptr_compress_lib
 build-sdk-meson
 meson_ut
 build_app
diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..5edbc35e56
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,160 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16_shift()`` and ``rte_ptr_decompress_16_shift()`` to
+compress and decompress pointers into 16-bit offsets.
+Use ``rte_ptr_compress_32_shift()`` and ``rte_ptr_decompress_32_shift()`` to
+compress and decompress pointers into 32-bit offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+Macros present in the rte_ptr_compress.h header may be used to evaluate whether
+compression is possible:
+
+*   RTE_PTR_COMPRESS_BITS_NEEDED_FOR_POINTER_WITHIN_RANGE
+
+*   RTE_PTR_COMPRESS_BIT_SHIFT_FROM_ALIGNMENT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_16_SHIFT
+
+*   RTE_PTR_COMPRESS_CAN_COMPRESS_32_SHIFT
+
+These will help you calculate compression parameters and whether these are
+legal for particular compression function.
+
+If using an rte_mempool you can get the parameters you need to use in the
+compression macros and functions by using ``rte_mempool_get_mem_range()``
+and ``rte_mempool_get_obj_alignment()``.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v15 4/6] test: add pointer compress tests to ring perf test

2024-06-11 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  94 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 334 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..9e97c5e3e7 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,47 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+ 

[PATCH v15 6/6] test: add unit test for ptr compression

2024-06-11 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly. Additionally tests helper macros perform
calculations correctly.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
Reviewed-by: Jack Bond-Preston 
---
 MAINTAINERS  |   1 +
 app/test/meson.build |   1 +
 app/test/test_ptr_compress.c | 193 +++
 3 files changed, 195 insertions(+)
 create mode 100644 app/test/test_ptr_compress.c

diff --git a/MAINTAINERS b/MAINTAINERS
index ed50121bd2..2565ef5f4b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1697,6 +1697,7 @@ F: lib/pci/
 Pointer Compression
 M: Paul Szczepanek 
 F: lib/ptr_compress/
+F: app/test/test_ptr_compress.c
 F: doc/guides/prog_guide/ptr_compress_lib.rst

 Power management
diff --git a/app/test/meson.build b/app/test/meson.build
index df8cc00730..e29258e6ec 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -144,6 +144,7 @@ source_file_deps = {
 'test_power_intel_uncore.c': ['power'],
 'test_power_kvm_vm.c': ['power'],
 'test_prefetch.c': [],
+'test_ptr_compress.c': ['ptr_compress'],
 'test_rand_perf.c': [],
 'test_rawdev.c': ['rawdev', 'bus_vdev'],
 'test_rcu_qsbr.c': ['rcu', 'hash'],
diff --git a/app/test/test_ptr_compress.c b/app/test/test_ptr_compress.c
new file mode 100644
index 00..807b19eaf6
--- /dev/null
+++ b/app/test/test_ptr_compress.c
@@ -0,0 +1,193 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define MAX_PTRS 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[MAX_PTRS] = {0};
+   void *ptrs_out[MAX_PTRS] = {0};
+   uint32_t offsets32[MAX_PTRS] = {0};
+   uint16_t offsets16[MAX_PTRS] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32_shift(
+   base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32_shift(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16_shift(
+   base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16_shift(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   /* the test is run with multiple memory regions and base addresses */
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   /* main test compresses and decompresses arrays of pointers
+* and compares the array before and after to verify that
+* pointers are successfully decompressed
+*/
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < MAX_PTRS; n++) {
+

Re: [PATCH v14 2/6] mempool: add functions to get extra mempool info

2024-06-11 Thread Paul Szczepanek



On 10/06/2024 15:24, Konstantin Ananyev wrote:
[snip]
> 
> With that changes in place:
> Acked-by: Konstantin Ananyev 
> 

I have applied your comments in v15.

[snip]

>> +
>> +size_t rte_mempool_get_obj_alignment(const struct rte_mempool *mp)
>> +{
>> +if (mp == NULL)
>> +return 0;
> 
> In case of mp==NULL, would it be better to return negative error code 
> (-EINVAL or so)?
> In that case the function should be changed to return ssize_t though.
> 

I have chosen size_t for portability and ease of use as that is the type
most likely expected by the caller and because there are no other error
modes expected in the future that would require distinguishing between
them. I hope this is an acceptable solution and have kept it in v15.

[snip]


Re: [PATCH v14 3/6] ptr_compress: add pointer compression library

2024-06-11 Thread Paul Szczepanek



On 10/06/2024 16:18, David Marchand wrote:
> Hello,
> 
> On Fri, Jun 7, 2024 at 5:10 PM Paul Szczepanek  
> wrote:
>>
>> Add a new utility header for compressing pointers. The provided
>> functions can store pointers as 32-bit or 16-bit offsets.
>>
>> The compression takes advantage of the fact that pointers are
>> usually located in a limited memory region (like a mempool).
>> We can compress them by converting them to offsets from a base
>> memory address. Offsets can be stored in fewer bytes (dictated
>> by the memory region size and alignment of the pointer).
>> For example: an 8 byte aligned pointer which is part of a 32GB
>> memory pool can be stored in 4 bytes.
>>
>> Suggested-by: Honnappa Nagarahalli 
>> Signed-off-by: Paul Szczepanek 
>> Signed-off-by: Kamalakshitha Aligeri 
>> Reviewed-by: Honnappa Nagarahalli 
>> Reviewed-by: Nathan Brown 
>> Reviewed-by: Jack Bond-Preston 
>> Acked-by: Morten Brørup 
>> ---
>>  MAINTAINERS|   4 +
>>  doc/api/doxy-api-index.md  |   1 +
>>  doc/api/doxy-api.conf.in   |   1 +
>>  doc/guides/rel_notes/release_24_07.rst |   5 +
>>  lib/meson.build|   1 +
>>  lib/ptr_compress/meson.build   |   4 +
>>  lib/ptr_compress/rte_ptr_compress.h| 324 +
>>  7 files changed, 340 insertions(+)
>>  create mode 100644 lib/ptr_compress/meson.build
>>  create mode 100644 lib/ptr_compress/rte_ptr_compress.h
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index c9adff9846..27b2f03e6c 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1694,6 +1694,10 @@ M: Chenbo Xia 
>>  M: Gaetan Rivet 
>>  F: lib/pci/
>>
>> +Pointer Compression
>> +M: Paul Szczepanek 
>> +F: lib/ptr_compress/
>> +
>>  Power management
>>  M: Anatoly Burakov 
>>  M: David Hunt 
>> diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
>> index 8c1eb8fafa..f9283154f8 100644
>> --- a/doc/api/doxy-api-index.md
>> +++ b/doc/api/doxy-api-index.md
>> @@ -222,6 +222,7 @@ The public API headers are grouped by topics:
>>[config file](@ref rte_cfgfile.h),
>>[key/value args](@ref rte_kvargs.h),
>>[argument parsing](@ref rte_argparse.h),
>> +  [ptr_compress](@ref rte_ptr_compress.h),
>>[string](@ref rte_string_fns.h),
>>[thread](@ref rte_thread.h)
>>
>> diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
>> index 27afec8b3b..a8823c046f 100644
>> --- a/doc/api/doxy-api.conf.in
>> +++ b/doc/api/doxy-api.conf.in
>> @@ -71,6 +71,7 @@ INPUT   = 
>> @TOPDIR@/doc/api/doxy-api-index.md \
>>@TOPDIR@/lib/pipeline \
>>@TOPDIR@/lib/port \
>>@TOPDIR@/lib/power \
>> +  @TOPDIR@/lib/ptr_compress \
>>@TOPDIR@/lib/rawdev \
>>@TOPDIR@/lib/rcu \
>>@TOPDIR@/lib/regexdev \
>> diff --git a/doc/guides/rel_notes/release_24_07.rst 
>> b/doc/guides/rel_notes/release_24_07.rst
>> index a69f24cf99..4711792e61 100644
>> --- a/doc/guides/rel_notes/release_24_07.rst
>> +++ b/doc/guides/rel_notes/release_24_07.rst
>> @@ -55,6 +55,11 @@ New Features
>>   Also, make sure to start the actual text at the margin.
>>   ===
>>
>> +* **Introduced pointer compression library.**
>> +
>> +  Library provides functions to compress and decompress arrays of pointers
>> +  which can improve application performance under certain conditions.
>> +  Performance test was added to help users evaluate performance on their 
>> setup.
> 
> Please, double empty line before a new section in the release notes.
> 
> 
>>
>>  Removed Items
>>  -
>> diff --git a/lib/meson.build b/lib/meson.build
>> index 7c90602bf5..63becee142 100644
>> --- a/lib/meson.build
>> +++ b/lib/meson.build
>> @@ -14,6 +14,7 @@ libraries = [
>>  'argparse',
>>  'telemetry', # basic info querying
>>  'eal', # everything depends on eal
>> +'ptr_compress',
>>  'ring',
>>  'rcu', # rcu depends on ring
>>  'mempool',
>> diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
>> new file mode 100644
>> in

Re: [PATCH v15 0/6] add pointer compression API

2024-06-17 Thread Paul Szczepanek



On 17/06/2024 11:02, David Marchand wrote:
> Hello Paul,
> 
> On Fri, Jun 14, 2024 at 12:28 PM David Marchand
>  wrote:
>> Even if this library only contains a header, with no tie to other
>> public DPDK API, this library should be optional.
>> If no objection, please work on this change for -rc2,
> 
> We have a build error on armv7 (tested in OBS).
> https://build.opensuse.org/public/build/home:bluca:dpdk/openSUSE_Factory_ARM/armv7l/dpdk/_log
> 
> [  395s] [1730/2037] cc -Iapp/dpdk-test.p -Iapp -I../app -Ilib/cmdline
> -I../lib/cmdline -I. -I.. -Iconfig -I../config -Ilib/eal/include
> -I../lib/eal/include -Ilib/eal/linux/include
> -I../lib/eal/linux/include -Ilib/eal/arm/include
> -I../lib/eal/arm/include -Ilib/eal/common -I../lib/eal/common
> -Ilib/eal -I../lib/eal -Ilib/kvargs -I../lib/kvargs -Ilib/log
> -I../lib/log -Ilib/metrics -I../lib/metrics -Ilib/telemetry
> -I../lib/telemetry -Ilib/net -I../lib/net -Ilib/mbuf -I../lib/mbuf
> -Ilib/mempool -I../lib/mempool -Ilib/ring -I../lib/ring
> -Idrivers/net/ring -I../drivers/net/ring -Ilib/ethdev -I../lib/ethdev
> -Ilib/meter -I../lib/meter -Idrivers/bus/pci -I../drivers/bus/pci
> -I../drivers/bus/pci/linux -Ilib/pci -I../lib/pci -Idrivers/bus/vdev
> -I../drivers/bus/vdev -Ilib/acl -I../lib/acl -Ilib/argparse
> -I../lib/argparse -Ilib/hash -I../lib/hash -Ilib/rcu -I../lib/rcu
> -Ilib/bitratestats -I../lib/bitratestats -Ilib/bpf -I../lib/bpf
> -Ilib/compressdev -I../lib/compressdev -Ilib/cryptodev
> -I../lib/cryptodev -Ilib/security -I../lib/security -Ilib/dispatcher
> -I../lib/dispatcher -Ilib/eventdev -I../lib/eventdev -Ilib/timer
> -I../lib/timer -Ilib/dmadev -I../lib/dmadev -Ilib/distributor
> -I../lib/distributor -Ilib/efd -I../lib/efd -Ilib/fib -I../lib/fib
> -Ilib/rib -I../lib/rib -Ilib/table -I../lib/table -Ilib/port
> -I../lib/port -Ilib/sched -I../lib/sched -Ilib/ip_frag
> -I../lib/ip_frag -Ilib/lpm -I../lib/lpm -Ilib/graph -I../lib/graph
> -Ilib/pcapng -I../lib/pcapng -Ilib/ipsec -I../lib/ipsec
> -Ilib/latencystats -I../lib/latencystats -Idrivers/net/bonding
> -I../drivers/net/bonding -Ilib/member -I../lib/member -Ilib/pdcp
> -I../lib/pdcp -Ilib/reorder -I../lib/reorder -Ilib/pdump
> -I../lib/pdump -Ilib/power -I../lib/power -Ilib/ptr_compress
> -I../lib/ptr_compress -Ilib/rawdev -I../lib/rawdev -Ilib/stack
> -I../lib/stack -Ilib/pipeline -I../lib/pipeline
> -Idrivers/crypto/scheduler -I../drivers/crypto/scheduler
> -I/usr/include/libnl3 -I/usr/include/dbus-1.0
> -I/usr/lib/dbus-1.0/include -fdiagnostics-color=always
> -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=c11 -O3
> -include rte_config.h -Wcast-qual -Wdeprecated -Wformat
> -Wmissing-declarations -Wmissing-prototypes -Wnested-externs
> -Wold-style-definition -Wpointer-arith -Wsign-compare
> -Wstrict-prototypes -Wundef -Wwrite-strings
> -Wno-address-of-packed-member -Wno-packed-not-aligned
> -Wno-missing-field-initializers -Wno-zero-length-bounds
> -Wno-pointer-to-int-cast -D_GNU_SOURCE -fcommon -Werror -march=armv7-a
> -mfpu=neon -DALLOW_EXPERIMENTAL_API -Wno-format-truncation
> -fno-strict-aliasing -DALLOW_INTERNAL_API -MD -MQ
> app/dpdk-test.p/test_test_ptr_compress.c.o -MF
> app/dpdk-test.p/test_test_ptr_compress.c.o.d -o
> app/dpdk-test.p/test_test_ptr_compress.c.o -c
> ../app/test/test_ptr_compress.c
> [  395s] FAILED: app/dpdk-test.p/test_test_ptr_compress.c.o
> [  395s] cc -Iapp/dpdk-test.p -Iapp -I../app -Ilib/cmdline
> -I../lib/cmdline -I. -I.. -Iconfig -I../config -Ilib/eal/include
> -I../lib/eal/include -Ilib/eal/linux/include
> -I../lib/eal/linux/include -Ilib/eal/arm/include
> -I../lib/eal/arm/include -Ilib/eal/common -I../lib/eal/common
> -Ilib/eal -I../lib/eal -Ilib/kvargs -I../lib/kvargs -Ilib/log
> -I../lib/log -Ilib/metrics -I../lib/metrics -Ilib/telemetry
> -I../lib/telemetry -Ilib/net -I../lib/net -Ilib/mbuf -I../lib/mbuf
> -Ilib/mempool -I../lib/mempool -Ilib/ring -I../lib/ring
> -Idrivers/net/ring -I../drivers/net/ring -Ilib/ethdev -I../lib/ethdev
> -Ilib/meter -I../lib/meter -Idrivers/bus/pci -I../drivers/bus/pci
> -I../drivers/bus/pci/linux -Ilib/pci -I../lib/pci -Idrivers/bus/vdev
> -I../drivers/bus/vdev -Ilib/acl -I../lib/acl -Ilib/argparse
> -I../lib/argparse -Ilib/hash -I../lib/hash -Ilib/rcu -I../lib/rcu
> -Ilib/bitratestats -I../lib/bitratestats -Ilib/bpf -I../lib/bpf
> -Ilib/compressdev -I../lib/compressdev -Ilib/cryptodev
> -I../lib/cryptodev -Ilib/security -I../lib/security -Ilib/dispatcher
> -I../lib/dispatcher -Ilib/eventdev -I../lib/eventdev -Ilib/timer
> -I../lib/timer -Ilib/dmadev -I../lib/dmadev -Ilib/distributor
> -I../lib/distributor -Ilib/efd -I../lib/efd -Ilib/fib -I../lib/fib
> -Ilib/rib -I../lib/rib -Ilib/table -I../lib/table -Ilib/port
> -I../lib/port -Ilib/sched -I../lib/sched -Ilib/ip_frag
> -I../lib/ip_frag -Ilib/lpm -I../lib/lpm -Ilib/graph -I../lib/graph
> -Ilib/pcapng -I../lib/pcapng -Ilib/ipsec -I../lib/ipsec
> -Ilib/latencystats -I../lib/latencystats -Idrivers/net/bonding
> -I../drivers/net/bonding -Ilib/member -I../lib/memb

[PATCH v1] ptr_compress: fix offset to use portable type

2024-06-19 Thread Paul Szczepanek
Fix the type of offset to use portable uintptr_t instead of uint64_t.

Fixes: 077596a4b077 ("ptr_compress: add pointer compression library")

Reviewed-by: Nick Connolly 
Signed-off-by: Paul Szczepanek 
---
 lib/ptr_compress/rte_ptr_compress.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
index ca746970c0..9742a9594a 100644
--- a/lib/ptr_compress/rte_ptr_compress.h
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -141,7 +141,7 @@ rte_ptr_compress_32_shift(void *ptr_base, void * const 
*src_table,
i += svcntd();
} while (i < n);
 #elif defined __ARM_NEON && !defined RTE_ARCH_ARMv8_AARCH32
-   uint64_t ptr_diff;
+   uintptr_t ptr_diff;
uint64x2_t v_ptr_table;
/* right shift is done by left shifting by negative int */
int64x2_t v_shift = vdupq_n_s64(-bit_shift);
@@ -202,7 +202,7 @@ rte_ptr_decompress_32_shift(void *ptr_base, uint32_t const 
*src_table,
i += svcntd();
} while (i < n);
 #elif defined __ARM_NEON && !defined RTE_ARCH_ARMv8_AARCH32
-   uint64_t ptr_diff;
+   uintptr_t ptr_diff;
uint64x2_t v_ptr_table;
int64x2_t v_shift = vdupq_n_s64(bit_shift);
uint64x2_t v_ptr_base = vdupq_n_u64((uint64_t)ptr_base);
@@ -215,7 +215,7 @@ rte_ptr_decompress_32_shift(void *ptr_base, uint32_t const 
*src_table,
}
/* process leftover single item in case of odd number of n */
if (unlikely(n & 0x1)) {
-   ptr_diff = ((uint64_t) src_table[i]) << bit_shift;
+   ptr_diff = ((uintptr_t) src_table[i]) << bit_shift;
dest_table[i] = RTE_PTR_ADD(ptr_base, ptr_diff);
}
 #else
--
2.25.1



[PATCH v3 0/3] add pointer compression API

2023-10-31 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode this translated into a ~5% throughput
increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16 bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide

Paul Szczepanek (3):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test
  docs: add pointer compression to the EAL guide

 .mailmap  |   1 +
 app/test/test_ring.h  |  94 -
 app/test/test_ring_perf.c | 354 --
 .../prog_guide/env_abstraction_layer.rst  | 142 +++
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_ptr_compress.h| 272 ++
 6 files changed, 740 insertions(+), 124 deletions(-)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v3 3/3] docs: add pointer compression to the EAL guide

2023-10-31 Thread Paul Szczepanek
Documentation added in the EAL guide for the new
utility functions for pointer compression
showing example code and potential usecases

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 .../prog_guide/env_abstraction_layer.rst  | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 89014789de..88cf1f16e2 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -1192,3 +1192,145 @@ will not be deallocated.

 Any successful deallocation event will trigger a callback, for which user
 applications and other DPDK subsystems can register.
+
+.. _pointer_compression:
+
+Pointer Compression
+---
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v3 2/3] test: add pointer compress tests to ring perf test

2023-10-31 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  94 +-
 app/test/test_ring_perf.c | 354 +-
 2 files changed, 324 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..3b00f2465d 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */

 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_16(0, obj, zcd.ptr1, zcd.n1 * 2, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_16(0,
+   obj + (zcd.n1 * 2),
+   zcd.ptr2,
+   (ret - zcd.n1) * 2, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret * 2;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +211,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
 

[PATCH v3 1/3] eal: add pointer compression functions

2023-10-31 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 272 +
 3 files changed, 274 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 864d33ee46..3f0c9d32f5 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1058,6 +1058,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index a0463efac7..17d8373648 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..ceb4662c14
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,272 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/env_abstraction_layer.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE
+   svuint64_t v_src_table;
+   svuint64_t v_dest_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_src_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_dest_table = svsub_x(pg, v_src_table, (uint64_t)ptr_base);
+   v_dest_table = svlsr_x(pg, v_dest_table, bit_shift);
+   svst1w(pg, &dest_table[i], v_dest_table);
+ 

[PATCH v4 0/4] add pointer compression API

2023-11-01 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress

Paul Szczepanek (4):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test
  docs: add pointer compression to the EAL guide
  test: add unit test for ptr compression

 .mailmap  |   1 +
 app/test/meson.build  |   1 +
 app/test/test_eal_ptr_compress.c  | 108 ++
 app/test/test_ring.h  |  94 -
 app/test/test_ring_perf.c | 354 --
 .../prog_guide/env_abstraction_layer.rst  | 142 +++
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_ptr_compress.h| 266 +
 8 files changed, 843 insertions(+), 124 deletions(-)
 create mode 100644 app/test/test_eal_ptr_compress.c
 create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v4 2/4] test: add pointer compress tests to ring perf test

2023-11-01 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  94 +-
 app/test/test_ring_perf.c | 354 +-
 2 files changed, 324 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..3b00f2465d 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */

 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_16(0, obj, zcd.ptr1, zcd.n1 * 2, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_16(0,
+   obj + (zcd.n1 * 2),
+   zcd.ptr2,
+   (ret - zcd.n1) * 2, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret * 2;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +211,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
 

[PATCH v4 4/4] test: add unit test for ptr compression

2023-11-01 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly.

Signed-off-by: Paul Szczepanek 
---
 app/test/meson.build |   1 +
 app/test/test_eal_ptr_compress.c | 108 +++
 2 files changed, 109 insertions(+)
 create mode 100644 app/test/test_eal_ptr_compress.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 05bae9216d..753de4bbd3 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -61,6 +61,7 @@ source_file_deps = {
 'test_dmadev_api.c': ['dmadev'],
 'test_eal_flags.c': [],
 'test_eal_fs.c': [],
+'test_eal_ptr_compress.c': [],
 'test_efd.c': ['efd', 'net'],
 'test_efd_perf.c': ['efd', 'hash'],
 'test_errno.c': [],
diff --git a/app/test/test_eal_ptr_compress.c b/app/test/test_eal_ptr_compress.c
new file mode 100644
index 00..c1c9a98be7
--- /dev/null
+++ b/app/test/test_eal_ptr_compress.c
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define PTRS_SIZE 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_eal_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[PTRS_SIZE] = {0};
+   void *ptrs_out[PTRS_SIZE] = {0};
+   uint32_t offsets32[PTRS_SIZE] = {0};
+   uint16_t offsets16[PTRS_SIZE] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32(base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16(base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_eal_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < PTRS_SIZE; n++) {
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_16[j],
+   j /* exponent of alignment */,
+   n,
+   false
+   );
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_32[j],
+   j /* exponent of alignment */,
+   n,
+   true
+   );
+   if (ret != 0)
+   return ret;
+   }
+   }
+   }
+
+   return ret;
+}
+
+REGISTER_FAST_TEST(eal_ptr_compress_autotest, true, true, 
test_eal_ptr_compress);
--
2.25.1



[PATCH v4 1/4] eal: add pointer compression functions

2023-11-01 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 266 +
 3 files changed, 268 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 864d33ee46..3f0c9d32f5 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1058,6 +1058,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index a0463efac7..17d8373648 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..6697385113
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,266 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/env_abstraction_layer.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE
+   svuint64_t v_ptr_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_ptr_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_ptr_table = svsub_x(pg, v_ptr_table, (uint64_t)ptr_base);
+   v_ptr_table = svlsr_x(pg, v_ptr_table, bit_shift);
+   svst1w(pg, &dest_table[i], v_ptr_table);
+   i += svcntd();
+  

[PATCH v4 3/4] docs: add pointer compression to the EAL guide

2023-11-01 Thread Paul Szczepanek
Documentation added in the EAL guide for the new
utility functions for pointer compression
showing example code and potential usecases

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 .../prog_guide/env_abstraction_layer.rst  | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 89014789de..cc56784e3d 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -1192,3 +1192,145 @@ will not be deallocated.

 Any successful deallocation event will trigger a callback, for which user
 applications and other DPDK subsystems can register.
+
+.. _pointer_compression:
+
+Pointer Compression
+---
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

Re: [PATCH v3 0/3] add pointer compression API

2023-11-01 Thread Paul Szczepanek

On 01/11/2023 07:42, Morten Brørup wrote:

From: Paul Szczepanek [mailto:paul.szczepa...@arm.com]
Sent: Tuesday, 31 October 2023 19.11

[...]


In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode this translated into a ~5%
throughput
increase on an ampere altra.

What was the bulk size in this test?

And were the pipeline stages running on the same lcore or individual lcores per 
pipeline stage?



The pipeline mode was run on separate cores and used 128 as the bulk size.



Re: [PATCH v3] config/arm: update aarch32 build with gcc13

2023-11-01 Thread Paul Szczepanek



On 25/10/2023 13:57, Juraj Linkeš wrote:

The aarch32 with gcc13 fails with:

Compiler for C supports arguments -march=armv8-a: NO

../config/arm/meson.build:714:12: ERROR: Problem encountered: No
suitable armv8 march version found.

This is because we test -march=armv8-a alone (without the -mpfu option),
which is no longer supported in gcc13 aarch32 builds.

The most recent recommendation from the compiler team is to build with
-march=armv8-a+simd -mfpu=auto, which should work for compilers old and
new. The suggestion is to first check -march=armv8-a+simd and only then
check -mfpu=auto.

To address this, add a way to force the architecture (the value of
the -march option).

Signed-off-by: Juraj Linkeš 
---
  config/arm/meson.build | 40 +++-
  1 file changed, 23 insertions(+), 17 deletions(-)

diff --git a/config/arm/meson.build b/config/arm/meson.build
index 3f22d8a2fc..c3f763764a 100644
--- a/config/arm/meson.build
+++ b/config/arm/meson.build
@@ -43,7 +43,9 @@ implementer_generic = {
  },
  'generic_aarch32': {
  'march': 'armv8-a',
-'compiler_options': ['-mfpu=neon'],
+'force_march': true,
+'march_features': ['simd'],
+'compiler_options': ['-mfpu=auto'],
  'flags': [
  ['RTE_ARCH_ARM_NEON_MEMCPY', false],
  ['RTE_ARCH_STRICT_ALIGN', true],
@@ -695,21 +697,25 @@ if update_flags
  # probe supported archs and their features
  candidate_march = ''
  if part_number_config.has_key('march')
-supported_marchs = ['armv8.6-a', 'armv8.5-a', 'armv8.4-a', 'armv8.3-a',
-'armv8.2-a', 'armv8.1-a', 'armv8-a']
-check_compiler_support = false
-foreach supported_march: supported_marchs
-if supported_march == part_number_config['march']
-# start checking from this version downwards
-check_compiler_support = true
-endif
-if (check_compiler_support and
-cc.has_argument('-march=' + supported_march))
-candidate_march = supported_march
-# highest supported march version found
-break
-endif
-endforeach
+if part_number_config.get('force_march', false)
+candidate_march = part_number_config['march']
+else
+supported_marchs = ['armv8.6-a', 'armv8.5-a', 'armv8.4-a', 
'armv8.3-a',
+'armv8.2-a', 'armv8.1-a', 'armv8-a']
+check_compiler_support = false
+foreach supported_march: supported_marchs
+if supported_march == part_number_config['march']
+# start checking from this version downwards
+check_compiler_support = true
+endif
+if (check_compiler_support and
+cc.has_argument('-march=' + supported_march))
+candidate_march = supported_march
+# highest supported march version found
+break
+endif
+endforeach
+endif
  if candidate_march == ''
  error('No suitable armv8 march version found.')
  endif
@@ -741,7 +747,7 @@ if update_flags
  # apply supported compiler options
  if part_number_config.has_key('compiler_options')
  foreach flag: part_number_config['compiler_options']
-if cc.has_argument(flag)
+if cc.has_multi_arguments(machine_args + [flag])
  machine_args += flag
  else
  warning('Configuration compiler option ' +



Reviewed-by: Paul Szczepanek 



[PATCH v5 0/4] add pointer compression API

2023-11-01 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size

Paul Szczepanek (4):
  eal: add pointer compression functions
  test: add pointer compress tests to ring perf test
  docs: add pointer compression to the EAL guide
  test: add unit test for ptr compression

 .mailmap  |   1 +
 app/test/meson.build  |   1 +
 app/test/test_eal_ptr_compress.c  | 108 ++
 app/test/test_ring.h  |  94 -
 app/test/test_ring_perf.c | 354 --
 .../prog_guide/env_abstraction_layer.rst  | 142 +++
 lib/eal/include/meson.build   |   1 +
 lib/eal/include/rte_ptr_compress.h| 266 +
 8 files changed, 843 insertions(+), 124 deletions(-)
 create mode 100644 app/test/test_eal_ptr_compress.c
 create mode 100644 lib/eal/include/rte_ptr_compress.h

--
2.25.1



[PATCH v5 1/4] eal: add pointer compression functions

2023-11-01 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
---
 .mailmap   |   1 +
 lib/eal/include/meson.build|   1 +
 lib/eal/include/rte_ptr_compress.h | 266 +
 3 files changed, 268 insertions(+)
 create mode 100644 lib/eal/include/rte_ptr_compress.h

diff --git a/.mailmap b/.mailmap
index 3f5bab26a8..004751d27a 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1069,6 +1069,7 @@ Paul Greenwalt 
 Paulis Gributs 
 Paul Luse 
 Paul M Stillwell Jr 
+Paul Szczepanek 
 Pavan Kumar Linga 
 Pavan Nikhilesh  
 Pavel Belous 
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index e94b056d46..ce2c733633 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -36,6 +36,7 @@ headers += files(
 'rte_pci_dev_features.h',
 'rte_per_lcore.h',
 'rte_pflock.h',
+   'rte_ptr_compress.h',
 'rte_random.h',
 'rte_reciprocal.h',
 'rte_seqcount.h',
diff --git a/lib/eal/include/rte_ptr_compress.h 
b/lib/eal/include/rte_ptr_compress.h
new file mode 100644
index 00..47a72e4213
--- /dev/null
+++ b/lib/eal/include/rte_ptr_compress.h
@@ -0,0 +1,266 @@
+/* SPDX-License-Identifier: BSD-shift-Clause
+ * Copyright(c) 2023 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/env_abstraction_layer.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, unsigned int n, unsigned int bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32
+   svuint64_t v_ptr_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_ptr_table = svld1_u64(pg, (uint64_t *)src_table + i);
+   v_ptr_table = svsub_x(pg, v_ptr_table, (uint64_t)ptr_base);
+   v_ptr_table = svlsr_x(pg, v_ptr_table, bit_shift);
+   svst1w(pg, &dest_table[i], v_ptr_ta

[PATCH v5 3/4] docs: add pointer compression to the EAL guide

2023-11-01 Thread Paul Szczepanek
Documentation added in the EAL guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 .../prog_guide/env_abstraction_layer.rst  | 142 ++
 1 file changed, 142 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst 
b/doc/guides/prog_guide/env_abstraction_layer.rst
index 6debf54efb..f04d032442 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -1192,3 +1192,145 @@ will not be deallocated.

 Any successful deallocation event will trigger a callback, for which user
 applications and other DPDK subsystems can register.
+
+.. _pointer_compression:
+
+Pointer Compression
+---
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

[PATCH v5 2/4] test: add pointer compress tests to ring perf test

2023-11-01 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
---
 app/test/test_ring.h  |  94 +-
 app/test/test_ring_perf.c | 354 +-
 2 files changed, 324 insertions(+), 124 deletions(-)

diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..3b00f2465d 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -1,10 +1,12 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2019 Arm Limited
+ * Copyright(c) 2019-2023 Arm Limited
  */

 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_16(0, obj, zcd.ptr1, zcd.n1 * 2, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_16(0,
+   obj + (zcd.n1 * 2),
+   zcd.ptr2,
+   (ret - zcd.n1) * 2, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret * 2;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_ptr_compress_32(0, obj, zcd.ptr1, zcd.n1, 3);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_ptr_compress_32(0, obj + zcd.n1,
+   zcd.ptr2, ret - zcd.n1, 3);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
default:
printf("Invalid API type\n");
return 0;
@@ -162,6 +211,9 @@ static inline unsigned int
 test_ring_dequeue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
 

[PATCH v5 4/4] test: add unit test for ptr compression

2023-11-01 Thread Paul Szczepanek
Test compresses and decompresses pointers with various combinations
of memory regions and alignments and verifies the pointers are
recovered correctly.

Signed-off-by: Paul Szczepanek 
---
 app/test/meson.build |   1 +
 app/test/test_eal_ptr_compress.c | 108 +++
 2 files changed, 109 insertions(+)
 create mode 100644 app/test/test_eal_ptr_compress.c

diff --git a/app/test/meson.build b/app/test/meson.build
index 4183d66b0e..3e172b154d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -66,6 +66,7 @@ source_file_deps = {
 'test_dmadev_api.c': ['dmadev'],
 'test_eal_flags.c': [],
 'test_eal_fs.c': [],
+'test_eal_ptr_compress.c': [],
 'test_efd.c': ['efd', 'net'],
 'test_efd_perf.c': ['efd', 'hash'],
 'test_errno.c': [],
diff --git a/app/test/test_eal_ptr_compress.c b/app/test/test_eal_ptr_compress.c
new file mode 100644
index 00..c1c9a98be7
--- /dev/null
+++ b/app/test/test_eal_ptr_compress.c
@@ -0,0 +1,108 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include "test.h"
+#include 
+#include 
+
+#include 
+
+#define MAX_ALIGN_EXPONENT 3
+#define PTRS_SIZE 16
+#define NUM_BASES 2
+#define NUM_REGIONS 4
+#define MAX_32BIT_REGION ((uint64_t)UINT32_MAX + 1)
+#define MAX_16BIT_REGION (UINT16_MAX + 1)
+
+static int
+test_eal_ptr_compress_params(
+   void *base,
+   uint64_t mem_sz,
+   unsigned int align_exp,
+   unsigned int num_ptrs,
+   bool use_32_bit)
+{
+   unsigned int i;
+   unsigned int align = 1 << align_exp;
+   void *ptrs[PTRS_SIZE] = {0};
+   void *ptrs_out[PTRS_SIZE] = {0};
+   uint32_t offsets32[PTRS_SIZE] = {0};
+   uint16_t offsets16[PTRS_SIZE] = {0};
+
+   for (i = 0; i < num_ptrs; i++) {
+   /* make pointers point at memory in steps of align */
+   /* alternate steps from the start and end of memory region */
+   if ((i & 1) == 1)
+   ptrs[i] = (char *)base + mem_sz - i * align;
+   else
+   ptrs[i] = (char *)base + i * align;
+   }
+
+   if (use_32_bit) {
+   rte_ptr_compress_32(base, ptrs, offsets32, num_ptrs, align_exp);
+   rte_ptr_decompress_32(base, offsets32, ptrs_out, num_ptrs,
+   align_exp);
+   } else {
+   rte_ptr_compress_16(base, ptrs, offsets16, num_ptrs, align_exp);
+   rte_ptr_decompress_16(base, offsets16, ptrs_out, num_ptrs,
+   align_exp);
+   }
+
+   TEST_ASSERT_BUFFERS_ARE_EQUAL(ptrs, ptrs_out, sizeof(void *) * num_ptrs,
+   "Decompressed pointers corrupted\nbase pointer: %p, "
+   "memory region size: %" PRIu64 ", alignment exponent: %u, "
+   "num of pointers: %u, using %s offsets",
+   base, mem_sz, align_exp, num_ptrs,
+   use_32_bit ? "32-bit" : "16-bit");
+
+   return 0;
+}
+
+static int
+test_eal_ptr_compress(void)
+{
+   unsigned int j, k, n;
+   int ret = 0;
+   void * const bases[NUM_BASES] = { (void *)0, (void *)UINT16_MAX };
+   /* maximum size for pointers aligned by consecutive powers of 2 */
+   const uint64_t region_sizes_16[NUM_REGIONS] = {
+   MAX_16BIT_REGION,
+   MAX_16BIT_REGION * 2,
+   MAX_16BIT_REGION * 4,
+   MAX_16BIT_REGION * 8,
+   };
+   const uint64_t region_sizes_32[NUM_REGIONS] = {
+   MAX_32BIT_REGION,
+   MAX_32BIT_REGION * 2,
+   MAX_32BIT_REGION * 4,
+   MAX_32BIT_REGION * 8,
+   };
+
+   for (j = 0; j < NUM_REGIONS; j++) {
+   for (k = 0; k < NUM_BASES; k++) {
+   for (n = 1; n < PTRS_SIZE; n++) {
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_16[j],
+   j /* exponent of alignment */,
+   n,
+   false
+   );
+   ret |= test_eal_ptr_compress_params(
+   bases[k],
+   region_sizes_32[j],
+   j /* exponent of alignment */,
+   n,
+   true
+   );
+   if (ret != 0)
+   return ret;
+   }
+   }
+   }
+
+   return ret;
+}
+
+REGISTER_FAST_TEST(eal_ptr_compress_autotest, true, true, 
test_eal_ptr_compress);
--
2.25.1



[RFC v1 0/1] allow header only libraries

2024-03-06 Thread Paul Szczepanek
The current devtools has a check that errors on any
library (except drivers library which is exempted)
that does not export any symbols. I want to create
a header only library. I want to have the check
ignore libraries which have no `global:` section.

Paul Szczepanek (1):
  devtools: allow libraries with no global section

 devtools/check-symbol-maps.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--
2.25.1



[RFC v1 1/1] devtools: allow libraries with no global section

2024-03-06 Thread Paul Szczepanek
If a library has no global section in the version.map
allow it not to have symbols and not report it as an error.
This happens if a library doesn't export any functions
if they're all inline.

Signed-off-by: Paul Szczepanek 
---
 devtools/check-symbol-maps.sh | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/devtools/check-symbol-maps.sh b/devtools/check-symbol-maps.sh
index ba2f892f56..380a251aea 100755
--- a/devtools/check-symbol-maps.sh
+++ b/devtools/check-symbol-maps.sh
@@ -63,7 +63,9 @@ fi
 find_empty_maps ()
 {
 for map in $@ ; do
-[ $(buildtools/map-list-symbol.sh $map | wc -l) != '0' ] || echo $map
+# ignore maps that do not have a 'global:' section since they are 
empty by design
+[ $(buildtools/map-list-symbol.sh $map | wc -l) != '0' ] ||
+! grep -q 'global:' $map || echo $map
 done
 }

--
2.25.1



[PATCH v1 0/1] allow libraries with no sources

2024-03-06 Thread Paul Szczepanek
I want to add library which is header only.
Attempting to build such a a library causes errors during checks.
This skips over building libraries while retaining other library
functionality.
The added lines are the first and last 5. The rest is indentation.

Paul Szczepanek (1):
  lib: allow libraries with no sources

 lib/meson.build | 176 +---
 1 file changed, 91 insertions(+), 85 deletions(-)

--
2.25.1



[PATCH v1 1/1] lib: allow libraries with no sources

2024-03-06 Thread Paul Szczepanek
Allow header only libraries.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Dhruv Tripathi 

---
 lib/meson.build | 176 +---
 1 file changed, 91 insertions(+), 85 deletions(-)

diff --git a/lib/meson.build b/lib/meson.build
index 4fb01f059b..0fcf3336d1 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -220,105 +220,111 @@ foreach l:libraries
 includes += include_directories(l)
 dpdk_includes += include_directories(l)

-if developer_mode and is_windows and use_function_versioning
-message('@0@: Function versioning is not supported by 
Windows.'.format(name))
-endif
+if sources.length() != 0
+if developer_mode and is_windows and use_function_versioning
+message('@0@: Function versioning is not supported by 
Windows.'.format(name))
+endif

-if use_function_versioning
-cflags += '-DRTE_USE_FUNCTION_VERSIONING'
-endif
-cflags += '-DRTE_LOG_DEFAULT_LOGTYPE=lib.' + l
-if annotate_locks and cc.get_id() == 'clang' and 
cc.version().version_compare('>=3.5.0')
-cflags += '-DRTE_ANNOTATE_LOCKS'
-cflags += '-Wthread-safety'
-endif
+if use_function_versioning
+cflags += '-DRTE_USE_FUNCTION_VERSIONING'
+endif
+cflags += '-DRTE_LOG_DEFAULT_LOGTYPE=lib.' + l
+if annotate_locks and cc.get_id() == 'clang' and 
cc.version().version_compare('>=3.5.0')
+cflags += '-DRTE_ANNOTATE_LOCKS'
+cflags += '-Wthread-safety'
+endif

-# first build static lib
-static_lib = static_library(libname,
-sources,
-objects: objs,
-c_args: cflags,
-dependencies: static_deps,
-include_directories: includes,
-install: true)
-static_dep = declare_dependency(
-include_directories: includes,
-dependencies: static_deps)
+# first build static lib
+static_lib = static_library(libname,
+sources,
+objects: objs,
+c_args: cflags,
+dependencies: static_deps,
+include_directories: includes,
+install: true)
+static_dep = declare_dependency(
+include_directories: includes,
+dependencies: static_deps)

-if not use_function_versioning or is_windows
-# use pre-build objects to build shared lib
-sources = []
-objs += static_lib.extract_all_objects(recursive: false)
-else
-# for compat we need to rebuild with
-# RTE_BUILD_SHARED_LIB defined
-cflags += '-DRTE_BUILD_SHARED_LIB'
-endif
+if not use_function_versioning or is_windows
+# use pre-build objects to build shared lib
+sources = []
+objs += static_lib.extract_all_objects(recursive: false)
+else
+# for compat we need to rebuild with
+# RTE_BUILD_SHARED_LIB defined
+cflags += '-DRTE_BUILD_SHARED_LIB'
+endif

-version_map = '@0@/@1@/version.map'.format(meson.current_source_dir(), l)
-lk_deps = [version_map]
+version_map = '@0@/@1@/version.map'.format(meson.current_source_dir(), 
l)
+lk_deps = [version_map]

-if is_ms_linker
-def_file = custom_target(libname + '_def',
-command: [map_to_win_cmd, '@INPUT@', '@OUTPUT@'],
-input: version_map,
-output: '@0@_exports.def'.format(libname))
-lk_deps += [def_file]
+if is_ms_linker
+def_file = custom_target(libname + '_def',
+command: [map_to_win_cmd, '@INPUT@', '@OUTPUT@'],
+input: version_map,
+output: '@0@_exports.def'.format(libname))
+lk_deps += [def_file]

-if is_ms_compiler
-lk_args = ['/def:' + def_file.full_path()]
-if meson.version().version_compare('<0.54.0')
-lk_args += ['/implib:lib\\librte_' + l + '.dll.a']
+if is_ms_compiler
+lk_args = ['/def:' + def_file.full_path()]
+if meson.version().version_compare('<0.54.0')
+lk_args += ['/implib:lib\\librte_' + l + '.dll.a']
+endif
+else
+lk_args = ['-Wl,/def:' + def_file.full_path()]
+if meson.version().version_compare('<0.54.0')
+lk_args += ['-Wl,/implib:lib\\librte_' + l + '.dll.a']
+endif
 endif
 

Re: [RFC v1 1/1] devtools: allow libraries with no global section

2024-03-06 Thread Paul Szczepanek




On 06/03/2024 16:51, David Marchand wrote:

On Wed, Mar 6, 2024 at 5:40 PM Bruce Richardson
 wrote:


On Wed, Mar 06, 2024 at 05:14:15PM +0100, David Marchand wrote:

On Wed, Mar 6, 2024 at 3:36 PM Paul Szczepanek  wrote:


If a library has no global section in the version.map
allow it not to have symbols and not report it as an error.
This happens if a library doesn't export any functions
if they're all inline.

Signed-off-by: Paul Szczepanek 


Added Bruce.

Having a library without any actual code compiled is (I think) new in DPDK.

On the other hand, for headers only, there should be no need for a
version.map file at all.

The current meson code expects that every library provides some files
to compile via the sources variable and it expects a version.map file
too.
I wonder if we could skip the whole library generation at the
lib/meson.build level.
Something like:

diff --git a/lib/meson.build b/lib/meson.build
index 179a272932..f0bbab6658 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -222,6 +222,10 @@ foreach l:libraries
  includes += include_directories(l)
  dpdk_includes += include_directories(l)

+if sources.length() == 0
+continue
+endif
+
  if developer_mode and is_windows and use_function_versioning
  message('@0@: Function versioning is not supported by
Windows.'.format(name))
  endif

No version.map, no check to update :-)


Two thoughts/suggestions here:

* in original meson port we did have support for header only libraries - I
   think for rte_compat.h, but that was done away with when the header was
   just merged into EAL. See [1]
* for a header only lib - if we are prepared to forego being able to
   disable it - the easiest enablement path may be to not add the directory
   to the list of libraries, and just add the header file path to the global
   include path, or perhaps some other library include path. How to make it
   work best may depend on what the library does and what other DPDK libs, if
   any, it depends upon.


If the goal is to provide those headers as public API, you still need
to call install_headers() somewhere.
And I don't like losing control over disabling about what is shipped.

I prefer [1].




I have uploaded a PATCH that follows [1]:
https://patches.dpdk.org/project/dpdk/patch/20240306221709.166722-2-paul.szczepa...@arm.com/
It might be easier to review by applying first as most of the diff is 
just tab indentation change caused by the if.

I have tested it with my header only library and it works.


Re: [PATCH v5 0/4] add pointer compression API

2024-03-06 Thread Paul Szczepanek

On 02/03/2024 10:33, Morten Brørup wrote:

I think that a misconception that arch specific optimizations (such as SIMD 
code) required stuff to go into EAL has been prevailing, and this misconception 
is a main reason why EAL has become so bloated.
Moving features like pointer compression out of EAL, thereby showing 
alternative design patterns for code containing arch specific optimizations, 
will help eliminate that misconception.


I have a patch ready that moves the ptr compress into its own library 
but I must first land this patch:

https://patches.dpdk.org/project/dpdk/patch/20240306221709.166722-2-paul.szczepa...@arm.com/
which is required to have header only libraries - otherwise errors stop 
the build.


[PATCH v7 0/4] add pointer compression API

2024-03-07 Thread Paul Szczepanek
This patchset is proposing adding a new EAL header with utility functions
that allow compression of arrays of pointers.

When passing caches full of pointers between threads, memory containing
the pointers is copied multiple times which is especially costly between
cores. A compression method will allow us to shrink the memory size
copied.

The compression takes advantage of the fact that pointers are usually
located in a limited memory region (like a mempool). We can compress them
by converting them to offsets from a base memory address.

Offsets can be stored in fewer bytes (dictated by the memory region size
and alignment of the pointer). For example: an 8 byte aligned pointer
which is part of a 32GB memory pool can be stored in 4 bytes. The API is
very generic and does not assume mempool pointers, any pointer can be
passed in.

Compression is based on few and fast operations and especially with vector
instructions leveraged creates minimal overhead.

The API accepts and returns arrays because the overhead means it only is
worth it when done in bulk.

Test is added that shows potential performance gain from compression. In
this test an array of pointers is passed through a ring between two cores.
It shows the gain which is dependent on the bulk operation size. In this
synthetic test run on ampere altra a substantial (up to 25%) performance
gain is seen if done in bulk size larger than 32. At 32 it breaks even and
lower sizes create a small (less than 5%) slowdown due to overhead.

In a more realistic mock application running the l3 forwarding dpdk
example that works in pipeline mode on two cores this translated into a
~5% throughput increase on an ampere altra.

v2:
* addressed review comments (style, explanations and typos)
* lowered bulk iterations closer to original numbers to keep runtime short
* fixed pointer size warning on 32-bit arch
v3:
* added 16-bit versions of compression functions and tests
* added documentation of these new utility functions in the EAL guide
v4:
* added unit test
* fix bug in NEON implementation of 32-bit decompress
v5:
* disable NEON and SVE implementation on AARCH32 due to wrong pointer size
v6:
* added example usage to commit message of the initial commit
v7:
* rebase to remove clashing mailmap changes
v8:
* put ptr compress into its own library
* add depends-on tag
* remove copyright bumps
* typos

Paul Szczepanek (4):
  ptr_compress: add pointer compression library
  test: add pointer compress tests to ring perf test
  docs: add pointer compression guide
  test: add unit test for ptr compression

 app/test/meson.build   |  21 +-
 app/test/test_ptr_compress.c   | 108 +++
 app/test/test_ring.h   |  92 ++
 app/test/test_ring_perf.c  | 352 ++---
 doc/guides/prog_guide/ptr_compress_lib.rst | 144 +
 lib/meson.build|   1 +
 lib/ptr_compress/meson.build   |   4 +
 lib/ptr_compress/rte_ptr_compress.h| 266 
 lib/ptr_compress/version.map   |   3 +
 9 files changed, 859 insertions(+), 132 deletions(-)
 create mode 100644 app/test/test_ptr_compress.c
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h
 create mode 100644 lib/ptr_compress/version.map

--
2.25.1



[PATCH v8 1/4] ptr_compress: add pointer compression library

2024-03-07 Thread Paul Szczepanek
Add a new utility header for compressing pointers. The provided
functions can store pointers in 32-bit or 16-bit offsets.

The compression takes advantage of the fact that pointers are
usually located in a limited memory region (like a mempool).
We can compress them by converting them to offsets from a base
memory address. Offsets can be stored in fewer bytes (dictated
by the memory region size and alignment of the pointer).
For example: an 8 byte aligned pointer which is part of a 32GB
memory pool can be stored in 4 bytes.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Paul Szczepanek 
Signed-off-by: Kamalakshitha Aligeri 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
---

Depends-on: patch-138071 ("lib: allow libraries with no sources")

 lib/meson.build |   1 +
 lib/ptr_compress/meson.build|   4 +
 lib/ptr_compress/rte_ptr_compress.h | 266 
 lib/ptr_compress/version.map|   3 +
 4 files changed, 274 insertions(+)
 create mode 100644 lib/ptr_compress/meson.build
 create mode 100644 lib/ptr_compress/rte_ptr_compress.h
 create mode 100644 lib/ptr_compress/version.map

diff --git a/lib/meson.build b/lib/meson.build
index 0fcf3336d1..b46d3e15c6 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -14,6 +14,7 @@ libraries = [
 'argparse',
 'telemetry', # basic info querying
 'eal', # everything depends on eal
+'ptr_compress',
 'ring',
 'rcu', # rcu depends on ring
 'mempool',
diff --git a/lib/ptr_compress/meson.build b/lib/ptr_compress/meson.build
new file mode 100644
index 00..e92706a45f
--- /dev/null
+++ b/lib/ptr_compress/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2024 Arm Limited
+
+headers = files('rte_ptr_compress.h')
diff --git a/lib/ptr_compress/rte_ptr_compress.h 
b/lib/ptr_compress/rte_ptr_compress.h
new file mode 100644
index 00..97c084003d
--- /dev/null
+++ b/lib/ptr_compress/rte_ptr_compress.h
@@ -0,0 +1,266 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Arm Limited
+ */
+
+#ifndef RTE_PTR_COMPRESS_H
+#define RTE_PTR_COMPRESS_H
+
+/**
+ * @file
+ * Pointer compression and decompression functions.
+ *
+ * When passing arrays full of pointers between threads, memory containing
+ * the pointers is copied multiple times which is especially costly between
+ * cores. These functions allow us to compress the pointers.
+ *
+ * Compression takes advantage of the fact that pointers are usually located in
+ * a limited memory region (like a mempool). We compress them by converting 
them
+ * to offsets from a base memory address. Offsets can be stored in fewer bytes.
+ *
+ * The compression functions come in two varieties: 32-bit and 16-bit.
+ *
+ * To determine how many bits are needed to compress the pointer calculate
+ * the biggest offset possible (highest value pointer - base pointer)
+ * and shift the value right according to alignment (shift by exponent of the
+ * power of 2 of alignment: aligned by 4 - shift by 2, aligned by 8 - shift by
+ * 3, etc.). The resulting value must fit in either 32 or 16 bits.
+ *
+ * For usage example and further explanation please see "Pointer Compression" 
in
+ * doc/guides/prog_guide/ptr_compress_lib.rst
+ */
+
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Compress pointers into 32-bit offsets from base pointer.
+ *
+ * @note It is programmer's responsibility to ensure the resulting offsets fit
+ * into 32 bits. Alignment of the structures pointed to by the pointers allows
+ * us to drop bits from the offsets. This is controlled by the bit_shift
+ * parameter. This means that if structures are aligned by 8 bytes they must be
+ * within 32GB of the base pointer. If there is no such alignment guarantee 
they
+ * must be within 4GB.
+ *
+ * @param ptr_base
+ *   A pointer used to calculate offsets of pointers in src_table.
+ * @param src_table
+ *   A pointer to an array of pointers.
+ * @param dest_table
+ *   A pointer to an array of compressed pointers returned by this function.
+ * @param n
+ *   The number of objects to compress, must be strictly positive.
+ * @param bit_shift
+ *   Byte alignment of memory pointed to by the pointers allows for
+ *   bits to be dropped from the offset and hence widen the memory region that
+ *   can be covered. This controls how many bits are right shifted.
+ **/
+static __rte_always_inline void
+rte_ptr_compress_32(void *ptr_base, void **src_table,
+   uint32_t *dest_table, size_t n, uint8_t bit_shift)
+{
+   unsigned int i = 0;
+#if defined RTE_HAS_SVE_ACLE && !defined RTE_ARCH_ARMv8_AARCH32
+   svuint64_t v_ptr_table;
+   svbool_t pg = svwhilelt_b64(i, n);
+   do {
+   v_ptr_

[PATCH v8 2/4] test: add pointer compress tests to ring perf test

2024-03-07 Thread Paul Szczepanek
Add a test that runs a zero copy burst enqueue and dequeue on a ring
of raw pointers and compressed pointers at different burst sizes to
showcase performance benefits of newly added pointer compression APIs.

Refactored threading code to pass more parameters to threads to
reuse existing code. Added more bulk sizes to showcase their effects
on compression. Adjusted loop iteration numbers to take into account
bulk sizes to keep runtime constant (instead of number of operations).

Adjusted old printfs to match new ones which have aligned numbers.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
---
 app/test/meson.build  |  20 +--
 app/test/test_ring.h  |  92 ++
 app/test/test_ring_perf.c | 352 +-
 3 files changed, 332 insertions(+), 132 deletions(-)

diff --git a/app/test/meson.build b/app/test/meson.build
index 7d909039ae..df8cc00730 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -156,16 +156,16 @@ source_file_deps = {
 #'test_resource.c': [],
 'test_rib.c': ['net', 'rib'],
 'test_rib6.c': ['net', 'rib'],
-'test_ring.c': [],
-'test_ring_hts_stress.c': [],
-'test_ring_mpmc_stress.c': [],
-'test_ring_mt_peek_stress.c': [],
-'test_ring_mt_peek_stress_zc.c': [],
-'test_ring_perf.c': [],
-'test_ring_rts_stress.c': [],
-'test_ring_st_peek_stress.c': [],
-'test_ring_st_peek_stress_zc.c': [],
-'test_ring_stress.c': [],
+'test_ring.c': ['ptr_compress'],
+'test_ring_hts_stress.c': ['ptr_compress'],
+'test_ring_mpmc_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress.c': ['ptr_compress'],
+'test_ring_mt_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_perf.c': ['ptr_compress'],
+'test_ring_rts_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress.c': ['ptr_compress'],
+'test_ring_st_peek_stress_zc.c': ['ptr_compress'],
+'test_ring_stress.c': ['ptr_compress'],
 'test_rwlock.c': [],
 'test_sched.c': ['net', 'sched'],
 'test_security.c': ['net', 'security'],
diff --git a/app/test/test_ring.h b/app/test/test_ring.h
index 45c263f3ff..f90662818c 100644
--- a/app/test/test_ring.h
+++ b/app/test/test_ring.h
@@ -5,6 +5,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 

 /* API type to call
  * rte_ring__enqueue_
@@ -25,6 +27,10 @@
 #define TEST_RING_ELEM_BULK 16
 #define TEST_RING_ELEM_BURST 32

+#define TEST_RING_ELEM_BURST_ZC 64
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16 128
+#define TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_32 256
+
 #define TEST_RING_IGNORE_API_TYPE ~0U

 /* This function is placed here as it is required for both
@@ -101,6 +107,9 @@ static inline unsigned int
 test_ring_enqueue(struct rte_ring *r, void **obj, int esize, unsigned int n,
unsigned int api_type)
 {
+   unsigned int ret;
+   struct rte_ring_zc_data zcd = {0};
+
/* Legacy queue APIs? */
if (esize == -1)
switch (api_type) {
@@ -152,6 +161,46 @@ test_ring_enqueue(struct rte_ring *r, void **obj, int 
esize, unsigned int n,
case (TEST_RING_THREAD_MPMC | TEST_RING_ELEM_BURST):
return rte_ring_mp_enqueue_burst_elem(r, obj, esize, n,
NULL);
+   case (TEST_RING_ELEM_BURST_ZC):
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, esize, n, &zcd, NULL);
+   if (unlikely(ret == 0))
+   return 0;
+   rte_memcpy(zcd.ptr1, (char *)obj, zcd.n1 * esize);
+   if (unlikely(zcd.ptr2 != NULL))
+   rte_memcpy(zcd.ptr2,
+   (char *)obj + zcd.n1 * esize,
+   (ret - zcd.n1) * esize);
+   rte_ring_enqueue_zc_finish(r, ret);
+   return ret;
+   case (TEST_RING_ELEM_BURST_ZC_COMPRESS_PTR_16):
+   /* rings cannot store uint16_t so we use a uint32_t
+* and half the requested number of elements
+* and compensate by doubling the returned numbers
+*/
+   ret = rte_ring_enqueue_zc_burst_elem_start(
+   r, sizeof(uint32_t), n / 2, &zcd, NULL);
+   if

[PATCH v8 3/4] docs: add pointer compression guide

2024-03-07 Thread Paul Szczepanek
Documentation added in the prog guide for the new
utility functions for pointer compression
showing example code and potential usecases.

Signed-off-by: Paul Szczepanek 
Reviewed-by: Honnappa Nagarahalli 
Reviewed-by: Nathan Brown 
---
 doc/guides/prog_guide/ptr_compress_lib.rst | 144 +
 1 file changed, 144 insertions(+)
 create mode 100644 doc/guides/prog_guide/ptr_compress_lib.rst

diff --git a/doc/guides/prog_guide/ptr_compress_lib.rst 
b/doc/guides/prog_guide/ptr_compress_lib.rst
new file mode 100644
index 00..00bc8a3697
--- /dev/null
+++ b/doc/guides/prog_guide/ptr_compress_lib.rst
@@ -0,0 +1,144 @@
+..  SPDX-License-Identifier: BSD-3-Clause
+Copyright(c) 2024 Arm Limited.
+
+.. _Pointer_Compression_Library:
+
+Pointer Compression Library
+===
+
+Use ``rte_ptr_compress_16()`` and ``rte_ptr_decompress_16()`` to compress and
+decompress pointers into 16-bit offsets. Use ``rte_ptr_compress_32()`` and
+``rte_ptr_decompress_32()`` to compress and decompress pointers into 32-bit
+offsets.
+
+Compression takes advantage of the fact that pointers are usually located in a
+limited memory region (like a mempool). By converting them to offsets from a
+base memory address they can be stored in fewer bytes. How many bytes are 
needed
+to store the offset is dictated by the memory region size and alignment of
+objects the pointers point to.
+
+For example, a pointer which is part of a 4GB memory pool can be stored as 32
+bit offset. If the pointer points to memory that is 8 bytes aligned then 3 bits
+can be dropped from the offset and a 32GB memory pool can now fit in 32 bits.
+
+For performance reasons these requirements are not enforced programmatically.
+The programmer is responsible for ensuring that the combination of distance
+from the base pointer and memory alignment allow for storing of the offset in
+the number of bits indicated by the function name (16 or 32). Start of mempool
+memory would be a good candidate for the base pointer. Otherwise any pointer
+that precedes all pointers, is close enough and has the same alignment as the
+pointers being compressed will work.
+
+.. note::
+
+Performance gains depend on the batch size of pointers and CPU capabilities
+such as vector extensions. It's important to measure the performance
+increase on target hardware. A test called ``ring_perf_autotest`` in
+``dpdk-test`` can provide the measurements.
+
+Example usage
+~
+
+In this example we send pointers between two cores through a ring. While this
+is a realistic use case the code is simplified for demonstration purposes and
+does not have error handling.
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#define ITEMS_ARRAY_SIZE (1024)
+#define BATCH_SIZE (128)
+#define ALIGN_EXPONENT (3)
+#define ITEM_ALIGN (1<

  1   2   >