答复: Inquiry Regarding Sending Patches to DPDK

2024-08-15 Thread 王颢
Dear Morten,

I have successfully resolved this problem. Prior to the tests, I overlooked the 
necessity of disabling smtpencryption. However, upon modifying my .gitconfig 
file as demonstrated below, everything now functions flawlessly. If IT allows 
anonymous email sending, then the 2FA problem will be solved.

[user]
name = howard_wang
email = howard_w...@realsil.com.cn
[core]
editor = vim
[sendemail]
smtpserverport=25
smtpserver=smtpsrv.realsil.com.cn
smtpdomain = 172.29.32.27  

Thanks!
Howard Wang

-邮件原件-
发件人: Morten Brørup  
发送时间: 2024年8月14日 18:15
收件人: 王颢 ; Stephen Hemminger 

抄送: dev@dpdk.org
主题: RE: Inquiry Regarding Sending Patches to DPDK


External mail.



Howard,

I'm using.gitconfig to configure my git send-email options. Try this in your 
.gitconfig:

[user]
name = Howard Wang
email = howard_w...@realsil.com.cn

[sendemail]
from = Howard Wang 
envelopeSender = howard_w...@realsil.com.cn
smtpServer = smtpsrv.realsil.com.cn


Med venlig hilsen / Kind regards,
-Morten Brørup


> -Original Message-
> From: 王颢 [mailto:howard_w...@realsil.com.cn]
> Sent: Wednesday, 14 August 2024 11.52
> To: Stephen Hemminger
> Cc: dev@dpdk.org
> Subject: 答复: Inquiry Regarding Sending Patches to DPDK
>
> Dear Stephen,
>
> Now I have a better understanding of the anonymous sending suggested 
> by the company's IT department. Since the second-factor authentication 
> for the email account is Microsoft's Okta, which seems not 
> straightforward to configure with an account and password, they have 
> enabled anonymous sending for me. Here's how it works approximately: 
> When I send emails, I don't need to input an account or password. 
> Instead, I just need to configure the server and port number, and I can send 
> emails. Attached below is the script I've written.
> However, it seems there are some issues, and perhaps I need to conduct 
> further research.
>
> test result: 
> https://mails.dpdk.org/archives/dev/2024-August/299466.html
> python:
> #!/usr/bin/env python3
> import smtplib
> from email.mime.multipart import MIMEMultipart from email.mime.text 
> import MIMEText from email.mime.base import MIMEBase from email import 
> encoders
>
> smtp_server = 'smtpsrv.realsil.com.cn'
> smtp_port = 25
>
> from_addr = 'howard_w...@realsil.com.cn'
> to_addr = 'dev@dpdk.org'
>
> msg = MIMEMultipart()
> msg['From'] = from_addr
> msg['To'] = to_addr
> #msg['Subject'] = 'test anonymous send mail'
>
> filename = '0001-net-r8169-add-PMD-driver-skeleton.patch'
> with open(filename, 'rb') as attachment:
> part = MIMEBase('application', 'octet-stream')
> part.set_payload(attachment.read())
> encoders.encode_base64(part)
> part.add_header('Content-Disposition', f"attachment; filename=
> {filename}")
> msg.attach(part)
>
> try:
> server = smtplib.SMTP(smtp_server, smtp_port)
> server.sendmail(from_addr, [to_addr], msg.as_string())
> server.quit()
> print('Mail sent successfully!')
> except Exception as e:
> print(f'Failed to send mail: {e}')
>
> Thanks!
> Howard Wang
>
> -邮件原件-
> 发件人: Stephen Hemminger 
> 发送时间: 2024年8月12日 22:56
> 收件人: 王颢 
> 抄送: dev@dpdk.org
> 主题: Re: Inquiry Regarding Sending Patches to DPDK
>
>
> External mail.
>
>
>
> On Mon, 12 Aug 2024 07:52:39 +
> 王颢  wrote:
>
> > Dear all,
> >
> > I hope this message finds you well.
> >
> > I would like to seek your advice on an issue I've encountered. Our 
> > company
> has recently enabled two-factor authentication (2FA) for our email accounts.
> The IT department has suggested that I abandon using the "git send-email"
> method, as configured through git config, to send patches to DPDK. 
> Instead, they have recommended using "Exchange anonymous send mail." 
> However, I believe this approach might not be feasible.
> >
> > I wanted to confirm this with you and see if you could provide any 
> > guidance
> on the matter. I look forward to your response.
> >
> > Thank you very much for your time and assistance.
> >
> > Best regards,
> > Howard Wang
>
> There are two issues here:
> Using git send-email is not required. You can generate patch files and 
> put them in your email.
> BUT Microsoft Exchange does not preserve text formatting in messages. 
> Any patches sent that way are usually corrupted.
>
> At Microsoft, we ended up using a special server (not Exchange) to 
> send Linux and DPDK patches. Or using non-corporate accounts.


[PATCH v2 0/2] examples/l3fwd fixes for ACL mode

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

As Song Jiale pointed outprevious fix is not enough to fix
the problem he is observing with l3fwd in ACl mode:
https://bugs.dpdk.org/show_bug.cgi?id=1502
This is a second attempt to fix it.

Konstantin Ananyev (2):
  examples/l3fwd: fix read beyond array bondaries
  examples/l3fwd: fix read beyond array boundaries in ACL mode

 examples/l3fwd/l3fwd_acl.c   | 37 
 examples/l3fwd/l3fwd_altivec.h   |  6 -
 examples/l3fwd/l3fwd_common.h|  7 ++
 examples/l3fwd/l3fwd_em_hlm.h|  2 +-
 examples/l3fwd/l3fwd_em_sequential.h |  2 +-
 examples/l3fwd/l3fwd_fib.c   |  2 +-
 examples/l3fwd/l3fwd_lpm_altivec.h   |  2 +-
 examples/l3fwd/l3fwd_lpm_neon.h  |  2 +-
 examples/l3fwd/l3fwd_lpm_sse.h   |  2 +-
 examples/l3fwd/l3fwd_neon.h  |  6 -
 examples/l3fwd/l3fwd_sse.h   |  6 -
 11 files changed, 55 insertions(+), 19 deletions(-)

-- 
2.35.3



[RFC 0/6] Stage-Ordered API and other extensions for ring library

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Konstantin Ananyev (6):
  ring: common functions for 'move head' ops
  ring: make copying functions generic
  ring/soring: introduce Staged Ordered Ring
  app/test: add unit tests for soring API
  examples/l3fwd: make ACL work in pipeline and eventdev modes
  ring: minimize reads of the counterpart cache-line

The main aim of these series is to extend ring library with
new API that allows user to create/use Staged-Ordered-Ring (SORING)
abstraction. In addition to that there are few other patches that serve
different purposes:
- first two patches are just code reordering to de-duplicate
  and generalize existing rte_ring code.
- next two patches introduce SORING API into the ring library and
  provide UT for it.
- patch #5 extends l3fwd sample app to work in pipeline (worker-pool) mode. 
  Right now it is done for demonstration and performance comparison pruposes:
  it makes possible to run l3fwd in different modes:
  run-to-completion, eventdev, pipeline
  and perform sort-of 'apple-to-apple' performance comparisons.
  I am aware that in general community consensus on l3fwd is to keep its
  functionality simple and limited. From other side we already do have
  eventdev  mode for it, so why pipeline should be prohibited?
  Though if l3fwd is not an option, then we need to select some other
  existing sample app to integrate with. Probably ipsec-secgw would be the
  second best choice from my perspective, though it would require much more
  effort.
  Have to say that current l3fwd patch is way too big and unfinished,
  so if we'll decide to go forward with it, it has to be split and reworked.   
- patch #6 - attempt to optimize (by caching counter-part tail value)
  enqueue/dequeue operations for vanilla rte_ring. Logically tt is not linked
  with patches 3-5 and probably should be in a separate series.
  I put it here for now just to minimize 'Depends-on' hassle, so everyone
  can build/try everything in one go.

Seeking community help/feedback (apart from usual patch review activity):
=
- While we tested these changes quite extensively, our platform coverage
  is limited to x86 right now.
  So would appreciate the feedback how it behaves on other architectures
  DPDK supports (ARM, PPC, etc.).
  Specially for patch #6: so far we didn't observe noticeable performance
  improvement with it on x86_64,
  So if there would be no real gain on other platforms (or scenarios) - 
  I am ok to drop that patch.
- Adding new (pipeline) mode for l3fwd sample app.
  Is it worth it? If not, what other sample app should be used to
  demonstrate new functionality we worked on? ipsec-secgw? Something else?  

SORING overview
===
Staged-Ordered-Ring (SORING) provides a SW abstraction for 'ordered' queues
with multiple processing 'stages'. It is based on conventional DPDK rte_ring,
re-uses many of its concepts, and even substantial part of its code.
It can be viewed as an 'extension' of rte_ring functionality.
In particular, main SORING properties:
- circular ring buffer with fixed size objects
- producer, consumer plus multiple processing stages in between.
- allows to split objects processing into multiple stages.
- objects remain in the same ring while moving from one stage to the other,
  initial order is preserved, no extra copying needed.
- preserves the ingress order of objects within the queue across multiple
  stages
- each stage (and producer/consumer) can be served by single and/or
  multiple threads.
- number of stages, size and number of objects in the ring are
 configurable at ring initialization time.

Data-path API provides four main operations:
- enqueue/dequeue works in the same manner as for conventional rte_ring,
  all rte_ring synchronization types are supported.
- acquire/release - for each stage there is an acquire (start) and
  release (finish) operation. After some objects are 'acquired' - given thread
  can safely assume that it has exclusive ownership of these objects till
  it will invoke 'release' for them.
  After 'release', objects can be 'acquired' by next stage and/or dequeued
  by the consumer (in case of last stage).

Expected use-case: applications that uses pipeline model
(probably with multiple stages) for packet processing, when preserving
incoming packet order is important.

The concept of ‘ring with stages’ is similar to DPDK OPDL eventdev PMD [1],
but the internals are different.  
In particular, SORING maintains internal array of 'states' for each element
in the ring that is  shared by all threads/processes that access the ring.
That allows 'release' to avoid excessive waits on the tail value and helps
to improve performancei and scalability. 
In terms of performance, with our measurements rte_soring and
conventional rte_ring provide nearly identical numbers.
As an example, on our SUT: Intel ICX CPU @ 2.00GHz, 
l3fwd (--lookup=acl) in pipeline mode (see patch #5 for detai

[RFC 1/6] ring: common functions for 'move head' ops

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Note upfront: that change doesn't introduce any functional or
performance changes.
It is just a code-reordering for:
 - code deduplication
 - ability in future to re-use the same code to introduce new functionality

For each sync mode corresponding move_prod_head() and
move_cons_head() are nearly identical to each other,
the only differences are:
 - do we need to use a @capacity to calculate number of entries or not.
 - what we need to update (prod/cons) and what is used as
   read-only counterpart.
So instead of having 2 copies of nearly identical functions,
introduce a new common one that could be used by both functions:
move_prod_head() and move_cons_head().

As another positive thing - we can get rid of referencing whole rte_ring
structure in that new common sub-function.

Signed-off-by: Konstantin Ananyev 
---
 lib/ring/rte_ring_c11_pvt.h  | 134 +--
 lib/ring/rte_ring_elem_pvt.h |  66 +++
 lib/ring/rte_ring_generic_pvt.h  | 121 
 lib/ring/rte_ring_hts_elem_pvt.h |  85 ++--
 lib/ring/rte_ring_rts_elem_pvt.h |  85 ++--
 5 files changed, 149 insertions(+), 342 deletions(-)

diff --git a/lib/ring/rte_ring_c11_pvt.h b/lib/ring/rte_ring_c11_pvt.h
index 629b2d9288..048933ddc6 100644
--- a/lib/ring/rte_ring_c11_pvt.h
+++ b/lib/ring/rte_ring_c11_pvt.h
@@ -28,41 +28,19 @@ __rte_ring_update_tail(struct rte_ring_headtail *ht, 
uint32_t old_val,
rte_atomic_store_explicit(&ht->tail, new_val, rte_memory_order_release);
 }
 
-/**
- * @internal This function updates the producer head for enqueue
- *
- * @param r
- *   A pointer to the ring structure
- * @param is_sp
- *   Indicates whether multi-producer path is needed or not
- * @param n
- *   The number of elements we will want to enqueue, i.e. how far should the
- *   head be moved
- * @param behavior
- *   RTE_RING_QUEUE_FIXED:Enqueue a fixed number of items from a ring
- *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
- * @param old_head
- *   Returns head value as it was before the move, i.e. where enqueue starts
- * @param new_head
- *   Returns the current/new head value i.e. where enqueue finishes
- * @param free_entries
- *   Returns the amount of free space in the ring BEFORE head was moved
- * @return
- *   Actual number of objects enqueued.
- *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
- */
 static __rte_always_inline unsigned int
-__rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
-   unsigned int n, enum rte_ring_queue_behavior behavior,
-   uint32_t *old_head, uint32_t *new_head,
-   uint32_t *free_entries)
+__rte_ring_headtail_move_head(struct rte_ring_headtail *d,
+   const struct rte_ring_headtail *s, uint32_t capacity,
+   unsigned int is_st, unsigned int n,
+   enum rte_ring_queue_behavior behavior,
+   uint32_t *old_head, uint32_t *new_head, uint32_t *entries)
 {
-   const uint32_t capacity = r->capacity;
-   uint32_t cons_tail;
-   unsigned int max = n;
+   uint32_t stail;
int success;
+   unsigned int max = n;
 
-   *old_head = rte_atomic_load_explicit(&r->prod.head, 
rte_memory_order_relaxed);
+   *old_head = rte_atomic_load_explicit(&d->head,
+   rte_memory_order_relaxed);
do {
/* Reset n to the initial burst count */
n = max;
@@ -73,112 +51,36 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int 
is_sp,
/* load-acquire synchronize with store-release of ht->tail
 * in update_tail.
 */
-   cons_tail = rte_atomic_load_explicit(&r->cons.tail,
+   stail = rte_atomic_load_explicit(&s->tail,
rte_memory_order_acquire);
 
/* The subtraction is done between two unsigned 32bits value
 * (the result is always modulo 32 bits even if we have
-* *old_head > cons_tail). So 'free_entries' is always between 0
+* *old_head > s->tail). So 'free_entries' is always between 0
 * and capacity (which is < size).
 */
-   *free_entries = (capacity + cons_tail - *old_head);
+   *entries = (capacity + stail - *old_head);
 
/* check that we have enough room in ring */
-   if (unlikely(n > *free_entries))
+   if (unlikely(n > *entries))
n = (behavior == RTE_RING_QUEUE_FIXED) ?
-   0 : *free_entries;
+   0 : *entries;
 
if (n == 0)
return 0;
 
*new_head = *old_head + n;
-   if (is_sp) {
-   r->prod.head = *new_head;
+   if

[RFC 2/6] ring: make copying functions generic

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Note upfront: that change doesn't introduce any functional
or performance changes.
It is just a code-reordering for:
 - improve code modularity and re-usability
 - ability in future to re-use the same code to introduce new functionality

There is no real need for enqueue_elems()/dequeue_elems()
to get pointer to actual rte_ring structure, instead it is enough to pass
a pointer to actual elements buffer inside the ring.
In return, we'll get a copying functions that could be used for other
queueing abstractions that do have circular ring buffer inside.

Signed-off-by: Konstantin Ananyev 
---
 lib/ring/rte_ring_elem_pvt.h | 117 ---
 1 file changed, 68 insertions(+), 49 deletions(-)

diff --git a/lib/ring/rte_ring_elem_pvt.h b/lib/ring/rte_ring_elem_pvt.h
index 3a83668a08..216cb6089f 100644
--- a/lib/ring/rte_ring_elem_pvt.h
+++ b/lib/ring/rte_ring_elem_pvt.h
@@ -17,12 +17,14 @@
 #endif
 
 static __rte_always_inline void
-__rte_ring_enqueue_elems_32(struct rte_ring *r, const uint32_t size,
-   uint32_t idx, const void *obj_table, uint32_t n)
+__rte_ring_enqueue_elems_32(void *ring_table, const void *obj_table,
+   uint32_t size, uint32_t idx, uint32_t n)
 {
unsigned int i;
-   uint32_t *ring = (uint32_t *)&r[1];
+
+   uint32_t *ring = ring_table;
const uint32_t *obj = (const uint32_t *)obj_table;
+
if (likely(idx + n <= size)) {
for (i = 0; i < (n & ~0x7); i += 8, idx += 8) {
ring[idx] = obj[i];
@@ -60,14 +62,14 @@ __rte_ring_enqueue_elems_32(struct rte_ring *r, const 
uint32_t size,
 }
 
 static __rte_always_inline void
-__rte_ring_enqueue_elems_64(struct rte_ring *r, uint32_t prod_head,
-   const void *obj_table, uint32_t n)
+__rte_ring_enqueue_elems_64(void *ring_table, const void *obj_table,
+   uint32_t size, uint32_t idx, uint32_t n)
 {
unsigned int i;
-   const uint32_t size = r->size;
-   uint32_t idx = prod_head & r->mask;
-   uint64_t *ring = (uint64_t *)&r[1];
+
+   uint64_t *ring = ring_table;
const unaligned_uint64_t *obj = (const unaligned_uint64_t *)obj_table;
+
if (likely(idx + n <= size)) {
for (i = 0; i < (n & ~0x3); i += 4, idx += 4) {
ring[idx] = obj[i];
@@ -93,14 +95,14 @@ __rte_ring_enqueue_elems_64(struct rte_ring *r, uint32_t 
prod_head,
 }
 
 static __rte_always_inline void
-__rte_ring_enqueue_elems_128(struct rte_ring *r, uint32_t prod_head,
-   const void *obj_table, uint32_t n)
+__rte_ring_enqueue_elems_128(void *ring_table, const void *obj_table,
+   uint32_t size, uint32_t idx, uint32_t n)
 {
unsigned int i;
-   const uint32_t size = r->size;
-   uint32_t idx = prod_head & r->mask;
-   rte_int128_t *ring = (rte_int128_t *)&r[1];
+
+   rte_int128_t *ring = ring_table;
const rte_int128_t *obj = (const rte_int128_t *)obj_table;
+
if (likely(idx + n <= size)) {
for (i = 0; i < (n & ~0x1); i += 2, idx += 2)
memcpy((void *)(ring + idx),
@@ -126,37 +128,47 @@ __rte_ring_enqueue_elems_128(struct rte_ring *r, uint32_t 
prod_head,
  * single and multi producer enqueue functions.
  */
 static __rte_always_inline void
-__rte_ring_enqueue_elems(struct rte_ring *r, uint32_t prod_head,
-   const void *obj_table, uint32_t esize, uint32_t num)
+__rte_ring_do_enqueue_elems(void *ring_table, const void *obj_table,
+   uint32_t size, uint32_t idx, uint32_t esize, uint32_t num)
 {
/* 8B and 16B copies implemented individually to retain
 * the current performance.
 */
if (esize == 8)
-   __rte_ring_enqueue_elems_64(r, prod_head, obj_table, num);
+   __rte_ring_enqueue_elems_64(ring_table, obj_table, size,
+   idx, num);
else if (esize == 16)
-   __rte_ring_enqueue_elems_128(r, prod_head, obj_table, num);
+   __rte_ring_enqueue_elems_128(ring_table, obj_table, size,
+   idx, num);
else {
-   uint32_t idx, scale, nr_idx, nr_num, nr_size;
+   uint32_t scale, nr_idx, nr_num, nr_size;
 
/* Normalize to uint32_t */
scale = esize / sizeof(uint32_t);
nr_num = num * scale;
-   idx = prod_head & r->mask;
nr_idx = idx * scale;
-   nr_size = r->size * scale;
-   __rte_ring_enqueue_elems_32(r, nr_size, nr_idx,
-   obj_table, nr_num);
+   nr_size = size * scale;
+   __rte_ring_enqueue_elems_32(ring_table, obj_table, nr_size,
+   nr_idx, nr_num);
}
 }
 
 static __rte_always_inline void
-__rte_ring_dequeue_elems_32(struct rte_ring *r, const uint32_t size,
-   uint32_t idx, void *obj_table, uint32_t n)

[RFC 3/6] ring/soring: introduce Staged Ordered Ring

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Staged-Ordered-Ring (SORING) provides a SW abstraction for 'ordered' queues
with multiple processing 'stages'.
It is based on conventional DPDK rte_ring, re-uses many of its concepts,
and even substantial part of its code.
It can be viewed as an 'extension' of rte_ring functionality.
In particular, main SORING properties:
- circular ring buffer with fixed size objects
- producer, consumer plus multiple processing stages in the middle.
- allows to split objects processing into multiple stages.
- objects remain in the same ring while moving from one stage to the other,
  initial order is preserved, no extra copying needed.
- preserves the ingress order of objects within the queue across multiple
  stages, i.e.:
  at the same stage multiple threads can process objects from the ring in
  any order, but for the next stage objects will always appear in the
  original order.
- each stage (and producer/consumer) can be served by single and/or
  multiple threads.
- number of stages, size and number of objects in the ring are
  configurable at ring initialization time.

Data-path API provides four main operations:
- enqueue/dequeue works in the same manner as for conventional rte_ring,
  all rte_ring synchronization types are supported.
- acquire/release - for each stage there is an acquire (start) and
  release (finish) operation.
  after some objects are 'acquired' - given thread can safely assume that
  it has exclusive possession of these objects till 'release' for them is
  invoked.
  Note that right now user has to release exactly the same number of
  objects that was acquired before.
  After 'release', objects can be 'acquired' by next stage and/or dequeued
  by the consumer (in case of last stage).

Expected use-case: applications that uses pipeline model
(probably with multiple stages) for packet processing, when preserving
incoming packet order is important. I.E.: IPsec processing, etc.

Signed-off-by: Konstantin Ananyev 
---
 lib/ring/meson.build  |   4 +-
 lib/ring/rte_soring.c | 144 ++
 lib/ring/rte_soring.h | 270 ++
 lib/ring/soring.c | 431 ++
 lib/ring/soring.h | 124 
 lib/ring/version.map  |  13 ++
 6 files changed, 984 insertions(+), 2 deletions(-)
 create mode 100644 lib/ring/rte_soring.c
 create mode 100644 lib/ring/rte_soring.h
 create mode 100644 lib/ring/soring.c
 create mode 100644 lib/ring/soring.h

diff --git a/lib/ring/meson.build b/lib/ring/meson.build
index 7fca958ed7..21f2c12989 100644
--- a/lib/ring/meson.build
+++ b/lib/ring/meson.build
@@ -1,8 +1,8 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-sources = files('rte_ring.c')
-headers = files('rte_ring.h')
+sources = files('rte_ring.c', 'rte_soring.c', 'soring.c')
+headers = files('rte_ring.h', 'rte_soring.h')
 # most sub-headers are not for direct inclusion
 indirect_headers += files (
 'rte_ring_core.h',
diff --git a/lib/ring/rte_soring.c b/lib/ring/rte_soring.c
new file mode 100644
index 00..17b1b73a42
--- /dev/null
+++ b/lib/ring/rte_soring.c
@@ -0,0 +1,144 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Huawei Technologies Co., Ltd
+ */
+
+#include "soring.h"
+#include 
+
+RTE_LOG_REGISTER_DEFAULT(soring_logtype, INFO);
+#define RTE_LOGTYPE_SORING soring_logtype
+#define SORING_LOG(level, ...) \
+   RTE_LOG_LINE(level, SORING, "" __VA_ARGS__)
+
+static uint32_t
+soring_calc_elem_num(uint32_t count)
+{
+   return rte_align32pow2(count + 1);
+}
+
+static int
+soring_check_param(uint32_t esize, uint32_t stsize, uint32_t count,
+   uint32_t stages)
+{
+   if (stages == 0) {
+   SORING_LOG(ERR, "invalid number of stages: %u", stages);
+   return -EINVAL;
+   }
+
+   /* Check if element size is a multiple of 4B */
+   if (esize == 0 || esize % 4 != 0) {
+   SORING_LOG(ERR, "invalid element size: %u", esize);
+   return -EINVAL;
+   }
+
+   /* Check if ret-code size is a multiple of 4B */
+   if (stsize % 4 != 0) {
+   SORING_LOG(ERR, "invalid retcode size: %u", stsize);
+   return -EINVAL;
+   }
+
+/* count must be a power of 2 */
+   if (rte_is_power_of_2(count) == 0 ||
+   (count > RTE_SORING_ELEM_MAX + 1)) {
+   SORING_LOG(ERR, "invalid number of elements: %u", count);
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+/*
+ * Calculate size offsets for SORING internal data layout.
+ */
+static size_t
+soring_get_szofs(uint32_t esize, uint32_t stsize, uint32_t count,
+   uint32_t stages, size_t *elst_ofs, size_t *state_ofs,
+   size_t *stage_ofs)
+{
+   size_t sz;
+   const struct rte_soring * const r = NULL;
+
+   sz = sizeof(r[0]) + (size_t)count * esize;
+   sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
+
+   if (elst_ofs != NULL)
+   

[RFC 4/6] app/test: add unit tests for soring API

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Add both functional and stess test-cases for soring API.
Stress test serves as both functional and performance test of soring
enqueue/dequeue/acquire/release operations under high contention
(for both over committed and non-over committed scenarios).

Signed-off-by: Eimear Morrissey 
Signed-off-by: Konstantin Ananyev 
---
 app/test/meson.build   |   3 +
 app/test/test_soring.c | 452 
 app/test/test_soring_mt_stress.c   |  45 ++
 app/test/test_soring_stress.c  |  48 ++
 app/test/test_soring_stress.h  |  35 ++
 app/test/test_soring_stress_impl.h | 832 +
 6 files changed, 1415 insertions(+)
 create mode 100644 app/test/test_soring.c
 create mode 100644 app/test/test_soring_mt_stress.c
 create mode 100644 app/test/test_soring_stress.c
 create mode 100644 app/test/test_soring_stress.h
 create mode 100644 app/test/test_soring_stress_impl.h

diff --git a/app/test/meson.build b/app/test/meson.build
index e29258e6ec..c290162e43 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -175,6 +175,9 @@ source_file_deps = {
 'test_security_proto.c' : ['cryptodev', 'security'],
 'test_seqlock.c': [],
 'test_service_cores.c': [],
+'test_soring.c': [],
+'test_soring_mt_stress.c': [],
+'test_soring_stress.c': [],
 'test_spinlock.c': [],
 'test_stack.c': ['stack'],
 'test_stack_perf.c': ['stack'],
diff --git a/app/test/test_soring.c b/app/test/test_soring.c
new file mode 100644
index 00..381979bc6f
--- /dev/null
+++ b/app/test/test_soring.c
@@ -0,0 +1,452 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2024 Huawei Technologies Co., Ltd
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "test.h"
+
+#define MAX_ACQUIRED 20
+
+#define SORING_TEST_ASSERT(val, expected) do { \
+   RTE_TEST_ASSERT(expected == val, \
+   "%s: expected %u got %u\n", #val, expected, val); \
+} while (0)
+
+static void
+set_soring_init_param(struct rte_soring_param *prm,
+   const char *name, uint32_t esize, uint32_t elems,
+   uint32_t stages, uint32_t stsize,
+   enum rte_ring_sync_type rst_prod,
+   enum rte_ring_sync_type rst_cons)
+{
+   prm->name = name;
+   prm->esize = esize;
+   prm->elems = elems;
+   prm->stages = stages;
+   prm->stsize = stsize;
+   prm->prod_synt = rst_prod;
+   prm->cons_synt = rst_cons;
+}
+
+static int
+move_forward_stage(struct rte_soring *sor,
+   uint32_t num_packets, uint32_t stage)
+{
+   uint32_t acquired;
+   uint32_t ftoken;
+   uint32_t *acquired_objs[MAX_ACQUIRED];
+
+   acquired = rte_soring_acquire(sor, acquired_objs, NULL, stage,
+   num_packets, RTE_RING_QUEUE_FIXED, &ftoken, NULL);
+   SORING_TEST_ASSERT(acquired, num_packets);
+   rte_soring_release(sor, NULL, NULL, stage, num_packets,
+   ftoken);
+
+   return 0;
+}
+
+/*
+ * struct rte_soring_param param checking.
+ */
+static int
+test_soring_init(void)
+{
+   struct rte_soring *sor = NULL;
+   struct rte_soring_param prm;
+   int rc;
+   size_t sz;
+   memset(&prm, 0, sizeof(prm));
+
+/*init memory*/
+   set_soring_init_param(&prm, "alloc_memory", sizeof(uintptr_t),
+   4, 1, 4, RTE_RING_SYNC_MT, RTE_RING_SYNC_MT);
+   sz = rte_soring_get_memsize(&prm);
+   sor = rte_zmalloc(NULL, sz, RTE_CACHE_LINE_SIZE);
+   RTE_TEST_ASSERT_NOT_NULL(sor, "could not allocate memory for soring");
+
+   set_soring_init_param(&prm, "test_invalid_stages", sizeof(uintptr_t),
+   4, 0, 4, RTE_RING_SYNC_MT, RTE_RING_SYNC_MT);
+   rc = rte_soring_init(sor, &prm);
+   RTE_TEST_ASSERT_FAIL(rc, "initted soring with invalid num stages");
+
+   set_soring_init_param(&prm, "test_invalid_esize", 0,
+   4, 1, 4, RTE_RING_SYNC_MT, RTE_RING_SYNC_MT);
+   rc = rte_soring_init(sor, &prm);
+   RTE_TEST_ASSERT_FAIL(rc, "initted soring with 0 esize");
+
+   set_soring_init_param(&prm, "test_invalid_esize", 9,
+   4, 1, 4, RTE_RING_SYNC_MT, RTE_RING_SYNC_MT);
+   rc = rte_soring_init(sor, &prm);
+   RTE_TEST_ASSERT_FAIL(rc, "initted soring with esize not multiple of 4");
+
+   set_soring_init_param(&prm, "test_invalid_rsize", sizeof(uintptr_t),
+   4, 1, 3, RTE_RING_SYNC_MT, RTE_RING_SYNC_MT);
+   rc = rte_soring_init(sor, &prm);
+   RTE_TEST_ASSERT_FAIL(rc, "initted soring with rcsize not multiple of 
4");
+
+   set_soring_init_param(&prm, "test_invalid_elems", sizeof(uintptr_t),
+   RTE_SORING_ELEM_MAX + 1, 1, 4, RTE_RING_SYNC_M

[RFC 5/6] examples/l3fwd: make ACL work in pipeline and eventdev modes

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Note upfront:
This is a huge commit that is combined from several ones.
For now, I submit it just for reference and demonstration purposes and
will probably remove it in future versions.
If will decide to go ahead with it, then it needs to be reworked and split
into several proper commits.

It adds for l3fwd:
 - eventdev mode for ACL lookup-mode
 - Introduce a worker-pool-mode
   (right now implemented for ACL lookup-mode only).
Worker-Pool mode is a simple pipeline model, with the following stages:
 1) I/O thread receives packets from NIC RX HW queues and enqueues them
into the work queue
 2) Worker thread reads packets from the work queue(s),
process them and then puts processed packets back into the
work queue along with the processing status (routing info/error code).
 3) I/O thread dequeues packets and their status from the work queue,
and based on it either TX packet or drops it.
Very similar to l3fwd-eventdev working model.

Note that it could be several I/O threads, each can serve one or multiple
HW RX queues. Also there could be several Worker threads, each of them can
process packets from multiple work queues in round-robin fashion.

Work queue can be one of the following types:
 - wqorder: allows Worker threads to process packets in any order,
   but guarantees that on dequeue stage the ingress order of packets
   will be preserved. I.E. at stage #3, I/O thread will get packets
   exactly in the same order as they were enqueued at stage #1.
 - wqunorder: doesn't provide any ordered guarantees.

'wqunroder' mode is implemented using 2 rte_ring structures per queue.
'wqorder' mode is implemtened using rte_soring structure per queue.

To facilitate this new functionality, command line parameters were
extended:
 --mode:
   Possible values one of: poll/eventdev/wqorder/wqorderS/wqunorder/wqunorderS
   Default value: poll
   - wqorder: Worker-Pool ordered mode with a separate work queue for each
 HW RX queue.
   - wqorderS: Worker-Pool ordered mode with one work queue per I/O thread.
   - wqunorder: Worker-Pool un-ordered mode with a separate work queue for each
 HW RX queue.
   - wqunorderS: Worker-Pool un-ordered mode with oen work queue per I/O thread.
 --wqsize: number of elements for each worker queue.
 --lookup-iter: forces to perform ACL lookup several times over the same
   packet. This is artificial parameter and is added temporally for
   benchmarking purposes. Will be removed in latest versions (if any).

Note that in Worker-Pool mode all free lcores that were not assigned as
I/O threads will be used as Worker threads.
As an example:
dpdk-l3fwd --lcores=53,55,57,59,61 ... -- \
-P -p f --config '(0,0,53)(1,0,53)(2,0,53)(3,0,53)' --lookup acl \
--parse-ptype --mode=wqorder ...
In that case lcore 53 will be used as I/O thread (stages #1,3)
to serve 4 HW RX queues,
while lcores 55,57,59,61 will serve as Worker threads (stage #2).

Signed-off-by: Konstantin Ananyev 
---
 examples/l3fwd/l3fwd.h   |  55 +++
 examples/l3fwd/l3fwd_acl.c   | 125 +++---
 examples/l3fwd/l3fwd_acl_event.h | 258 +
 examples/l3fwd/l3fwd_event.c |  14 ++
 examples/l3fwd/l3fwd_event.h |   1 +
 examples/l3fwd/l3fwd_sse.h   |  49 +-
 examples/l3fwd/l3fwd_wqp.c   | 274 +++
 examples/l3fwd/l3fwd_wqp.h   | 132 +++
 examples/l3fwd/main.c|  75 -
 examples/l3fwd/meson.build   |   1 +
 10 files changed, 956 insertions(+), 28 deletions(-)
 create mode 100644 examples/l3fwd/l3fwd_acl_event.h
 create mode 100644 examples/l3fwd/l3fwd_wqp.c
 create mode 100644 examples/l3fwd/l3fwd_wqp.h

diff --git a/examples/l3fwd/l3fwd.h b/examples/l3fwd/l3fwd.h
index 93ce652d02..218f363764 100644
--- a/examples/l3fwd/l3fwd.h
+++ b/examples/l3fwd/l3fwd.h
@@ -77,6 +77,42 @@ struct __rte_cache_aligned lcore_rx_queue {
uint16_t queue_id;
 };
 
+enum L3FWD_WORKER_MODE {
+   L3FWD_WORKER_POLL,
+   L3FWD_WORKER_UNQUE,
+   L3FWD_WORKER_ORQUE,
+};
+
+struct l3fwd_wqp_param {
+   enum L3FWD_WORKER_MODE mode;
+   uint32_t qsize;/**< Number of elems in worker queue */
+   int32_t single;/**< use single queue per I/O (poll) thread */
+};
+
+extern struct l3fwd_wqp_param l3fwd_wqp_param;
+
+enum {
+   LCORE_WQ_IN,
+   LCORE_WQ_OUT,
+   LCORE_WQ_NUM,
+};
+
+union lcore_wq {
+   struct rte_ring *r[LCORE_WQ_NUM];
+   struct {
+   struct rte_soring *sor;
+   /* used by WQ, sort of thred-local var */
+   uint32_t ftoken;
+   };
+};
+
+struct lcore_wq_pool {
+   uint32_t nb_queue;
+   uint32_t qmask;
+   union lcore_wq queue[MAX_RX_QUEUE_PER_LCORE];
+   struct l3fwd_wqp_param prm;
+};
+
 struct __rte_cache_aligned lcore_conf {
uint16_t n_rx_queue;
struct lcore_rx_queue rx_queue_list[MAX_RX_QUEUE_PER_LCORE];
@@ -86,6 +122,7 @@ struct __rte_cache_aligned l

[RFC 6/6] ring: minimize reads of the counterpart cache-line

2024-08-15 Thread Konstantin Ananyev
From: Konstantin Ananyev 

Note upfront: this change shouldn't affect rte_ring public API.
Though as layout of public structures have changed - it is an ABI
breakage.

This is an attempt to implement rte_ring optimization
that was suggested by Morten and discussed on this mailing
list a while ago.
The idea is to optimize MP/SP & MC/SC ring enqueue/dequeue ops
by storing along with the head its Cached Foreign Tail (CFT) value.
I.E.: for producer we cache consumer tail value and visa-versa.
To avoid races head and CFT values are read/written using atomic
64-bit ops.
In theory that might help by reducing number of times producer
needs to access consumer's cache-line and visa-versa.
In practice, I didn't see any impressive boost so far:
-  ring_per_autotest micro-bench - results are a mixed bag,
   Some are a bit better, some are worse.
 - [so]ring_stress_autotest  micro-benchs: ~10-15% improvement
 - l3fwd in wqorder/wqundorder (see previous patch for details):
   no real difference.

Though so far my testing scope was quite limited, I tried it only
on x86 machines. So can I ask all interested parties:
different platform vendors (ARM, PPC, etc.)
and people who do use rte_ring extensively to give it a try and come up
with the feedback.

If there would be no real performance improvements on
any platform we support, or some problems will be encountered -
I am ok to drop that patch.

Signed-off-by: Konstantin Ananyev 
---
 drivers/net/mlx5/mlx5_hws_cnt.h   |  5 ++--
 drivers/net/ring/rte_eth_ring.c   |  2 +-
 lib/ring/rte_ring.c   |  6 ++--
 lib/ring/rte_ring_core.h  | 12 +++-
 lib/ring/rte_ring_generic_pvt.h   | 46 +--
 lib/ring/rte_ring_peek_elem_pvt.h |  4 +--
 lib/ring/soring.c | 31 +++--
 lib/ring/soring.h |  4 +--
 8 files changed, 77 insertions(+), 33 deletions(-)

diff --git a/drivers/net/mlx5/mlx5_hws_cnt.h b/drivers/net/mlx5/mlx5_hws_cnt.h
index 996ac8dd9a..663146563c 100644
--- a/drivers/net/mlx5/mlx5_hws_cnt.h
+++ b/drivers/net/mlx5/mlx5_hws_cnt.h
@@ -388,11 +388,12 @@ __mlx5_hws_cnt_pool_enqueue_revert(struct rte_ring *r, 
unsigned int n,
 
MLX5_ASSERT(r->prod.sync_type == RTE_RING_SYNC_ST);
MLX5_ASSERT(r->cons.sync_type == RTE_RING_SYNC_ST);
-   current_head = rte_atomic_load_explicit(&r->prod.head, 
rte_memory_order_relaxed);
+   current_head = rte_atomic_load_explicit(&r->prod.head.val.pos,
+   rte_memory_order_relaxed);
MLX5_ASSERT(n <= r->capacity);
MLX5_ASSERT(n <= rte_ring_count(r));
revert2head = current_head - n;
-   r->prod.head = revert2head; /* This ring should be SP. */
+   r->prod.head.val.pos = revert2head; /* This ring should be SP. */
__rte_ring_get_elem_addr(r, revert2head, sizeof(cnt_id_t), n,
&zcd->ptr1, &zcd->n1, &zcd->ptr2);
/* Update tail */
diff --git a/drivers/net/ring/rte_eth_ring.c b/drivers/net/ring/rte_eth_ring.c
index 1346a0dba3..31009e90d2 100644
--- a/drivers/net/ring/rte_eth_ring.c
+++ b/drivers/net/ring/rte_eth_ring.c
@@ -325,7 +325,7 @@ eth_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
 */
pmc->addr = &rng->prod.head;
pmc->size = sizeof(rng->prod.head);
-   pmc->opaque[0] = rng->prod.head;
+   pmc->opaque[0] = rng->prod.head.val.pos;
pmc->fn = ring_monitor_callback;
return 0;
 }
diff --git a/lib/ring/rte_ring.c b/lib/ring/rte_ring.c
index aebb6d6728..cb2c39c7ad 100644
--- a/lib/ring/rte_ring.c
+++ b/lib/ring/rte_ring.c
@@ -102,7 +102,7 @@ reset_headtail(void *p)
switch (ht->sync_type) {
case RTE_RING_SYNC_MT:
case RTE_RING_SYNC_ST:
-   ht->head = 0;
+   ht->head.raw = 0;
ht->tail = 0;
break;
case RTE_RING_SYNC_MT_RTS:
@@ -373,9 +373,9 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
fprintf(f, "  size=%"PRIu32"\n", r->size);
fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-   fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+   fprintf(f, "  ch=%"PRIu32"\n", r->cons.head.val.pos);
fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-   fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+   fprintf(f, "  ph=%"PRIu32"\n", r->prod.head.val.pos);
fprintf(f, "  used=%u\n", rte_ring_count(r));
fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/ring/rte_ring_core.h b/lib/ring/rte_ring_core.h
index 270869d214..b88a1bc352 100644
--- a/lib/ring/rte_ring_core.h
+++ b/lib/ring/rte_ring_core.h
@@ -66,8 +66,17 @@ enum rte_ring_sync_type {
  * Depending on sync_type format of that structure might be different,
  * but offset for *sync_type* and *tail* values should remain the same.
  */
+union __rte_ring_head_cft {
+   /** raw 8B value to read/write *cnt* and *pos* as one atomic op */
+ 

RE: [PATCH v2 0/2] examples/l3fwd fixes for ACL mode

2024-08-15 Thread Konstantin Ananyev
Sorry, that's a dup, sent by mistake this time.
Please disregard.
Konstantin

> -Original Message-
> From: Konstantin Ananyev 
> Sent: Thursday, August 15, 2024 9:53 AM
> To: dev@dpdk.org
> Cc: honnappa.nagaraha...@arm.com; jer...@marvell.com; hemant.agra...@nxp.com; 
> bruce.richard...@intel.com;
> d...@linux.vnet.ibm.com; ruifeng.w...@arm.com; m...@smartsharesystems.com; 
> Konstantin Ananyev
> 
> Subject: [PATCH v2 0/2] examples/l3fwd fixes for ACL mode
> 
> From: Konstantin Ananyev 
> 
> As Song Jiale pointed outprevious fix is not enough to fix
> the problem he is observing with l3fwd in ACl mode:
> https://bugs.dpdk.org/show_bug.cgi?id=1502
> This is a second attempt to fix it.
> 
> Konstantin Ananyev (2):
>   examples/l3fwd: fix read beyond array bondaries
>   examples/l3fwd: fix read beyond array boundaries in ACL mode
> 
>  examples/l3fwd/l3fwd_acl.c   | 37 
>  examples/l3fwd/l3fwd_altivec.h   |  6 -
>  examples/l3fwd/l3fwd_common.h|  7 ++
>  examples/l3fwd/l3fwd_em_hlm.h|  2 +-
>  examples/l3fwd/l3fwd_em_sequential.h |  2 +-
>  examples/l3fwd/l3fwd_fib.c   |  2 +-
>  examples/l3fwd/l3fwd_lpm_altivec.h   |  2 +-
>  examples/l3fwd/l3fwd_lpm_neon.h  |  2 +-
>  examples/l3fwd/l3fwd_lpm_sse.h   |  2 +-
>  examples/l3fwd/l3fwd_neon.h  |  6 -
>  examples/l3fwd/l3fwd_sse.h   |  6 -
>  11 files changed, 55 insertions(+), 19 deletions(-)
> 
> --
> 2.35.3



crc stripping for vf on same pf

2024-08-15 Thread Yaron Illouz

I have 2 pods running on same worker
Pod1 send to pod2

Pod2 receive with 4 bytes less at end of packet
This problem happens only if the 2 nic are on the same pf,
If different pf, the problem doesn’t occurs

I tried with dpdk21 and dpdk22
The code is using driver net_iavf
nic e810c
driver: ice
firmware-version: 4.00 0x800139bc 21.5.9
Who does the stripping? The dpdk code or the card?
Why is is different behavior for same pf and different pf ?
What should i change or check?


port_conf.rxmode.offloads |= RTE_ETH_RX_OFFLOAD_KEEP_CRC;   //Don't strip CRC
port_conf.rxmode.offloads &= pi_devInfo.rx_offload_capa;
int ret = rte_eth_dev_configure(pi_nPort, nRxQueues, nTxQueues, &port_conf);



struct rte_eth_rxconf rx_conf;
rx_conf.offloads = RTE_ETH_RX_OFFLOAD_KEEP_CRC;

int ret = rte_eth_rx_queue_setup(
pi_nPort,
nQueue,
nRxRingSize,
socket,
performanceMode?NULL:&rx_conf,
pool);



Re: 22.11.6 patches review and test

2024-08-15 Thread Luca Boccassi
On Wed, 31 Jul 2024 at 20:37,  wrote:
>
> Hi all,
>
> Here is a list of patches targeted for stable release 22.11.6.
>
> The planned date for the final release is August 20th.
>
> Please help with testing and validation of your use cases and report
> any issues/results with reply-all to this mail. For the final release
> the fixes and reported validations will be added to the release notes.
>
> A release candidate tarball can be found at:
>
> https://dpdk.org/browse/dpdk-stable/tag/?id=v22.11.6-rc1
>
> These patches are located at branch 22.11 of dpdk-stable repo:
> https://dpdk.org/browse/dpdk-stable/
>
> Thanks.
>
> Luca Boccassi

Hi Ali,

As the deadline is approaching, I wanted to double check whether
NVIDIA is planning to run regression tests for 22.11.6? If you need
more time it's fine to extend the deadline, but if you do not have the
bandwidth for this cycle that's ok too, just let me know and I'll go
ahead with the release without waiting. Thanks!


RE: [RFC 3/6] ring/soring: introduce Staged Ordered Ring

2024-08-15 Thread Morten Brørup
> From: Konstantin Ananyev 
> 
> Staged-Ordered-Ring (SORING) provides a SW abstraction for 'ordered' queues
> with multiple processing 'stages'.
> It is based on conventional DPDK rte_ring, re-uses many of its concepts,
> and even substantial part of its code.
> It can be viewed as an 'extension' of rte_ring functionality.
> In particular, main SORING properties:
> - circular ring buffer with fixed size objects
> - producer, consumer plus multiple processing stages in the middle.
> - allows to split objects processing into multiple stages.
> - objects remain in the same ring while moving from one stage to the other,
>   initial order is preserved, no extra copying needed.
> - preserves the ingress order of objects within the queue across multiple
>   stages, i.e.:
>   at the same stage multiple threads can process objects from the ring in
>   any order, but for the next stage objects will always appear in the
>   original order.
> - each stage (and producer/consumer) can be served by single and/or
>   multiple threads.
> - number of stages, size and number of objects in the ring are
>   configurable at ring initialization time.
> 
> Data-path API provides four main operations:
> - enqueue/dequeue works in the same manner as for conventional rte_ring,
>   all rte_ring synchronization types are supported.
> - acquire/release - for each stage there is an acquire (start) and
>   release (finish) operation.
>   after some objects are 'acquired' - given thread can safely assume that
>   it has exclusive possession of these objects till 'release' for them is
>   invoked.
>   Note that right now user has to release exactly the same number of
>   objects that was acquired before.
>   After 'release', objects can be 'acquired' by next stage and/or dequeued
>   by the consumer (in case of last stage).
> 
> Expected use-case: applications that uses pipeline model
> (probably with multiple stages) for packet processing, when preserving
> incoming packet order is important. I.E.: IPsec processing, etc.
> 
> Signed-off-by: Konstantin Ananyev 
> ---

The existing RING library is for a ring of objects.

It is very confusing that the new SORING library is for a ring of object pairs 
(obj, objst).

The new SORING library should be for a ring of objects, like the existing RING 
library. Please get rid of all the objst stuff.

This might also improve performance when not using the optional secondary 
object.


With that in place, you can extend the SORING library with additional APIs for 
object pairs.

I suggest calling the secondary object "metadata" instead of "status" or 
"state" or "ret-value".

I agree that data passed as {obj[num], meta[num]} is more efficient than {obj, 
meta}[num] in some use cases, which is why your API uses two vector pointers 
instead of one.


Furthermore, you should consider semi-zero-copy APIs for the 
"acquire"/"release" functions:

The "acquire" function can use a concept similar to rte_pktmbuf_read(), where a 
vector is provided for copying (if the ring wraps), and the return value either 
points directly to the objects in the ring (zero-copy), or to the vector where 
the objects were copied to.

And the "release" function does not need to copy the object vector back if the 
"acquire" function returned a zero-copy pointer.



[PATCH] net/af_packet: add explicit flush for Tx

2024-08-15 Thread vignesh.purushotham.srinivas
From: Vignesh PS 

af_packet PMD uses system calls to transmit packets. Separate the
transmit function into two different calls so its possible to avoid
syscalls during transmit.

Signed-off-by: Vignesh PS 
---
 .mailmap  |  1 +
 doc/guides/nics/af_packet.rst | 26 ++-
 drivers/net/af_packet/rte_eth_af_packet.c | 90 ++-
 3 files changed, 110 insertions(+), 7 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4a508bafad..5e9462b7cd 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1548,6 +1548,7 @@ Viacheslav Ovsiienko  

 Victor Kaplansky 
 Victor Raj 
 Vidya Sagar Velumuri 
+Vignesh PS 
 Vignesh Sridhar 
 Vijayakumar Muthuvel Manickam 
 Vijaya Mohan Guvva 
diff --git a/doc/guides/nics/af_packet.rst b/doc/guides/nics/af_packet.rst
index 66b977e1a2..fe92ef231f 100644
--- a/doc/guides/nics/af_packet.rst
+++ b/doc/guides/nics/af_packet.rst
@@ -29,6 +29,7 @@ Some of these, in turn, will be used to configure the 
PACKET_MMAP settings.
 *   ``framesz`` - PACKET_MMAP frame size (optional, default 2048B; Note: 
multiple
 of 16B);
 *   ``framecnt`` - PACKET_MMAP frame count (optional, default 512).
+*   ``explicit_flush`` - enable two stage packet transmit.
 
 Because this implementation is based on PACKET_MMAP, and PACKET_MMAP has its
 own pre-requisites, it should be noted that the inner workings of PACKET_MMAP
@@ -39,6 +40,9 @@ As an example, if one changes ``framesz`` to be 1024B, it is 
expected that
 ``blocksz`` is set to at least 1024B as well (although 2048B in this case would
 allow two "frames" per "block").
 
+When ``explicit_flush`` is enabled, then the PMD will temporary buffer mbuf in 
a
+ring buffer in the PMD until ``rte_eth_tx_done_cleanup`` is called on the TX 
queue.
+
 This restriction happens because PACKET_MMAP expects each single "frame" to fit
 inside of a "block". And although multiple "frames" can fit inside of a single
 "block", a "frame" may not span across two "blocks".
@@ -64,11 +68,25 @@ framecnt=512):
 
 .. code-block:: console
 
-
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0
+
--vdev=eth_af_packet0,iface=tap0,blocksz=4096,framesz=2048,framecnt=512,qpairs=1,qdisc_bypass=0,explicit_flush=1
 
 Features and Limitations
 
 
-The PMD will re-insert the VLAN tag transparently to the packet if the kernel
-strips it, as long as the ``RTE_ETH_RX_OFFLOAD_VLAN_STRIP`` is not enabled by 
the
-application.
+* The PMD will re-insert the VLAN tag transparently to the packet if the kernel
+  strips it, as long as the ``RTE_ETH_RX_OFFLOAD_VLAN_STRIP`` is not enabled 
by the
+  application.
+* The PMD relies on send_to() system call to transmit packets from the 
PACKET_MMAP socket.
+  This system call can cause head-in-line blocking. Hence, it's advantageous 
to buffer the
+  packets in the drivers instead of immediately triggering packet transmits on 
calling
+  ``rte_eth_tx_burst()``. Therefore, the PMD splits the functionality of 
``rte_eth_tx_burst()``
+  into two functional stages, where ``rte_eth_tx_burst()`` causes packets to 
be  be buffered
+  in the driver, and subsequent call to ``rte_eth_tx_done_cleanup()`` triggers 
the actual
+  packet transmits. With such disaggregated PMD design, it is possible to call
+  ``rte_eth_tx_burst()`` on workers and trigger tramists (by calling
+  ``rte_eth_tx_done_cleanup()``) from a control plane worker and eliminate
+  head-in-line blocking.
+* To enable the two stage packet transmit, the PMD should be started with 
explicit_flush=1
+  (Default explicit_flush=0).
+* When calling ``rte_eth_tx_done_cleanup()`` the free_cnt parameter has no 
effect on how
+  many packets are flushed. The PMD will flush all the packets present in the 
buffer.
diff --git a/drivers/net/af_packet/rte_eth_af_packet.c 
b/drivers/net/af_packet/rte_eth_af_packet.c
index 6b7b16f348..cdbe43313a 100644
--- a/drivers/net/af_packet/rte_eth_af_packet.c
+++ b/drivers/net/af_packet/rte_eth_af_packet.c
@@ -36,9 +36,11 @@
 #define ETH_AF_PACKET_FRAMESIZE_ARG"framesz"
 #define ETH_AF_PACKET_FRAMECOUNT_ARG   "framecnt"
 #define ETH_AF_PACKET_QDISC_BYPASS_ARG "qdisc_bypass"
+#define ETH_AF_PACKET_EXPLICIT_FLUSH_ARG   "explicit_flush"
 
 #define DFLT_FRAME_SIZE(1 << 11)
 #define DFLT_FRAME_COUNT   (1 << 9)
+#define DFLT_FRAME_BURST   (32)
 
 struct __rte_cache_aligned pkt_rx_queue {
int sockfd;
@@ -62,8 +64,10 @@ struct __rte_cache_aligned pkt_tx_queue {
 
struct iovec *rd;
uint8_t *map;
+   struct rte_ring *buf;
unsigned int framecount;
unsigned int framenum;
+   unsigned int explicit_flush;
 
volatile unsigned long tx_pkts;
volatile unsigned long err_pkts;
@@ -91,6 +95,7 @@ static const char *valid_arguments[] = {
ETH_AF_PACKET_FRAMESIZE_ARG,
ETH_AF_PACKET_FRAMECOUNT_ARG,
ETH_AF_PACKET_QDISC_BYPASS_ARG,
+   ETH_AF_PACKET_EXPLICIT

[PATCH] net/bonding: add user callback for bond xmit policy

2024-08-15 Thread vignesh.purushotham.srinivas
From: Vignesh PS 

Add support to bonding PMD to allow user callback
function registration for TX transmit policy.

Signed-off-by: Vignesh PS 
---
 .mailmap|  1 +
 drivers/net/bonding/eth_bond_private.h  |  6 ++
 drivers/net/bonding/rte_eth_bond.h  | 17 +
 drivers/net/bonding/rte_eth_bond_api.c  | 15 +++
 drivers/net/bonding/rte_eth_bond_args.c |  2 ++
 drivers/net/bonding/rte_eth_bond_pmd.c  |  2 +-
 drivers/net/bonding/version.map |  1 +
 7 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/.mailmap b/.mailmap
index 4a508bafad..69b229a5b7 100644
--- a/.mailmap
+++ b/.mailmap
@@ -1548,6 +1548,7 @@ Viacheslav Ovsiienko  

 Victor Kaplansky 
 Victor Raj 
 Vidya Sagar Velumuri 
+Vignesh PS  

 Vignesh Sridhar 
 Vijayakumar Muthuvel Manickam 
 Vijaya Mohan Guvva 
diff --git a/drivers/net/bonding/eth_bond_private.h 
b/drivers/net/bonding/eth_bond_private.h
index e688894210..4141b6e09f 100644
--- a/drivers/net/bonding/eth_bond_private.h
+++ b/drivers/net/bonding/eth_bond_private.h
@@ -32,6 +32,7 @@
 #define PMD_BOND_XMIT_POLICY_LAYER2_KVARG  ("l2")
 #define PMD_BOND_XMIT_POLICY_LAYER23_KVARG ("l23")
 #define PMD_BOND_XMIT_POLICY_LAYER34_KVARG ("l34")
+#define PMD_BOND_XMIT_POLICY_USER_KVARG("user")
 
 extern int bond_logtype;
 
@@ -101,9 +102,6 @@ struct rte_flow {
uint8_t rule_data[];
 };
 
-typedef void (*burst_xmit_hash_t)(struct rte_mbuf **buf, uint16_t nb_pkts,
-   uint16_t member_count, uint16_t *members);
-
 /** Link Bonding PMD device private configuration Structure */
 struct bond_dev_private {
uint16_t port_id;   /**< Port Id of Bonding Port */
@@ -118,7 +116,7 @@ struct bond_dev_private {
/**< Flag for whether primary port is user defined or not */
 
uint8_t balance_xmit_policy;
-   /**< Transmit policy - l2 / l23 / l34 for operation in balance mode */
+   /**< Transmit policy - l2 / l23 / l34 / user for operation in balance 
mode */
burst_xmit_hash_t burst_xmit_hash;
/**< Transmit policy hash function */
 
diff --git a/drivers/net/bonding/rte_eth_bond.h 
b/drivers/net/bonding/rte_eth_bond.h
index f10165f2c6..66bc41097a 100644
--- a/drivers/net/bonding/rte_eth_bond.h
+++ b/drivers/net/bonding/rte_eth_bond.h
@@ -91,6 +91,11 @@ extern "C" {
 /**< Layer 2+3 (Ethernet MAC + IP Addresses) transmit load balancing */
 #define BALANCE_XMIT_POLICY_LAYER34(2)
 /**< Layer 3+4 (IP Addresses + UDP Ports) transmit load balancing */
+#define BALANCE_XMIT_POLICY_USER   (3)
+/**< User callback function to transmit load balancing */
+
+typedef void (*burst_xmit_hash_t)(struct rte_mbuf **buf, uint16_t nb_pkts,
+   uint16_t slave_count, uint16_t *slaves);
 
 /**
  * Create a bonding rte_eth_dev device
@@ -351,6 +356,18 @@ rte_eth_bond_link_up_prop_delay_set(uint16_t 
bonding_port_id,
 int
 rte_eth_bond_link_up_prop_delay_get(uint16_t bonding_port_id);
 
+/**
+ * Register transmit callback function for bonded device to use when it is 
operating in
+ * balance mode. The callback is ignored in other modes of operation.
+ *
+ * @param cb_fn   User defined callback function to determine the xmit 
slave
+ *
+ * @return
+ * 0 on success, negative value otherwise.
+ */
+__rte_experimental
+int
+rte_eth_bond_xmit_policy_cb_register(burst_xmit_hash_t cb_fn);
 
 #ifdef __cplusplus
 }
diff --git a/drivers/net/bonding/rte_eth_bond_api.c 
b/drivers/net/bonding/rte_eth_bond_api.c
index 99e496556a..b53038eeda 100644
--- a/drivers/net/bonding/rte_eth_bond_api.c
+++ b/drivers/net/bonding/rte_eth_bond_api.c
@@ -15,6 +15,8 @@
 #include "eth_bond_private.h"
 #include "eth_bond_8023ad_private.h"
 
+static burst_xmit_hash_t burst_xmit_user_hash;
+
 int
 check_for_bonding_ethdev(const struct rte_eth_dev *eth_dev)
 {
@@ -972,6 +974,13 @@ rte_eth_bond_mac_address_reset(uint16_t bonding_port_id)
return 0;
 }
 
+int
+rte_eth_bond_xmit_policy_cb_register(burst_xmit_hash_t cb_fn)
+{
+   burst_xmit_user_hash = cb_fn;
+   return 0;
+}
+
 int
 rte_eth_bond_xmit_policy_set(uint16_t bonding_port_id, uint8_t policy)
 {
@@ -995,6 +1004,12 @@ rte_eth_bond_xmit_policy_set(uint16_t bonding_port_id, 
uint8_t policy)
internals->balance_xmit_policy = policy;
internals->burst_xmit_hash = burst_xmit_l34_hash;
break;
+   case BALANCE_XMIT_POLICY_USER:
+   if (burst_xmit_user_hash == NULL)
+   return -1;
+   internals->balance_xmit_policy = policy;
+   internals->burst_xmit_hash = burst_xmit_user_hash;
+   break;
 
default:
return -1;
diff --git a/drivers/net/bonding/rte_eth_bond_args.c 
b/drivers/net/bonding/rte_eth_bond_args.c
index bdec5d61d4..eaa313bf73 100644
--- a/drivers/net/bonding/rte_eth_bond_args.c
+++ b/drivers/net/bonding/rte_eth_bond_args.c
@@ -261,6 +261,8 @@ bond

RE: [RFC 3/6] ring/soring: introduce Staged Ordered Ring

2024-08-15 Thread Konstantin Ananyev



> > From: Konstantin Ananyev 
> >
> > Staged-Ordered-Ring (SORING) provides a SW abstraction for 'ordered' queues
> > with multiple processing 'stages'.
> > It is based on conventional DPDK rte_ring, re-uses many of its concepts,
> > and even substantial part of its code.
> > It can be viewed as an 'extension' of rte_ring functionality.
> > In particular, main SORING properties:
> > - circular ring buffer with fixed size objects
> > - producer, consumer plus multiple processing stages in the middle.
> > - allows to split objects processing into multiple stages.
> > - objects remain in the same ring while moving from one stage to the other,
> >   initial order is preserved, no extra copying needed.
> > - preserves the ingress order of objects within the queue across multiple
> >   stages, i.e.:
> >   at the same stage multiple threads can process objects from the ring in
> >   any order, but for the next stage objects will always appear in the
> >   original order.
> > - each stage (and producer/consumer) can be served by single and/or
> >   multiple threads.
> > - number of stages, size and number of objects in the ring are
> >   configurable at ring initialization time.
> >
> > Data-path API provides four main operations:
> > - enqueue/dequeue works in the same manner as for conventional rte_ring,
> >   all rte_ring synchronization types are supported.
> > - acquire/release - for each stage there is an acquire (start) and
> >   release (finish) operation.
> >   after some objects are 'acquired' - given thread can safely assume that
> >   it has exclusive possession of these objects till 'release' for them is
> >   invoked.
> >   Note that right now user has to release exactly the same number of
> >   objects that was acquired before.
> >   After 'release', objects can be 'acquired' by next stage and/or dequeued
> >   by the consumer (in case of last stage).
> >
> > Expected use-case: applications that uses pipeline model
> > (probably with multiple stages) for packet processing, when preserving
> > incoming packet order is important. I.E.: IPsec processing, etc.
> >
> > Signed-off-by: Konstantin Ananyev 
> > ---
> 
> The existing RING library is for a ring of objects.
> 
> It is very confusing that the new SORING library is for a ring of object 
> pairs (obj, objst).
> 
> The new SORING library should be for a ring of objects, like the existing 
> RING library. Please get rid of all the objst stuff.
> 
> This might also improve performance when not using the optional secondary 
> object.
> 
> 
> With that in place, you can extend the SORING library with additional APIs 
> for object pairs.
> 
> I suggest calling the secondary object "metadata" instead of "status" or 
> "state" or "ret-value".
> I agree that data passed as {obj[num], meta[num]} is more efficient than 
> {obj, meta}[num] in some use cases, which is why your API
> uses two vector pointers instead of one.

I suppose what you suggest is to have 2 set of functions: one that takes both 
objs[] and meta[] and second that takes just objs[]?
If so, yes I can do that - in fact I was thinking about same thing.
BTW, right now meta[] is an optional one anyway.
Also will probably get rid of explicit 'behavior' and will have '_burst_' and 
'_bulk_' versions instead,
same as rte_ring. 

> 
> Furthermore, you should consider semi-zero-copy APIs for the 
> "acquire"/"release" functions:
> 
> The "acquire" function can use a concept similar to rte_pktmbuf_read(), where 
> a vector is provided for copying (if the ring wraps), and
> the return value either points directly to the objects in the ring 
> (zero-copy), or to the vector where the objects were copied to.

You mean to introduce analog of rte_ring '_zc_' functions?
Yes, I considered that, but decided to leave it for the future.
First, because we do need a generic and simple function with copying things 
anyway.
Second I am not so convinced that this _zc_ will give much performance gain,
while it definitely makes API not that straightforward.  

> And the "release" function does not need to copy the object vector back if 
> the "acquire" function returned a zero-copy pointer.

For "release" you don't need to *always* copy objs[] and meta[].
It is optional and is left for the user to decide based on the use-case.
If he doesn't need to update objs[] or meta[] he can just pass a NULL ptr here.

 



RE: [RFC 3/6] ring/soring: introduce Staged Ordered Ring

2024-08-15 Thread Morten Brørup
> From: Konstantin Ananyev [mailto:konstantin.anan...@huawei.com]
> 
> > > From: Konstantin Ananyev 
> > >
> > > Staged-Ordered-Ring (SORING) provides a SW abstraction for 'ordered'
> queues
> > > with multiple processing 'stages'.
> > > It is based on conventional DPDK rte_ring, re-uses many of its concepts,
> > > and even substantial part of its code.
> > > It can be viewed as an 'extension' of rte_ring functionality.
> > > In particular, main SORING properties:
> > > - circular ring buffer with fixed size objects
> > > - producer, consumer plus multiple processing stages in the middle.
> > > - allows to split objects processing into multiple stages.
> > > - objects remain in the same ring while moving from one stage to the
> other,
> > >   initial order is preserved, no extra copying needed.
> > > - preserves the ingress order of objects within the queue across multiple
> > >   stages, i.e.:
> > >   at the same stage multiple threads can process objects from the ring in
> > >   any order, but for the next stage objects will always appear in the
> > >   original order.
> > > - each stage (and producer/consumer) can be served by single and/or
> > >   multiple threads.
> > > - number of stages, size and number of objects in the ring are
> > >   configurable at ring initialization time.
> > >
> > > Data-path API provides four main operations:
> > > - enqueue/dequeue works in the same manner as for conventional rte_ring,
> > >   all rte_ring synchronization types are supported.
> > > - acquire/release - for each stage there is an acquire (start) and
> > >   release (finish) operation.
> > >   after some objects are 'acquired' - given thread can safely assume that
> > >   it has exclusive possession of these objects till 'release' for them is
> > >   invoked.
> > >   Note that right now user has to release exactly the same number of
> > >   objects that was acquired before.
> > >   After 'release', objects can be 'acquired' by next stage and/or dequeued
> > >   by the consumer (in case of last stage).
> > >
> > > Expected use-case: applications that uses pipeline model
> > > (probably with multiple stages) for packet processing, when preserving
> > > incoming packet order is important. I.E.: IPsec processing, etc.
> > >
> > > Signed-off-by: Konstantin Ananyev 
> > > ---
> >
> > The existing RING library is for a ring of objects.
> >
> > It is very confusing that the new SORING library is for a ring of object
> pairs (obj, objst).
> >
> > The new SORING library should be for a ring of objects, like the existing
> RING library. Please get rid of all the objst stuff.
> >
> > This might also improve performance when not using the optional secondary
> object.
> >
> >
> > With that in place, you can extend the SORING library with additional APIs
> for object pairs.
> >
> > I suggest calling the secondary object "metadata" instead of "status" or
> "state" or "ret-value".
> > I agree that data passed as {obj[num], meta[num]} is more efficient than
> {obj, meta}[num] in some use cases, which is why your API
> > uses two vector pointers instead of one.
> 
> I suppose what you suggest is to have 2 set of functions: one that takes both
> objs[] and meta[] and second that takes just objs[]?
> If so, yes I can do that - in fact I was thinking about same thing.

Yes, please.
Mainly for readability/familiarity; it makes the API much more similar to the 
Ring API.

> BTW, right now meta[] is an optional one anyway.

I noticed that meta[] is optional, but it is confusing that the APIs are so 
much different than the Ring APIs.

With two sets of functions, the basic set will resemble the Ring APIs much more.

> Also will probably get rid of explicit 'behavior' and will have '_burst_' and
> '_bulk_' versions instead,
> same as rte_ring.

+1

> 
> >
> > Furthermore, you should consider semi-zero-copy APIs for the
> "acquire"/"release" functions:
> >
> > The "acquire" function can use a concept similar to rte_pktmbuf_read(),
> where a vector is provided for copying (if the ring wraps), and
> > the return value either points directly to the objects in the ring (zero-
> copy), or to the vector where the objects were copied to.
> 
> You mean to introduce analog of rte_ring '_zc_' functions?
> Yes, I considered that, but decided to leave it for the future.

Somewhat similar, but I think the (semi-)zero-copy "acquire"/"release" APIs 
will be simpler than the rte_ring's _zc_ functions because we know that no 
other thread can dequeue the objects out of the ring before the processing 
stage has released them, i.e. no additional locking is required.

Anyway, leave it for the future.
I don't think it will require changes to the underlying implementation, so we 
don't need to consider it in advance.

> First, because we do need a generic and simple function with copying things
> anyway.
> Second I am not so convinced that this _zc_ will give much performance gain,
> while it definitely makes API not that straightforward.
> 
> > And the "rel

[PATCH] app/testpmd: add L4 port to verbose output

2024-08-15 Thread Alex Chapman
To help distinguish packets we want to add more identifiable
information and print port number for all packets.
This will make packet metadata more uniform as previously it
only printed port number for encapsulated packets.

Bugzilla-ID: 1517

Signed-off-by: Alex Chapman 
Reviewed-by: Luca Vizzarro 
Reviewed-by: Paul Szczepanek 
---
 app/test-pmd/util.c | 71 +++--
 1 file changed, 42 insertions(+), 29 deletions(-)

diff --git a/app/test-pmd/util.c b/app/test-pmd/util.c
index bf9b639d95..5fa05fad16 100644
--- a/app/test-pmd/util.c
+++ b/app/test-pmd/util.c
@@ -81,7 +81,6 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct 
rte_mbuf *pkts[],
char buf[256];
struct rte_net_hdr_lens hdr_lens;
uint32_t sw_packet_type;
-   uint16_t udp_port;
uint32_t vx_vni;
const char *reason;
int dynf_index;
@@ -234,49 +233,63 @@ dump_pkt_burst(uint16_t port_id, uint16_t queue, struct 
rte_mbuf *pkts[],
if (sw_packet_type & RTE_PTYPE_INNER_L4_MASK)
MKDUMPSTR(print_buf, buf_size, cur_len,
  " - inner_l4_len=%d", hdr_lens.inner_l4_len);
-   if (is_encapsulation) {
-   struct rte_ipv4_hdr *ipv4_hdr;
-   struct rte_ipv6_hdr *ipv6_hdr;
-   struct rte_udp_hdr *udp_hdr;
-   uint8_t l2_len;
-   uint8_t l3_len;
-   uint8_t l4_len;
-   uint8_t l4_proto;
-   struct  rte_vxlan_hdr *vxlan_hdr;
-
-   l2_len  = sizeof(struct rte_ether_hdr);
-
-   /* Do not support ipv4 option field */
-   if (RTE_ETH_IS_IPV4_HDR(packet_type)) {
-   l3_len = sizeof(struct rte_ipv4_hdr);
-   ipv4_hdr = rte_pktmbuf_mtod_offset(mb,
+
+   struct rte_ipv4_hdr *ipv4_hdr;
+   struct rte_ipv6_hdr *ipv6_hdr;
+   struct rte_udp_hdr *udp_hdr;
+   struct rte_tcp_hdr *tcp_hdr;
+   uint8_t l2_len;
+   uint8_t l3_len;
+   uint8_t l4_len;
+   uint8_t l4_proto;
+   uint16_t l4_port;
+   struct  rte_vxlan_hdr *vxlan_hdr;
+
+   l2_len  = sizeof(struct rte_ether_hdr);
+
+   /* Do not support ipv4 option field */
+   if (RTE_ETH_IS_IPV4_HDR(packet_type)) {
+   l3_len = sizeof(struct rte_ipv4_hdr);
+   ipv4_hdr = rte_pktmbuf_mtod_offset(mb,
struct rte_ipv4_hdr *,
l2_len);
-   l4_proto = ipv4_hdr->next_proto_id;
-   } else {
-   l3_len = sizeof(struct rte_ipv6_hdr);
-   ipv6_hdr = rte_pktmbuf_mtod_offset(mb,
+   l4_proto = ipv4_hdr->next_proto_id;
+   } else {
+   l3_len = sizeof(struct rte_ipv6_hdr);
+   ipv6_hdr = rte_pktmbuf_mtod_offset(mb,
struct rte_ipv6_hdr *,
l2_len);
-   l4_proto = ipv6_hdr->proto;
-   }
-   if (l4_proto == IPPROTO_UDP) {
-   udp_hdr = rte_pktmbuf_mtod_offset(mb,
+   l4_proto = ipv6_hdr->proto;
+   }
+   if (l4_proto == IPPROTO_UDP) {
+   udp_hdr = rte_pktmbuf_mtod_offset(mb,
struct rte_udp_hdr *,
l2_len + l3_len);
+   l4_port = RTE_BE_TO_CPU_16(udp_hdr->dst_port);
+   if (is_encapsulation) {
l4_len = sizeof(struct rte_udp_hdr);
vxlan_hdr = rte_pktmbuf_mtod_offset(mb,
-   struct rte_vxlan_hdr *,
-   l2_len + l3_len + l4_len);
-   udp_port = RTE_BE_TO_CPU_16(udp_hdr->dst_port);
+   struct rte_vxlan_hdr *,
+   l2_len + l3_len + l4_len);
vx_vni = rte_be_to_cpu_32(vxlan_hdr->vx_vni);
MKDUMPSTR(print_buf, buf_size, cur_len,
  " - VXLAN packet: packet type =%d, "
  "Destination UDP port =%d, VNI = %d, "
  "last_rsvd = %d", packet_type,
- udp_port, vx_vni >> 8, vx_vni & 0xff);
+ l4_port, vx_vni >> 8, vx_vni & 0xff);
+   } else {
+   MKDUMPSTR(p

Re: [PATCH] app/testpmd: add L4 port to verbose output

2024-08-15 Thread Stephen Hemminger
On Thu, 15 Aug 2024 15:20:51 +0100
Alex Chapman  wrote:

> To help distinguish packets we want to add more identifiable
> information and print port number for all packets.
> This will make packet metadata more uniform as previously it
> only printed port number for encapsulated packets.
> 
> Bugzilla-ID: 1517
> 
> Signed-off-by: Alex Chapman 
> Reviewed-by: Luca Vizzarro 
> Reviewed-by: Paul Szczepanek 

The verbose output is already too verbose.
Maybe you would like the simpler format (which does include the port number)
see the network packet dissector patches.



Re: [PATCH] net/bonding: add user callback for bond xmit policy

2024-08-15 Thread Patrick Robb
Recheck-request: iol-marvell-Functional

Putting in a retest for this.


RE: 22.11.6 patches review and test

2024-08-15 Thread Ali Alnubani
> -Original Message-
> From: Luca Boccassi 
> Sent: Thursday, August 15, 2024 2:11 PM
> To: sta...@dpdk.org
> Cc: dev@dpdk.org; Ali Alnubani ; John McNamara
> ; Raslan Darawsheh ; NBU-
> Contact-Thomas Monjalon (EXTERNAL) 
> Subject: Re: 22.11.6 patches review and test
> 
> On Wed, 31 Jul 2024 at 20:37,  wrote:
> >
> > Hi all,
> >
> > Here is a list of patches targeted for stable release 22.11.6.
> >
> > The planned date for the final release is August 20th.
> >
> > Please help with testing and validation of your use cases and report
> > any issues/results with reply-all to this mail. For the final release
> > the fixes and reported validations will be added to the release notes.
> >
> > A release candidate tarball can be found at:
> >
> > https://dpdk.org/browse/dpdk-stable/tag/?id=v22.11.6-rc1
> >
> > These patches are located at branch 22.11 of dpdk-stable repo:
> > https://dpdk.org/browse/dpdk-stable/
> >
> > Thanks.
> >
> > Luca Boccassi
> 
> Hi Ali,
> 
> As the deadline is approaching, I wanted to double check whether
> NVIDIA is planning to run regression tests for 22.11.6? If you need
> more time it's fine to extend the deadline, but if you do not have the
> bandwidth for this cycle that's ok too, just let me know and I'll go
> ahead with the release without waiting. Thanks!

Hi Luca,

We will report our results hopefully by Monday. Apologies for the delay.

Thanks,
Ali


Re: 22.11.6 patches review and test

2024-08-15 Thread Luca Boccassi
On Thu, 15 Aug 2024 at 17:19, Ali Alnubani  wrote:
>
> > -Original Message-
> > From: Luca Boccassi 
> > Sent: Thursday, August 15, 2024 2:11 PM
> > To: sta...@dpdk.org
> > Cc: dev@dpdk.org; Ali Alnubani ; John McNamara
> > ; Raslan Darawsheh ; NBU-
> > Contact-Thomas Monjalon (EXTERNAL) 
> > Subject: Re: 22.11.6 patches review and test
> >
> > On Wed, 31 Jul 2024 at 20:37,  wrote:
> > >
> > > Hi all,
> > >
> > > Here is a list of patches targeted for stable release 22.11.6.
> > >
> > > The planned date for the final release is August 20th.
> > >
> > > Please help with testing and validation of your use cases and report
> > > any issues/results with reply-all to this mail. For the final release
> > > the fixes and reported validations will be added to the release notes.
> > >
> > > A release candidate tarball can be found at:
> > >
> > > https://dpdk.org/browse/dpdk-stable/tag/?id=v22.11.6-rc1
> > >
> > > These patches are located at branch 22.11 of dpdk-stable repo:
> > > https://dpdk.org/browse/dpdk-stable/
> > >
> > > Thanks.
> > >
> > > Luca Boccassi
> >
> > Hi Ali,
> >
> > As the deadline is approaching, I wanted to double check whether
> > NVIDIA is planning to run regression tests for 22.11.6? If you need
> > more time it's fine to extend the deadline, but if you do not have the
> > bandwidth for this cycle that's ok too, just let me know and I'll go
> > ahead with the release without waiting. Thanks!
>
> Hi Luca,
>
> We will report our results hopefully by Monday. Apologies for the delay.
>
> Thanks,
> Ali

No problem at all, just wanted to check, thank you for the update


Re: [dpdk-dev] [PATCH v3 5/5] devtools: test different build types

2024-08-15 Thread Stephen Hemminger
On Sun,  8 Aug 2021 14:51:38 +0200
Thomas Monjalon  wrote:

> All builds were of type debugoptimized.
> It is kept only for builds having an ABI check.
> Others will have the default build type (release),
> except if specified differently as in the x86 generic build
> which will be a test of the non-optimized debug build type.
> Some static builds will test the minsize build type.
> 
> Signed-off-by: Thomas Monjalon 
> Acked-by: Andrew Rybchenko 
> 
> ---
> 
> This patch cannot be merged now because it makes clang 11.1.0 crashing.
> ---

Dropping this patch from patchwork because of the clang crash.


DTS WG Meeting Minutes - August 15, 2024

2024-08-15 Thread Patrick Robb
#
August 15, 2024
Attendees
* Patrick Robb
* Jeremy Spewock
* Alex Chapman
* Juraj Linkeš
* Tomas Durovec
* Dean Marx
* Luca Vizzarro
* Paul Szczepanek
* Nicholas Pratte

#
Minutes

=
General Discussion
* DTS Roadmap: 
https://docs.google.com/document/d/1Rcp1-gZWzGGCCSkbEsigrd0-NoQmknv6ZS7V2CPdgFo/edit
   * Will email out after this meeting
* Speakers are all signed up for the CI and DTS talks at DPDK Summit

=
Patch discussions
* Testpmd shell method names: should they align with existing testpmd
runtime commands? I.e. should the “flow create” runtime command be
implemented via a method named flow_create_*() or the more english
intuitive create_flow_*()
   * One option is to implement both, and have one method call the other
  * This potentially creates confusion as people read different
testsuites and see different functions used, not realizing they may be
the same
   * The group agrees it is best to name methods in a human readable
intuitive way… so like create_flow_*() from the example above.
* Testpmd verbose parser
   * If we read port from testpmd to identify packets, they must have
a tcp/udp layer, which may be limiting. If, for whatever reason,
packets for a testsuite cannot be built with a l4, individual
testsuites may have to check based on src mac address, checksum etc.
   * In almost all cases, packets can be build with a l4
* Checksum offload suite is submitted
   * Dependency on the existing testpmd verbose parser
   * RX side testcases work fine, but TX side behavior is not aligning
with what is described in the testsuite, so feedback on this is
appreciated
   * Checksum offload command
  * Csum set {layer name} hw {port number}
  * Returns sctp offload is not supported
  * TCP/UDP packets are working
* Port assignment:
   * Physical ports are defined in the nodes conf section, then port
ids are referred to in the testrun config
   * Also includes splitting the nodes and testrun configs into different files
  * Discussion on ticket regarding having a conf directory to contain these
   * Still some work to be done removing unneeded configuration from conf.yaml
* VXLAN-GPE testsuite is now canceled as the feature is removed as of DPDK 24.07
* API Docs
   * Juraj needs reviews and testing
   * UNH people please rebuild the docs and provide your experience
  * Should specifically test meson install
   * Aim is to make it simple to use (and it is)
   * It builds with DPDK docs
* L2fwd
   * Jeremy provided a review, more people at UNH please run this and
provide feedback
   * When reviewing people should also review the dependency - add
pktgen and testpmd change series
* Tomas and Juraj have begun work on producing the testrun results json

=
Bugzilla discussions
* None

=
Any other business
* Next meeting Aug 29, 2024


Ethdev tracepoints optimization

2024-08-15 Thread Adel Belkhiri
Hi DPDK Community,

I am currently working on developing performance analyses for applications
using the ethdev library. These analyses are being implemented in Trace
Compass, an open-source performance analyzer. One of the views I’ve
implemented shows the rate of traffic received or sent by an ethernet port,
measured in packets per second. However, I've encountered an issue with the
lib.ethdev.rx.burst event, which triggers even when no packets are polled,
leading to a significant number of irrelevant events in the trace. This
becomes problematic as these "empty" events can overwhelm the tracer
buffer, potentially causing the loss of more critical events due to their
high frequency.

To address this, I've modified the DPDK code in lib/ethdev/rte_ethdev.h to
add a conditional statement that only triggers the event when nb_rx > 0. My
question to the community is whether there are use cases where an "empty"
lib.ethdev.rx.burst event could be useful. If not, would there be interest
in submitting a patch with this modification?

Moreover, I am looking to develop an analysis that calculates the
throughput (in kb/s, mb/s, etc.) per NIC, utilizing the same events (i.e.,
lib.ethdev.rx.burst and lib.ethdev.tx.burst). These tracepoints do not
provide packet size directly, only a pointer to the packet array. My
attempt to use an eBPF program to iterate through that array to access the
packet sizes was unsuccessful, as I found no method to export the computed
data (e.g., via a custom tracepoint). Does anyone have suggestions or
alternative approaches for achieving a throughput measurement?

I would be grateful for any insights or suggestions you might have.

Thank you!
Adel


RE: [EXTERNAL] Re: [PATCH v5 1/1] examples/l2fwd-jobstats: fix lock availability

2024-08-15 Thread Rakesh Kudurumalla



> -Original Message-
> From: Stephen Hemminger 
> Sent: Sunday, August 11, 2024 9:47 PM
> To: Rakesh Kudurumalla 
> Cc: ferruh.yi...@amd.com; andrew.rybche...@oktetlabs.ru;
> or...@nvidia.com; tho...@monjalon.net; dev@dpdk.org; Jerin Jacob
> ; Nithin Kumar Dabilpuram
> ; sta...@dpdk.org
> Subject: [EXTERNAL] Re: [PATCH v5 1/1] examples/l2fwd-jobstats: fix lock
> availability
> 
> On Sun, 11 Aug 2024 21: 29: 57 +0530 Rakesh Kudurumalla
>  wrote: > Race condition between jobstats
> and time metrics > for forwarding and flushing is maintained using spinlock.
> > Timer metrics are not displayed 
> On Sun, 11 Aug 2024 21:29:57 +0530
> Rakesh Kudurumalla  wrote:
> 
> > Race condition between jobstats and time metrics for forwarding and
> > flushing is maintained using spinlock.
> > Timer metrics are not displayed properly due to the frequent
> > unavailability of the lock.This patch fixes the issue by introducing a
> > delay before acquiring the lock in the loop. This delay allows for
> > betteravailability of the lock, ensuring that show_lcore_stats() can
> > periodically update the statistics even when forwarding jobs are
> > running.
> >
> > Fixes: 204896f8d66c ("examples/l2fwd-jobstats: add new example")
> > Cc: sta...@dpdk.org
> >
> > Signed-off-by: Rakesh Kudurumalla 
> 
> Would be better if this code used RCU and not a lock

Currently the jobstats app uses the lock only for collecting single snapshot of 
different statistics and printing the same from main core. With RCU since we 
cannot pause the worker core to collect such a single snapshot, integrating RCU 
would need a full redesign of the application and would take lot of effort.