回复: 回复: 回复: [PATCH v2 1/3] ethdev: add API for direct rearm mode

2022-10-26 Thread Feifei Wang


> -邮件原件-
> 发件人: Konstantin Ananyev 
> 发送时间: Thursday, October 13, 2022 5:49 PM
> 收件人: Feifei Wang ; tho...@monjalon.net;
> Ferruh Yigit ; Andrew Rybchenko
> ; Ray Kinsella 
> 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> 
> 主题: Re: 回复: 回复: [PATCH v2 1/3] ethdev: add API for direct rearm mode
> 
> 12/10/2022 14:38, Feifei Wang пишет:
> >
> >
> >> -邮件原件-
> >> 发件人: Konstantin Ananyev 
> >> 发送时间: Wednesday, October 12, 2022 6:21 AM
> >> 收件人: Feifei Wang ; tho...@monjalon.net;
> Ferruh
> >> Yigit ; Andrew Rybchenko
> >> ; Ray Kinsella 
> >> 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> >> ; Ruifeng Wang
> 
> >> 主题: Re: 回复: [PATCH v2 1/3] ethdev: add API for direct rearm mode
> >>
> >>
> >>
> >>>>> Add API for enabling direct rearm mode and for mapping RX and TX
> >>>>> queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >>>>>
> >>>>> Furthermore, to avoid Rx load Tx data directly, add API called
> >>>>> 'rte_eth_txq_data_get' to get Tx sw_ring and its information.
> >>>>>
> >>>>> Suggested-by: Honnappa Nagarahalli
> 
> >>>>> Suggested-by: Ruifeng Wang 
> >>>>> Signed-off-by: Feifei Wang 
> >>>>> Reviewed-by: Ruifeng Wang 
> >>>>> Reviewed-by: Honnappa Nagarahalli
> 
> >>>>> ---
> >>>>> lib/ethdev/ethdev_driver.h   |  9 
> >>>>> lib/ethdev/ethdev_private.c  |  1 +
> >>>>> lib/ethdev/rte_ethdev.c  | 37 ++
> >>>>> lib/ethdev/rte_ethdev.h  | 95
> >>>> 
> >>>>> lib/ethdev/rte_ethdev_core.h |  5 ++
> >>>>> lib/ethdev/version.map   |  4 ++
> >>>>> 6 files changed, 151 insertions(+)
> >>>>>
> >>>>> diff --git a/lib/ethdev/ethdev_driver.h
> >>>>> b/lib/ethdev/ethdev_driver.h index 47a55a419e..14f52907c1 100644
> >>>>> --- a/lib/ethdev/ethdev_driver.h
> >>>>> +++ b/lib/ethdev/ethdev_driver.h
> >>>>> @@ -58,6 +58,8 @@ struct rte_eth_dev {
> >>>>> eth_rx_descriptor_status_t rx_descriptor_status;
> >>>>> /** Check the status of a Tx descriptor */
> >>>>> eth_tx_descriptor_status_t tx_descriptor_status;
> >>>>> +   /**  Use Tx mbufs for Rx to rearm */
> >>>>> +   eth_rx_direct_rearm_t rx_direct_rearm;
> >>>>>
> >>>>> /**
> >>>>>  * Device data that is shared between primary and secondary
> >>>>> processes @@ -486,6 +488,11 @@ typedef int
> >>>> (*eth_rx_enable_intr_t)(struct rte_eth_dev *dev,
> >>>>> typedef int (*eth_rx_disable_intr_t)(struct rte_eth_dev *dev,
> >>>>> uint16_t rx_queue_id);
> >>>>>
> >>>>> +/**< @internal Get Tx information of a transmit queue of an
> >>>>> +Ethernet device. */ typedef void (*eth_txq_data_get_t)(struct
> >> rte_eth_dev *dev,
> >>>>> + uint16_t tx_queue_id,
> >>>>> + struct rte_eth_txq_data
> *txq_data);
> >>>>> +
> >>>>> /** @internal Release memory resources allocated by given
> >>>>> Rx/Tx
> >> queue.
> >>>> */
> >>>>> typedef void (*eth_queue_release_t)(struct rte_eth_dev *dev,
> >>>>> uint16_t queue_id);
> >>>>> @@ -1138,6 +1145,8 @@ struct eth_dev_ops {
> >>>>> eth_rxq_info_get_t rxq_info_get;
> >>>>> /** Retrieve Tx queue information */
> >>>>> eth_txq_info_get_t txq_info_get;
> >>>>> +   /** Get the address where Tx data is stored */
> >>>>> +   eth_txq_data_get_t txq_data_get;
> >>>>> eth_burst_mode_get_t   rx_burst_mode_get; /**< Get Rx
> burst
> >>>> mode */
> >>>>> eth_burst_mode_get_t   tx_burst_mode_get; /**< Get Tx
> burst
> >>>> mode */
> >>>>> e

回复: 回复: [PATCH v1 3/3] examples/l3fwd-power: enable PMD power mgmt on Arm

2022-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Thomas Monjalon 
> 发送时间: Friday, October 21, 2022 4:42 AM
> 收件人: David Marchand 
> 抄送: Hunt, David ; Ruifeng Wang
> ; dev@dpdk.org; nd ; Feifei
> Wang 
> 主题: Re: 回复: [PATCH v1 3/3] examples/l3fwd-power: enable PMD power
> mgmt on Arm
> 
> 11/10/2022 09:56, Feifei Wang:
> > David Marchand 
> > > > On 25/08/2022 07:42, Feifei Wang wrote:
> > > > > --- a/examples/l3fwd-power/main.c
> > > > > +++ b/examples/l3fwd-power/main.c
> > > > > @@ -432,8 +432,16 @@ static void
> > > > >   signal_exit_now(int sigtype)
> > > > >   {
> > > > >
> > > > > - if (sigtype == SIGINT)
> > > > > + if (sigtype == SIGINT) {
> > > > > +#if defined(RTE_ARCH_ARM64)
> > >
> > > Having a arch specific behavior in the application shows that there
> > > is something wrong either in the API, or in the Arm implementation of
> the API.
> > > I don't think this is a good solution.
> > >
> > > Can't we find a better alternative? By changing the API probably?
> > Sorry I do not understand ' shows that there is something wrong either in
> the API'
> 
> David means the application developer should not have to be aware of the
> arch differences.
> When you use an API, you don't check how it is implemented, and you are
> not supposed to use #ifdef.
> The API must be arch-agnostic.

Ok, Understand. Thanks for the explanation.
> 
> > Here we call ' rte_power_monitor_wakeup' API is due to that we need to
> > wake up all cores from WFE instructions in arm, and then l3fwd can exit
> correctly.
> >
> > This is due to that arm arch is different from x86, if there is no
> > packets received, x86's 'UMONITOR' can automatically exit from energy-
> saving state after waiting for a period of time.
> > But arm's 'WFE' can not exit automatically. It will wait 'SEV'
> > instructions in wake_up API to wake up it.
> >
> > Finally, if user want to exit l3fwd by  'SIGINT' in arm, main core
> > should firstly call 'wake_up' API to force worker cores to exit from energy-
> saving state.
> > Otherwise, the worker will stay in the energy-saving state forever if no
> packet is received.
> 
> Please find a way to have a common API,
> even if the API implementation is empty in x86 case.

Yes, I think what we need to do is not a create a new API, it is to look
for a correct location to call 'rte_power_monitor_wakeup'. 

> 
> > >
> > >
> > > > > + /**
> > > > > +  * wake_up api does not need input parameter on Arm,
> > > > > +  * so 0 is meaningless here.
> > > > > +  */
> > > > > + rte_power_monitor_wakeup(0); #endif
> > > > >   quit_signal = true;
> > > > > + }
> 
> 



回复: [PATCH v1 3/3] examples/l3fwd-power: enable PMD power mgmt on Arm

2022-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Stephen Hemminger 
> 发送时间: Friday, October 21, 2022 6:10 AM
> 收件人: Feifei Wang 
> 抄送: David Hunt ; dev@dpdk.org; nd
> ; Ruifeng Wang 
> 主题: Re: [PATCH v1 3/3] examples/l3fwd-power: enable PMD power mgmt
> on Arm
> 
> On Thu, 25 Aug 2022 14:42:51 +0800
> Feifei Wang  wrote:
> 
> > diff --git a/examples/l3fwd-power/main.c b/examples/l3fwd-
> power/main.c
> > index 887c6eae3f..2bd0d700f0 100644
> > --- a/examples/l3fwd-power/main.c
> > +++ b/examples/l3fwd-power/main.c
> > @@ -432,8 +432,16 @@ static void
> >  signal_exit_now(int sigtype)
> >  {
> >
> > -   if (sigtype == SIGINT)
> > +   if (sigtype == SIGINT) {
> > +#if defined(RTE_ARCH_ARM64)
> > +   /**
> > +* wake_up api does not need input parameter on Arm,
> > +* so 0 is meaningless here.
> > +*/
> > +   rte_power_monitor_wakeup(0);
> > +#endif
> > quit_signal = true;
> > +   }
> >
> 
> This method is problematic. There is no guarantee that power monitor is
> async signal safe.

Agree with this. We will put 'rte_power_monitor_wakeup' out of signal_exit.
And put it after that main_lcore exit 'main_empty_poll_loop':
-
rte_eal_mp_remote_launch(main_telemetry_loop, NULL, CALL_MAIN);


%wake up all worker calls from low power state.
rte_power_monitor_wakeup(0);

if (app_mode == APP_MODE_EMPTY_POLL || app_mode == APP_MODE_TELEMETRY)
launch_timer(rte_lcore_id());

RTE_LCORE_FOREACH_WORKER(lcore_id) {
if (rte_eal_wait_lcore(lcore_id) < 0)
return -1;
}
-




回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side

2023-01-30 Thread Feifei Wang
+ping konstantin,

Would you please give some comments for this patch series?
Thanks very much. 

Best Regards
Feifei


回复: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side

2023-01-31 Thread Feifei Wang
That's all right. Thanks very much for your attention~

> -邮件原件-
> 发件人: Konstantin Ananyev 
> 发送时间: Wednesday, February 1, 2023 9:11 AM
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd 
> 主题: Re: 回复: [PATCH v3 0/3] Direct re-arming of buffers on receive side
> 
> Hi Feifei,
> 
> > +ping konstantin,
> >
> > Would you please give some comments for this patch series?
> > Thanks very much.
> 
> Sure, will have a look in next few days.
> Apologies for the delay.



RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-01 Thread Feifei Wang



> -Original Message-
> From: Konstantin Ananyev 
> Sent: Friday, September 1, 2023 1:25 AM
> To: Feifei Wang ; Konstantin Ananyev
> 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > >
> > > Define specific function implementation for i40e driver.
> > > Currently, mbufs recycle mode can support 128bit vector path and avx2
> path.
> > > And can be enabled both in fast free and no fast free mode.
> > >
> > > Suggested-by: Honnappa Nagarahalli 
> > > Signed-off-by: Feifei Wang 
> > > Reviewed-by: Ruifeng Wang 
> > > Reviewed-by: Honnappa Nagarahalli 
> > > ---
> > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > ++
> > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > >  drivers/net/i40e/meson.build  |   1 +
> > >  6 files changed, 187 insertions(+)
> > >  create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > >
> > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > b/drivers/net/i40e/i40e_ethdev.c index 8271bbb394..50ba9aac94 100644
> > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops
> = {
> > >   .flow_ops_get = i40e_dev_flow_ops_get,
> > >   .rxq_info_get = i40e_rxq_info_get,
> > >   .txq_info_get = i40e_txq_info_get,
> > > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> > >   .rx_burst_mode_get= i40e_rx_burst_mode_get,
> > >   .tx_burst_mode_get= i40e_tx_burst_mode_get,
> > >   .timesync_enable  = i40e_timesync_enable,
> > > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > > b/drivers/net/i40e/i40e_ethdev.h index 6f65d5e0ac..af758798e1 100644
> > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev
> > > *dev, uint16_t queue_id,
> > >   struct rte_eth_rxq_info *qinfo);
> > >  void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > >   struct rte_eth_txq_info *qinfo);
> > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t
> queue_id,
> > > + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > >  struct rte_eth_burst_mode *mode);  int
> > > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > diff -- git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > new file mode 100644
> > > index 00..5663ecccde
> > > --- /dev/null
> > > +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > @@ -0,0 +1,147 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright (c) 2023 Arm Limited.
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +
> > > +#include "base/i40e_prototype.h"
> > > +#include "base/i40e_type.h"
> > > +#include "i40e_ethdev.h"
> > > +#include "i40e_rxtx.h"
> > > +
> > > +#pragma GCC diagnostic ignored "-Wcast-qual"
> > > +
> > > +void
> > > +i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t
> > > +nb_mbufs) {
> > > + struct i40e_rx_queue *rxq = rx_queue;
> > > + struct i40e_rx_entry *rxep;
> > > + volatile union i40e_rx_desc *rxdp;
> > > + uint16_t rx_id;
> > > + uint64_t paddr;
> > > + uint64_t dma_addr;
> > > + uint16_t i;
> > > +
> > > + rxdp = rxq->rx_ring + rxq->rxrearm_start;
> > > + rxep = &rxq->sw_ring[rxq->rxrearm_start];
> > > +
> > > + for (i = 0; i < nb_mbufs; i++) {
> > > + /* Initialize rxdp descs. */
> > > + paddr = (rxep[i].mbuf)->buf_iova +
> > > RTE_PKTMBUF_HEADROOM;
> > > + dma_addr = rte_cpu_to_le_64(paddr);
> > > +

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-03 Thread Feifei Wang



> -Original Message-
> From: Konstantin Ananyev 
> Sent: Friday, September 1, 2023 10:23 PM
> To: Feifei Wang ; Konstantin Ananyev
> 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > > > >
> > > > > Define specific function implementation for i40e driver.
> > > > > Currently, mbufs recycle mode can support 128bit vector path and
> > > > > avx2
> > > path.
> > > > > And can be enabled both in fast free and no fast free mode.
> > > > >
> > > > > Suggested-by: Honnappa Nagarahalli
> > > > > 
> > > > > Signed-off-by: Feifei Wang 
> > > > > Reviewed-by: Ruifeng Wang 
> > > > > Reviewed-by: Honnappa Nagarahalli
> 
> > > > > ---
> > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > > > ++
> > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > >  6 files changed, 187 insertions(+)  create mode 100644
> > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > >
> > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > b/drivers/net/i40e/i40e_ethdev.c index 8271bbb394..50ba9aac94
> > > > > 100644
> > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > > i40e_eth_dev_ops
> > > = {
> > > > >   .flow_ops_get = i40e_dev_flow_ops_get,
> > > > >   .rxq_info_get = i40e_rxq_info_get,
> > > > >   .txq_info_get = i40e_txq_info_get,
> > > > > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> > > > >   .rx_burst_mode_get= i40e_rx_burst_mode_get,
> > > > >   .tx_burst_mode_get= i40e_tx_burst_mode_get,
> > > > >   .timesync_enable  = i40e_timesync_enable,
> > > > > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > > > > b/drivers/net/i40e/i40e_ethdev.h index 6f65d5e0ac..af758798e1
> > > > > 100644
> > > > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev
> > > > > *dev, uint16_t queue_id,
> > > > >   struct rte_eth_rxq_info *qinfo);  void
> > > > > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > > >   struct rte_eth_txq_info *qinfo);
> > > > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev,
> > > > > +uint16_t
> > > queue_id,
> > > > > + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > > > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t
> queue_id,
> > > > >  struct rte_eth_burst_mode *mode);  int
> > > > > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t
> > > > > queue_id, diff -- git
> > > > > a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > new file mode 100644
> > > > > index 00..5663ecccde
> > > > > --- /dev/null
> > > > > +++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > @@ -0,0 +1,147 @@
> > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > + * Copyright (c) 2023 Arm Limited.
> > > > > + */
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +#include "base/i40e_prototype.h"
> > > > > +#include "base/i40e_type.h"
> > > > > +#include "i40e_ethdev.h"
> > > > > +#include "i40e_rxtx.h"
> > > > > +
> > &g

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-04 Thread Feifei Wang



> -Original Message-
> From: Konstantin Ananyev 
> Sent: Monday, September 4, 2023 3:50 PM
> To: Feifei Wang ; Konstantin Ananyev
> 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> Hi Feifei,
> 
> > > > > > > Define specific function implementation for i40e driver.
> > > > > > > Currently, mbufs recycle mode can support 128bit vector path
> > > > > > > and
> > > > > > > avx2
> > > > > path.
> > > > > > > And can be enabled both in fast free and no fast free mode.
> > > > > > >
> > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > 
> > > > > > > Signed-off-by: Feifei Wang 
> > > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > 
> > > > > > > ---
> > > > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > > > > > ++
> > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > > > >  6 files changed, 187 insertions(+)  create mode 100644
> > > > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > >
> > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > b/drivers/net/i40e/i40e_ethdev.c index
> > > > > > > 8271bbb394..50ba9aac94
> > > > > > > 100644
> > > > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > > > > i40e_eth_dev_ops
> > > > > = {
> > > > > > >   .flow_ops_get = i40e_dev_flow_ops_get,
> > > > > > >   .rxq_info_get = i40e_rxq_info_get,
> > > > > > >   .txq_info_get = i40e_txq_info_get,
> > > > > > > + .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
> > > > > > >   .rx_burst_mode_get= i40e_rx_burst_mode_get,
> > > > > > >   .tx_burst_mode_get= i40e_tx_burst_mode_get,
> > > > > > >   .timesync_enable  = i40e_timesync_enable,
> > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > b/drivers/net/i40e/i40e_ethdev.h index
> > > > > > > 6f65d5e0ac..af758798e1
> > > > > > > 100644
> > > > > > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > > > > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct
> > > > > > > rte_eth_dev *dev, uint16_t queue_id,
> > > > > > >   struct rte_eth_rxq_info *qinfo);  void
> > > > > > > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > > > > >   struct rte_eth_txq_info *qinfo);
> > > > > > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev,
> > > > > > > +uint16_t
> > > > > queue_id,
> > > > > > > + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > > > > > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev,
> > > > > > > uint16_t
> > > queue_id,
> > > > > > >  struct rte_eth_burst_mode *mode);  int
> > > > > > > i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t
> > > > > > > queue_id, diff -- git
> > > > > > > a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > new file mode 100644
> > > > > > > index 00..5663ecccde
> > > > > > > --- /dev/null

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-04 Thread Feifei Wang



> -Original Message-
> From: Konstantin Ananyev 
> Sent: Monday, September 4, 2023 6:22 PM
> To: Feifei Wang ; Konstantin Ananyev
> 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > > > > > > > > Define specific function implementation for i40e driver.
> > > > > > > > > Currently, mbufs recycle mode can support 128bit vector
> > > > > > > > > path and
> > > > > > > > > avx2
> > > > > > > path.
> > > > > > > > > And can be enabled both in fast free and no fast free mode.
> > > > > > > > >
> > > > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Feifei Wang 
> > > > > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > 
> > > > > > > > > ---
> > > > > > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > > > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > > > > > > > ++
> > > > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > > > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > > > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > > > > > >  6 files changed, 187 insertions(+)  create mode 100644
> > > > > > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > b/drivers/net/i40e/i40e_ethdev.c index
> > > > > > > > > 8271bbb394..50ba9aac94
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > > > > > > i40e_eth_dev_ops
> > > > > > > = {
> > > > > > > > >   .flow_ops_get = i40e_dev_flow_ops_get,
> > > > > > > > >   .rxq_info_get = i40e_rxq_info_get,
> > > > > > > > >   .txq_info_get = i40e_txq_info_get,
> > > > > > > > > + .recycle_rxq_info_get = 
> > > > > > > > > i40e_recycle_rxq_info_get,
> > > > > > > > >   .rx_burst_mode_get= i40e_rx_burst_mode_get,
> > > > > > > > >   .tx_burst_mode_get= i40e_tx_burst_mode_get,
> > > > > > > > >   .timesync_enable  = i40e_timesync_enable,
> > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > b/drivers/net/i40e/i40e_ethdev.h index
> > > > > > > > > 6f65d5e0ac..af758798e1
> > > > > > > > > 100644
> > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct
> > > > > > > > > rte_eth_dev *dev, uint16_t queue_id,
> > > > > > > > >   struct rte_eth_rxq_info *qinfo);  void
> > > > > > > > > i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
> > > > > > > > >   struct rte_eth_txq_info *qinfo);
> > > > > > > > > +void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev,
> > > > > > > > > +uint16_t
> > > > > > > queue_id,
> > > > > > > > > + struct rte_eth_recycle_rxq_info *recycle_rxq_info);
> > > > > > > > >  int i40e_rx_burst_mode_get(struct rte_eth_dev *dev,
> >

[PATCH v3 0/3] fix test-pipeline issues

2023-09-11 Thread Feifei Wang
For test-pipeline application, there are some problems with the normal
operation of the program and security issues. These patches can fix
these issues.

v3: fix SIGINT handling issue and add dev close operation

Feifei Wang (3):
  app/test-pipeline: relax RSS hash requirement
  app/test-pipeline: fix SIGINT handling issue
  app/test-pipeline: add dev close operation

 app/test-pipeline/init.c  |  22 -
 app/test-pipeline/main.c  |  33 +++
 app/test-pipeline/main.h  |   2 +
 app/test-pipeline/pipeline_acl.c  |   6 +-
 app/test-pipeline/pipeline_hash.c | 110 ++---
 app/test-pipeline/pipeline_lpm.c  |   6 +-
 app/test-pipeline/pipeline_lpm_ipv6.c |   6 +-
 app/test-pipeline/pipeline_stub.c |   6 +-
 app/test-pipeline/runtime.c   | 132 ++
 9 files changed, 198 insertions(+), 125 deletions(-)

-- 
2.25.1



[PATCH v3 1/3] app/test-pipeline: relax RSS hash requirement

2023-09-11 Thread Feifei Wang
For some drivers which can not support the configured RSS hash functions,
the thread reports 'invalid rss_hf' when doing device configure.

For example, i40e driver can not support 'RTE_ETH_RSS_IPV4',
'RTE_ETH_RSS_IPV6' and 'RTE_ETH_RSS_NONFRAG_IPV6_OTHER', thus it can not
run successfully in test-pipeline with XL710 NIC and reports the issue:
-
Ethdev port_id=0 invalid rss_hf: 0xa38c, valid value: 0x7ef8
PANIC in app_init_ports():
Cannot init NIC port 0 (-22)
-

To fix this, referring to l3fwd operation, adjust the 'rss_hf' based on
device capability and just report a warning.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
Reviewed-by: Trevor Tao 
Acked-by: Huisong Li 
---
 app/test-pipeline/init.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/app/test-pipeline/init.c b/app/test-pipeline/init.c
index d146c44be0..84a1734519 100644
--- a/app/test-pipeline/init.c
+++ b/app/test-pipeline/init.c
@@ -188,21 +188,41 @@ static void
 app_init_ports(void)
 {
uint32_t i;
+   struct rte_eth_dev_info dev_info;
+
 
/* Init NIC ports, then start the ports */
for (i = 0; i < app.n_ports; i++) {
uint16_t port;
int ret;
+   struct rte_eth_conf local_port_conf = port_conf;
 
port = app.ports[i];
RTE_LOG(INFO, USER1, "Initializing NIC port %u ...\n", port);
 
+   ret = rte_eth_dev_info_get(port, &dev_info);
+   if (ret != 0)
+   rte_panic("Error during getting device (port %u) info: 
%s\n",
+   port, rte_strerror(-ret));
+
/* Init port */
+   local_port_conf.rx_adv_conf.rss_conf.rss_hf &=
+   dev_info.flow_type_rss_offloads;
+   if (local_port_conf.rx_adv_conf.rss_conf.rss_hf !=
+   port_conf.rx_adv_conf.rss_conf.rss_hf) {
+   printf("Warning:"
+   "Port %u modified RSS hash function based on 
hardware support,"
+   "requested:%#"PRIx64" configured:%#"PRIx64"\n",
+   port,
+   port_conf.rx_adv_conf.rss_conf.rss_hf,
+   local_port_conf.rx_adv_conf.rss_conf.rss_hf);
+   }
+
ret = rte_eth_dev_configure(
port,
1,
1,
-   &port_conf);
+   &local_port_conf);
if (ret < 0)
rte_panic("Cannot init NIC port %u (%d)\n", port, ret);
 
-- 
2.25.1



[PATCH v3 2/3] app/test-pipeline: fix SIGINT handling issue

2023-09-11 Thread Feifei Wang
For test-pipeline, if the main core receive SIGINT signal, it will kill
all the threads immediately and not wait other threads to finish their
jobs.

To fix this, add 'signal_handler' function.

Fixes: 48f31ca50cc4 ("app/pipeline: packet framework benchmark")
Cc: cristian.dumitre...@intel.com
Cc: sta...@dpdk.org

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
Reviewed-by: Matthew Dirba 
---
 app/test-pipeline/main.c  |  14 +++
 app/test-pipeline/main.h  |   2 +
 app/test-pipeline/pipeline_acl.c  |   6 +-
 app/test-pipeline/pipeline_hash.c | 110 ++---
 app/test-pipeline/pipeline_lpm.c  |   6 +-
 app/test-pipeline/pipeline_lpm_ipv6.c |   6 +-
 app/test-pipeline/pipeline_stub.c |   6 +-
 app/test-pipeline/runtime.c   | 132 ++
 8 files changed, 158 insertions(+), 124 deletions(-)

diff --git a/app/test-pipeline/main.c b/app/test-pipeline/main.c
index 1e16794183..8633933fd9 100644
--- a/app/test-pipeline/main.c
+++ b/app/test-pipeline/main.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -41,6 +42,15 @@
 
 #include "main.h"
 
+bool force_quit;
+
+static void
+signal_handler(int signum)
+{
+   if (signum == SIGINT || signum == SIGTERM)
+   force_quit = true;
+}
+
 int
 main(int argc, char **argv)
 {
@@ -54,6 +64,10 @@ main(int argc, char **argv)
argc -= ret;
argv += ret;
 
+   force_quit = false;
+   signal(SIGINT, signal_handler);
+   signal(SIGTERM, signal_handler);
+
/* Parse application arguments (after the EAL ones) */
ret = app_parse_args(argc, argv);
if (ret < 0) {
diff --git a/app/test-pipeline/main.h b/app/test-pipeline/main.h
index 59dcfddbf4..9df157de22 100644
--- a/app/test-pipeline/main.h
+++ b/app/test-pipeline/main.h
@@ -60,6 +60,8 @@ struct app_params {
 
 extern struct app_params app;
 
+extern bool force_quit;
+
 int app_parse_args(int argc, char **argv);
 void app_print_usage(void);
 void app_init(void);
diff --git a/app/test-pipeline/pipeline_acl.c b/app/test-pipeline/pipeline_acl.c
index 2f04868e3e..9eb4053e23 100644
--- a/app/test-pipeline/pipeline_acl.c
+++ b/app/test-pipeline/pipeline_acl.c
@@ -236,14 +236,16 @@ app_main_loop_worker_pipeline_acl(void) {
 
/* Run-time */
 #if APP_FLUSH == 0
-   for ( ; ; )
+   while (!force_quit)
rte_pipeline_run(p);
 #else
-   for (i = 0; ; i++) {
+   i = 0;
+   while (!force_quit) {
rte_pipeline_run(p);
 
if ((i & APP_FLUSH) == 0)
rte_pipeline_flush(p);
+   i++;
}
 #endif
 }
diff --git a/app/test-pipeline/pipeline_hash.c 
b/app/test-pipeline/pipeline_hash.c
index 2dd8928d43..cab9c20980 100644
--- a/app/test-pipeline/pipeline_hash.c
+++ b/app/test-pipeline/pipeline_hash.c
@@ -366,14 +366,16 @@ app_main_loop_worker_pipeline_hash(void) {
 
/* Run-time */
 #if APP_FLUSH == 0
-   for ( ; ; )
+   while (!force_quit)
rte_pipeline_run(p);
 #else
-   for (i = 0; ; i++) {
+   i = 0;
+   while (!force_quit) {
rte_pipeline_run(p);
 
if ((i & APP_FLUSH) == 0)
rte_pipeline_flush(p);
+   i++;
}
 #endif
 }
@@ -411,59 +413,61 @@ app_main_loop_rx_metadata(void) {
RTE_LOG(INFO, USER1, "Core %u is doing RX (with meta-data)\n",
rte_lcore_id());
 
-   for (i = 0; ; i = ((i + 1) & (app.n_ports - 1))) {
-   uint16_t n_mbufs;
-
-   n_mbufs = rte_eth_rx_burst(
-   app.ports[i],
-   0,
-   app.mbuf_rx.array,
-   app.burst_size_rx_read);
-
-   if (n_mbufs == 0)
-   continue;
-
-   for (j = 0; j < n_mbufs; j++) {
-   struct rte_mbuf *m;
-   uint8_t *m_data, *key;
-   struct rte_ipv4_hdr *ip_hdr;
-   struct rte_ipv6_hdr *ipv6_hdr;
-   uint32_t ip_dst;
-   uint8_t *ipv6_dst;
-   uint32_t *signature, *k32;
-
-   m = app.mbuf_rx.array[j];
-   m_data = rte_pktmbuf_mtod(m, uint8_t *);
-   signature = RTE_MBUF_METADATA_UINT32_PTR(m,
-   APP_METADATA_OFFSET(0));
-   key = RTE_MBUF_METADATA_UINT8_PTR(m,
-   APP_METADATA_OFFSET(32));
-
-   if (RTE_ETH_IS_IPV4_HDR(m->packet_type)) {
-   ip_hdr = (struct rte_ipv4_hdr *)
-   &m_data[sizeof(struct rte_ether_hdr)];
-   ip_dst = ip_hdr-&g

[PATCH v3 3/3] app/test-pipeline: add dev close operation

2023-09-11 Thread Feifei Wang
For test-pipeline, there is dev start operation, but when thread need to
exit, there is no dev close operation. This is not safe, to fix this,
add dev close operation.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 app/test-pipeline/main.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/app/test-pipeline/main.c b/app/test-pipeline/main.c
index 8633933fd9..73f6d31f82 100644
--- a/app/test-pipeline/main.c
+++ b/app/test-pipeline/main.c
@@ -55,6 +55,7 @@ int
 main(int argc, char **argv)
 {
uint32_t lcore;
+   uint32_t i;
int ret;
 
/* Init EAL */
@@ -85,6 +86,24 @@ main(int argc, char **argv)
return -1;
}
 
+   /*Close ports */
+   for (i = 0; i < app.n_ports; i++) {
+   uint16_t port;
+   int ret;
+
+   port = app.ports[i];
+   printf("Closing port %d...", port);
+   ret = rte_eth_dev_stop(port);
+   if (ret != 0)
+   printf("rte_eth_dev_stop: err=%d, port=%u\n",
+ret, port);
+   rte_eth_dev_close(port);
+   printf("Done\n");
+   }
+
+   /* Clean up the EAL */
+   rte_eal_cleanup();
+
return 0;
 }
 
-- 
2.25.1



RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-22 Thread Feifei Wang


Hi, Konstantin

> -Original Message-
> From: Feifei Wang
> Sent: Tuesday, September 5, 2023 11:11 AM
> To: Konstantin Ananyev ; Konstantin
> Ananyev 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> ; nd ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > -Original Message-
> > From: Konstantin Ananyev 
> > Sent: Monday, September 4, 2023 6:22 PM
> > To: Feifei Wang ; Konstantin Ananyev
> > 
> > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > ; Ruifeng Wang
> ;
> > Yuying Zhang ; Beilei Xing
> > ; nd ; nd ; nd
> > ; nd 
> > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> >
> >
> >
> > > > > > > > > > Define specific function implementation for i40e driver.
> > > > > > > > > > Currently, mbufs recycle mode can support 128bit
> > > > > > > > > > vector path and
> > > > > > > > > > avx2
> > > > > > > > path.
> > > > > > > > > > And can be enabled both in fast free and no fast free mode.
> > > > > > > > > >
> > > > > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: Feifei Wang 
> > > > > > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > 
> > > > > > > > > > ---
> > > > > > > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > > > > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > > > > > > > > ++
> > > > > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > > > > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > > > > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > > > > > > >  6 files changed, 187 insertions(+)  create mode
> > > > > > > > > > 100644
> > > > > > > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > b/drivers/net/i40e/i40e_ethdev.c index
> > > > > > > > > > 8271bbb394..50ba9aac94
> > > > > > > > > > 100644
> > > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > > > > > > > i40e_eth_dev_ops
> > > > > > > > = {
> > > > > > > > > > .flow_ops_get = i40e_dev_flow_ops_get,
> > > > > > > > > > .rxq_info_get = i40e_rxq_info_get,
> > > > > > > > > > .txq_info_get = i40e_txq_info_get,
> > > > > > > > > > +   .recycle_rxq_info_get =
> i40e_recycle_rxq_info_get,
> > > > > > > > > > .rx_burst_mode_get=
> i40e_rx_burst_mode_get,
> > > > > > > > > > .tx_burst_mode_get=
> i40e_tx_burst_mode_get,
> > > > > > > > > > .timesync_enable  = i40e_timesync_enable,
> > > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > > b/drivers/net/i40e/i40e_ethdev.h index
> > > > > > > > > > 6f65d5e0ac..af758798e1
> > > > > > > > > > 100644
> > > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.h
> > > > > > > > > > @@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct
> > > > > > > > > > rte_eth_dev *dev,

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-22 Thread Feifei Wang



> -Original Message-
> From: Feifei Wang
> Sent: Friday, September 22, 2023 10:59 PM
> To: Konstantin Ananyev ; Konstantin
> Ananyev 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> ; nd ; nd ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> Hi, Konstantin
> 
> > -Original Message-
> > From: Feifei Wang
> > Sent: Tuesday, September 5, 2023 11:11 AM
> > To: Konstantin Ananyev ; Konstantin
> > Ananyev 
> > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > ; Ruifeng Wang
> ;
> > Yuying Zhang ; Beilei Xing
> > ; nd ; nd ; nd
> > ; nd ; nd 
> > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> >
> >
> >
> > > -Original Message-
> > > From: Konstantin Ananyev 
> > > Sent: Monday, September 4, 2023 6:22 PM
> > > To: Feifei Wang ; Konstantin Ananyev
> > > 
> > > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > ; Ruifeng Wang
> > ;
> > > Yuying Zhang ; Beilei Xing
> > > ; nd ; nd ; nd
> > > ; nd 
> > > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> > >
> > >
> > >
> > > > > > > > > > > Define specific function implementation for i40e driver.
> > > > > > > > > > > Currently, mbufs recycle mode can support 128bit
> > > > > > > > > > > vector path and
> > > > > > > > > > > avx2
> > > > > > > > > path.
> > > > > > > > > > > And can be enabled both in fast free and no fast free 
> > > > > > > > > > > mode.
> > > > > > > > > > >
> > > > > > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: Feifei Wang 
> > > > > > > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > > 
> > > > > > > > > > > ---
> > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > > > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147
> > > > > > > > > > > ++
> > > > > > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > > > > > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > > > > > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > > > > > > > >  6 files changed, 187 insertions(+)  create mode
> > > > > > > > > > > 100644
> > > > > > > > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > > b/drivers/net/i40e/i40e_ethdev.c index
> > > > > > > > > > > 8271bbb394..50ba9aac94
> > > > > > > > > > > 100644
> > > > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > > @@ -496,6 +496,7 @@ static const struct eth_dev_ops
> > > > > > > > > > > i40e_eth_dev_ops
> > > > > > > > > = {
> > > > > > > > > > >   .flow_ops_get = i40e_dev_flow_ops_get,
> > > > > > > > > > >   .rxq_info_get = i40e_rxq_info_get,
> > > > > > > > > > >   .txq_info_get = i40e_txq_info_get,
> > > > > > > > > > > + .recycle_rxq_info_get =
> > i40e_recycle_rxq_info_get,
> > > > > > > > > > >   .rx_burst_mode_get=
> > i40e_rx_burst_mode_get,
> > > > > > > > > > >   .tx_burst_mode_get  

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-22 Thread Feifei Wang



> -Original Message-
> From: Konstantin Ananyev 
> Sent: Saturday, September 23, 2023 12:41 AM
> To: Feifei Wang ; Konstantin Ananyev
> 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> ; nd ; nd ; nd ;
> nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> Hi Feifei,
> 
> > > > -Original Message-
> > > > From: Feifei Wang
> > > > Sent: Tuesday, September 5, 2023 11:11 AM
> > > > To: Konstantin Ananyev ; Konstantin
> > > > Ananyev 
> > > > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > > ; Ruifeng Wang
> > > ;
> > > > Yuying Zhang ; Beilei Xing
> > > > ; nd ; nd ; nd
> > > > ; nd ; nd 
> > > > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle
> > > > mode
> > > >
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Konstantin Ananyev 
> > > > > Sent: Monday, September 4, 2023 6:22 PM
> > > > > To: Feifei Wang ; Konstantin Ananyev
> > > > > 
> > > > > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > > > ; Ruifeng Wang
> > > > ;
> > > > > Yuying Zhang ; Beilei Xing
> > > > > ; nd ; nd ; nd
> > > > > ; nd 
> > > > > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle
> > > > > mode
> > > > >
> > > > >
> > > > >
> > > > > > > > > > > > > Define specific function implementation for i40e 
> > > > > > > > > > > > > driver.
> > > > > > > > > > > > > Currently, mbufs recycle mode can support 128bit
> > > > > > > > > > > > > vector path and
> > > > > > > > > > > > > avx2
> > > > > > > > > > > path.
> > > > > > > > > > > > > And can be enabled both in fast free and no fast free
> mode.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Signed-off-by: Feifei Wang
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > > > > 
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.c|   1 +
> > > > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.h|   2 +
> > > > > > > > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c  |
> > > > > > > > > > > > > 147
> > > > > > > > > > > > > ++
> > > > > > > > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  32 
> > > > > > > > > > > > > 
> > > > > > > > > > > > >  drivers/net/i40e/i40e_rxtx.h  |   4 +
> > > > > > > > > > > > >  drivers/net/i40e/meson.build  |   1 +
> > > > > > > > > > > > >  6 files changed, 187 insertions(+)  create mode
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > > > > b/drivers/net/i40e/i40e_ethdev.c index
> > > > > > > > > > > > > 8271bbb394..50ba9aac94
> > > > > > > > > > > > > 100644
> > > > > > > > > > > > > --- a/drivers/net/i40e/i40e_ethdev.c
> > > > > > > > > > > > > +++ b/drivers/net/i40e/i40e_ethdev.c
> > > &g

[PATCH v13 0/4] Recycle mbufs from Tx queue into Rx queue

2023-09-24 Thread Feifei Wang
tion name
and variable name are changed to let this mode more general for all
drivers. (Konstantin, Morten)
2. Add ring wrapping check (Konstantin)

v5:
1. some change for ethdev API (Morten)
2. add support for avx2, sse, altivec path

v6:
1. fix ixgbe build issue in ppc
2. remove 'recycle_tx_mbufs_reuse' and 'recycle_rx_descriptors_refill'
   API wrapper (Tech Board meeting)
3. add recycle_mbufs engine in testpmd (Tech Board meeting)
4. add namespace in the functions related to mbufs recycle(Ferruh)

v7:
1. move 'rxq/txq data' pointers to the beginning of eth_dev structure,
in order to keep them in the same cache line as rx/tx_burst function
pointers (Morten)
2. add the extra description for 'rte_eth_recycle_mbufs' to show it can
support feeding 1 Rx queue from 2 Tx queues in the same thread
(Konstantin)
3. For i40e/ixgbe driver, make the previous copied buffers as invalid if
there are Tx buffers refcnt > 1 or from unexpected mempool (Konstantin)
4. add check for the return value of 'rte_eth_recycle_rx_queue_info_get'
in testpmd fwd engine (Morten)

v8:
1. add arm/x86 build option to fix ixgbe build issue in ppc

v9:
1. delete duplicate file name for ixgbe

v10:
1. fix compile issue on windows

v11:
1. fix doc warning

v12:
1. replace rx queue check code with eth_dev_validate_rx_queue
function (Stephen)
2. put port and queue check before function call (Konstantin)

v13:
1. for i40e and ixgbe drivers, reset nb_recycle_mbufs to zero
when rxep[i] == NULL, no matter what value refill_requirement
is (Konstantin)

Feifei Wang (4):
  ethdev: add API for mbufs recycle mode
  net/i40e: implement mbufs recycle mode
  net/ixgbe: implement mbufs recycle mode
  app/testpmd: add recycle mbufs engine

 app/test-pmd/meson.build  |   1 +
 app/test-pmd/recycle_mbufs.c  |  58 ++
 app/test-pmd/testpmd.c|   1 +
 app/test-pmd/testpmd.h|   3 +
 doc/guides/rel_notes/release_23_11.rst|  15 ++
 doc/guides/testpmd_app_ug/run_app.rst |   1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst   |   5 +-
 drivers/net/i40e/i40e_ethdev.c|   1 +
 drivers/net/i40e/i40e_ethdev.h|   2 +
 .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147 ++
 drivers/net/i40e/i40e_rxtx.c  |  32 
 drivers/net/i40e/i40e_rxtx.h  |   4 +
 drivers/net/i40e/meson.build  |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.c  |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h  |   3 +
 .../ixgbe/ixgbe_recycle_mbufs_vec_common.c| 143 ++
 drivers/net/ixgbe/ixgbe_rxtx.c|  37 +++-
 drivers/net/ixgbe/ixgbe_rxtx.h|   4 +
 drivers/net/ixgbe/meson.build |   2 +
 lib/ethdev/ethdev_driver.h|  10 +
 lib/ethdev/ethdev_private.c   |   2 +
 lib/ethdev/rte_ethdev.c   |  22 +++
 lib/ethdev/rte_ethdev.h   | 180 ++
 lib/ethdev/rte_ethdev_core.h  |  23 ++-
 lib/ethdev/version.map|   3 +
 25 files changed, 692 insertions(+), 9 deletions(-)
 create mode 100644 app/test-pmd/recycle_mbufs.c
 create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
 create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c

-- 
2.25.1



[PATCH v13 1/4] ethdev: add API for mbufs recycle mode

2023-09-24 Thread Feifei Wang
Add 'rte_eth_recycle_rx_queue_info_get' and 'rte_eth_recycle_mbufs'
APIs to recycle used mbufs from a transmit queue of an Ethernet device,
and move these mbufs into a mbuf ring for a receive queue of an Ethernet
device. This can bypass mempool 'put/get' operations hence saving CPU
cycles.

For each recycling mbufs, the rte_eth_recycle_mbufs() function performs
the following operations:
- Copy used *rte_mbuf* buffer pointers from Tx mbuf ring into Rx mbuf
ring.
- Replenish the Rx descriptors with the recycling *rte_mbuf* mbufs freed
from the Tx mbuf ring.

Suggested-by: Honnappa Nagarahalli 
Suggested-by: Ruifeng Wang 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
Reviewed-by: Honnappa Nagarahalli 
Acked-by: Morten Brørup 
Acked-by: Konstantin Ananyev 
Acked-by: Ferruh Yigit 
---
 doc/guides/rel_notes/release_23_11.rst |  15 +++
 lib/ethdev/ethdev_driver.h |  10 ++
 lib/ethdev/ethdev_private.c|   2 +
 lib/ethdev/rte_ethdev.c|  22 +++
 lib/ethdev/rte_ethdev.h| 180 +
 lib/ethdev/rte_ethdev_core.h   |  23 +++-
 lib/ethdev/version.map |   3 +
 7 files changed, 249 insertions(+), 6 deletions(-)

diff --git a/doc/guides/rel_notes/release_23_11.rst 
b/doc/guides/rel_notes/release_23_11.rst
index 9746809a66..3c2bed73aa 100644
--- a/doc/guides/rel_notes/release_23_11.rst
+++ b/doc/guides/rel_notes/release_23_11.rst
@@ -78,6 +78,13 @@ New Features
 * build: Optional libraries can now be selected with the new ``enable_libs``
   build option similarly to the existing ``enable_drivers`` build option.
 
+* **Add mbufs recycling support.**
+
+  Added ``rte_eth_recycle_rx_queue_info_get`` and ``rte_eth_recycle_mbufs``
+  APIs which allow the user to copy used mbufs from the Tx mbuf ring
+  into the Rx mbuf ring. This feature supports the case that the Rx Ethernet
+  device is different from the Tx Ethernet device with respective driver
+  callback functions in ``rte_eth_recycle_mbufs``.
 
 Removed Items
 -
@@ -135,6 +142,14 @@ ABI Changes
Also, make sure to start the actual text at the margin.
===
 
+* ethdev: Added ``recycle_tx_mbufs_reuse`` and 
``recycle_rx_descriptors_refill``
+  fields to ``rte_eth_dev`` structure.
+
+* ethdev: Structure ``rte_eth_fp_ops`` was affected to add
+  ``recycle_tx_mbufs_reuse`` and ``recycle_rx_descriptors_refill``
+  fields, to move ``rxq`` and ``txq`` fields, to change the size of
+  ``reserved1`` and ``reserved2`` fields.
+
 
 Known Issues
 
diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 3fa8b309c1..deb23ada18 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -60,6 +60,10 @@ struct rte_eth_dev {
eth_rx_descriptor_status_t rx_descriptor_status;
/** Check the status of a Tx descriptor */
eth_tx_descriptor_status_t tx_descriptor_status;
+   /** Pointer to PMD transmit mbufs reuse function */
+   eth_recycle_tx_mbufs_reuse_t recycle_tx_mbufs_reuse;
+   /** Pointer to PMD receive descriptors refill function */
+   eth_recycle_rx_descriptors_refill_t recycle_rx_descriptors_refill;
 
/**
 * Device data that is shared between primary and secondary processes
@@ -509,6 +513,10 @@ typedef void (*eth_rxq_info_get_t)(struct rte_eth_dev *dev,
 typedef void (*eth_txq_info_get_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id, struct rte_eth_txq_info *qinfo);
 
+typedef void (*eth_recycle_rxq_info_get_t)(struct rte_eth_dev *dev,
+   uint16_t rx_queue_id,
+   struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 typedef int (*eth_burst_mode_get_t)(struct rte_eth_dev *dev,
uint16_t queue_id, struct rte_eth_burst_mode *mode);
 
@@ -1252,6 +1260,8 @@ struct eth_dev_ops {
eth_rxq_info_get_t rxq_info_get;
/** Retrieve Tx queue information */
eth_txq_info_get_t txq_info_get;
+   /** Retrieve mbufs recycle Rx queue information */
+   eth_recycle_rxq_info_get_t recycle_rxq_info_get;
eth_burst_mode_get_t   rx_burst_mode_get; /**< Get Rx burst mode */
eth_burst_mode_get_t   tx_burst_mode_get; /**< Get Tx burst mode */
eth_fw_version_get_t   fw_version_get; /**< Get firmware version */
diff --git a/lib/ethdev/ethdev_private.c b/lib/ethdev/ethdev_private.c
index 14ec8c6ccf..f8ab64f195 100644
--- a/lib/ethdev/ethdev_private.c
+++ b/lib/ethdev/ethdev_private.c
@@ -277,6 +277,8 @@ eth_dev_fp_ops_setup(struct rte_eth_fp_ops *fpo,
fpo->rx_queue_count = dev->rx_queue_count;
fpo->rx_descriptor_status = dev->rx_descriptor_status;
fpo->tx_descriptor_status = dev->tx_descriptor_status;
+   fpo->recycle_tx_mbufs_reuse = dev->recycle_tx_mbufs_reuse;
+   fpo->recycle_rx_descriptors_refill = dev->recycle_rx_descriptor

[PATCH v13 2/4] net/i40e: implement mbufs recycle mode

2023-09-24 Thread Feifei Wang
Define specific function implementation for i40e driver.
Currently, mbufs recycle mode can support 128bit
vector path and avx2 path. And can be enabled both in
fast free and no fast free mode.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
Reviewed-by: Honnappa Nagarahalli 
---
 drivers/net/i40e/i40e_ethdev.c|   1 +
 drivers/net/i40e/i40e_ethdev.h|   2 +
 .../net/i40e/i40e_recycle_mbufs_vec_common.c  | 147 ++
 drivers/net/i40e/i40e_rxtx.c  |  32 
 drivers/net/i40e/i40e_rxtx.h  |   4 +
 drivers/net/i40e/meson.build  |   1 +
 6 files changed, 187 insertions(+)
 create mode 100644 drivers/net/i40e/i40e_recycle_mbufs_vec_common.c

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 8271bbb394..50ba9aac94 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -496,6 +496,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.flow_ops_get = i40e_dev_flow_ops_get,
.rxq_info_get = i40e_rxq_info_get,
.txq_info_get = i40e_txq_info_get,
+   .recycle_rxq_info_get = i40e_recycle_rxq_info_get,
.rx_burst_mode_get= i40e_rx_burst_mode_get,
.tx_burst_mode_get= i40e_tx_burst_mode_get,
.timesync_enable  = i40e_timesync_enable,
diff --git a/drivers/net/i40e/i40e_ethdev.h b/drivers/net/i40e/i40e_ethdev.h
index 8d7e50287f..1bbe7ad376 100644
--- a/drivers/net/i40e/i40e_ethdev.h
+++ b/drivers/net/i40e/i40e_ethdev.h
@@ -1355,6 +1355,8 @@ void i40e_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
queue_id,
struct rte_eth_rxq_info *qinfo);
 void i40e_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
+void i40e_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+   struct rte_eth_recycle_rxq_info *recycle_rxq_info);
 int i40e_rx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
   struct rte_eth_burst_mode *mode);
 int i40e_tx_burst_mode_get(struct rte_eth_dev *dev, uint16_t queue_id,
diff --git a/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c 
b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
new file mode 100644
index 00..14424c9921
--- /dev/null
+++ b/drivers/net/i40e/i40e_recycle_mbufs_vec_common.c
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include 
+#include 
+
+#include "base/i40e_prototype.h"
+#include "base/i40e_type.h"
+#include "i40e_ethdev.h"
+#include "i40e_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+i40e_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+   struct i40e_rx_queue *rxq = rx_queue;
+   struct i40e_rx_entry *rxep;
+   volatile union i40e_rx_desc *rxdp;
+   uint16_t rx_id;
+   uint64_t paddr;
+   uint64_t dma_addr;
+   uint16_t i;
+
+   rxdp = rxq->rx_ring + rxq->rxrearm_start;
+   rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+   for (i = 0; i < nb_mbufs; i++) {
+   /* Initialize rxdp descs. */
+   paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+   dma_addr = rte_cpu_to_le_64(paddr);
+   /* flush desc with pa dma_addr */
+   rxdp[i].read.hdr_addr = 0;
+   rxdp[i].read.pkt_addr = dma_addr;
+   }
+
+   /* Update the descriptor initializer index */
+   rxq->rxrearm_start += nb_mbufs;
+   rx_id = rxq->rxrearm_start - 1;
+
+   if (unlikely(rxq->rxrearm_start >= rxq->nb_rx_desc)) {
+   rxq->rxrearm_start = 0;
+   rx_id = rxq->nb_rx_desc - 1;
+   }
+
+   rxq->rxrearm_nb -= nb_mbufs;
+
+   rte_io_wmb();
+   /* Update the tail pointer on the NIC */
+   I40E_PCI_REG_WRITE_RELAXED(rxq->qrx_tail, rx_id);
+}
+
+uint16_t
+i40e_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+   struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+   struct i40e_tx_queue *txq = tx_queue;
+   struct i40e_tx_entry *txep;
+   struct rte_mbuf **rxep;
+   int i, n;
+   uint16_t nb_recycle_mbufs;
+   uint16_t avail = 0;
+   uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+   uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+   uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+   uint16_t refill_head = *recycle_rxq_info->refill_head;
+   uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+   /* Get available recycling Rx buffers. */
+   avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+   /* Check Tx free thresh and Rx available space. */
+   if (txq->nb_tx_fr

[PATCH v13 3/4] net/ixgbe: implement mbufs recycle mode

2023-09-24 Thread Feifei Wang
Define specific function implementation for ixgbe driver.
Currently, recycle buffer mode can support 128bit
vector path. And can be enabled both in fast free and
no fast free mode.

Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
Reviewed-by: Honnappa Nagarahalli 
---
 drivers/net/ixgbe/ixgbe_ethdev.c  |   1 +
 drivers/net/ixgbe/ixgbe_ethdev.h  |   3 +
 .../ixgbe/ixgbe_recycle_mbufs_vec_common.c| 143 ++
 drivers/net/ixgbe/ixgbe_rxtx.c|  37 -
 drivers/net/ixgbe/ixgbe_rxtx.h|   4 +
 drivers/net/ixgbe/meson.build |   2 +
 6 files changed, 188 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index cd4a85b3a7..d6cf00317e 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -543,6 +543,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.set_mc_addr_list = ixgbe_dev_set_mc_addr_list,
.rxq_info_get = ixgbe_rxq_info_get,
.txq_info_get = ixgbe_txq_info_get,
+   .recycle_rxq_info_get = ixgbe_recycle_rxq_info_get,
.timesync_enable  = ixgbe_timesync_enable,
.timesync_disable = ixgbe_timesync_disable,
.timesync_read_rx_timestamp = ixgbe_timesync_read_rx_timestamp,
diff --git a/drivers/net/ixgbe/ixgbe_ethdev.h b/drivers/net/ixgbe/ixgbe_ethdev.h
index 1291e9099c..22fc3be3d8 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.h
+++ b/drivers/net/ixgbe/ixgbe_ethdev.h
@@ -626,6 +626,9 @@ void ixgbe_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
queue_id,
 void ixgbe_txq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
struct rte_eth_txq_info *qinfo);
 
+void ixgbe_recycle_rxq_info_get(struct rte_eth_dev *dev, uint16_t queue_id,
+   struct rte_eth_recycle_rxq_info *recycle_rxq_info);
+
 int ixgbevf_dev_rx_init(struct rte_eth_dev *dev);
 
 void ixgbevf_dev_tx_init(struct rte_eth_dev *dev);
diff --git a/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c 
b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
new file mode 100644
index 00..d451562269
--- /dev/null
+++ b/drivers/net/ixgbe/ixgbe_recycle_mbufs_vec_common.c
@@ -0,0 +1,143 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include 
+#include 
+
+#include "ixgbe_ethdev.h"
+#include "ixgbe_rxtx.h"
+
+#pragma GCC diagnostic ignored "-Wcast-qual"
+
+void
+ixgbe_recycle_rx_descriptors_refill_vec(void *rx_queue, uint16_t nb_mbufs)
+{
+   struct ixgbe_rx_queue *rxq = rx_queue;
+   struct ixgbe_rx_entry *rxep;
+   volatile union ixgbe_adv_rx_desc *rxdp;
+   uint16_t rx_id;
+   uint64_t paddr;
+   uint64_t dma_addr;
+   uint16_t i;
+
+   rxdp = rxq->rx_ring + rxq->rxrearm_start;
+   rxep = &rxq->sw_ring[rxq->rxrearm_start];
+
+   for (i = 0; i < nb_mbufs; i++) {
+   /* Initialize rxdp descs. */
+   paddr = (rxep[i].mbuf)->buf_iova + RTE_PKTMBUF_HEADROOM;
+   dma_addr = rte_cpu_to_le_64(paddr);
+   /* Flush descriptors with pa dma_addr */
+   rxdp[i].read.hdr_addr = 0;
+   rxdp[i].read.pkt_addr = dma_addr;
+   }
+
+   /* Update the descriptor initializer index */
+   rxq->rxrearm_start += nb_mbufs;
+   if (rxq->rxrearm_start >= rxq->nb_rx_desc)
+   rxq->rxrearm_start = 0;
+
+   rxq->rxrearm_nb -= nb_mbufs;
+
+   rx_id = (uint16_t)((rxq->rxrearm_start == 0) ?
+   (rxq->nb_rx_desc - 1) : (rxq->rxrearm_start - 1));
+
+   /* Update the tail pointer on the NIC */
+   IXGBE_PCI_REG_WRITE(rxq->rdt_reg_addr, rx_id);
+}
+
+uint16_t
+ixgbe_recycle_tx_mbufs_reuse_vec(void *tx_queue,
+   struct rte_eth_recycle_rxq_info *recycle_rxq_info)
+{
+   struct ixgbe_tx_queue *txq = tx_queue;
+   struct ixgbe_tx_entry *txep;
+   struct rte_mbuf **rxep;
+   int i, n;
+   uint32_t status;
+   uint16_t nb_recycle_mbufs;
+   uint16_t avail = 0;
+   uint16_t mbuf_ring_size = recycle_rxq_info->mbuf_ring_size;
+   uint16_t mask = recycle_rxq_info->mbuf_ring_size - 1;
+   uint16_t refill_requirement = recycle_rxq_info->refill_requirement;
+   uint16_t refill_head = *recycle_rxq_info->refill_head;
+   uint16_t receive_tail = *recycle_rxq_info->receive_tail;
+
+   /* Get available recycling Rx buffers. */
+   avail = (mbuf_ring_size - (refill_head - receive_tail)) & mask;
+
+   /* Check Tx free thresh and Rx available space. */
+   if (txq->nb_tx_free > txq->tx_free_thresh || avail <= txq->tx_rs_thresh)
+   return 0;
+
+   /* check DD bits on threshold descriptor */
+   stat

[PATCH v13 4/4] app/testpmd: add recycle mbufs engine

2023-09-24 Thread Feifei Wang
Add recycle mbufs engine for testpmd. This engine forward pkts with
I/O forward mode. But enable mbufs recycle feature to recycle used
txq mbufs for rxq mbuf ring, which can bypass mempool path and save
CPU cycles.

Suggested-by: Jerin Jacob 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 app/test-pmd/meson.build|  1 +
 app/test-pmd/recycle_mbufs.c| 58 +
 app/test-pmd/testpmd.c  |  1 +
 app/test-pmd/testpmd.h  |  3 ++
 doc/guides/testpmd_app_ug/run_app.rst   |  1 +
 doc/guides/testpmd_app_ug/testpmd_funcs.rst |  5 +-
 6 files changed, 68 insertions(+), 1 deletion(-)
 create mode 100644 app/test-pmd/recycle_mbufs.c

diff --git a/app/test-pmd/meson.build b/app/test-pmd/meson.build
index d2e3f60892..6e5f067274 100644
--- a/app/test-pmd/meson.build
+++ b/app/test-pmd/meson.build
@@ -22,6 +22,7 @@ sources = files(
 'macswap.c',
 'noisy_vnf.c',
 'parameters.c',
+   'recycle_mbufs.c',
 'rxonly.c',
 'shared_rxq_fwd.c',
 'testpmd.c',
diff --git a/app/test-pmd/recycle_mbufs.c b/app/test-pmd/recycle_mbufs.c
new file mode 100644
index 00..6e9e1c5eb6
--- /dev/null
+++ b/app/test-pmd/recycle_mbufs.c
@@ -0,0 +1,58 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2023 Arm Limited.
+ */
+
+#include "testpmd.h"
+
+/*
+ * Forwarding of packets in I/O mode.
+ * Enable mbufs recycle mode to recycle txq used mbufs
+ * for rxq mbuf ring. This can bypass mempool path and
+ * save CPU cycles.
+ */
+static bool
+pkt_burst_recycle_mbufs(struct fwd_stream *fs)
+{
+   struct rte_mbuf *pkts_burst[MAX_PKT_BURST];
+   uint16_t nb_rx;
+
+   /* Recycle used mbufs from the txq, and move these mbufs into
+* the rxq mbuf ring.
+*/
+   rte_eth_recycle_mbufs(fs->rx_port, fs->rx_queue,
+   fs->tx_port, fs->tx_queue, &(fs->recycle_rxq_info));
+
+   /*
+* Receive a burst of packets and forward them.
+*/
+   nb_rx = common_fwd_stream_receive(fs, pkts_burst, nb_pkt_per_burst);
+   if (unlikely(nb_rx == 0))
+   return false;
+
+   common_fwd_stream_transmit(fs, pkts_burst, nb_rx);
+
+   return true;
+}
+
+static void
+recycle_mbufs_stream_init(struct fwd_stream *fs)
+{
+   int rc;
+
+   /* Retrieve information about given ports's Rx queue
+* for recycling mbufs.
+*/
+   rc = rte_eth_recycle_rx_queue_info_get(fs->rx_port,
+   fs->rx_queue, &(fs->recycle_rxq_info));
+   if (rc != 0)
+   TESTPMD_LOG(WARNING,
+   "Failed to get rx queue mbufs recycle info\n");
+
+   common_fwd_stream_init(fs);
+}
+
+struct fwd_engine recycle_mbufs_engine = {
+   .fwd_mode_name  = "recycle_mbufs",
+   .stream_init= recycle_mbufs_stream_init,
+   .packet_fwd = pkt_burst_recycle_mbufs,
+};
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 938ca035d4..5d0f9ca119 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -199,6 +199,7 @@ struct fwd_engine * fwd_engines[] = {
&icmp_echo_engine,
&noisy_vnf_engine,
&five_tuple_swap_fwd_engine,
+   &recycle_mbufs_engine,
 #ifdef RTE_LIBRTE_IEEE1588
&ieee1588_fwd_engine,
 #endif
diff --git a/app/test-pmd/testpmd.h b/app/test-pmd/testpmd.h
index f1df6a8faf..0eb8d7883a 100644
--- a/app/test-pmd/testpmd.h
+++ b/app/test-pmd/testpmd.h
@@ -188,6 +188,8 @@ struct fwd_stream {
struct pkt_burst_stats rx_burst_stats;
struct pkt_burst_stats tx_burst_stats;
struct fwd_lcore *lcore; /**< Lcore being scheduled. */
+   /**< Rx queue information for recycling mbufs */
+   struct rte_eth_recycle_rxq_info recycle_rxq_info;
 };
 
 /**
@@ -449,6 +451,7 @@ extern struct fwd_engine csum_fwd_engine;
 extern struct fwd_engine icmp_echo_engine;
 extern struct fwd_engine noisy_vnf_engine;
 extern struct fwd_engine five_tuple_swap_fwd_engine;
+extern struct fwd_engine recycle_mbufs_engine;
 #ifdef RTE_LIBRTE_IEEE1588
 extern struct fwd_engine ieee1588_fwd_engine;
 #endif
diff --git a/doc/guides/testpmd_app_ug/run_app.rst 
b/doc/guides/testpmd_app_ug/run_app.rst
index 6e9c552e76..24a086401e 100644
--- a/doc/guides/testpmd_app_ug/run_app.rst
+++ b/doc/guides/testpmd_app_ug/run_app.rst
@@ -232,6 +232,7 @@ The command line options are:
noisy
5tswap
shared-rxq
+   recycle_mbufs
 
 *   ``--rss-ip``
 
diff --git a/doc/guides/testpmd_app_ug/testpmd_funcs.rst 
b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
index a182479ab2..aef4de3e0e 100644
--- a/doc/guides/testpmd_app_ug/testpmd_funcs.rst
+++ b/doc/guides/testpmd_app_ug/testpmd_funcs.rst
@@ -318,7 +318,7 @@ set fwd
 Set t

RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode

2023-09-24 Thread Feifei Wang
For Konstantin

> -Original Message-
> From: Feifei Wang
> Sent: Saturday, September 23, 2023 1:52 PM
> To: Konstantin Ananyev ; Konstantin
> Ananyev 
> Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; Ruifeng Wang
> ; Yuying Zhang ; Beilei
> Xing ; nd ; nd ; nd
> ; nd ; nd ; nd ;
> nd ; nd 
> Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> 
> 
> 
> > -Original Message-
> > From: Konstantin Ananyev 
> > Sent: Saturday, September 23, 2023 12:41 AM
> > To: Feifei Wang ; Konstantin Ananyev
> > 
> > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > ; Ruifeng Wang
> ;
> > Yuying Zhang ; Beilei Xing
> > ; nd ; nd ; nd
> > ; nd ; nd ; nd
> ; nd
> > 
> > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle mode
> >
> >
> > Hi Feifei,
> >
> > > > > -Original Message-
> > > > > From: Feifei Wang
> > > > > Sent: Tuesday, September 5, 2023 11:11 AM
> > > > > To: Konstantin Ananyev ;
> > > > > Konstantin Ananyev 
> > > > > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > > > ; Ruifeng Wang
> > > > ;
> > > > > Yuying Zhang ; Beilei Xing
> > > > > ; nd ; nd ; nd
> > > > > ; nd ; nd 
> > > > > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle
> > > > > mode
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Konstantin Ananyev 
> > > > > > Sent: Monday, September 4, 2023 6:22 PM
> > > > > > To: Feifei Wang ; Konstantin Ananyev
> > > > > > 
> > > > > > Cc: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > > > > ; Ruifeng Wang
> > > > > ;
> > > > > > Yuying Zhang ; Beilei Xing
> > > > > > ; nd ; nd ; nd
> > > > > > ; nd 
> > > > > > Subject: RE: [PATCH v11 2/4] net/i40e: implement mbufs recycle
> > > > > > mode
> > > > > >
> > > > > >
> > > > > >
> > > > > > > > > > > > > > Define specific function implementation for i40e 
> > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > Currently, mbufs recycle mode can support
> > > > > > > > > > > > > > 128bit vector path and
> > > > > > > > > > > > > > avx2
> > > > > > > > > > > > path.
> > > > > > > > > > > > > > And can be enabled both in fast free and no
> > > > > > > > > > > > > > fast free
> > mode.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Suggested-by: Honnappa Nagarahalli
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Signed-off-by: Feifei Wang
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Reviewed-by: Ruifeng Wang
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Reviewed-by: Honnappa Nagarahalli
> > > > > > > > > > 
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.c|   
> > > > > > > > > > > > > > 1 +
> > > > > > > > > > > > > >  drivers/net/i40e/i40e_ethdev.h|   
> > > > > > > > > > > > > > 2 +
> > > > > > > > > > > > > >  .../net/i40e/i40e_recycle_mbufs_vec_common.c
> > > > > > > > > > > > > > |
> > > > > > > > > > > > > > 147
> > > > > > > > > > > > > > ++
> > > > > > > > > > > > > >  drivers/net/i40e/i40e_rxtx.c  |  
> > > > > > > > > > > > > > 32 
> > > > > > > > > > > > 

[dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

2021-04-19 Thread Feifei Wang
Hi, Slava

Thanks very much for your explanation.

I can understand the app can wait all mbufs are returned to the memory pool,
and then it can free this mbufs, I agree with this.

As a result, I will remove the bug fix patch from this series and just replace 
the smp barrier
with C11 thread fence. Thanks very much for your patient explanation again.

Best Regards
Feifei

> -邮件原件-
> 发件人: Slava Ovsiienko 
> 发送时间: 2021年4月20日 2:51
> 收件人: Feifei Wang ; Matan Azrad
> ; Shahaf Shuler 
> 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> ; nd 
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Please, see below
> 
> 
> 
> > > Hi, Feifei
> > >
> > > Sorry, I do not follow what this patch fixes. Do we have some
> > > issue/bug with MR cache in practice?
> >
> > This patch fixes the bug which is based on logical deduction, and it
> > doesn't actually happen.
> >
> > >
> > > Each Tx queue has its own dedicated "local" cache for MRs to convert
> > > buffer address in mbufs being transmitted to LKeys (HW-related
> > > entity
> > > handle) and the "global" cache for all MR registered on the device.
> > >
> > > AFAIK, how conversion happens in datapath:
> > > - check the local queue cache flush request
> > > - lookup in local cache
> > > - if not found:
> > > - acquire lock for global cache read access
> > > - lookup in global cache
> > > - release lock for global cache
> > >
> > > How cache update on memory freeing/unregistering happens:
> > > - acquire lock for global cache write access
> > > - [a] remove relevant MRs from the global cache
> > > - [b] set local caches flush request
> > > - free global cache lock
> > >
> > > If I understand correctly, your patch swaps [a] and [b], and local
> > > caches flush is requested earlier. What problem does it solve?
> > > It is not supposed there are in datapath some mbufs referencing to
> > > the memory being freed. Application must ensure this and must not
> > > allocate new mbufs from this memory regions being freed. Hence, the
> > > lookups for these MRs in caches should not occur.
> >
> > For your first point that, application can take charge of preventing
> > MR freed memory being allocated to data path.
> >
> > Does it means that If there is an emergency of MR fragment, such as
> > hotplug, the application must inform thedata path in advance, and this
> > memory will not be allocated, and then the control path will free this
> > memory? If application  can do like this, I agree that this bug cannot 
> > happen.
> 
> Actually,  this is the only correct way for application to operate.
> Let's suppose we have some memory area that application wants to free. ALL
> references to this area must be removed. If we have some mbufs allocated
> from this area, it means that we have memory pool created there.
> 
> What application should do:
> - notify all its components/agents the memory area is going to be freed
> - all components/agents free the mbufs they might own
> - PMD might not support freeing for some mbufs (for example being sent
> and awaiting for completion), so app should just wait
> - wait till all mbufs are returned to the memory pool (by monitoring available
> obj == pool size)
> 
> Otherwise - it is dangerous to free the memory. There are just some mbufs
> still allocated, it is regardless to buf address to MR translation. We just 
> can't
> free the memory - the mapping will be destroyed and might cause the
> segmentation fault by SW or some HW issues on DMA access to unmapped
> memory.  It is very generic safety approach - do not free the memory that is
> still in use. Hence, at the moment of freeing and unregistering the MR, there
> MUST BE NO any mbufs in flight referencing to the addresses being freed.
> No translation to MR being invalidated can happen.
> 
> >
> > > For other side, the cache flush has negative effect - the local
> > > cache is getting empty and can't provide translation for other valid
> > > (not being removed) MRs, and the translation has to look up in the
> > > global cache, that is locked now for rebuilding, this causes the
> > > delays in datapatch
> > on acquiring global cache lock.
> > > So, I see some potential performance impact.
> >
> > If above assumption is true, we can go to your second point. I think
> > this is a problem of the tradeoff between cache coherence and
> performance.
> >

[dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

2021-04-20 Thread Feifei Wang
Hi, Slava

Another question suddenly occurred to me, in order to keep the order that 
rebuilding global cache
before updating ”dev_gen“, the wmb should be before updating "dev_gen" rather 
than after it.
Otherwise, in the out-of-order platforms, current order cannot be kept.

Thus, we should change the code as:
a) rebuild global cache;
b) rte_smp_wmb();
c) updating dev_gen

Best Regards
Feifei
> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: 2021年4月20日 13:54
> 收件人: Slava Ovsiienko ; Matan Azrad
> ; Shahaf Shuler 
> 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> ; nd ; nd 
> 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> Thanks very much for your explanation.
> 
> I can understand the app can wait all mbufs are returned to the memory pool,
> and then it can free this mbufs, I agree with this.
> 
> As a result, I will remove the bug fix patch from this series and just replace
> the smp barrier with C11 thread fence. Thanks very much for your patient
> explanation again.
> 
> Best Regards
> Feifei
> 
> > -邮件原件-
> > 发件人: Slava Ovsiienko 
> > 发送时间: 2021年4月20日 2:51
> > 收件人: Feifei Wang ; Matan Azrad
> > ; Shahaf Shuler 
> > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> > ; nd 
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > Please, see below
> >
> > 
> >
> > > > Hi, Feifei
> > > >
> > > > Sorry, I do not follow what this patch fixes. Do we have some
> > > > issue/bug with MR cache in practice?
> > >
> > > This patch fixes the bug which is based on logical deduction, and it
> > > doesn't actually happen.
> > >
> > > >
> > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > convert buffer address in mbufs being transmitted to LKeys
> > > > (HW-related entity
> > > > handle) and the "global" cache for all MR registered on the device.
> > > >
> > > > AFAIK, how conversion happens in datapath:
> > > > - check the local queue cache flush request
> > > > - lookup in local cache
> > > > - if not found:
> > > > - acquire lock for global cache read access
> > > > - lookup in global cache
> > > > - release lock for global cache
> > > >
> > > > How cache update on memory freeing/unregistering happens:
> > > > - acquire lock for global cache write access
> > > > - [a] remove relevant MRs from the global cache
> > > > - [b] set local caches flush request
> > > > - free global cache lock
> > > >
> > > > If I understand correctly, your patch swaps [a] and [b], and local
> > > > caches flush is requested earlier. What problem does it solve?
> > > > It is not supposed there are in datapath some mbufs referencing to
> > > > the memory being freed. Application must ensure this and must not
> > > > allocate new mbufs from this memory regions being freed. Hence,
> > > > the lookups for these MRs in caches should not occur.
> > >
> > > For your first point that, application can take charge of preventing
> > > MR freed memory being allocated to data path.
> > >
> > > Does it means that If there is an emergency of MR fragment, such as
> > > hotplug, the application must inform thedata path in advance, and
> > > this memory will not be allocated, and then the control path will
> > > free this memory? If application  can do like this, I agree that this bug
> cannot happen.
> >
> > Actually,  this is the only correct way for application to operate.
> > Let's suppose we have some memory area that application wants to free.
> > ALL references to this area must be removed. If we have some mbufs
> > allocated from this area, it means that we have memory pool created there.
> >
> > What application should do:
> > - notify all its components/agents the memory area is going to be
> > freed
> > - all components/agents free the mbufs they might own
> > - PMD might not support freeing for some mbufs (for example being sent
> > and awaiting for completion), so app should just wait
> > - wait till all mbufs are returned to the memory pool (by monitoring
> > available obj == pool size)
> >
> > Otherwise - it is dangerous to free the memory. There are just some
> > mbufs still allocated, it is regardless to buf address to MR
> >

[dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

2021-04-20 Thread Feifei Wang
Hi, Slava

I think the second wmb can be removed. 
As I know, wmb is just a barrier to keep the order between write and write. 
and it cannot tell the CPU when it should commit the changes.

It is usually used before guard variable to keep the order that updating guard 
variable after
some changes, which you want to release, have been done.

For example, for the wmb  after global cache update/before altering dev_gen, it 
can ensure
the order that updating global cache before altering dev_gen:
1)If other agent load the changed "dev_gen", it can know the global cache has 
been updated.
2)If other agents load the unchanged, "dev_gen", it means the global cache has 
not been updated, 
and the local cache will not be flushed. 

As a result, we use  wmb and guard variable "dev_gen" to ensure the global 
cache updating is "visible".
The "visible" means when updating guard variable "dev_gen" is known by other 
agents, they also can
confirm global cache has been updated in  the meanwhile. Thus, just one wmb 
before altering  dev_gen
can ensure this.

Best Regards
Feifei 

> -邮件原件-
> 发件人: Slava Ovsiienko 
> 发送时间: 2021年4月20日 15:54
> 收件人: Feifei Wang ; Matan Azrad
> ; Shahaf Shuler 
> 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> ; nd ; nd ; nd
> 
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> In my opinion, there should be 2 barriers:
>  - after global cache update/before altering dev_gen, to ensure the correct
> order
>  - after altering dev_gen to make this change visible for other agents and to
> trigger local cache update
> 
> With best regards,
> Slava
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Tuesday, April 20, 2021 10:30
> > To: Slava Ovsiienko ; Matan Azrad
> > ; Shahaf Shuler 
> > Cc: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> > ; nd ; nd ; nd
> > 
> > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > Region cache
> >
> > Hi, Slava
> >
> > Another question suddenly occurred to me, in order to keep the order
> > that rebuilding global cache before updating ”dev_gen“, the wmb should
> > be before updating "dev_gen" rather than after it.
> > Otherwise, in the out-of-order platforms, current order cannot be kept.
> >
> > Thus, we should change the code as:
> > a) rebuild global cache;
> > b) rte_smp_wmb();
> > c) updating dev_gen
> >
> > Best Regards
> > Feifei
> > > -邮件原件-
> > > 发件人: Feifei Wang
> > > 发送时间: 2021年4月20日 13:54
> > > 收件人: Slava Ovsiienko ; Matan Azrad
> > > ; Shahaf Shuler 
> > > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng
> Wang
> > > ; nd ; nd 
> > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > cache
> > >
> > > Hi, Slava
> > >
> > > Thanks very much for your explanation.
> > >
> > > I can understand the app can wait all mbufs are returned to the
> > > memory pool, and then it can free this mbufs, I agree with this.
> > >
> > > As a result, I will remove the bug fix patch from this series and
> > > just replace the smp barrier with C11 thread fence. Thanks very much
> > > for your patient explanation again.
> > >
> > > Best Regards
> > > Feifei
> > >
> > > > -邮件原件-
> > > > 发件人: Slava Ovsiienko 
> > > > 发送时间: 2021年4月20日 2:51
> > > > 收件人: Feifei Wang ; Matan Azrad
> > > > ; Shahaf Shuler 
> > > > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng
> > Wang
> > > > ; nd 
> > > > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > > > cache
> > > >
> > > > Hi, Feifei
> > > >
> > > > Please, see below
> > > >
> > > > 
> > > >
> > > > > > Hi, Feifei
> > > > > >
> > > > > > Sorry, I do not follow what this patch fixes. Do we have some
> > > > > > issue/bug with MR cache in practice?
> > > > >
> > > > > This patch fixes the bug which is based on logical deduction,
> > > > > and it doesn't actually happen.
> > > > >
> > > > > >
> > > > > > Each Tx queue has its own dedicated "local" cache for MRs to
> > > > > > convert buffer address in mbufs being transmitted to LKeys
> > > > > > (HW-related entity

[dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

2021-05-05 Thread Feifei Wang
Hi, Slava

Would you have more comments about this patch? 
For my sight, only one wmb before "dev_gen" updating is enough to synchronize.

Thanks very much for your attention.


Best Regards
Feifei

> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: 2021年4月20日 16:42
> 收件人: Slava Ovsiienko ; Matan Azrad
> ; Shahaf Shuler 
> 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> ; nd 
> 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> cache
> 
> Hi, Slava
> 
> I think the second wmb can be removed.
> As I know, wmb is just a barrier to keep the order between write and write.
> and it cannot tell the CPU when it should commit the changes.
> 
> It is usually used before guard variable to keep the order that updating guard
> variable after some changes, which you want to release, have been done.
> 
> For example, for the wmb  after global cache update/before altering
> dev_gen, it can ensure the order that updating global cache before altering
> dev_gen:
> 1)If other agent load the changed "dev_gen", it can know the global cache
> has been updated.
> 2)If other agents load the unchanged, "dev_gen", it means the global cache
> has not been updated, and the local cache will not be flushed.
> 
> As a result, we use  wmb and guard variable "dev_gen" to ensure the global
> cache updating is "visible".
> The "visible" means when updating guard variable "dev_gen" is known by
> other agents, they also can confirm global cache has been updated in  the
> meanwhile. Thus, just one wmb before altering  dev_gen can ensure this.
> 
> Best Regards
> Feifei
> 
> > -邮件原件-
> > 发件人: Slava Ovsiienko 
> > 发送时间: 2021年4月20日 15:54
> > 收件人: Feifei Wang ; Matan Azrad
> > ; Shahaf Shuler 
> > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> > ; nd ; nd ; nd
> > 
> > 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > Hi, Feifei
> >
> > In my opinion, there should be 2 barriers:
> >  - after global cache update/before altering dev_gen, to ensure the
> > correct order
> >  - after altering dev_gen to make this change visible for other agents
> > and to trigger local cache update
> >
> > With best regards,
> > Slava
> >
> > > -Original Message-
> > > From: Feifei Wang 
> > > Sent: Tuesday, April 20, 2021 10:30
> > > To: Slava Ovsiienko ; Matan Azrad
> > > ; Shahaf Shuler 
> > > Cc: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> > > ; nd ; nd ; nd
> > > 
> > > Subject: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> > > Region cache
> > >
> > > Hi, Slava
> > >
> > > Another question suddenly occurred to me, in order to keep the order
> > > that rebuilding global cache before updating ”dev_gen“, the wmb
> > > should be before updating "dev_gen" rather than after it.
> > > Otherwise, in the out-of-order platforms, current order cannot be kept.
> > >
> > > Thus, we should change the code as:
> > > a) rebuild global cache;
> > > b) rte_smp_wmb();
> > > c) updating dev_gen
> > >
> > > Best Regards
> > > Feifei
> > > > -邮件原件-
> > > > 发件人: Feifei Wang
> > > > 发送时间: 2021年4月20日 13:54
> > > > 收件人: Slava Ovsiienko ; Matan Azrad
> > > > ; Shahaf Shuler 
> > > > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng
> > Wang
> > > > ; nd ; nd 
> > > > 主题: 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory
> Region
> > > > cache
> > > >
> > > > Hi, Slava
> > > >
> > > > Thanks very much for your explanation.
> > > >
> > > > I can understand the app can wait all mbufs are returned to the
> > > > memory pool, and then it can free this mbufs, I agree with this.
> > > >
> > > > As a result, I will remove the bug fix patch from this series and
> > > > just replace the smp barrier with C11 thread fence. Thanks very
> > > > much for your patient explanation again.
> > > >
> > > > Best Regards
> > > > Feifei
> > > >
> > > > > -邮件原件-
> > > > > 发件人: Slava Ovsiienko 
> > > > > 发送时间: 2021年4月20日 2:51
> > > > > 收件人: Feifei Wang ; Matan Azrad
> > > > > ; Shahaf Shuler 
> > > > > 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng
> > > Wang

[dpdk-dev] 回复: [PATCH v1 0/2] net/i40e: improve free mbuf

2021-06-21 Thread Feifei Wang
Hi, Qi

Can you help review these patches?
Thanks very much.

Best Regards
Feifei

> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: 2021年5月27日 16:17
> 抄送: dev@dpdk.org; nd ; Feifei Wang
> 
> 主题: [PATCH v1 0/2] net/i40e: improve free mbuf
> 
> For i40e Tx path, use bulk free of the buffers when mbuf fast free mode is
> enabled. This can efficiently improve the performance.
> 
> Feifei Wang (2):
>   net/i40e: improve performance for scalar Tx
>   net/i40e: improve performance for vector Tx
> 
>  drivers/net/i40e/i40e_rxtx.c|  5 -
>  drivers/net/i40e/i40e_rxtx_vec_common.h | 11 +++
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA unmap

2021-06-21 Thread Feifei Wang
Hi, Slava

Would you please help review this patch?
Thanks.

Best Regards
Feifei

> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: 2021年5月27日 17:48
> 收件人: Matan Azrad ; Shahaf Shuler
> ; Viacheslav Ovsiienko 
> 抄送: dev@dpdk.org; nd ; Feifei Wang
> ; shah...@mellanox.com; sta...@dpdk.org;
> Ruifeng Wang 
> 主题: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA unmap
> 
> For mlx5 DMA unmap, write lock should be used for rebuilding memory
> region cache table rather than read lock.
> 
> Fixes: 989e999d9305 ("net/mlx5: support PCI device DMA map and unmap")
> Cc: shah...@mellanox.com
> Cc: sta...@dpdk.org
> 
> Signed-off-by: Feifei Wang 
> Reviewed-by: Ruifeng Wang 
> ---
>  drivers/net/mlx5/mlx5_mr.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c index
> e791b6338d..45a122f4f9 100644
> --- a/drivers/net/mlx5/mlx5_mr.c
> +++ b/drivers/net/mlx5/mlx5_mr.c
> @@ -395,10 +395,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void *addr,
>   }
>   priv = dev->data->dev_private;
>   sh = priv->sh;
> - rte_rwlock_read_lock(&sh->share_cache.rwlock);
> + rte_rwlock_write_lock(&sh->share_cache.rwlock);
>   mr = mlx5_mr_lookup_list(&sh->share_cache, &entry,
> (uintptr_t)addr);
>   if (!mr) {
> - rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> + rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>   DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't
> registered "
>"to PCI device %p", (uintptr_t)addr,
>(void *)pdev);
> @@ -423,7 +423,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev, void
> *addr,
>   DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
> sh->share_cache.dev_gen);
>   rte_smp_wmb();
> - rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> + rte_rwlock_write_unlock(&sh->share_cache.rwlock);
>   return 0;
>  }
> 
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH v1 1/2] devtools: add relative path support for ABI compatibility check

2021-06-21 Thread Feifei Wang
Hi, Bruce

Would you please help review this patch series?
Thanks.

Best Regards
Feifei

> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: 2021年6月1日 9:57
> 收件人: Bruce Richardson 
> 抄送: dev@dpdk.org; nd ; Phil Yang ;
> Feifei Wang ; Juraj Linkeš
> ; Ruifeng Wang 
> 主题: [PATCH v1 1/2] devtools: add relative path support for ABI
> compatibility check
> 
> From: Phil Yang 
> 
> Because dpdk guide does not limit the relative path for ABI compatibility
> check, users maybe set 'DPDK_ABI_REF_DIR' as a relative
> path:
> 
> ~/dpdk/devtools$ DPDK_ABI_REF_VERSION=v19.11
> DPDK_ABI_REF_DIR=build-gcc-shared ./test-meson-builds.sh
> 
> And if the DESTDIR is not an absolute path, ninja complains:
> + install_target build-gcc-shared/v19.11/build
> + build-gcc-shared/v19.11/build-gcc-shared
> + rm -rf build-gcc-shared/v19.11/build-gcc-shared
> + echo 'DESTDIR=build-gcc-shared/v19.11/build-gcc-shared ninja -C build-gcc-
> shared/v19.11/build install'
> + DESTDIR=build-gcc-shared/v19.11/build-gcc-shared
> + ninja -C build-gcc-shared/v19.11/build install
> ...
> ValueError: dst_dir must be absolute, got build-gcc-shared/v19.11/build-gcc-
> shared/usr/local/share/dpdk/
> examples/bbdev_app
> ...
> Error: install directory 'build-gcc-shared/v19.11/build-gcc-shared' does not
> exist.
> 
> To fix this, add relative path support using 'readlink -f'.
> 
> Signed-off-by: Phil Yang 
> Signed-off-by: Feifei Wang 
> Reviewed-by: Juraj Linkeš 
> Reviewed-by: Ruifeng Wang 
> ---
>  devtools/test-meson-builds.sh | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/devtools/test-meson-builds.sh b/devtools/test-meson-builds.sh
> index daf817ac3e..43b906598d 100755
> --- a/devtools/test-meson-builds.sh
> +++ b/devtools/test-meson-builds.sh
> @@ -168,7 +168,8 @@ build () #check> [meson options]
>   config $srcdir $builds_dir/$targetdir $cross --werror $*
>   compile $builds_dir/$targetdir
>   if [ -n "$DPDK_ABI_REF_VERSION" -a "$abicheck" = ABI ] ; then
> - abirefdir=${DPDK_ABI_REF_DIR:-
> reference}/$DPDK_ABI_REF_VERSION
> + abirefdir=$(readlink -f \
> + ${DPDK_ABI_REF_DIR:-
> reference}/$DPDK_ABI_REF_VERSION)
>   if [ ! -d $abirefdir/$targetdir ]; then
>   # clone current sources
>   if [ ! -d $abirefdir/src ]; then
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

2021-06-22 Thread Feifei Wang
Hi, Beilei

Thanks for your comments, please see below.

> -邮件原件-
> 发件人: Xing, Beilei 
> 发送时间: 2021年6月22日 14:08
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> 
> 主题: RE: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx
> 
> 
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Thursday, May 27, 2021 4:17 PM
> > To: Xing, Beilei 
> > Cc: dev@dpdk.org; n...@arm.com; Feifei Wang ;
> > Ruifeng Wang 
> > Subject: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx
> >
> > For i40e scalar Tx path, if implement FAST_FREE_MBUF mode, it means
> > per- queue all mbufs come from the same mempool and have refcnt = 1.
> >
> > Thus we can use bulk free of the buffers when mbuf fast free mode is
> > enabled.
> >
> > For scalar path in arm platform:
> > In n1sdp, performance is improved by 7.8%; In thunderx2, performance
> > is improved by 6.7%.
> >
> > For scalar path in x86 platform,
> > performance is improved by 6%.
> >
> > Suggested-by: Ruifeng Wang 
> > Signed-off-by: Feifei Wang 
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index
> > 6c58decece..fe7b20f750 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -1295,6 +1295,7 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)  {
> > struct i40e_tx_entry *txep;
> > uint16_t i;
> > +   struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> >
> > if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> >
>   rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) != @@ -1308,9
> +1309,11
> > @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> >
> > if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
> > for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
> > -   rte_mempool_put(txep->mbuf->pool, txep->mbuf);
> > +   free[i] = txep->mbuf;
> 
> The tx_rs_thresh can be 'nb_desc - 3', so if tx_rs_thres >
> RTE_I40E_TX_MAX_FREE_BUF_SZ, there'll be out of bounds, right?

Actually tx_rs_thresh  <=  tx__free_thresh  <  nb_desc - 3 
(i40e_dev_tx_queue_setup).
However, I don't know how it affects the relationship between tx_rs_thresh and
RTE_I40E_TX_MAX_FREE_BUF_SZ.

Furthermore, I think you are right that tx_rs_thres can be greater than
RTE_I40E_TX_MAX_FREE_BUF_SZ in tx_simple_mode (i40e_set_tx_function_flag).

Thus, in scalar path, we can change like:
---
int n = txq->tx_rs_thresh;
int32_t i = 0, j = 0;
const int32_t k = RTE_ALIGN_FLOOR(n, RTE_I40E_TX_MAX_FREE_BUF_SZ);
const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ;
struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];

For FAST_FREE_MODE:

if (k) {
for (j = 0; j != k - RTE_I40E_TX_MAX_FREE_BUF_SZ;
j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
for (i = 0; i mbuf;
txep->mbuf = NULL;
}
rte_mempool_put_bulk(free[0]->pool, (void **)free,
RTE_I40E_TX_MAX_FREE_BUF_SZ);
}
} else {
for (i = 0; i < m; ++i, ++txep) {
free[i] = txep->mbuf;
txep->mbuf = NULL;
}
rte_mempool_put_bulk(free[0]->pool, (void **)free, m);
}
---

Best Regards
Feifei


[dpdk-dev] 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

2021-06-22 Thread Feifei Wang
Sorry for a mistake for the code, it should be:

int n = txq->tx_rs_thresh;
 int32_t i = 0, j = 0;
const int32_t k = RTE_ALIGN_FLOOR(n, RTE_I40E_TX_MAX_FREE_BUF_SZ);
const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ; 
struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];

For FAST_FREE_MODE:

if (k) {
for (j = 0; j != k - RTE_I40E_TX_MAX_FREE_BUF_SZ;
j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
for (i = 0; i mbuf;
txep->mbuf = NULL;
}
rte_mempool_put_bulk(free[0]->pool, (void **)free,
RTE_I40E_TX_MAX_FREE_BUF_SZ);
}
 } 

if (m) {
for (i = 0; i < m; ++i, ++txep) {
free[i] = txep->mbuf;
txep->mbuf = NULL;
}
 }
 rte_mempool_put_bulk(free[0]->pool, (void **)free, m); }



[dpdk-dev] 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

2021-06-25 Thread Feifei Wang


> > int n = txq->tx_rs_thresh;
> >  int32_t i = 0, j = 0;
> > const int32_t k = RTE_ALIGN_FLOOR(n, RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ; struct rte_mbuf
> > *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> >
> > For FAST_FREE_MODE:
> >
> > if (k) {
> > for (j = 0; j != k - RTE_I40E_TX_MAX_FREE_BUF_SZ;
> > j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
> > for (i = 0; i  > free[i] = txep->mbuf;
> > txep->mbuf = NULL;
> > }
> > rte_mempool_put_bulk(free[0]->pool, (void **)free,
> > RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > }
> >  }
> >
> > if (m) {
> > for (i = 0; i < m; ++i, ++txep) {
> > free[i] = txep->mbuf;
> > txep->mbuf = NULL;
> > }
> >  }
> >  rte_mempool_put_bulk(free[0]->pool, (void **)free, m); }

> Seems no logical problem, but the code looks heavy due to for loops.
> Did you run performance with this change when tx_rs_thresh >
> RTE_I40E_TX_MAX_FREE_BUF_SZ?

Sorry for my late rely. It takes me some time to do the test for this path and 
following
is my test results:

First, I come up with another way to solve this bug and compare it with 
"loop"(size of 'free' is 64).
That is set the size of 'free' as a large constant. We know:
tx_rs_thresh < ring_desc_size < I40E_MAX_RING_DESC(4096), so we can directly 
define as:
struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];

[1]Test Config:
MRR Test: two porst & bi-directional flows & one core
RX API: i40e_recv_pkts_bulk_alloc
TX API: i40e_xmit_pkts_simple
ring_descs_size: 1024
Ring_I40E_TX_MAX_FREE_SZ: 64

[2]Scheme:
tx_rs_thresh =  I40E_DEFAULT_TX_RSBIT_THRESH
tx_free_thresh = I40E_DEFAULT_TX_FREE_THRESH
tx_rs_thresh <= tx_free_thresh < nb_tx_desc
So we change the value of 'tx_rs_thresh' by adjust I40E_DEFAULT_TX_RSBIT_THRESH

[3]Test Results (performance improve):
In X86: 
tx_rs_thresh/ tx_free_thresh   32/32  256/256   
   512/512
1.mempool_put(base)   0  0  
  0
2.mempool_put_bulk:loop   +4.7% +5.6%   
+7.0%
3.mempool_put_bulk:large size for free   +3.8%  +2.3%   
-2.0%
(free[I40E_MAX_RING_DESC])

In Arm:
N1SDP:
tx_rs_thresh/ tx_free_thresh   32/32  256/256   
   512/512
1.mempool_put(base)   0  0  
  0
2.mempool_put_bulk:loop   +7.9% +9.1%   
+2.9%
3.mempool_put_bulk:large size for free+7.1% +8.7%   
+3.4%
(free[I40E_MAX_RING_DESC])

Thunderx2:
tx_rs_thresh/ tx_free_thresh   32/32  256/256   
   512/512
1.mempool_put(base)   0  0  
  0
2.mempool_put_bulk:loop   +7.6% +10.5%  
   +7.6%
3.mempool_put_bulk:large size for free+1.7% +18.4% 
+10.2%
(free[I40E_MAX_RING_DESC])

As a result, I feel maybe 'loop' is better and it seems not very heavy 
according to the test.
What about your views and look forward to your reply.
Thanks a lot.


[dpdk-dev] 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx

2021-06-27 Thread Feifei Wang

> -邮件原件-
> 发件人: Xing, Beilei 
> 发送时间: 2021年6月28日 10:27
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> ; nd ; nd 
> 主题: RE: [PATCH v1 1/2] net/i40e: improve performance for scalar Tx
> 
> 
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Friday, June 25, 2021 5:40 PM
> > To: Xing, Beilei 
> > Cc: dev@dpdk.org; nd ; Ruifeng Wang
> > ; nd ; nd 
> > Subject: 回复: [PATCH v1 1/2] net/i40e: improve performance for scalar
> > Tx
> >
> > 
> >
> > > > int n = txq->tx_rs_thresh;
> > > >  int32_t i = 0, j = 0;
> > > > const int32_t k = RTE_ALIGN_FLOOR(n,
> RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > > > const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ; struct
> rte_mbuf
> > > > *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> > > >
> > > > For FAST_FREE_MODE:
> > > >
> > > > if (k) {
> > > > for (j = 0; j != k - RTE_I40E_TX_MAX_FREE_BUF_SZ;
> > > > j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
> > > > for (i = 0; i  > > > ++txep) {
> > > > free[i] = txep->mbuf;
> > > > txep->mbuf = NULL;
> > > > }
> > > > rte_mempool_put_bulk(free[0]->pool, (void **)free,
> > > > RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > > > }
> > > >  }
> > > >
> > > > if (m) {
> > > > for (i = 0; i < m; ++i, ++txep) {
> > > > free[i] = txep->mbuf;
> > > > txep->mbuf = NULL;
> > > > }
> > > >  }
> > > >  rte_mempool_put_bulk(free[0]->pool, (void **)free, m); }
> >
> > > Seems no logical problem, but the code looks heavy due to for loops.
> > > Did you run performance with this change when tx_rs_thresh >
> > > RTE_I40E_TX_MAX_FREE_BUF_SZ?
> >
> > Sorry for my late rely. It takes me some time to do the test for this
> > path and following is my test results:
> >
> > First, I come up with another way to solve this bug and compare it
> > with "loop"(size of 'free' is 64).
> > That is set the size of 'free' as a large constant. We know:
> > tx_rs_thresh < ring_desc_size < I40E_MAX_RING_DESC(4096), so we can
> > directly define as:
> > struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> >
> > [1]Test Config:
> > MRR Test: two porst & bi-directional flows & one core RX API:
> > i40e_recv_pkts_bulk_alloc TX API: i40e_xmit_pkts_simple
> > ring_descs_size: 1024
> > Ring_I40E_TX_MAX_FREE_SZ: 64
> >
> > [2]Scheme:
> > tx_rs_thresh =  I40E_DEFAULT_TX_RSBIT_THRESH tx_free_thresh =
> > I40E_DEFAULT_TX_FREE_THRESH tx_rs_thresh <= tx_free_thresh <
> > nb_tx_desc So we change the value of 'tx_rs_thresh' by adjust
> > I40E_DEFAULT_TX_RSBIT_THRESH
> >
> > [3]Test Results (performance improve):
> > In X86:
> > tx_rs_thresh/ tx_free_thresh   32/32  256/256   
> >512/512
> > 1.mempool_put(base)   0  0  
> >   0
> > 2.mempool_put_bulk:loop   +4.7% +5.6%   
> > +7.0%
> > 3.mempool_put_bulk:large size for free   +3.8%  +2.3%   
> > -2.0%
> > (free[I40E_MAX_RING_DESC])
> >
> > In Arm:
> > N1SDP:
> > tx_rs_thresh/ tx_free_thresh   32/32  256/256   
> >512/512
> > 1.mempool_put(base)   0  0  
> >   0
> > 2.mempool_put_bulk:loop   +7.9% +9.1%   
> > +2.9%
> > 3.mempool_put_bulk:large size for free+7.1% +8.7%   
> > +3.4%
> > (free[I40E_MAX_RING_DESC])
> >
> > Thunderx2:
> > tx_rs_thresh/ tx_free_thresh   32/32  256/256   
> >512/512
> > 1.mempool_put(base)   0  0  
> >   0
> > 2.mempool_put_bulk:loop   +7.6% +10.5%  
> >+7.6%
> > 3.mempool_put_bulk:large size for free+1.7% +18.4% 
> > +10.2%
> > (free[I40E_MAX_RING_DESC])
> >
> > As a result, I feel maybe 'loop' is better and it seems not very heavy
> > according to the test.
> > What about your views and look forward to your reply.
> > Thanks a lot.
> 
> Thanks for your patch and test.
> It looks OK for me, please send V2.
Thanks for the reviewing, I will update the V2 version.


[dpdk-dev] [PATCH v2 0/2] net/i40e: improve free mbuf for Tx

2021-06-29 Thread Feifei Wang
For i40e Tx path, use bulk free of the buffers when mbuf fast free
mode is enabled. This can efficiently improve the performance.

v2:
1. fix bug when tx_rs_thres > RTE_I40E_TX_MAX_FREE_BUF_SZ (Beilei)

Feifei Wang (2):
  net/i40e: improve performance for scalar Tx
  net/i40e: improve performance for vector Tx

 drivers/net/i40e/i40e_rxtx.c| 26 +
 drivers/net/i40e/i40e_rxtx_vec_common.h | 11 +++
 2 files changed, 33 insertions(+), 4 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v2 1/2] net/i40e: improve performance for scalar Tx

2021-06-29 Thread Feifei Wang
For i40e scalar Tx path, if implement FAST_FREE_MBUF mode, it means
per-queue all mbufs come from the same mempool and have refcnt = 1.

Thus we can use bulk free of the buffers when mbuf fast free mode is
enabled.

Following are the test results with this patch:

MRR L3FWD Test:
two ports & bi-directional flows & one core
RX API: i40e_recv_pkts_bulk_alloc
TX API: i40e_xmit_pkts_simple
ring_descs_size = 1024;
Ring_I40E_TX_MAX_FREE_SZ = 64;
tx_rs_thresh = I40E_DEFAULT_TX_RSBIT_THRESH = 32;
tx_free_thresh = I40E_DEFAULT_TX_FREE_THRESH = 32;

For scalar path in arm platform with default 'tx_rs_thresh':
In n1sdp, performance is improved by 7.9%;
In thunderx2, performance is improved by 7.6%.

For scalar path in x86 platform with default 'tx_rs_thresh':
performance is improved by 4.7%.

Suggested-by: Ruifeng Wang 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 drivers/net/i40e/i40e_rxtx.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..8c72391cde 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1294,7 +1294,11 @@ static __rte_always_inline int
 i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 {
struct i40e_tx_entry *txep;
-   uint16_t i;
+   int n = txq->tx_rs_thresh;
+   uint16_t i = 0, j = 0;
+   struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+   const int32_t k = RTE_ALIGN_FLOOR(n, RTE_I40E_TX_MAX_FREE_BUF_SZ);
+   const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ;
 
if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
@@ -1307,9 +1311,23 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
rte_prefetch0((txep + i)->mbuf);
 
if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
-   for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-   rte_mempool_put(txep->mbuf->pool, txep->mbuf);
-   txep->mbuf = NULL;
+   if (k) {
+   for (j = 0; j != k; j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
+   for (i = 0; i < RTE_I40E_TX_MAX_FREE_BUF_SZ; 
++i, ++txep) {
+   free[i] = txep->mbuf;
+   txep->mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void 
**)free,
+   RTE_I40E_TX_MAX_FREE_BUF_SZ);
+   }
+   }
+
+   if (m) {
+   for (i = 0; i < m; ++i, ++txep) {
+   free[i] = txep->mbuf;
+   txep->mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void **)free, m);
}
} else {
for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-- 
2.25.1



[dpdk-dev] [PATCH v2 2/2] net/i40e: improve performance for vector Tx

2021-06-29 Thread Feifei Wang
For i40e vector Tx path, if tx_offload is set as FAST_FREE_MBUF mode,
no mbuf fast free operations are executed. To fix this, add mbuf fast
free mode for vector Tx path.

Furthermore, for i40e vector Tx path, if implement FAST_FREE_MBUF mode,
it means per-queue all mbufs come from the same mempool and have
refcnt = 1. Thus we can use bulk free of the buffers when mbuf fast free
mode is enabled.

For vector path in arm platform:
In n1sdp, performance is improved by 18.4%;
In thunderx2, performance is improved by 23%.

For vector path in x86 platform:
No performance changes.

Suggested-by: Ruifeng Wang 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h 
b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 16fcf0aec6..f52ed98d62 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -99,6 +99,16 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
  * tx_next_dd - (tx_rs_thresh-1)
  */
txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+   if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
+   for (i = 0; i < n; i++) {
+   free[i] = txep[i].mbuf;
+   txep[i].mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
+   goto done;
+   }
+
m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
if (likely(m != NULL)) {
free[0] = m;
@@ -126,6 +136,7 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
}
}
 
+done:
/* buffers were freed, update counters */
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
-- 
2.25.1



[dpdk-dev] 回复: [PATCH v2 1/2] net/i40e: improve performance for scalar Tx

2021-06-29 Thread Feifei Wang

> -邮件原件-
> 发件人: Xing, Beilei 
> 发送时间: 2021年6月30日 11:43
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> 
> 主题: RE: [PATCH v2 1/2] net/i40e: improve performance for scalar Tx
> 
> 
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Wednesday, June 30, 2021 10:04 AM
> > To: Xing, Beilei 
> > Cc: dev@dpdk.org; n...@arm.com; Feifei Wang ;
> > Ruifeng Wang 
> > Subject: [PATCH v2 1/2] net/i40e: improve performance for scalar Tx
> >
> > For i40e scalar Tx path, if implement FAST_FREE_MBUF mode, it means
> > per- queue all mbufs come from the same mempool and have refcnt = 1.
> >
> > Thus we can use bulk free of the buffers when mbuf fast free mode is
> > enabled.
> >
> > Following are the test results with this patch:
> >
> > MRR L3FWD Test:
> > two ports & bi-directional flows & one core RX API:
> > i40e_recv_pkts_bulk_alloc TX API: i40e_xmit_pkts_simple
> > ring_descs_size = 1024; Ring_I40E_TX_MAX_FREE_SZ = 64; tx_rs_thresh =
> > I40E_DEFAULT_TX_RSBIT_THRESH = 32; tx_free_thresh =
> > I40E_DEFAULT_TX_FREE_THRESH = 32;
> >
> > For scalar path in arm platform with default 'tx_rs_thresh':
> > In n1sdp, performance is improved by 7.9%; In thunderx2, performance
> > is improved by 7.6%.
> >
> > For scalar path in x86 platform with default 'tx_rs_thresh':
> > performance is improved by 4.7%.
> >
> > Suggested-by: Ruifeng Wang 
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >  drivers/net/i40e/i40e_rxtx.c | 26 ++
> >  1 file changed, 22 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx.c
> > b/drivers/net/i40e/i40e_rxtx.c index 6c58decece..8c72391cde 100644
> > --- a/drivers/net/i40e/i40e_rxtx.c
> > +++ b/drivers/net/i40e/i40e_rxtx.c
> > @@ -1294,7 +1294,11 @@ static __rte_always_inline int
> > i40e_tx_free_bufs(struct i40e_tx_queue *txq)  {
> > struct i40e_tx_entry *txep;
> > -   uint16_t i;
> > +   int n = txq->tx_rs_thresh;
> 
> Thanks for the patch, just little comment, can we use 'tx_rs_thresh' to
> replace 'n' to make it more readable?
Good comments for this, I will update it, thanks.

> 
> > +   uint16_t i = 0, j = 0;
> > +   struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
> > +   const int32_t k = RTE_ALIGN_FLOOR(n,
> > RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > +   const int32_t m = n % RTE_I40E_TX_MAX_FREE_BUF_SZ;
> >
> > if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
> >
>   rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) != @@ -1307,9
> +1311,23
> > @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> > rte_prefetch0((txep + i)->mbuf);
> >
> > if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
> > -   for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
> > -   rte_mempool_put(txep->mbuf->pool, txep->mbuf);
> > -   txep->mbuf = NULL;
> > +   if (k) {
> > +   for (j = 0; j != k; j +=
> RTE_I40E_TX_MAX_FREE_BUF_SZ)
> > {
> > +   for (i = 0; i <
> RTE_I40E_TX_MAX_FREE_BUF_SZ;
> > ++i, ++txep) {
> > +   free[i] = txep->mbuf;
> > +   txep->mbuf = NULL;
> > +   }
> > +   rte_mempool_put_bulk(free[0]->pool, (void
> > **)free,
> > +
> > RTE_I40E_TX_MAX_FREE_BUF_SZ);
> > +   }
> > +   }
> > +
> > +   if (m) {
> > +   for (i = 0; i < m; ++i, ++txep) {
> > +   free[i] = txep->mbuf;
> > +   txep->mbuf = NULL;
> > +   }
> > +   rte_mempool_put_bulk(free[0]->pool, (void **)free,
> > m);
> > }
> > } else {
> > for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
> > --
> > 2.25.1



[dpdk-dev] [PATCH v3 0/2] net/i40e: improve free mbuf for Tx

2021-06-29 Thread Feifei Wang
For i40e Tx path, use bulk free of the buffers when mbuf fast free
mode is enabled. This can efficiently improve the performance.

v2:
1. fix bug when tx_rs_thres > RTE_I40E_TX_MAX_FREE_BUF_SZ (Beilei)

v3:
1. change variable name for more readable (Beilei)

Feifei Wang (2):
  net/i40e: improve performance for scalar Tx
  net/i40e: improve performance for vector Tx

 drivers/net/i40e/i40e_rxtx.c| 30 -
 drivers/net/i40e/i40e_rxtx_vec_common.h | 11 +
 2 files changed, 35 insertions(+), 6 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v3 1/2] net/i40e: improve performance for scalar Tx

2021-06-29 Thread Feifei Wang
For i40e scalar Tx path, if implement FAST_FREE_MBUF mode, it means
per-queue all mbufs come from the same mempool and have refcnt = 1.

Thus we can use bulk free of the buffers when mbuf fast free mode is
enabled.

Following are the test results with this patch:

MRR L3FWD Test:
two ports & bi-directional flows & one core
RX API: i40e_recv_pkts_bulk_alloc
TX API: i40e_xmit_pkts_simple
ring_descs_size = 1024;
Ring_I40E_TX_MAX_FREE_SZ = 64;
tx_rs_thresh = I40E_DEFAULT_TX_RSBIT_THRESH = 32;
tx_free_thresh = I40E_DEFAULT_TX_FREE_THRESH = 32;

For scalar path in arm platform with default 'tx_rs_thresh':
In n1sdp, performance is improved by 7.9%;
In thunderx2, performance is improved by 7.6%.

For scalar path in x86 platform with default 'tx_rs_thresh':
performance is improved by 4.7%.

Suggested-by: Ruifeng Wang 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 drivers/net/i40e/i40e_rxtx.c | 30 --
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx.c b/drivers/net/i40e/i40e_rxtx.c
index 6c58decece..0d3482a9d2 100644
--- a/drivers/net/i40e/i40e_rxtx.c
+++ b/drivers/net/i40e/i40e_rxtx.c
@@ -1294,22 +1294,40 @@ static __rte_always_inline int
 i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 {
struct i40e_tx_entry *txep;
-   uint16_t i;
+   uint16_t tx_rs_thresh = txq->tx_rs_thresh;
+   uint16_t i = 0, j = 0;
+   struct rte_mbuf *free[RTE_I40E_TX_MAX_FREE_BUF_SZ];
+   const uint16_t k = RTE_ALIGN_FLOOR(tx_rs_thresh, 
RTE_I40E_TX_MAX_FREE_BUF_SZ);
+   const uint16_t m = tx_rs_thresh % RTE_I40E_TX_MAX_FREE_BUF_SZ;
 
if ((txq->tx_ring[txq->tx_next_dd].cmd_type_offset_bsz &
rte_cpu_to_le_64(I40E_TXD_QW1_DTYPE_MASK)) !=
rte_cpu_to_le_64(I40E_TX_DESC_DTYPE_DESC_DONE))
return 0;
 
-   txep = &(txq->sw_ring[txq->tx_next_dd - (txq->tx_rs_thresh - 1)]);
+   txep = &txq->sw_ring[txq->tx_next_dd - (tx_rs_thresh - 1)];
 
-   for (i = 0; i < txq->tx_rs_thresh; i++)
+   for (i = 0; i < tx_rs_thresh; i++)
rte_prefetch0((txep + i)->mbuf);
 
if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
-   for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-   rte_mempool_put(txep->mbuf->pool, txep->mbuf);
-   txep->mbuf = NULL;
+   if (k) {
+   for (j = 0; j != k; j += RTE_I40E_TX_MAX_FREE_BUF_SZ) {
+   for (i = 0; i < RTE_I40E_TX_MAX_FREE_BUF_SZ; 
++i, ++txep) {
+   free[i] = txep->mbuf;
+   txep->mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void 
**)free,
+   RTE_I40E_TX_MAX_FREE_BUF_SZ);
+   }
+   }
+
+   if (m) {
+   for (i = 0; i < m; ++i, ++txep) {
+   free[i] = txep->mbuf;
+   txep->mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void **)free, m);
}
} else {
for (i = 0; i < txq->tx_rs_thresh; ++i, ++txep) {
-- 
2.25.1



[dpdk-dev] [PATCH v3 2/2] net/i40e: improve performance for vector Tx

2021-06-29 Thread Feifei Wang
For i40e vector Tx path, if tx_offload is set as FAST_FREE_MBUF mode,
no mbuf fast free operations are executed. To fix this, add mbuf fast
free mode for vector Tx path.

Furthermore, for i40e vector Tx path, if implement FAST_FREE_MBUF mode,
it means per-queue all mbufs come from the same mempool and have
refcnt = 1. Thus we can use bulk free of the buffers when mbuf fast free
mode is enabled.

For vector path in arm platform:
In n1sdp, performance is improved by 18.4%;
In thunderx2, performance is improved by 23%.

For vector path in x86 platform:
No performance changes.

Suggested-by: Ruifeng Wang 
Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h 
b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 16fcf0aec6..f52ed98d62 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -99,6 +99,16 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
  * tx_next_dd - (tx_rs_thresh-1)
  */
txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+
+   if (txq->offloads & DEV_TX_OFFLOAD_MBUF_FAST_FREE) {
+   for (i = 0; i < n; i++) {
+   free[i] = txep[i].mbuf;
+   txep[i].mbuf = NULL;
+   }
+   rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
+   goto done;
+   }
+
m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
if (likely(m != NULL)) {
free[0] = m;
@@ -126,6 +136,7 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
}
}
 
+done:
/* buffers were freed, update counters */
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free + txq->tx_rs_thresh);
txq->tx_next_dd = (uint16_t)(txq->tx_next_dd + txq->tx_rs_thresh);
-- 
2.25.1



[dpdk-dev] 回复: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA unmap

2021-07-01 Thread Feifei Wang
Hi, Slava

That's OK. Thanks for your reviewing.

Best Regards
Feifei

> -邮件原件-
> 发件人: Slava Ovsiienko 
> 发送时间: 2021年7月1日 22:27
> 收件人: Feifei Wang ; Matan Azrad
> ; Shahaf Shuler 
> 抄送: dev@dpdk.org; nd ; Shahaf Shuler
> ; sta...@dpdk.org; Ruifeng Wang
> ; nd 
> 主题: RE: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA unmap
> 
> Hi, Feifei
> 
> Sorry for the delayed review.
> I think it is a good catch, thank you for the patch.
> 
> Acked-by: Viacheslav Ovsiienko 
> 
> With best regards,
> Slava
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Tuesday, June 22, 2021 4:54
> > To: Feifei Wang ; Matan Azrad
> > ; Shahaf Shuler ; Slava
> > Ovsiienko 
> > Cc: dev@dpdk.org; nd ; Shahaf Shuler
> ;
> > sta...@dpdk.org; Ruifeng Wang ; nd
> 
> > Subject: 回复: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA
> > unmap
> >
> > Hi, Slava
> >
> > Would you please help review this patch?
> > Thanks.
> >
> > Best Regards
> > Feifei
> >
> > > -邮件原件-
> > > 发件人: Feifei Wang 
> > > 发送时间: 2021年5月27日 17:48
> > > 收件人: Matan Azrad ; Shahaf Shuler
> > > ; Viacheslav Ovsiienko 
> > > 抄送: dev@dpdk.org; nd ; Feifei Wang
> > ;
> > > shah...@mellanox.com; sta...@dpdk.org; Ruifeng Wang
> > > 
> > > 主题: [PATCH] net/mlx5: fix incorrect r/w lock usage in DMA unmap
> > >
> > > For mlx5 DMA unmap, write lock should be used for rebuilding memory
> > > region cache table rather than read lock.
> > >
> > > Fixes: 989e999d9305 ("net/mlx5: support PCI device DMA map and
> > > unmap")
> > > Cc: shah...@mellanox.com
> > > Cc: sta...@dpdk.org
> > >
> > > Signed-off-by: Feifei Wang 
> > > Reviewed-by: Ruifeng Wang 
> > > ---
> > >  drivers/net/mlx5/mlx5_mr.c | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/mlx5/mlx5_mr.c b/drivers/net/mlx5/mlx5_mr.c
> > > index
> > > e791b6338d..45a122f4f9 100644
> > > --- a/drivers/net/mlx5/mlx5_mr.c
> > > +++ b/drivers/net/mlx5/mlx5_mr.c
> > > @@ -395,10 +395,10 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> > void
> > > *addr,
> > >   }
> > >   priv = dev->data->dev_private;
> > >   sh = priv->sh;
> > > - rte_rwlock_read_lock(&sh->share_cache.rwlock);
> > > + rte_rwlock_write_lock(&sh->share_cache.rwlock);
> > >   mr = mlx5_mr_lookup_list(&sh->share_cache, &entry,
> > (uintptr_t)addr);
> > >   if (!mr) {
> > > - rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > + rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > >   DRV_LOG(WARNING, "address 0x%" PRIxPTR " wasn't
> > registered "
> > >"to PCI device %p", (uintptr_t)addr,
> > >(void *)pdev);
> > > @@ -423,7 +423,7 @@ mlx5_dma_unmap(struct rte_pci_device *pdev,
> void
> > > *addr,
> > >   DRV_LOG(DEBUG, "broadcasting local cache flush, gen=%d",
> > > sh->share_cache.dev_gen);
> > >   rte_smp_wmb();
> > > - rte_rwlock_read_unlock(&sh->share_cache.rwlock);
> > > + rte_rwlock_write_unlock(&sh->share_cache.rwlock);
> > >   return 0;
> > >  }
> > >
> > > --
> > > 2.25.1



答复: [RFC 1/2] eal: add llc aware functions

2024-08-28 Thread Feifei Wang
Hi,

> -邮件原件-
> 发件人: Wathsala Wathawana Vithanage 
> 发送时间: 2024年8月28日 4:56
> 收件人: Vipin Varghese ; ferruh.yi...@amd.com;
> dev@dpdk.org
> 抄送: nd ; nd 
> 主题: RE: [RFC 1/2] eal: add llc aware functions
> 
> > -unsigned int rte_get_next_lcore(unsigned int i, int skip_main, int wrap)
> > +#define LCORE_GET_LLC   \
> > +   "ls -d /sys/bus/cpu/devices/cpu%u/cache/index[0-9] | sort  -r
> > | grep -m1 index[0-9] | awk -F '[x]' '{print $2}' "
> >
> 
> This won't work for some SOCs.
> How to ensure the index you got is for an LLC? Some SOCs may only show
> upper-level caches here, therefore cannot be use blindly without knowing the
> SOC.
> Also, unacceptable to execute a shell script, consider implementing in C.

Maybe:
For arm, maybe we can load MPIDR_EL1 register to achieve cpu cluster topology.
MPIDR_EL1 register bit meaning:
[23:16] AFF3 (Level 3 affinity)
[15:8]  AFF2 (Level 2 affinity)
[7:0]   AFF1(Level 1 affinity)
[7:0]   AFF0(Level 0 affinity)

For x86, we can use apic_id:
Apic_id includes cluster id, die id, smt id and core id.
 
This bypass execute a shell script, and for arm and x86, we set different path 
to implement this.

Best Regards
Feifei
> --wathsala
> 



[RFC PATCH v1] net/i40e: put mempool cache out of API

2022-06-12 Thread Feifei Wang
Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
out of API to free buffers directly. There are two changes different
with previous version:
1. change txep from "i40e_entry" to "i40e_vec_entry"
2. put cache out of "mempool_bulk" API to copy buffers into it directly

Performance Test with l3fwd neon path:
with this patch
n1sdp:  no perforamnce change
amper-altra:+4.0%

Suggested-by: Konstantin Ananyev 
Suggested-by: Honnappa Nagarahalli 
Signed-off-by: Feifei Wang 
---
 drivers/net/i40e/i40e_rxtx_vec_common.h | 36 -
 drivers/net/i40e/i40e_rxtx_vec_neon.c   | 10 ---
 2 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h 
b/drivers/net/i40e/i40e_rxtx_vec_common.h
index 959832ed6a..e418225b4e 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -81,7 +81,7 @@ reassemble_packets(struct i40e_rx_queue *rxq, struct rte_mbuf 
**rx_bufs,
 static __rte_always_inline int
 i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 {
-   struct i40e_tx_entry *txep;
+   struct i40e_vec_tx_entry *txep;
uint32_t n;
uint32_t i;
int nb_free = 0;
@@ -98,17 +98,39 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 /* first buffer to free from S/W ring is at index
  * tx_next_dd - (tx_rs_thresh-1)
  */
-   txep = &txq->sw_ring[txq->tx_next_dd - (n - 1)];
+   txep = (void *)txq->sw_ring;
+   txep += txq->tx_next_dd - (n - 1);
 
if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
-   for (i = 0; i < n; i++) {
-   free[i] = txep[i].mbuf;
-   /* no need to reset txep[i].mbuf in vector path */
+   struct rte_mempool *mp = txep[0].mbuf->pool;
+   void **cache_objs;
+   struct rte_mempool_cache *cache = rte_mempool_default_cache(mp,
+   rte_lcore_id());
+
+   if (!cache || cache->len == 0)
+   goto normal;
+
+   cache_objs = &cache->objs[cache->len];
+
+   if (n > RTE_MEMPOOL_CACHE_MAX_SIZE) {
+   rte_mempool_ops_enqueue_bulk(mp, (void *)txep, n);
+   goto done;
+   }
+
+   rte_memcpy(cache_objs, txep, sizeof(void *) * n);
+   /* no need to reset txep[i].mbuf in vector path */
+   cache->len += n;
+
+   if (cache->len >= cache->flushthresh) {
+   rte_mempool_ops_enqueue_bulk
+   (mp, &cache->objs[cache->size],
+   cache->len - cache->size);
+   cache->len = cache->size;
}
-   rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
goto done;
}
 
+normal:
m = rte_pktmbuf_prefree_seg(txep[0].mbuf);
if (likely(m != NULL)) {
free[0] = m;
@@ -147,7 +169,7 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
 }
 
 static __rte_always_inline void
-tx_backlog_entry(struct i40e_tx_entry *txep,
+tx_backlog_entry(struct i40e_vec_tx_entry *txep,
 struct rte_mbuf **tx_pkts, uint16_t nb_pkts)
 {
int i;
diff --git a/drivers/net/i40e/i40e_rxtx_vec_neon.c 
b/drivers/net/i40e/i40e_rxtx_vec_neon.c
index 12e6f1cbcb..d2d61e8ef4 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_neon.c
+++ b/drivers/net/i40e/i40e_rxtx_vec_neon.c
@@ -680,12 +680,15 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 {
struct i40e_tx_queue *txq = (struct i40e_tx_queue *)tx_queue;
volatile struct i40e_tx_desc *txdp;
-   struct i40e_tx_entry *txep;
+   struct i40e_vec_tx_entry *txep;
uint16_t n, nb_commit, tx_id;
uint64_t flags = I40E_TD_CMD;
uint64_t rs = I40E_TX_DESC_CMD_RS | I40E_TD_CMD;
int i;
 
+   /* cross rx_thresh boundary is not allowed */
+   nb_pkts = RTE_MIN(nb_pkts, txq->tx_rs_thresh);
+
if (txq->nb_tx_free < txq->tx_free_thresh)
i40e_tx_free_bufs(txq);
 
@@ -695,7 +698,8 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 
tx_id = txq->tx_tail;
txdp = &txq->tx_ring[tx_id];
-   txep = &txq->sw_ring[tx_id];
+   txep = (void *)txq->sw_ring;
+   txep += tx_id;
 
txq->nb_tx_free = (uint16_t)(txq->nb_tx_free - nb_pkts);
 
@@ -715,7 +719,7 @@ i40e_xmit_fixed_burst_vec(void *__rte_restrict tx_queue,
 
/* avoid reach the end of ring */
txdp = &txq->tx_ring[tx_id];
-   txep = &txq->sw_ring[tx_id];
+   txep = (void *)txq->sw_ring;
}
 
tx_backlog_entry(txep, tx_pkts, nb_commit);
-- 
2.25.1



回复: [PATCH v1 0/5] Direct re-arming of buffers on receive side

2022-06-12 Thread Feifei Wang


> -邮件原件-
> 发件人: Konstantin Ananyev 
> 发送时间: Tuesday, May 24, 2022 9:26 AM
> 收件人: Feifei Wang 
> 抄送: nd ; dev@dpdk.org; Ruifeng Wang
> ; Honnappa Nagarahalli
> 
> 主题: Re: [PATCH v1 0/5] Direct re-arming of buffers on receive side
> 
> [konstantin.v.anan...@yandex.ru appears similar to someone who previously
> sent you email, but may not be that person. Learn why this could be a risk at
> https://aka.ms/LearnAboutSenderIdentification.]
> 
> 16/05/2022 07:10, Feifei Wang пишет:
> >
> >>> Currently, the transmit side frees the buffers into the lcore cache
> >>> and the receive side allocates buffers from the lcore cache. The
> >>> transmit side typically frees 32 buffers resulting in 32*8=256B of
> >>> stores to lcore cache. The receive side allocates 32 buffers and
> >>> stores them in the receive side software ring, resulting in
> >>> 32*8=256B of stores and 256B of load from the lcore cache.
> >>>
> >>> This patch proposes a mechanism to avoid freeing to/allocating from
> >>> the lcore cache. i.e. the receive side will free the buffers from
> >>> transmit side directly into it's software ring. This will avoid the
> >>> 256B of loads and stores introduced by the lcore cache. It also
> >>> frees up the cache lines used by the lcore cache.
> >>>
> >>> However, this solution poses several constraints:
> >>>
> >>> 1)The receive queue needs to know which transmit queue it should
> >>> take the buffers from. The application logic decides which transmit
> >>> port to use to send out the packets. In many use cases the NIC might
> >>> have a single port ([1], [2], [3]), in which case a given transmit
> >>> queue is always mapped to a single receive queue (1:1 Rx queue: Tx
> >>> queue). This is easy to configure.
> >>>
> >>> If the NIC has 2 ports (there are several references), then we will
> >>> have
> >>> 1:2 (RX queue: TX queue) mapping which is still easy to configure.
> >>> However, if this is generalized to 'N' ports, the configuration can
> >>> be long. More over the PMD would have to scan a list of transmit
> >>> queues to pull the buffers from.
> >
> >> Just to re-iterate some generic concerns about this proposal:
> >>   - We effectively link RX and TX queues - when this feature is enabled,
> >> user can't stop TX queue without stopping linked RX queue first.
> >> Right now user is free to start/stop any queues at his will.
> >> If that feature will allow to link queues from different ports,
> >> then even ports will become dependent and user will have to pay extra
> >> care when managing such ports.
> >
> > [Feifei] When direct rearm enabled, there are two path for thread to
> > choose. If there are enough Tx freed buffers, Rx can put buffers from
> > Tx.
> > Otherwise, Rx will put buffers from mempool as usual. Thus, users do
> > not need to pay much attention managing ports.
> 
> What I am talking about: right now different port or different queues of the
> same port can be treated as independent entities:
> in general user is free to start/stop (and even reconfigure in some
> cases) one entity without need to stop other entity.
> I.E user can stop and re-configure TX queue while keep receiving packets
> from RX queue.
> With direct re-arm enabled, I think it wouldn't be possible any more:
> before stopping/reconfiguring TX queue user would have make sure that
> corresponding RX queue wouldn't be used by datapath.
> 
> >
> >> - very limited usage scenario - it will have a positive effect only
> >>when we have a fixed forwarding mapping: all (or nearly all) packets
> >>from the RX queue are forwarded into the same TX queue.
> >
> > [Feifei] Although the usage scenario is limited, this usage scenario
> > has a wide range of applications, such as NIC with one port.
> 
> yes, there are NICs with one port, but no guarantee there wouldn't be several
> such NICs within the system.
> 
> > Furtrhermore, I think this is a tradeoff between performance and
> > flexibility.
> > Our goal is to achieve best performance, this means we need to give up
> > some flexibility decisively. For example of 'FAST_FREE Mode', it
> > deletes most of the buffer check (refcnt > 1, external buffer, chain
> > buffer), chooses a shorest path, and then achieve significant performance
> improvement.
> >> Wonder did you had a cha

回复: [RFC PATCH v1] net/i40e: put mempool cache out of API

2022-07-06 Thread Feifei Wang


> -邮件原件-
> 发件人: Konstantin Ananyev 
> 发送时间: Sunday, July 3, 2022 8:20 PM
> 收件人: Feifei Wang ; Yuying Zhang
> ; Beilei Xing ; Ruifeng
> Wang 
> 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> 
> 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache out
> > of API to free buffers directly. There are two changes different with
> > previous version:
> > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > directly
> >
> > Performance Test with l3fwd neon path:
> > with this patch
> > n1sdp:  no perforamnce change
> > amper-altra:+4.0%
> >
> 
> 
Thanks for your detailed comments.

> Thanks for RFC, appreciate your effort.
> So, as I understand - bypassing mempool put/get itself gives about 7-10%
> speedup for RX/TX on ARM platforms, correct?
[Feifei] Yes.

> 
> About direct-rearm RX approach you propose:
> After another thought, probably it is possible to re-arrange it in a way that
> would help avoid related negatives.
> The basic idea as follows:
> 
> 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> Also make sw_ring de-coupled from RXQ itself, i.e:
> when RXQ is stopped or even destroyed, related sw_ring may still
> exist (probably ref-counter or RCU would be sufficient here).
> All that means we need a common layout/api for rxq_sw_ring
> and PMDs that would like to support direct-rearming will have to
> use/obey it.
[Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to
RXQ is stopped but elements of it (sw-ring) is still kept and we may forget
to free this sw-ring in the end.
Furthermore,  if we apply this, we need to separate operation when closing
RXQ and add Rx sw-ring free operation when closing TXQ. This will be complex
and it is not conducive to subsequent maintenance if maintainer does not
understand direct-rearm mode very well.

> 
> 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> at txq_free_bufs() try to store released mbufs inside attached
> sw_ring directly. If there is no attached sw_ring, or not enough
> free space in it - continue with mempool_put() as usual.
> Note that actual arming of HW RXDs still remains responsibility
> of RX code-path:
> rxq_rearm(rxq) {
>   ...
>   - check are there are N already filled entries inside rxq_sw_ring.
> if not, populate them from mempool (usual mempool_get()).
>   - arm related RXDs and mark these sw_ring entries as managed by HW.
>   ...
> }
> 
[Feifei] We try to create two modes, one is direct-rearm and the other is 
direct-free like above.
And by performance comparison, we select direct-rearm which improve performance 
by
7% - 14% compared with direct-free by 3.6% - 7% in n1sdp. 
Furthermore, I think put direct mode in Tx or Rx is equivalent. For 
direct-rearm, if there is no
Tx sw-ring, Rx will get mbufs from mempool. For direct-fee, if there is no Rx 
sw-ring, Tx will put
mbufs into mempool. At last, what affects our decision-making is the 
improvement of performance.

> 
> So rxq_sw_ring will serve two purposes:
> - track mbufs that are managed by HW (that what it does now)
> - private (per RXQ) mbuf cache
> 
> Now, if TXQ is stopped while RXQ is running - no extra synchronization is
> required, RXQ would just use
> mempool_get() to rearm its sw_ring itself.
> 
> If RXQ is stopped while TXQ is still running - TXQ can still continue to 
> populate
> related sw_ring till it gets full.
> Then it will continue with mempool_put() as usual.
> Of-course it means that user who wants to use this feature should probably
> account some extra mbufs for such case, or might be rxq_sw_ring can have
> enable/disable flag to mitigate such situation.
> 
[Feifei] For direct-rearm, the key point should be the communication between TXQ
and RXQ when TXQ is stopped. De-coupled sw-ring is complex, maybe we can 
simplify
this and assign this to the application. My thought is that if direct-rearm is 
enabled, when
users want to close TX port, they must firstly close mapped RX port and disable 
direct-rearm
feature. Then they can restart RX port.

> As another benefit here - such approach makes possible to use several TXQs
> (even from different devices) to rearm same RXQ.
[Feifei] Actually, for direct-rearm, it can use several RXQs to rearm same TXQ, 
so this
is equivalent for direct-rearm and direct-free. Furthermore, If use multiple 
cores,
I think we need to consider synchronization of variables, and lock is necessary.

> 
> Ha

回复: [RFC PATCH v1] net/i40e: put mempool cache out of API

2022-07-06 Thread Feifei Wang


> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: Wednesday, July 6, 2022 4:53 PM
> 收件人: Konstantin Ananyev ; Yuying
> Zhang ; Beilei Xing ;
> Ruifeng Wang 
> 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> ; nd 
> 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> 
> > -邮件原件-
> > 发件人: Konstantin Ananyev 
> > 发送时间: Sunday, July 3, 2022 8:20 PM
> > 收件人: Feifei Wang ; Yuying Zhang
> > ; Beilei Xing ; Ruifeng
> > Wang 
> > 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > 
> > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> >
> >
> > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> > > out of API to free buffers directly. There are two changes different
> > > with previous version:
> > > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > > directly
> > >
> > > Performance Test with l3fwd neon path:
> > >   with this patch
> > > n1sdp:no perforamnce change
> > > amper-altra:  +4.0%
> > >
> >
> >
> Thanks for your detailed comments.
> 
> > Thanks for RFC, appreciate your effort.
> > So, as I understand - bypassing mempool put/get itself gives about
> > 7-10% speedup for RX/TX on ARM platforms, correct?
> [Feifei] Yes.
> 
> >
> > About direct-rearm RX approach you propose:
> > After another thought, probably it is possible to re-arrange it in a
> > way that would help avoid related negatives.
> > The basic idea as follows:
> >
> > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> > Also make sw_ring de-coupled from RXQ itself, i.e:
> > when RXQ is stopped or even destroyed, related sw_ring may still
> > exist (probably ref-counter or RCU would be sufficient here).
> > All that means we need a common layout/api for rxq_sw_ring
> > and PMDs that would like to support direct-rearming will have to
> > use/obey it.
> [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to RXQ is
> stopped but elements of it (sw-ring) is still kept and we may forget to free
> this sw-ring in the end.
> Furthermore,  if we apply this, we need to separate operation when closing
> RXQ and add Rx sw-ring free operation when closing TXQ. This will be
> complex and it is not conducive to subsequent maintenance if maintainer
> does not understand direct-rearm mode very well.
> 
> >
> > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> > at txq_free_bufs() try to store released mbufs inside attached
> > sw_ring directly. If there is no attached sw_ring, or not enough
> > free space in it - continue with mempool_put() as usual.
> > Note that actual arming of HW RXDs still remains responsibility
> > of RX code-path:
> > rxq_rearm(rxq) {
> >   ...
> >   - check are there are N already filled entries inside rxq_sw_ring.
> > if not, populate them from mempool (usual mempool_get()).
> >   - arm related RXDs and mark these sw_ring entries as managed by HW.
> >   ...
> > }
> >
> [Feifei] We try to create two modes, one is direct-rearm and the other is
> direct-free like above.
> And by performance comparison, we select direct-rearm which improve
> performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp.
> Furthermore, I think put direct mode in Tx or Rx is equivalent. For direct-
> rearm, if there is no Tx sw-ring, Rx will get mbufs from mempool. For direct-
> fee, if there is no Rx sw-ring, Tx will put mbufs into mempool. At last, what
> affects our decision-making is the improvement of performance.
> 
> >
> > So rxq_sw_ring will serve two purposes:
> > - track mbufs that are managed by HW (that what it does now)
> > - private (per RXQ) mbuf cache
> >
> > Now, if TXQ is stopped while RXQ is running - no extra synchronization
> > is required, RXQ would just use
> > mempool_get() to rearm its sw_ring itself.
> >
> > If RXQ is stopped while TXQ is still running - TXQ can still continue
> > to populate related sw_ring till it gets full.
> > Then it will continue with mempool_put() as usual.
> > Of-course it means that user who wants to use this feature should
> > probably account some extra mbufs for such case, or might be
> > rxq_sw_ring can have enable/disable flag to mitigate such situation.
> >
> [Feifei] For direct-rearm, the key point 

回复: [RFC PATCH v1] net/i40e: put mempool cache out of API

2022-07-10 Thread Feifei Wang


> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: Wednesday, July 6, 2022 7:36 PM
> 收件人: 'Konstantin Ananyev' ; 'Yuying
> Zhang' ; 'Beilei Xing' ;
> Ruifeng Wang 
> 抄送: 'dev@dpdk.org' ; nd ; Honnappa
> Nagarahalli ; nd ; nd
> 
> 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> 
> 
> 
> > -邮件原件-
> > 发件人: Feifei Wang
> > 发送时间: Wednesday, July 6, 2022 4:53 PM
> > 收件人: Konstantin Ananyev ; Yuying
> Zhang
> > ; Beilei Xing ; Ruifeng
> > Wang 
> > 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > ; nd 
> > 主题: 回复: [RFC PATCH v1] net/i40e: put mempool cache out of API
> >
> >
> >
> > > -邮件原件-
> > > 发件人: Konstantin Ananyev 
> > > 发送时间: Sunday, July 3, 2022 8:20 PM
> > > 收件人: Feifei Wang ; Yuying Zhang
> > > ; Beilei Xing ;
> > > Ruifeng Wang 
> > > 抄送: dev@dpdk.org; nd ; Honnappa Nagarahalli
> > > 
> > > 主题: Re: [RFC PATCH v1] net/i40e: put mempool cache out of API
> > >
> > >
> > > > Refer to "i40e_tx_free_bufs_avx512", this patch puts mempool cache
> > > > out of API to free buffers directly. There are two changes
> > > > different with previous version:
> > > > 1. change txep from "i40e_entry" to "i40e_vec_entry"
> > > > 2. put cache out of "mempool_bulk" API to copy buffers into it
> > > > directly
> > > >
> > > > Performance Test with l3fwd neon path:
> > > > with this patch
> > > > n1sdp:  no perforamnce change
> > > > amper-altra:+4.0%
> > > >
> > >
> > >
> > Thanks for your detailed comments.
> >
> > > Thanks for RFC, appreciate your effort.
> > > So, as I understand - bypassing mempool put/get itself gives about
> > > 7-10% speedup for RX/TX on ARM platforms, correct?
> > [Feifei] Yes.
[Feifei] Sorry I need to correct this. Actually according to our test in 
direct-rearm
cover letter,  the improvement is 7% to 14% on N1SDP and 14% to 17% on Ampere
Altra.
> >
> > >
> > > About direct-rearm RX approach you propose:
> > > After another thought, probably it is possible to re-arrange it in a
> > > way that would help avoid related negatives.
> > > The basic idea as follows:
> > >
> > > 1. Make RXQ sw_ring visible and accessible by 'attached' TX queues.
> > > Also make sw_ring de-coupled from RXQ itself, i.e:
> > > when RXQ is stopped or even destroyed, related sw_ring may still
> > > exist (probably ref-counter or RCU would be sufficient here).
> > > All that means we need a common layout/api for rxq_sw_ring
> > > and PMDs that would like to support direct-rearming will have to
> > > use/obey it.
> > [Feifei] de-coupled sw-ring and RXQ may cause dangerous case due to
> > RXQ is stopped but elements of it (sw-ring) is still kept and we may
> > forget to free this sw-ring in the end.
> > Furthermore,  if we apply this, we need to separate operation when
> > closing RXQ and add Rx sw-ring free operation when closing TXQ. This
> > will be complex and it is not conducive to subsequent maintenance if
> > maintainer does not understand direct-rearm mode very well.
> >
> > >
> > > 2. Make RXQ sw_ring 'direct' rearming driven by TXQ itself, i.e:
> > > at txq_free_bufs() try to store released mbufs inside attached
> > > sw_ring directly. If there is no attached sw_ring, or not enough
> > > free space in it - continue with mempool_put() as usual.
> > > Note that actual arming of HW RXDs still remains responsibility
> > > of RX code-path:
> > > rxq_rearm(rxq) {
> > >   ...
> > >   - check are there are N already filled entries inside rxq_sw_ring.
> > > if not, populate them from mempool (usual mempool_get()).
> > >   - arm related RXDs and mark these sw_ring entries as managed by HW.
> > >   ...
> > > }
> > >
> > [Feifei] We try to create two modes, one is direct-rearm and the other
> > is direct-free like above.
> > And by performance comparison, we select direct-rearm which improve
> > performance by 7% - 14% compared with direct-free by 3.6% - 7% in n1sdp.
> > Furthermore, I think put direct mode in Tx or Rx is equivalent. For
> > direct- rearm, if there is no Tx sw-ring, Rx will get mbufs from
> > mempool. For direct- fee, if there

[dpdk-dev] 回复: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache

2021-04-12 Thread Feifei Wang
Hi, Slava

Thanks very much for your attention.





Best Regards
Feifei

> -邮件原件-
> 发件人: Slava Ovsiienko 
> 发送时间: 2021年4月12日 16:28
> 收件人: Feifei Wang ; Matan Azrad
> ; Shahaf Shuler ;
> ys...@mellanox.com
> 抄送: dev@dpdk.org; nd ; sta...@dpdk.org; Ruifeng Wang
> 
> 主题: RE: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region cache
> 
> Hi, Feifei
> 
> Sorry, I do not follow what this patch fixes. Do we have some issue/bug with
> MR cache in practice?

This patch fixes the bug which is based on logical deduction, 
and it doesn't actually happen.

> 
> Each Tx queue has its own dedicated "local" cache for MRs to convert buffer
> address in mbufs being transmitted to LKeys (HW-related entity handle) and
> the "global" cache for all MR registered on the device.
> 
> AFAIK, how conversion happens in datapath:
> - check the local queue cache flush request
> - lookup in local cache
> - if not found:
> - acquire lock for global cache read access
> - lookup in global cache
> - release lock for global cache
> 
> How cache update on memory freeing/unregistering happens:
> - acquire lock for global cache write access
> - [a] remove relevant MRs from the global cache
> - [b] set local caches flush request
> - free global cache lock
> 
> If I understand correctly, your patch swaps [a] and [b], and local caches 
> flush
> is requested earlier. What problem does it solve?
> It is not supposed there are in datapath some mbufs referencing to the
> memory being freed. Application must ensure this and must not allocate new
> mbufs from this memory regions being freed. Hence, the lookups for these
> MRs in caches should not occur.

For your first point that, application can take charge of preventing MR freed 
memory
being allocated to data path.

Does it means that If there is an emergency of MR fragment, such as hotplug, 
the application
must inform thedata path in advance, and this memory will not be allocated, and 
then the
control path will free this memory? If application  can do like this, I agree 
that this bug
cannot happen.

> For other side, the cache flush has negative effect - the local cache is 
> getting
> empty and can't provide translation for other valid (not being removed) MRs,
> and the translation has to look up in the global cache, that is locked now for
> rebuilding, this causes the delays in datapatch on acquiring global cache 
> lock.
> So, I see some potential performance impact.

If above assumption is true, we can go to your second point. I think this is a 
problem
of the tradeoff between cache coherence and performance.  

I can understand your meaning that though global cache has been changed, we 
should 
keep the valid MR in local cache as long as possible to ensure the fast 
searching speed. 
In the meanwhile, the local cache can be rebuilt later to reduce its waiting 
time for
acquiring the global cache lock.

However,  this mechanism just ensures the performance unchanged for  the first 
few mbufs. 
During the next mbufs lkey searching after 'dev_gen' updated, it is still 
necessary to update
the local cache. And the performance can firstly reduce and then returns. Thus, 
no matter
whether there is this patch or not,  the performance will jitter in a certain 
period of time. 

Finally, in conclusion, I tend to think that the bottom layer can do more 
things to ensure
the correct execution of the program, which may have a negative impact on the 
performance in
a short time, but in the long run, the performance will eventually come back.  
Furthermore,
maybe we should pay attention to the performance in the stable period, and try 
our best to ensure the
correctness of the program in case of emergencies.

Best Regards
Feifei

> With best regards,
> Slava
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Thursday, March 18, 2021 9:19
> > To: Matan Azrad ; Shahaf Shuler
> > ; Slava Ovsiienko ;
> > Yongseok Koh 
> > Cc: dev@dpdk.org; n...@arm.com; Feifei Wang ;
> > sta...@dpdk.org; Ruifeng Wang 
> > Subject: [PATCH v1 3/4] net/mlx5: fix rebuild bug for Memory Region
> > cache
> >
> > 'dev_gen' is a variable to inform other cores to flush their local
> > cache when global cache is rebuilt.
> >
> > However, if 'dev_gen' is updated after global cache is rebuilt, other
> > cores may load a wrong memory region lkey value from old local cache.
> >
> > Timeslotmain core   worker core
> >   1 rebuild global cache
> >   2  load unchanged dev_gen
> >   3update dev_gen
> >   4  look up old local cache
> >
> > From the example abo

回复: [PATCH v2] net/i40e: reduce redundant store operation

2022-01-26 Thread Feifei Wang


> -邮件原件-
> 发件人: Zhang, Qi Z 
> 发送时间: Wednesday, January 26, 2022 10:28 PM
> 收件人: Feifei Wang ; Xing, Beilei
> 
> 抄送: dev@dpdk.org; Wang, Haiyue ; nd
> ; Ruifeng Wang 
> 主题: RE: [PATCH v2] net/i40e: reduce redundant store operation
> 
> 
> 
> > -Original Message-
> > From: Feifei Wang 
> > Sent: Tuesday, December 21, 2021 4:11 PM
> > To: Xing, Beilei 
> > Cc: dev@dpdk.org; Wang, Haiyue ; n...@arm.com;
> > Feifei Wang ; Ruifeng Wang
> > 
> > Subject: [PATCH v2] net/i40e: reduce redundant store operation
> >
> > For free buffer operation in i40e vector path, it is unnecessary to store
> 'NULL'
> > into txep.mbuf. This is because when putting mbuf into Tx queue,
> > tx_tail is the sentinel. And when doing tx_free, tx_next_dd is the
> > sentinel. In all processes, mbuf==NULL is not a condition in check.
> > Thus reset of mbuf is unnecessary and can be omitted.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >
> > v2: remove the change for scalar path due to scalar path needs to
> > check whether the mbuf is 'NULL' to release and clean up (Haiyue)
> >
> >  drivers/net/i40e/i40e_rxtx_vec_common.h | 1 -
> >  1 file changed, 1 deletion(-)
> >
> > diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > index f9a7f46550..26deb59fc4 100644
> > --- a/drivers/net/i40e/i40e_rxtx_vec_common.h
> > +++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
> > @@ -103,7 +103,6 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
> > if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
> > for (i = 0; i < n; i++) {
> > free[i] = txep[i].mbuf;
> > -   txep[i].mbuf = NULL;
> 
> I will suggest to still add some comment here just for explaining, this may 
> help
> to avoid unnecessary suspect when someone reading or debug on these code
> 😊
> 
Thanks for your comments. Agree with this, and I will add the comment to
explain why this store operation is unnecessary here.
> 
> > }
> > rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
> > goto done;
> > --
> > 2.25.1



[PATCH v3] net/i40e: reduce redundant reset operation

2022-01-26 Thread Feifei Wang
For free buffer operation in i40e vector path, it is unnecessary to
store 'NULL' into txep.mbuf. This is because when putting mbuf into Tx
queue, tx_tail is the sentinel. And when doing tx_free, tx_next_dd is
the sentinel. In all processes, mbuf==NULL is not a condition in check.
Thus reset of mbuf is unnecessary and can be omitted.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---

v2: remove the change for scalar path due to scalar path needs to check
whether the mbuf is 'NULL' to release and clean up (Haiyue)

v3: add comments to remind reset mbuf is unnecessary here (Qi Zhang)

 drivers/net/i40e/i40e_rxtx_vec_common.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/i40e/i40e_rxtx_vec_common.h 
b/drivers/net/i40e/i40e_rxtx_vec_common.h
index f9a7f46550..959832ed6a 100644
--- a/drivers/net/i40e/i40e_rxtx_vec_common.h
+++ b/drivers/net/i40e/i40e_rxtx_vec_common.h
@@ -103,7 +103,7 @@ i40e_tx_free_bufs(struct i40e_tx_queue *txq)
if (txq->offloads & RTE_ETH_TX_OFFLOAD_MBUF_FAST_FREE) {
for (i = 0; i < n; i++) {
free[i] = txep[i].mbuf;
-   txep[i].mbuf = NULL;
+   /* no need to reset txep[i].mbuf in vector path */
}
rte_mempool_put_bulk(free[0]->pool, (void **)free, n);
goto done;
-- 
2.25.1



[dpdk-dev] 回复: [RFC PATCH v3 1/5] eal: add new definitions for wait scheme

2021-10-12 Thread Feifei Wang
> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Friday, October 8, 2021 12:19 AM
> 收件人: Feifei Wang ; Ruifeng Wang
> 
> 抄送: dev@dpdk.org; nd 
> 主题: RE: [dpdk-dev] [RFC PATCH v3 1/5] eal: add new definitions for wait
> scheme

[snip]

> > diff --git a/lib/eal/include/generic/rte_pause.h
> > b/lib/eal/include/generic/rte_pause.h
> > index 668ee4a184..4e32107eca 100644
> > --- a/lib/eal/include/generic/rte_pause.h
> > +++ b/lib/eal/include/generic/rte_pause.h
> > @@ -111,6 +111,84 @@ rte_wait_until_equal_64(volatile uint64_t *addr,
> uint64_t expected,
> > while (__atomic_load_n(addr, memorder) != expected)
> > rte_pause();
> >  }
> > +
> > +/*
> > + * Wait until a 16-bit *addr breaks the condition, with a relaxed
> > +memory
> > + * ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param mask
> > + *  A mask of value bits in interest
> > + * @param expected
> > + *  A 16-bit expected value to be in the memory location.
> > + * @param cond
> > + *  A symbol representing the condition (==, !=).
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard
> > +or
> > + *  the GCC wiki on atomic synchronization for detailed definition.
> > + */
> 
> Hmm, so now we have 2 APIs doing similar thing:
> rte_wait_until_equal_n() and rte_wait_event_n().
> Can we probably unite them somehow?
> At least make rte_wait_until_equal_n() to use rte_wait_event_n() underneath.
> 
You are right. We plan to change rte_wait_until_equal API after this new scheme
is achieved.  And then, we will merge wait_unil into wait_event definition in 
the next new
patch series.
 
> > +#define rte_wait_event_16(addr, mask, expected, cond, memorder)
>  \
> > +do {   
> >\
> > +   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > +__ATOMIC_RELAXED);  \
> 
> And why user is not allowed to use __ATOMIC_SEQ_CST here?
Actually this is just a load operation, and acquire here is enough to make sure 
'load
addr value' can be before other operations.
 
> BTW, if we expect memorder to always be a constant, might be better
> BUILD_BUG_ON()?
If I understand correctly, you means we can replace 'assert' by 'build_bug_on':
RTE_BUILD_BUG_ON(memorder != __ATOMIC_ACQUIRE && memorder !=__ATOMIC_RELAXED);  

> 
> > +  \
> > +   while ((__atomic_load_n(addr, memorder) & mask) cond expected)
>  \
> > +   rte_pause();   \
> > +} while (0)
> 
> Two thoughts with these macros:
> 1. It is a goof practise to put () around macro parameters in the macro body.
> Will save from a lot of unexpected troubles.
> 2. I think these 3 macros can be united into one.
> Something like:
> 
> #define rte_wait_event(addr, mask, expected, cond, memorder) do {\
> typeof (*(addr)) val = __atomic_load_n((addr), (memorder)); \
> if ((val & (typeof(val))(mask)) cond (typeof(val))(expected)) \
> break; \
> rte_pause(); \
> } while (1);
For this point, I think it is due to different size need to use different 
assembly instructions
in arm architecture. For example,
load 16 bits instruction is "ldxrh %w[tmp], [%x[addr]"
load 32 bits instruction is " ldxr %w[tmp], [%x[addr]" 
load 64 bits instruction is " ldxr %x[tmp], [%x[addr] "
And for consistency, we also use 3 APIs in generic path.
> 
> 
> > +
> > +/*
> > + * Wait until a 32-bit *addr breaks the condition, with a relaxed
> > +memory
> > + * ordering model meaning the loads around this API can be reordered.
> > + *
> > + * @param addr
> > + *  A pointer to the memory location.
> > + * @param mask
> > + *  A mask of value bits in interest.
> > + * @param expected
> > + *  A 32-bit expected value to be in the memory location.
> > + * @param cond
> > + *  A symbol representing the condition (==, !=).
> > + * @param memorder
> > + *  Two different memory orders that can be specified:
> > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > + *  C++11 memory orders with the same names, see the C++11 standard
> > +or
> > + 

[dpdk-dev] 回复: [RFC PATCH v3 1/5] eal: add new definitions for wait scheme

2021-10-13 Thread Feifei Wang
> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Wednesday, October 13, 2021 11:04 PM
> 收件人: Feifei Wang ; Ruifeng Wang
> 
> 抄送: dev@dpdk.org; nd ; nd 
> 主题: RE: [dpdk-dev] [RFC PATCH v3 1/5] eal: add new definitions for wait
> scheme
> 
> >
> > [snip]
> >
> > > > diff --git a/lib/eal/include/generic/rte_pause.h
> > > > b/lib/eal/include/generic/rte_pause.h
> > > > index 668ee4a184..4e32107eca 100644
> > > > --- a/lib/eal/include/generic/rte_pause.h
> > > > +++ b/lib/eal/include/generic/rte_pause.h
> > > > @@ -111,6 +111,84 @@ rte_wait_until_equal_64(volatile uint64_t
> > > > *addr,
> > > uint64_t expected,
> > > > while (__atomic_load_n(addr, memorder) != expected)
> > > > rte_pause();
> > > >  }
> > > > +
> > > > +/*
> > > > + * Wait until a 16-bit *addr breaks the condition, with a relaxed
> > > > +memory
> > > > + * ordering model meaning the loads around this API can be reordered.
> > > > + *
> > > > + * @param addr
> > > > + *  A pointer to the memory location.
> > > > + * @param mask
> > > > + *  A mask of value bits in interest
> > > > + * @param expected
> > > > + *  A 16-bit expected value to be in the memory location.
> > > > + * @param cond
> > > > + *  A symbol representing the condition (==, !=).
> > > > + * @param memorder
> > > > + *  Two different memory orders that can be specified:
> > > > + *  __ATOMIC_ACQUIRE and __ATOMIC_RELAXED. These map to
> > > > + *  C++11 memory orders with the same names, see the C++11
> > > > +standard or
> > > > + *  the GCC wiki on atomic synchronization for detailed definition.
> > > > + */
> > >
> > > Hmm, so now we have 2 APIs doing similar thing:
> > > rte_wait_until_equal_n() and rte_wait_event_n().
> > > Can we probably unite them somehow?
> > > At least make rte_wait_until_equal_n() to use rte_wait_event_n()
> underneath.
> > >
> > You are right. We plan to change rte_wait_until_equal API after this
> > new scheme is achieved.  And then, we will merge wait_unil into
> > wait_event definition in the next new patch series.
> >
> > > > +#define rte_wait_event_16(addr, mask, expected, cond, memorder)
> > >  \
> > > > +do {
>  \
> > > > +   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > > > +__ATOMIC_RELAXED);  \
> > >
> > > And why user is not allowed to use __ATOMIC_SEQ_CST here?
> > Actually this is just a load operation, and acquire here is enough to
> > make sure 'load addr value' can be before other operations.
> >
> > > BTW, if we expect memorder to always be a constant, might be better
> > > BUILD_BUG_ON()?
> > If I understand correctly, you means we can replace 'assert' by
> 'build_bug_on':
> > RTE_BUILD_BUG_ON(memorder != __ATOMIC_ACQUIRE && memorder
> > !=__ATOMIC_RELAXED);
> 
> Yes, that was my thought.
> In that case I think we should be able to catch wrong memorder at compilation
> stage.
> 
> >
> > >
> > > > +   
> > > >\
> > > > +   while ((__atomic_load_n(addr, memorder) & mask) cond expected)
> > >  \
> > > > +   rte_pause();
> > > >\
> > > > +} while (0)
> > >
> > > Two thoughts with these macros:
> > > 1. It is a goof practise to put () around macro parameters in the macro
> body.
> > > Will save from a lot of unexpected troubles.
> > > 2. I think these 3 macros can be united into one.
> > > Something like:
> > >
> > > #define rte_wait_event(addr, mask, expected, cond, memorder) do {\
> > > typeof (*(addr)) val = __atomic_load_n((addr), (memorder)); \
> > > if ((val & (typeof(val))(mask)) cond (typeof(val))(expected)) \
> > > break; \
> > > rte_pause(); \
> > > } while (1);
> > For this point, I think it is due to different size need to use
> > different assembly instructions in arm architecture. For example, load
> > 16 bits instruction is "ldxrh %w[tmp], [%x[addr]"
> > load 32 bits instruction is 

[dpdk-dev] 回复: [RFC PATCH v3 1/5] eal: add new definitions for wait scheme

2021-10-13 Thread Feifei Wang


> -邮件原件-
> 发件人: Stephen Hemminger 
> 发送时间: Thursday, October 14, 2021 1:00 AM
> 收件人: Ananyev, Konstantin 
> 抄送: Feifei Wang ; Ruifeng Wang
> ; dev@dpdk.org; nd 
> 主题: Re: [dpdk-dev] [RFC PATCH v3 1/5] eal: add new definitions for wait
> scheme
> 
> On Wed, 13 Oct 2021 15:03:56 +
> "Ananyev, Konstantin"  wrote:
> 
> > > addr value' can be before other operations.
> > >
> > > > BTW, if we expect memorder to always be a constant, might be
> > > > better BUILD_BUG_ON()?
> > > If I understand correctly, you means we can replace 'assert' by
> 'build_bug_on':
> > > RTE_BUILD_BUG_ON(memorder != __ATOMIC_ACQUIRE && memorder
> > > !=__ATOMIC_RELAXED);
> >
> > Yes, that was my thought.
> > In that case I think we should be able to catch wrong memorder at
> compilation stage.
> 
> Maybe:
>RTE_BUILD_BUG_ON(!_constant_p(memorder));
>RTE_BUILD_BUG_ON(memorder != __ATOMIC_ACQUIRE &&
> memorder !=__ATOMIC_RELAXED);
> 
Thanks for your comments. One question for this, I do not know why we should 
check if memorder is a constant?
Is it to check whether memorder has been assigned or NULL?  


[dpdk-dev] 回复: [PATCH v6 0/6] hide eth dev related structures

2021-10-14 Thread Feifei Wang
es in PMD is required).
> > One extra note - with new implementation RX/TX callback invocation
> > will cost one extra function call with this changes. That might cause
> > some slowdown for code-path with RX/TX callbacks heavily involved.
> > Hope such trade-off is acceptable for the community.
> > 3. Move rte_eth_dev, rte_eth_dev_data, rte_eth_rxtx_callback and related
> > things into internal header: .
> >
> > That approach was selected to:
> >- Avoid(/minimize) possible performance losses.
> >- Minimize required changes inside PMDs.
> >
> > Performance testing results (ICX 2.0GHz, E810 (ice)):
> >   - testpmd macswap fwd mode, plus
> > a) no RX/TX callbacks:
> >no actual slowdown observed
> > b) bpf-load rx 0 0 JM ./dpdk.org/examples/bpf/t3.o:
> >~2% slowdown
> >   - l3fwd: no actual slowdown observed
> >
> > Would like to thank everyone who already reviewed and tested previous
> > versions of these series. All other interested parties please don't be
> > shy and provide your feedback.
> >
> > Konstantin Ananyev (6):
> >ethdev: allocate max space for internal queue array
> >ethdev: change input parameters for rx_queue_count
> >ethdev: copy fast-path API into separate structure
> >ethdev: make fast-path functions to use new flat array
> >ethdev: add API to retrieve multiple ethernet addresses
> >ethdev: hide eth dev related structures
> >
> 
> For series,
> Reviewed-by: Ferruh Yigit 
> 
> No performance regression detected on my testing.
> 
> I am merging the series to next-net now which helps testing, but before
> merging to main repo it will be good to get more ack and test results (I can
> squash new tags later).
> 
> @Jerin, @Ajit, @Raslan, @Andrew, @Qi, @Honnappa, Can you please test
> this set for any possible regression?
> 
> Series applied to dpdk-next-net/main, thanks.
> 

For series, there is no performance regression in n1sdp/thunderx2
with i40e and mlx5 40G NIC for l3fwd and testpmd.

Tested-by: Feifei Wang 



[dpdk-dev] 回复: [PATCH v2 1/1] devtools: add relative path support for ABI compatibility check

2021-10-15 Thread Feifei Wang
Hi,

Sorry to disturb you. Have more comments for this patch or if it can be applied?
Thanks very much.

 Best Regards
Feifei

> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: Wednesday, August 11, 2021 2:17 PM
> 收件人: Bruce Richardson 
> 抄送: dev@dpdk.org; nd ; Phil Yang ;
> Feifei Wang ; Juraj Linkeš
> ; Ruifeng Wang 
> 主题: [PATCH v2 1/1] devtools: add relative path support for ABI compatibility
> check
> 
> From: Phil Yang 
> 
> Because dpdk guide does not limit the relative path for ABI compatibility
> check, users maybe set 'DPDK_ABI_REF_DIR' as a relative
> path:
> 
> ~/dpdk/devtools$ DPDK_ABI_REF_VERSION=v19.11
> DPDK_ABI_REF_DIR=build-gcc-shared ./test-meson-builds.sh
> 
> And if the DESTDIR is not an absolute path, ninja complains:
> + install_target build-gcc-shared/v19.11/build
> + build-gcc-shared/v19.11/build-gcc-shared
> + rm -rf build-gcc-shared/v19.11/build-gcc-shared
> + echo 'DESTDIR=build-gcc-shared/v19.11/build-gcc-shared ninja -C build-gcc-
> shared/v19.11/build install'
> + DESTDIR=build-gcc-shared/v19.11/build-gcc-shared
> + ninja -C build-gcc-shared/v19.11/build install
> ...
> ValueError: dst_dir must be absolute, got build-gcc-shared/v19.11/build-gcc-
> shared/usr/local/share/dpdk/
> examples/bbdev_app
> ...
> Error: install directory 'build-gcc-shared/v19.11/build-gcc-shared' does not
> exist.
> 
> To fix this, add relative path support using 'readlink -f'.
> 
> Signed-off-by: Phil Yang 
> Signed-off-by: Feifei Wang 
> Reviewed-by: Juraj Linkeš 
> Reviewed-by: Ruifeng Wang 
> Acked-by: Bruce Richardson 
> ---
>  devtools/test-meson-builds.sh | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/devtools/test-meson-builds.sh b/devtools/test-meson-builds.sh
> index 9ec8e2bc7e..8ddde95276 100755
> --- a/devtools/test-meson-builds.sh
> +++ b/devtools/test-meson-builds.sh
> @@ -168,7 +168,8 @@ build () #check> [meson options]
>   config $srcdir $builds_dir/$targetdir $cross --werror $*
>   compile $builds_dir/$targetdir
>   if [ -n "$DPDK_ABI_REF_VERSION" -a "$abicheck" = ABI ] ; then
> - abirefdir=${DPDK_ABI_REF_DIR:-
> reference}/$DPDK_ABI_REF_VERSION
> + abirefdir=$(readlink -f \
> + ${DPDK_ABI_REF_DIR:-
> reference}/$DPDK_ABI_REF_VERSION)
>   if [ ! -d $abirefdir/$targetdir ]; then
>   # clone current sources
>   if [ ! -d $abirefdir/src ]; then
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH v1 0/2] replace tight loop with wait until equal api

2021-10-15 Thread Feifei Wang
Hi,

Would you please help review this patch series?
Thanks very much.

Best Regards
Feifei
> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: Wednesday, August 25, 2021 4:01 PM
> 抄送: dev@dpdk.org; nd ; Feifei Wang
> 
> 主题: [PATCH v1 0/2] replace tight loop with wait until equal api
> 
> For dpdk/lib, directly use wait_until_equal API to replace tight loop.
> 
> Feifei Wang (2):
>   eal/common: use wait until equal API for tight loop
>   mcslock: use wait until equal API for tight loop
> 
>  lib/eal/common/eal_common_mcfg.c  | 3 +--
>  lib/eal/include/generic/rte_mcslock.h | 4 ++--
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH v1 2/2] mcslock: use wait until equal API for tight loop

2021-10-19 Thread Feifei Wang
> -邮件原件-
> 发件人: dev  代表 David Marchand
> 发送时间: Tuesday, October 19, 2021 7:10 PM
> 收件人: Feifei Wang 
> 抄送: Honnappa Nagarahalli ; dev
> ; nd ; Ruifeng Wang
> 
> 主题: Re: [dpdk-dev] [PATCH v1 2/2] mcslock: use wait until equal API for
> tight loop
> 
> On Wed, Aug 25, 2021 at 10:02 AM Feifei Wang 
> wrote:
> >
> > Instead of polling for previous lock holder unlocking, use
> > wait_until_equal API.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >  lib/eal/include/generic/rte_mcslock.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/eal/include/generic/rte_mcslock.h
> > b/lib/eal/include/generic/rte_mcslock.h
> > index 9f323bd2a2..c99343f22c 100644
> > --- a/lib/eal/include/generic/rte_mcslock.h
> > +++ b/lib/eal/include/generic/rte_mcslock.h
> > @@ -84,8 +84,8 @@ rte_mcslock_lock(rte_mcslock_t **msl,
> rte_mcslock_t *me)
> >  * to spin on me->locked until the previous lock holder resets
> >  * the me->locked using mcslock_unlock().
> >  */
> > -   while (__atomic_load_n(&me->locked, __ATOMIC_ACQUIRE))
> > -   rte_pause();
> > +   rte_wait_until_equal_32((volatile uint32_t *)&me->locked,
> > +   0, __ATOMIC_ACQUIRE);
> 
> Why do you need to cast as volatile?
Thanks for the comments.
This is firstly because rte_wait_until_equal API defines the variable as 
volatile.
However, with your comment, I find 'me->lock' is not volatile. And by the test,
I think you are right, it is necessary to add volatile here.
> 
> 
> --
> David Marchand



[dpdk-dev] 回复: [PATCH v1 2/2] mcslock: use wait until equal API for tight loop

2021-10-19 Thread Feifei Wang


> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: Wednesday, October 20, 2021 10:46 AM
> 收件人: David Marchand 
> 抄送: Honnappa Nagarahalli ; dev
> ; nd ; Ruifeng Wang
> ; nd 
> 主题: 回复: [dpdk-dev] [PATCH v1 2/2] mcslock: use wait until equal API for
> tight loop
> 
> > -邮件原件-
> > 发件人: dev  代表 David Marchand
> > 发送时间: Tuesday, October 19, 2021 7:10 PM
> > 收件人: Feifei Wang 
> > 抄送: Honnappa Nagarahalli ; dev
> > ; nd ; Ruifeng Wang
> 
> > 主题: Re: [dpdk-dev] [PATCH v1 2/2] mcslock: use wait until equal API
> > for tight loop
> >
> > On Wed, Aug 25, 2021 at 10:02 AM Feifei Wang 
> > wrote:
> > >
> > > Instead of polling for previous lock holder unlocking, use
> > > wait_until_equal API.
> > >
> > > Signed-off-by: Feifei Wang 
> > > Reviewed-by: Ruifeng Wang 
> > > ---
> > >  lib/eal/include/generic/rte_mcslock.h | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/eal/include/generic/rte_mcslock.h
> > > b/lib/eal/include/generic/rte_mcslock.h
> > > index 9f323bd2a2..c99343f22c 100644
> > > --- a/lib/eal/include/generic/rte_mcslock.h
> > > +++ b/lib/eal/include/generic/rte_mcslock.h
> > > @@ -84,8 +84,8 @@ rte_mcslock_lock(rte_mcslock_t **msl,
> > rte_mcslock_t *me)
> > >  * to spin on me->locked until the previous lock holder resets
> > >  * the me->locked using mcslock_unlock().
> > >  */
> > > -   while (__atomic_load_n(&me->locked, __ATOMIC_ACQUIRE))
> > > -   rte_pause();
> > > +   rte_wait_until_equal_32((volatile uint32_t *)&me->locked,
> > > +   0, __ATOMIC_ACQUIRE);
> >
> > Why do you need to cast as volatile?
> Thanks for the comments.
> This is firstly because rte_wait_until_equal API defines the variable as 
> volatile.
> However, with your comment, I find 'me->lock' is not volatile. And by the 
> test,
> I think you are right, it is necessary to add volatile here.

Sorry, correct the writing mistakes:
'It is unnecessary to add volatile here.'
> >
> >
> > --
> > David Marchand



[dpdk-dev] [PATCH v2 0/2] replace tight loop with wait until equal api

2021-10-19 Thread Feifei Wang
For dpdk/lib, directly use wait_until_equal API to replace tight loop.

v2:
1. delete wrong 'volatile' in mcslock (David)

Feifei Wang (2):
  eal/common: use wait until equal API for tight loop
  mcslock: use wait until equal API for tight loop

 lib/eal/common/eal_common_mcfg.c  | 3 +--
 lib/eal/include/generic/rte_mcslock.h | 3 +--
 2 files changed, 2 insertions(+), 4 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v2 1/2] eal/common: use wait until equal API for tight loop

2021-10-19 Thread Feifei Wang
Instead of polling for mcfg->magic to be updated, use wait_until_equal
API.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/common/eal_common_mcfg.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/eal/common/eal_common_mcfg.c b/lib/eal/common/eal_common_mcfg.c
index c77ba97a9f..cf4a279905 100644
--- a/lib/eal/common/eal_common_mcfg.c
+++ b/lib/eal/common/eal_common_mcfg.c
@@ -30,8 +30,7 @@ eal_mcfg_wait_complete(void)
struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
 
/* wait until shared mem_config finish initialising */
-   while (mcfg->magic != RTE_MAGIC)
-   rte_pause();
+   rte_wait_until_equal_32(&mcfg->magic, RTE_MAGIC, __ATOMIC_RELAXED);
 }
 
 int
-- 
2.25.1



[dpdk-dev] [PATCH v2 2/2] mcslock: use wait until equal API for tight loop

2021-10-19 Thread Feifei Wang
Instead of polling for previous lock holder unlocking, use
wait_until_equal API.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_mcslock.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/eal/include/generic/rte_mcslock.h 
b/lib/eal/include/generic/rte_mcslock.h
index 9f323bd2a2..34f33c64a5 100644
--- a/lib/eal/include/generic/rte_mcslock.h
+++ b/lib/eal/include/generic/rte_mcslock.h
@@ -84,8 +84,7 @@ rte_mcslock_lock(rte_mcslock_t **msl, rte_mcslock_t *me)
 * to spin on me->locked until the previous lock holder resets
 * the me->locked using mcslock_unlock().
 */
-   while (__atomic_load_n(&me->locked, __ATOMIC_ACQUIRE))
-   rte_pause();
+   rte_wait_until_equal_32((uint32_t *)&me->locked, 0, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.25.1



[dpdk-dev] 回复: [RFC PATCH v3 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-19 Thread Feifei Wang
> -邮件原件-
> 发件人: dev  代表 Ananyev, Konstantin
> 发送时间: Friday, October 8, 2021 1:40 AM
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> 
> 主题: Re: [dpdk-dev] [RFC PATCH v3 4/5] lib/bpf: use wait event scheme for
> Rx/Tx iteration
> 
> 
> 
> > -Original Message-
> > From: Ananyev, Konstantin
> > Sent: Thursday, October 7, 2021 4:50 PM
> > To: Feifei Wang 
> > Cc: dev@dpdk.org; n...@arm.com; Ruifeng Wang 
> > Subject: RE: [RFC PATCH v3 4/5] lib/bpf: use wait event scheme for
> > Rx/Tx iteration
> >
> >
> >
> > >
> > > Signed-off-by: Feifei Wang 
> > > Reviewed-by: Ruifeng Wang 
> > > ---
> > >  lib/bpf/bpf_pkt.c | 9 +++--
> > >  1 file changed, 3 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c index
> > > 6e8248f0d6..3af15ae97b 100644
> > > --- a/lib/bpf/bpf_pkt.c
> > > +++ b/lib/bpf/bpf_pkt.c
> > > @@ -113,7 +113,7 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
> > > static void  bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)  {
> > > - uint32_t nuse, puse;
> > > + uint32_t puse;
> > >
> > >   /* make sure all previous loads and stores are completed */
> > >   rte_smp_mb();
> > > @@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> > >
> > >   /* in use, busy wait till current RX/TX iteration is finished */
> > >   if ((puse & BPF_ETH_CBI_INUSE) != 0) {
> > > - do {
> > > - rte_pause();
> > > - rte_compiler_barrier();
> > > - nuse = cbi->use;
> > > - } while (nuse == puse);
> > > + rte_compiler_barrier();
> > > + rte_wait_event_32(&cbi->use, UINT_MAX, puse, ==,
> > > +__ATOMIC_RELAXED);
> 
> Probably UINT32_MAX will be a bit better here.
That's right, UINT32_MAX is more suitable.
> 
> >
> > If we do use atomic load, while we still need a compiler_barrier() here?
Yes, compiler_barrier can be removed here since atomic_load can update the 
value in time.
> >
> > >   }
> > >  }
> > >
> > > --
> > > 2.25.1



[dpdk-dev] [PATCH v4 0/5] add new definitions for wait scheme

2021-10-20 Thread Feifei Wang
Add new definitions for wait scheme, and apply this new definitions into
lib to replace rte_pause.

v2:
1. use macro to create new wait scheme (Stephen)

v3:
1. delete unnecessary bug fix in bpf (Konstantin)

v4:
1. put size into the macro body (Konstantin)
2. replace assert with BUILD_BUG_ON (Stephen)
3. delete unnecessary compiler barrier for bpf (Konstantin)

Feifei Wang (5):
  eal: add new definitions for wait scheme
  eal: use wait event for read pflock
  eal: use wait event scheme for mcslock
  lib/bpf: use wait event scheme for Rx/Tx iteration
  lib/distributor: use wait event scheme

 lib/bpf/bpf_pkt.c|   9 +-
 lib/distributor/rte_distributor_single.c |  10 +-
 lib/eal/arm/include/rte_pause_64.h   | 126 +--
 lib/eal/include/generic/rte_mcslock.h|   9 +-
 lib/eal/include/generic/rte_pause.h  |  32 ++
 lib/eal/include/generic/rte_pflock.h |   4 +-
 6 files changed, 119 insertions(+), 71 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v4 1/5] eal: add new definitions for wait scheme

2021-10-20 Thread Feifei Wang
Introduce macros as generic interface for address monitoring.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/arm/include/rte_pause_64.h  | 126 
 lib/eal/include/generic/rte_pause.h |  32 +++
 2 files changed, 104 insertions(+), 54 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index e87d10b8cc..23954c2de2 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -31,20 +31,12 @@ static inline void rte_pause(void)
 /* Put processor into low power WFE(Wait For Event) state. */
 #define __WFE() { asm volatile("wfe" : : : "memory"); }
 
-static __rte_always_inline void
-rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
-   int memorder)
-{
-   uint16_t value;
-
-   assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
-
-   /*
-* Atomic exclusive load from addr, it returns the 16-bit content of
-* *addr while making it 'monitored',when it is written by someone
-* else, the 'monitored' state is cleared and a event is generated
-* implicitly to exit WFE.
-*/
+/*
+ * Atomic exclusive load from addr, it returns the 16-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and a event is generated
+ * implicitly to exit WFE.
+ */
 #define __LOAD_EXC_16(src, dst, memorder) {   \
if (memorder == __ATOMIC_RELAXED) {   \
asm volatile("ldxrh %w[tmp], [%x[addr]]"  \
@@ -58,6 +50,52 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
: "memory");  \
} }
 
+/*
+ * Atomic exclusive load from addr, it returns the 32-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and a event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_32(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+/*
+ * Atomic exclusive load from addr, it returns the 64-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and a event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_64(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+static __rte_always_inline void
+rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
+   int memorder)
+{
+   uint16_t value;
+
+   assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
+
__LOAD_EXC_16(addr, value, memorder)
if (value != expected) {
__SEVL()
@@ -66,7 +104,6 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
__LOAD_EXC_16(addr, value, memorder)
} while (value != expected);
}
-#undef __LOAD_EXC_16
 }
 
 static __rte_always_inline void
@@ -77,25 +114,6 @@ rte_wait_until_equal_32(volatile uint32_t *addr, uint32_t 
expected,
 
assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
 
-   /*
-* Atomic exclusive load from addr, it returns the 32-bit content of
-* *addr while making it 'monitored',when it is written by someone
-* else, the 'monitored' state is cleared and a event is generated
-   

[dpdk-dev] [PATCH v4 2/5] eal: use wait event for read pflock

2021-10-20 Thread Feifei Wang
Instead of polling for read pflock update, use wait event scheme for
this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_pflock.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/eal/include/generic/rte_pflock.h 
b/lib/eal/include/generic/rte_pflock.h
index e57c179ef2..c1c230d131 100644
--- a/lib/eal/include/generic/rte_pflock.h
+++ b/lib/eal/include/generic/rte_pflock.h
@@ -121,9 +121,7 @@ rte_pflock_read_lock(rte_pflock_t *pf)
return;
 
/* Wait for current write phase to complete. */
-   while ((__atomic_load_n(&pf->rd.in, __ATOMIC_ACQUIRE)
-   & RTE_PFLOCK_WBITS) == w)
-   rte_pause();
+   rte_wait_event(&pf->rd.in, RTE_PFLOCK_WBITS, w, ==, __ATOMIC_ACQUIRE, 
16);
 }
 
 /**
-- 
2.25.1



[dpdk-dev] [PATCH v4 3/5] eal: use wait event scheme for mcslock

2021-10-20 Thread Feifei Wang
Instead of polling for mcslock to be updated, use wait event scheme
for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_mcslock.h | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/eal/include/generic/rte_mcslock.h 
b/lib/eal/include/generic/rte_mcslock.h
index 34f33c64a5..08137c361b 100644
--- a/lib/eal/include/generic/rte_mcslock.h
+++ b/lib/eal/include/generic/rte_mcslock.h
@@ -116,8 +116,13 @@ rte_mcslock_unlock(rte_mcslock_t **msl, rte_mcslock_t *me)
/* More nodes added to the queue by other CPUs.
 * Wait until the next pointer is set.
 */
-   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) == NULL)
-   rte_pause();
+#ifdef RTE_ARCH_32
+   rte_wait_event((uint32_t *)&me->next, UINT32_MAX, 0, ==,
+   __ATOMIC_RELAXED, 32);
+#else
+   rte_wait_event((uint64_t *)&me->next, UINT64_MAX, 0, ==,
+   __ATOMIC_RELAXED, 64);
+#endif
}
 
/* Pass lock to next waiter. */
-- 
2.25.1



[dpdk-dev] [PATCH v4 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-20 Thread Feifei Wang
Instead of polling for cbi->use to be updated, use wait event scheme.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/bpf/bpf_pkt.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c
index 6e8248f0d6..00a5748061 100644
--- a/lib/bpf/bpf_pkt.c
+++ b/lib/bpf/bpf_pkt.c
@@ -113,7 +113,7 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
 static void
 bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
 {
-   uint32_t nuse, puse;
+   uint32_t puse;
 
/* make sure all previous loads and stores are completed */
rte_smp_mb();
@@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
 
/* in use, busy wait till current RX/TX iteration is finished */
if ((puse & BPF_ETH_CBI_INUSE) != 0) {
-   do {
-   rte_pause();
-   rte_compiler_barrier();
-   nuse = cbi->use;
-   } while (nuse == puse);
+   rte_wait_event(&cbi->use, UINT32_MAX, puse, ==,
+   __ATOMIC_RELAXED, 32);
}
 }
 
-- 
2.25.1



[dpdk-dev] [PATCH v4 5/5] lib/distributor: use wait event scheme

2021-10-20 Thread Feifei Wang
Instead of polling for bufptr64 to be updated, use
wait event for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/distributor/rte_distributor_single.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/distributor/rte_distributor_single.c 
b/lib/distributor/rte_distributor_single.c
index f4725b1d0b..c623bb135d 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -33,9 +33,8 @@ rte_distributor_request_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_GET_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   0, !=, __ATOMIC_RELAXED, 64);
 
/* Sync with distributor on GET_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -74,9 +73,8 @@ rte_distributor_return_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_RETURN_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   0, !=, __ATOMIC_RELAXED, 64);
 
/* Sync with distributor on RETURN_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
-- 
2.25.1



回复: [PATCH v1 3/5] ethdev: add API for direct rearm mode

2022-05-01 Thread Feifei Wang


> -邮件原件-
> 发件人: Jerin Jacob 
> 发送时间: Wednesday, April 20, 2022 6:50 PM
> 收件人: Feifei Wang 
> 抄送: tho...@monjalon.net; Ferruh Yigit ; Andrew
> Rybchenko ; Ray Kinsella
> ; dpdk-dev ; nd ;
> Honnappa Nagarahalli ; Ruifeng Wang
> 
> 主题: Re: [PATCH v1 3/5] ethdev: add API for direct rearm mode
> 
> On Wed, Apr 20, 2022 at 1:47 PM Feifei Wang 
> wrote:
> >
> > Add API for enabling direct rearm mode and for mapping RX and TX
> > queues. Currently, the API supports 1:1(txq : rxq) mapping.
> >
> > Suggested-by: Honnappa Nagarahalli 
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > Reviewed-by: Honnappa Nagarahalli 
> > ---
> 
> > + *
> > + * @return
> > + *   - (0) if successful.
> > + */
> > +__rte_experimental
> > +int rte_eth_direct_rxrearm_map(uint16_t rx_port_id, uint16_t
> rx_queue_id,
> > +  uint16_t tx_port_id, uint16_t
> > +tx_queue_id);
> 
> Won't existing rte_eth_hairpin_* APIs work to achieve the same?
[Feifei] Thanks for the comment. Look at the hairpin feature which is enabled 
in MLX5 driver.

I think the most important difference is that hairpin just re-directs the 
packet from the Rx queue
to Tx queue in the same port, and Rx/Tx queue just  can record the peer queue 
id.
For direct rearm, it can map Rx queue to the Tx queue which are from different 
ports. And this needs
Rx queue records paired port id and queue id. 

Furthermore, hairpin needs to set up new hairpin queue and then it can bind Rx 
queue to Tx queue.
and direct-rearm just can use normal queue to map. This is due to direct rearm 
needs used buffers and
it doesn't care about packet.


[dpdk-dev] 回复: [PATCH v4 1/5] eal: add new definitions for wait scheme

2021-10-25 Thread Feifei Wang


> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Friday, October 22, 2021 12:25 AM
> 收件人: Feifei Wang ; Ruifeng Wang
> 
> 抄送: dev@dpdk.org; nd 
> 主题: RE: [PATCH v4 1/5] eal: add new definitions for wait scheme
> 
> > Introduce macros as generic interface for address monitoring.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >  lib/eal/arm/include/rte_pause_64.h  | 126
> >   lib/eal/include/generic/rte_pause.h |
> > 32 +++
> >  2 files changed, 104 insertions(+), 54 deletions(-)
> >
> > diff --git a/lib/eal/arm/include/rte_pause_64.h
> > b/lib/eal/arm/include/rte_pause_64.h
> > index e87d10b8cc..23954c2de2 100644
> > --- a/lib/eal/arm/include/rte_pause_64.h
> > +++ b/lib/eal/arm/include/rte_pause_64.h
> > @@ -31,20 +31,12 @@ static inline void rte_pause(void)
> >  /* Put processor into low power WFE(Wait For Event) state. */
> > #define __WFE() { asm volatile("wfe" : : : "memory"); }
> >
> > -static __rte_always_inline void
> > -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > -   int memorder)
> > -{
> > -   uint16_t value;
> > -
> > -   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> __ATOMIC_RELAXED);
> > -
> > -   /*
> > -* Atomic exclusive load from addr, it returns the 16-bit content of
> > -* *addr while making it 'monitored',when it is written by someone
> > -* else, the 'monitored' state is cleared and a event is generated
> > -* implicitly to exit WFE.
> > -*/
> > +/*
> > + * Atomic exclusive load from addr, it returns the 16-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> >  #define __LOAD_EXC_16(src, dst, memorder) {   \
> > if (memorder == __ATOMIC_RELAXED) {   \
> > asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@ -58,6 +50,52
> @@
> > rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > : "memory");  \
> > } }
> >
> > +/*
> > + * Atomic exclusive load from addr, it returns the 32-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> > +#define __LOAD_EXC_32(src, dst, memorder) {  \
> > +   if (memorder == __ATOMIC_RELAXED) {  \
> > +   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } else { \
> > +   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } }
> > +
> > +/*
> > + * Atomic exclusive load from addr, it returns the 64-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> > +#define __LOAD_EXC_64(src, dst, memorder) {  \
> > +   if (memorder == __ATOMIC_RELAXED) {  \
> > +   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } else { \
> > +   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } }
> > +
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +   int 

[dpdk-dev] 回复: [PATCH v4 1/5] eal: add new definitions for wait scheme

2021-10-25 Thread Feifei Wang
> -邮件原件-
> 发件人: Jerin Jacob 
> 发送时间: Friday, October 22, 2021 8:10 AM
> 收件人: Feifei Wang 
> 抄送: Ruifeng Wang ; Ananyev, Konstantin
> ; dpdk-dev ; nd
> 
> 主题: Re: [dpdk-dev] [PATCH v4 1/5] eal: add new definitions for wait scheme
> 
> On Wed, Oct 20, 2021 at 2:16 PM Feifei Wang 
> wrote:
> >
> > Introduce macros as generic interface for address monitoring.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >  lib/eal/arm/include/rte_pause_64.h  | 126
> >   lib/eal/include/generic/rte_pause.h |
> > 32 +++
> >  2 files changed, 104 insertions(+), 54 deletions(-)
> >
> > diff --git a/lib/eal/arm/include/rte_pause_64.h
> > b/lib/eal/arm/include/rte_pause_64.h
> > index e87d10b8cc..23954c2de2 100644
> > --- a/lib/eal/arm/include/rte_pause_64.h
> > +++ b/lib/eal/arm/include/rte_pause_64.h
> > @@ -31,20 +31,12 @@ static inline void rte_pause(void)
> >  /* Put processor into low power WFE(Wait For Event) state. */
> > #define __WFE() { asm volatile("wfe" : : : "memory"); }
> >
> > -static __rte_always_inline void
> > -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > -   int memorder)
> > -{
> > -   uint16_t value;
> > -
> > -   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> __ATOMIC_RELAXED);
> > -
> > -   /*
> > -* Atomic exclusive load from addr, it returns the 16-bit content of
> > -* *addr while making it 'monitored',when it is written by someone
> > -* else, the 'monitored' state is cleared and a event is generated
> 
> a event -> an event in all the occurrence.
> 
> > -* implicitly to exit WFE.
> > -*/
> > +/*
> > + * Atomic exclusive load from addr, it returns the 16-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> >  #define __LOAD_EXC_16(src, dst, memorder) {   \
> > if (memorder == __ATOMIC_RELAXED) {   \
> > asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@ -58,6
> > +50,52 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t
> expected,
> > : "memory");  \
> > } }
> >
> > +/*
> > + * Atomic exclusive load from addr, it returns the 32-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> > +#define __LOAD_EXC_32(src, dst, memorder) {  \
> > +   if (memorder == __ATOMIC_RELAXED) {  \
> > +   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } else { \
> > +   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } }
> > +
> > +/*
> > + * Atomic exclusive load from addr, it returns the 64-bit content of
> > + * *addr while making it 'monitored', when it is written by someone
> > + * else, the 'monitored' state is cleared and a event is generated
> > + * implicitly to exit WFE.
> > + */
> > +#define __LOAD_EXC_64(src, dst, memorder) {  \
> > +   if (memorder == __ATOMIC_RELAXED) {  \
> > +   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [addr] "r"(src)\
> > +   : "memory"); \
> > +   } else { \
> > +   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
> > +   : [tmp] "=&r" (dst)  \
> > +   : [

[dpdk-dev] 回复: [PATCH v4 1/5] eal: add new definitions for wait scheme

2021-10-25 Thread Feifei Wang


> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Monday, October 25, 2021 10:29 PM
> 收件人: Feifei Wang ; Ruifeng Wang
> 
> 抄送: dev@dpdk.org; nd ; nd 
> 主题: RE: [PATCH v4 1/5] eal: add new definitions for wait scheme
> 
> 
> > > > Introduce macros as generic interface for address monitoring.
> > > >
> > > > Signed-off-by: Feifei Wang 
> > > > Reviewed-by: Ruifeng Wang 
> > > > ---
> > > >  lib/eal/arm/include/rte_pause_64.h  | 126
> > > >   lib/eal/include/generic/rte_pause.h
> > > > |
> > > > 32 +++
> > > >  2 files changed, 104 insertions(+), 54 deletions(-)
> > > >
> > > > diff --git a/lib/eal/arm/include/rte_pause_64.h
> > > > b/lib/eal/arm/include/rte_pause_64.h
> > > > index e87d10b8cc..23954c2de2 100644
> > > > --- a/lib/eal/arm/include/rte_pause_64.h
> > > > +++ b/lib/eal/arm/include/rte_pause_64.h
> > > > @@ -31,20 +31,12 @@ static inline void rte_pause(void)
> > > >  /* Put processor into low power WFE(Wait For Event) state. */
> > > > #define __WFE() { asm volatile("wfe" : : : "memory"); }
> > > >
> > > > -static __rte_always_inline void
> > > > -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > > -   int memorder)
> > > > -{
> > > > -   uint16_t value;
> > > > -
> > > > -   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > > __ATOMIC_RELAXED);
> > > > -
> > > > -   /*
> > > > -* Atomic exclusive load from addr, it returns the 16-bit 
> > > > content of
> > > > -* *addr while making it 'monitored',when it is written by 
> > > > someone
> > > > -* else, the 'monitored' state is cleared and a event is 
> > > > generated
> > > > -* implicitly to exit WFE.
> > > > -*/
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 16-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and a event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > >  #define __LOAD_EXC_16(src, dst, memorder) {   \
> > > > if (memorder == __ATOMIC_RELAXED) {   \
> > > > asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@ -58,6 
> > > > +50,52
> > > @@
> > > > rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > > : "memory");  \
> > > > } }
> > > >
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 32-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and a event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > > +#define __LOAD_EXC_32(src, dst, memorder) {  \
> > > > +   if (memorder == __ATOMIC_RELAXED) {  \
> > > > +   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > > > +   : [tmp] "=&r" (dst)  \
> > > > +   : [addr] "r"(src)\
> > > > +   : "memory"); \
> > > > +   } else { \
> > > > +   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > > > +   : [tmp] "=&r" (dst)  \
> > > > +   : [addr] "r"(src)\
> > > > +   : "memory"); \
> > > > +   } }
> > > > +
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 64-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, 

[dpdk-dev] 回复: [PATCH v4 1/5] eal: add new definitions for wait scheme

2021-10-25 Thread Feifei Wang

> -邮件原件-
> 发件人: dev  代表 Jerin Jacob
> 发送时间: Monday, October 25, 2021 5:44 PM
> 收件人: Feifei Wang 
> 抄送: Ruifeng Wang ; Ananyev, Konstantin
> ; dpdk-dev ; nd
> 
> 主题: Re: [dpdk-dev] [PATCH v4 1/5] eal: add new definitions for wait scheme
> 
> On Mon, Oct 25, 2021 at 3:01 PM Feifei Wang 
> wrote:
> >
> > > -邮件原件-
> > > 发件人: Jerin Jacob 
> > > 发送时间: Friday, October 22, 2021 8:10 AM
> > > 收件人: Feifei Wang 
> > > 抄送: Ruifeng Wang ; Ananyev, Konstantin
> > > ; dpdk-dev ; nd
> > > 
> > > 主题: Re: [dpdk-dev] [PATCH v4 1/5] eal: add new definitions for wait
> > > scheme
> > >
> > > On Wed, Oct 20, 2021 at 2:16 PM Feifei Wang 
> > > wrote:
> > > >
> > > > Introduce macros as generic interface for address monitoring.
> > > >
> > > > Signed-off-by: Feifei Wang 
> > > > Reviewed-by: Ruifeng Wang 
> > > > ---
> > > >  lib/eal/arm/include/rte_pause_64.h  | 126
> > > >   lib/eal/include/generic/rte_pause.h
> > > > |
> > > > 32 +++
> > > >  2 files changed, 104 insertions(+), 54 deletions(-)
> > > >
> > > > diff --git a/lib/eal/arm/include/rte_pause_64.h
> > > > b/lib/eal/arm/include/rte_pause_64.h
> > > > index e87d10b8cc..23954c2de2 100644
> > > > --- a/lib/eal/arm/include/rte_pause_64.h
> > > > +++ b/lib/eal/arm/include/rte_pause_64.h
> > > > @@ -31,20 +31,12 @@ static inline void rte_pause(void)
> > > >  /* Put processor into low power WFE(Wait For Event) state. */
> > > > #define __WFE() { asm volatile("wfe" : : : "memory"); }
> > > >
> > > > -static __rte_always_inline void
> > > > -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > > -   int memorder)
> > > > -{
> > > > -   uint16_t value;
> > > > -
> > > > -   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > > __ATOMIC_RELAXED);
> > > > -
> > > > -   /*
> > > > -* Atomic exclusive load from addr, it returns the 16-bit 
> > > > content of
> > > > -* *addr while making it 'monitored',when it is written by 
> > > > someone
> > > > -* else, the 'monitored' state is cleared and a event is 
> > > > generated
> > >
> > > a event -> an event in all the occurrence.
> > >
> > > > -* implicitly to exit WFE.
> > > > -*/
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 16-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and a event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > >  #define __LOAD_EXC_16(src, dst, memorder) {   \
> > > > if (memorder == __ATOMIC_RELAXED) {   \
> > > > asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@
> > > > -58,6
> > > > +50,52 @@ rte_wait_until_equal_16(volatile uint16_t *addr,
> > > > +uint16_t
> > > expected,
> > > > : "memory");  \
> > > > } }
> > > >
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 32-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and a event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > > +#define __LOAD_EXC_32(src, dst, memorder) {  \
> > > > +   if (memorder == __ATOMIC_RELAXED) {  \
> > > > +   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > > > +   : [tmp] "=&r" (dst)  \
> > > > +   : [addr] "r"(src)\
> > > > +   : "memory"); \
> > > > +   } else { \
&

[dpdk-dev] [PATCH v5 0/5] add new definitions for wait scheme

2021-10-26 Thread Feifei Wang
Add new definitions for wait scheme, and apply this new definitions into
lib to replace rte_pause.

v2:
1. use macro to create new wait scheme (Stephen)

v3:
1. delete unnecessary bug fix in bpf (Konstantin)

v4:
1. put size into the macro body (Konstantin)
2. replace assert with BUILD_BUG_ON (Stephen)
3. delete unnecessary compiler barrier for bpf (Konstantin)

v5:
1. 'size' is not the parameter (Konstantin)
2. put () around macro parameters (Konstantin)
3. fix some original typo issue (Jerin)
4. swap 'rte_wait_event' parameter location (Jerin)
4. add new macro '__LOAD_EXC'
5. delete 'undef' to prevent compilation warning
 
Feifei Wang (5):
  eal: add new definitions for wait scheme
  eal: use wait event for read pflock
  eal: use wait event scheme for mcslock
  lib/bpf: use wait event scheme for Rx/Tx iteration
  lib/distributor: use wait event scheme

 lib/bpf/bpf_pkt.c|  11 +-
 lib/distributor/rte_distributor_single.c |  10 +-
 lib/eal/arm/include/rte_pause_64.h   | 135 +--
 lib/eal/include/generic/rte_mcslock.h|   9 +-
 lib/eal/include/generic/rte_pause.h  |  27 +
 lib/eal/include/generic/rte_pflock.h |   4 +-
 6 files changed, 121 insertions(+), 75 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v5 1/5] eal: add new definitions for wait scheme

2021-10-26 Thread Feifei Wang
Introduce macros as generic interface for address monitoring.
For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.

Furthermore, to prevent compilation warning in arm:
--
'warning: implicit declaration of function ...'
--
Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and '__WFE'.

This is because original macros are undefine at the end of the file.
If new macro 'rte_wait_event' calls them in other files, they will be
seen as 'not defined'.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/arm/include/rte_pause_64.h  | 135 
 lib/eal/include/generic/rte_pause.h |  27 ++
 2 files changed, 105 insertions(+), 57 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index e87d10b8cc..1fea0dec63 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -31,20 +31,12 @@ static inline void rte_pause(void)
 /* Put processor into low power WFE(Wait For Event) state. */
 #define __WFE() { asm volatile("wfe" : : : "memory"); }
 
-static __rte_always_inline void
-rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
-   int memorder)
-{
-   uint16_t value;
-
-   assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
-
-   /*
-* Atomic exclusive load from addr, it returns the 16-bit content of
-* *addr while making it 'monitored',when it is written by someone
-* else, the 'monitored' state is cleared and a event is generated
-* implicitly to exit WFE.
-*/
+/*
+ * Atomic exclusive load from addr, it returns the 16-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
 #define __LOAD_EXC_16(src, dst, memorder) {   \
if (memorder == __ATOMIC_RELAXED) {   \
asm volatile("ldxrh %w[tmp], [%x[addr]]"  \
@@ -58,6 +50,62 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
: "memory");  \
} }
 
+/*
+ * Atomic exclusive load from addr, it returns the 32-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_32(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+/*
+ * Atomic exclusive load from addr, it returns the 64-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_64(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+#define __LOAD_EXC(src, dst, memorder, size) {  \
+   assert(size == 16 || size == 32 || size == 64); \
+   if (size == 16) \
+   __LOAD_EXC_16(src, dst, memorder)   \
+   else if (size == 32)\
+   __LOAD_EXC_32(src, dst, memorder)   \
+   else if (size == 64)\
+   

[dpdk-dev] [PATCH v5 2/5] eal: use wait event for read pflock

2021-10-26 Thread Feifei Wang
Instead of polling for read pflock update, use wait event scheme for
this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_pflock.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/eal/include/generic/rte_pflock.h 
b/lib/eal/include/generic/rte_pflock.h
index e57c179ef2..7573b036bf 100644
--- a/lib/eal/include/generic/rte_pflock.h
+++ b/lib/eal/include/generic/rte_pflock.h
@@ -121,9 +121,7 @@ rte_pflock_read_lock(rte_pflock_t *pf)
return;
 
/* Wait for current write phase to complete. */
-   while ((__atomic_load_n(&pf->rd.in, __ATOMIC_ACQUIRE)
-   & RTE_PFLOCK_WBITS) == w)
-   rte_pause();
+   rte_wait_event(&pf->rd.in, RTE_PFLOCK_WBITS, ==, w, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.25.1



[dpdk-dev] [PATCH v5 3/5] eal: use wait event scheme for mcslock

2021-10-26 Thread Feifei Wang
Instead of polling for mcslock to be updated, use wait event scheme
for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_mcslock.h | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/eal/include/generic/rte_mcslock.h 
b/lib/eal/include/generic/rte_mcslock.h
index 34f33c64a5..806a2b2c7e 100644
--- a/lib/eal/include/generic/rte_mcslock.h
+++ b/lib/eal/include/generic/rte_mcslock.h
@@ -116,8 +116,13 @@ rte_mcslock_unlock(rte_mcslock_t **msl, rte_mcslock_t *me)
/* More nodes added to the queue by other CPUs.
 * Wait until the next pointer is set.
 */
-   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) == NULL)
-   rte_pause();
+#ifdef RTE_ARCH_32
+   rte_wait_event((uint32_t *)&me->next, UINT32_MAX, ==, 0,
+   __ATOMIC_RELAXED);
+#else
+   rte_wait_event((uint64_t *)&me->next, UINT64_MAX, ==, 0,
+   __ATOMIC_RELAXED);
+#endif
}
 
/* Pass lock to next waiter. */
-- 
2.25.1



[dpdk-dev] [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-26 Thread Feifei Wang
Instead of polling for cbi->use to be updated, use wait event scheme.

Furthermore, delete 'const' for 'bpf_eth_cbi_wait'. This is because of
a compilation error:
---
../lib/eal/include/rte_common.h:36:13: error: read-only variable ‘value’
used as ‘asm’ output
   36 | #define asm __asm__
  | ^~~

../lib/eal/arm/include/rte_pause_64.h:66:3: note: in expansion of macro
‘asm’
   66 |   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
  |   ^~~

../lib/eal/arm/include/rte_pause_64.h:96:3: note: in expansion of macro
‘__LOAD_EXC_32’
   96 |   __LOAD_EXC_32((src), dst, memorder) \
  |   ^

../lib/eal/arm/include/rte_pause_64.h:167:4: note: in expansion of macro
‘__LOAD_EXC’
  167 |__LOAD_EXC((addr), value, memorder, size) \
  |^~

../lib/bpf/bpf_pkt.c:125:3: note: in expansion of macro ‘rte_wait_event’
  125 |   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
---

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/bpf/bpf_pkt.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c
index 6e8248f0d6..213d44a75a 100644
--- a/lib/bpf/bpf_pkt.c
+++ b/lib/bpf/bpf_pkt.c
@@ -111,9 +111,9 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
  * Waits till datapath finished using given callback.
  */
 static void
-bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
+bpf_eth_cbi_wait(struct bpf_eth_cbi *cbi)
 {
-   uint32_t nuse, puse;
+   uint32_t puse;
 
/* make sure all previous loads and stores are completed */
rte_smp_mb();
@@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
 
/* in use, busy wait till current RX/TX iteration is finished */
if ((puse & BPF_ETH_CBI_INUSE) != 0) {
-   do {
-   rte_pause();
-   rte_compiler_barrier();
-   nuse = cbi->use;
-   } while (nuse == puse);
+   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
+   __ATOMIC_RELAXED);
}
 }
 
-- 
2.25.1



[dpdk-dev] [PATCH v5 5/5] lib/distributor: use wait event scheme

2021-10-26 Thread Feifei Wang
Instead of polling for bufptr64 to be updated, use
wait event for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/distributor/rte_distributor_single.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/distributor/rte_distributor_single.c 
b/lib/distributor/rte_distributor_single.c
index f4725b1d0b..d52b24a453 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -33,9 +33,8 @@ rte_distributor_request_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_GET_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on GET_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -74,9 +73,8 @@ rte_distributor_return_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_RETURN_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on RETURN_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
-- 
2.25.1



[dpdk-dev] 回复: [PATCH v5 1/5] eal: add new definitions for wait scheme

2021-10-26 Thread Feifei Wang
> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: Tuesday, October 26, 2021 4:02 PM
> 收件人: Ruifeng Wang 
> 抄送: dev@dpdk.org; nd ; Feifei Wang
> 
> 主题: [PATCH v5 1/5] eal: add new definitions for wait scheme
> 
> Introduce macros as generic interface for address monitoring.
> For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
> and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.
> 
> Furthermore, to prevent compilation warning in arm:
> --
> 'warning: implicit declaration of function ...'
> --
> Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and '__WFE'.
> 
> This is because original macros are undefine at the end of the file.
> If new macro 'rte_wait_event' calls them in other files, they will be seen as
> 'not defined'.
> 
> Signed-off-by: Feifei Wang 
> Reviewed-by: Ruifeng Wang 
> ---
>  lib/eal/arm/include/rte_pause_64.h  | 135 
> lib/eal/include/generic/rte_pause.h |  27 ++
>  2 files changed, 105 insertions(+), 57 deletions(-)
> 
> diff --git a/lib/eal/arm/include/rte_pause_64.h
> b/lib/eal/arm/include/rte_pause_64.h
> index e87d10b8cc..1fea0dec63 100644
> --- a/lib/eal/arm/include/rte_pause_64.h
> +++ b/lib/eal/arm/include/rte_pause_64.h
> @@ -31,20 +31,12 @@ static inline void rte_pause(void)
>  /* Put processor into low power WFE(Wait For Event) state. */  #define
> __WFE() { asm volatile("wfe" : : : "memory"); }
> 
> -static __rte_always_inline void
> -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> - int memorder)
> -{
> - uint16_t value;
> -
> - assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> __ATOMIC_RELAXED);
> -
> - /*
> -  * Atomic exclusive load from addr, it returns the 16-bit content of
> -  * *addr while making it 'monitored',when it is written by someone
> -  * else, the 'monitored' state is cleared and a event is generated
> -  * implicitly to exit WFE.
> -  */
> +/*
> + * Atomic exclusive load from addr, it returns the 16-bit content of
> + * *addr while making it 'monitored', when it is written by someone
> + * else, the 'monitored' state is cleared and an event is generated
> + * implicitly to exit WFE.
> + */
>  #define __LOAD_EXC_16(src, dst, memorder) {   \
>   if (memorder == __ATOMIC_RELAXED) {   \
>   asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@ -58,6 +50,62
> @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
>   : "memory");  \
>   } }
> 
> +/*
> + * Atomic exclusive load from addr, it returns the 32-bit content of
> + * *addr while making it 'monitored', when it is written by someone
> + * else, the 'monitored' state is cleared and an event is generated
> + * implicitly to exit WFE.
> + */
> +#define __LOAD_EXC_32(src, dst, memorder) {  \
> + if (memorder == __ATOMIC_RELAXED) {  \
> + asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> + : [tmp] "=&r" (dst)  \
> + : [addr] "r"(src)\
> + : "memory"); \
> + } else { \
> + asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> + : [tmp] "=&r" (dst)  \
> + : [addr] "r"(src)\
> + : "memory"); \
> + } }
> +
> +/*
> + * Atomic exclusive load from addr, it returns the 64-bit content of
> + * *addr while making it 'monitored', when it is written by someone
> + * else, the 'monitored' state is cleared and an event is generated
> + * implicitly to exit WFE.
> + */
> +#define __LOAD_EXC_64(src, dst, memorder) {  \
> + if (memorder == __ATOMIC_RELAXED) {  \
> + asm volatile("ldxr %x[tmp], [%x[addr]]"  \
> + : [tmp] "=&r" (dst)  \
> + : [addr] "r"(src)\
> + : "memory"); \
> + } else { \
> + asm volatile("ldaxr %x[tmp], [%x[addr]]" \
> + : [tmp]

[dpdk-dev] 回复: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-26 Thread Feifei Wang


> -邮件原件-
> 发件人: Feifei Wang 
> 发送时间: Tuesday, October 26, 2021 4:02 PM
> 收件人: Konstantin Ananyev 
> 抄送: dev@dpdk.org; nd ; Feifei Wang
> ; Ruifeng Wang 
> 主题: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration
> 
> Instead of polling for cbi->use to be updated, use wait event scheme.
> 
> Furthermore, delete 'const' for 'bpf_eth_cbi_wait'. This is because of a
> compilation error:
> ---
> ../lib/eal/include/rte_common.h:36:13: error: read-only variable ‘value’
> used as ‘asm’ output
>36 | #define asm __asm__
>   | ^~~
> 
> ../lib/eal/arm/include/rte_pause_64.h:66:3: note: in expansion of macro ‘asm’
>66 |   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
>   |   ^~~
> 
> ../lib/eal/arm/include/rte_pause_64.h:96:3: note: in expansion of macro
> ‘__LOAD_EXC_32’
>96 |   __LOAD_EXC_32((src), dst, memorder) \
>   |   ^
> 
> ../lib/eal/arm/include/rte_pause_64.h:167:4: note: in expansion of macro
> ‘__LOAD_EXC’
>   167 |__LOAD_EXC((addr), value, memorder, size) \
>   |^~
> 
> ../lib/bpf/bpf_pkt.c:125:3: note: in expansion of macro ‘rte_wait_event’
>   125 |   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> ---
> 
> Signed-off-by: Feifei Wang 
> Reviewed-by: Ruifeng Wang 
> ---
>  lib/bpf/bpf_pkt.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c index
> 6e8248f0d6..213d44a75a 100644
> --- a/lib/bpf/bpf_pkt.c
> +++ b/lib/bpf/bpf_pkt.c
> @@ -111,9 +111,9 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
>   * Waits till datapath finished using given callback.
>   */
>  static void
> -bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> +bpf_eth_cbi_wait(struct bpf_eth_cbi *cbi)

Hi, Konstantin

For this bpf patch, I delete 'const' through this is contrary to what we
discussed earlier. This is because if  we keep 'constant' here and use 
'rte_wait_event'
new macro, compiler will report error. And earlier the arm version cannot be 
compiled
due to I forgot enable "wfe" config in the meson file, so this issue can not 
happen before.

>  {
> - uint32_t nuse, puse;
> + uint32_t puse;
> 
>   /* make sure all previous loads and stores are completed */
>   rte_smp_mb();
> @@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> 
>   /* in use, busy wait till current RX/TX iteration is finished */
>   if ((puse & BPF_ETH_CBI_INUSE) != 0) {
> - do {
> - rte_pause();
> - rte_compiler_barrier();
> - nuse = cbi->use;
> - } while (nuse == puse);
> + rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> + __ATOMIC_RELAXED);
>   }
>  }
> 
> --
> 2.25.1



[dpdk-dev] 回复: [PATCH v5 1/5] eal: add new definitions for wait scheme

2021-10-26 Thread Feifei Wang


> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Tuesday, October 26, 2021 5:59 PM
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; nd 
> 主题: RE: [PATCH v5 1/5] eal: add new definitions for wait scheme
> 
> 
> > > > Introduce macros as generic interface for address monitoring.
> > > > For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
> > > > and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.
> > > >
> > > > Furthermore, to prevent compilation warning in arm:
> > > > --
> > > > 'warning: implicit declaration of function ...'
> > > > --
> > > > Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and
> '__WFE'.
> > > >
> > > > This is because original macros are undefine at the end of the file.
> > > > If new macro 'rte_wait_event' calls them in other files, they will
> > > > be seen as 'not defined'.
> > > >
> > > > Signed-off-by: Feifei Wang 
> > > > Reviewed-by: Ruifeng Wang 
> > > > ---
> > > >  lib/eal/arm/include/rte_pause_64.h  | 135
> > > >  lib/eal/include/generic/rte_pause.h |
> > > > 27 ++
> > > >  2 files changed, 105 insertions(+), 57 deletions(-)
> > > >
> > > > diff --git a/lib/eal/arm/include/rte_pause_64.h
> > > > b/lib/eal/arm/include/rte_pause_64.h
> > > > index e87d10b8cc..1fea0dec63 100644
> > > > --- a/lib/eal/arm/include/rte_pause_64.h
> > > > +++ b/lib/eal/arm/include/rte_pause_64.h
> > > > @@ -31,20 +31,12 @@ static inline void rte_pause(void)
> > > >  /* Put processor into low power WFE(Wait For Event) state. */
> > > > #define
> > > > __WFE() { asm volatile("wfe" : : : "memory"); }
> > > >
> > > > -static __rte_always_inline void
> > > > -rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > > -   int memorder)
> > > > -{
> > > > -   uint16_t value;
> > > > -
> > > > -   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > > > __ATOMIC_RELAXED);
> > > > -
> > > > -   /*
> > > > -* Atomic exclusive load from addr, it returns the 16-bit 
> > > > content of
> > > > -* *addr while making it 'monitored',when it is written by 
> > > > someone
> > > > -* else, the 'monitored' state is cleared and a event is 
> > > > generated
> > > > -* implicitly to exit WFE.
> > > > -*/
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 16-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and an event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > >  #define __LOAD_EXC_16(src, dst, memorder) {   \
> > > > if (memorder == __ATOMIC_RELAXED) {   \
> > > > asm volatile("ldxrh %w[tmp], [%x[addr]]"  \ @@ -58,6 
> > > > +50,62
> @@
> > > > rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > > > : "memory");  \
> > > > } }
> > > >
> > > > +/*
> > > > + * Atomic exclusive load from addr, it returns the 32-bit content
> > > > +of
> > > > + * *addr while making it 'monitored', when it is written by
> > > > +someone
> > > > + * else, the 'monitored' state is cleared and an event is
> > > > +generated
> > > > + * implicitly to exit WFE.
> > > > + */
> > > > +#define __LOAD_EXC_32(src, dst, memorder) {  \
> > > > +   if (memorder == __ATOMIC_RELAXED) {  \
> > > > +   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > > > +   : [tmp] "=&r" (dst)  \
> > > > +   : [addr] "r"(src)\
> > &g

[dpdk-dev] 回复: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: dev  代表 Ananyev, Konstantin
> 发送时间: Tuesday, October 26, 2021 8:57 PM
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> ; nd 
> 主题: Re: [dpdk-dev] [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx
> iteration
> 
> 
> > Hi Feifei,
> >
> > > > Instead of polling for cbi->use to be updated, use wait event scheme.
> > > >
> > > > Furthermore, delete 'const' for 'bpf_eth_cbi_wait'. This is
> > > > because of a compilation error:
> > > > --
> > > > -
> > > > ../lib/eal/include/rte_common.h:36:13: error: read-only variable ‘value’
> > > > used as ‘asm’ output
> > > >36 | #define asm __asm__
> > > >   | ^~~
> > > >
> > > > ../lib/eal/arm/include/rte_pause_64.h:66:3: note: in expansion of
> macro ‘asm’
> > > >66 |   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > > >   |   ^~~
> > > >
> > > > ../lib/eal/arm/include/rte_pause_64.h:96:3: note: in expansion of
> > > > macro ‘__LOAD_EXC_32’
> > > >96 |   __LOAD_EXC_32((src), dst, memorder) \
> > > >   |   ^
> > > >
> > > > ../lib/eal/arm/include/rte_pause_64.h:167:4: note: in expansion of
> > > > macro ‘__LOAD_EXC’
> > > >   167 |__LOAD_EXC((addr), value, memorder, size) \
> > > >   |^~
> > > >
> > > > ../lib/bpf/bpf_pkt.c:125:3: note: in expansion of macro ‘rte_wait_event’
> > > >   125 |   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> > > > --
> > > > -
> > > >
> > > > Signed-off-by: Feifei Wang 
> > > > Reviewed-by: Ruifeng Wang 
> > > > ---
> > > >  lib/bpf/bpf_pkt.c | 11 ---
> > > >  1 file changed, 4 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c index
> > > > 6e8248f0d6..213d44a75a 100644
> > > > --- a/lib/bpf/bpf_pkt.c
> > > > +++ b/lib/bpf/bpf_pkt.c
> > > > @@ -111,9 +111,9 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
> > > >   * Waits till datapath finished using given callback.
> > > >   */
> > > >  static void
> > > > -bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> > > > +bpf_eth_cbi_wait(struct bpf_eth_cbi *cbi)
> > >
> > > Hi, Konstantin
> > >
> > > For this bpf patch, I delete 'const' through this is contrary to
> > > what we discussed earlier. This is because if  we keep 'constant' here and
> use 'rte_wait_event'
> > > new macro, compiler will report error. And earlier the arm version
> > > cannot be compiled due to I forgot enable "wfe" config in the meson file,
> so this issue can not happen before.
> >
> >
> > Honestly, I don't understand why we have to remove perfectly valid 'const'
> qualifier here.
> > If this macro can't be used with pointers to const (still don't
> > understand why), then let's just not use this macro here.
> > Strictly speaking I don't see much benefit here from it.
> >
> > >
> > > >  {
> > > > -   uint32_t nuse, puse;
> > > > +   uint32_t puse;
> > > >
> > > > /* make sure all previous loads and stores are completed */
> > > > rte_smp_mb();
> > > > @@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi
> > > > *cbi)
> > > >
> > > > /* in use, busy wait till current RX/TX iteration is finished */
> > > > if ((puse & BPF_ETH_CBI_INUSE) != 0) {
> > > > -   do {
> > > > -   rte_pause();
> > > > -   rte_compiler_barrier();
> > > > -   nuse = cbi->use;
> > > > -   } while (nuse == puse);
> > > > +   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> > > > +   __ATOMIC_RELAXED);
> 
> After another thought, if we do type conversion at macro invocation time:
> 
> bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi) {
>   ...
>   rte_wait_event((uint32_t *)&cbi->use, UINT32_MAX, ==, puse,
> __ATOMIC_RELAXED);
> 
> would that help?

I try to with this and it will report compiler warning:
' cast discards ‘const’ qualifier'.
I think this is due to that in rte_wait_event macro, we use
typeof(*(addr)) value = 0;
 and value is defined as "const uint32_t",
but it should be able to be updated.

Furthermore, this reflects the limitations of the new macro, it cannot be 
applied
when 'addr' is type of 'const'. Finally, I think I should give up the change 
for "bpf".
> 
> 
> > > > }
> > > >  }
> > > >
> > > > --
> > > > 2.25.1



[dpdk-dev] 回复: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Feifei Wang
> 发送时间: Wednesday, October 27, 2021 3:04 PM
> 收件人: Ananyev, Konstantin 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> ; nd ; nd 
> 主题: 回复: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration
> 
> 
> 
> > -邮件原件-
> > 发件人: dev  代表 Ananyev, Konstantin
> > 发送时间: Tuesday, October 26, 2021 8:57 PM
> > 收件人: Feifei Wang 
> > 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> > ; nd 
> > 主题: Re: [dpdk-dev] [PATCH v5 4/5] lib/bpf: use wait event scheme for
> > Rx/Tx iteration
> >
> >
> > > Hi Feifei,
> > >
> > > > > Instead of polling for cbi->use to be updated, use wait event scheme.
> > > > >
> > > > > Furthermore, delete 'const' for 'bpf_eth_cbi_wait'. This is
> > > > > because of a compilation error:
> > > > > 
> > > > > --
> > > > > -
> > > > > ../lib/eal/include/rte_common.h:36:13: error: read-only variable
> ‘value’
> > > > > used as ‘asm’ output
> > > > >36 | #define asm __asm__
> > > > >   | ^~~
> > > > >
> > > > > ../lib/eal/arm/include/rte_pause_64.h:66:3: note: in expansion
> > > > > of
> > macro ‘asm’
> > > > >66 |   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > > > >   |   ^~~
> > > > >
> > > > > ../lib/eal/arm/include/rte_pause_64.h:96:3: note: in expansion
> > > > > of macro ‘__LOAD_EXC_32’
> > > > >96 |   __LOAD_EXC_32((src), dst, memorder) \
> > > > >   |   ^
> > > > >
> > > > > ../lib/eal/arm/include/rte_pause_64.h:167:4: note: in expansion
> > > > > of macro ‘__LOAD_EXC’
> > > > >   167 |__LOAD_EXC((addr), value, memorder, size) \
> > > > >   |^~
> > > > >
> > > > > ../lib/bpf/bpf_pkt.c:125:3: note: in expansion of macro
> ‘rte_wait_event’
> > > > >   125 |   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> > > > > 
> > > > > --
> > > > > -
> > > > >
> > > > > Signed-off-by: Feifei Wang 
> > > > > Reviewed-by: Ruifeng Wang 
> > > > > ---
> > > > >  lib/bpf/bpf_pkt.c | 11 ---
> > > > >  1 file changed, 4 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c index
> > > > > 6e8248f0d6..213d44a75a 100644
> > > > > --- a/lib/bpf/bpf_pkt.c
> > > > > +++ b/lib/bpf/bpf_pkt.c
> > > > > @@ -111,9 +111,9 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
> > > > >   * Waits till datapath finished using given callback.
> > > > >   */
> > > > >  static void
> > > > > -bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> > > > > +bpf_eth_cbi_wait(struct bpf_eth_cbi *cbi)
> > > >
> > > > Hi, Konstantin
> > > >
> > > > For this bpf patch, I delete 'const' through this is contrary to
> > > > what we discussed earlier. This is because if  we keep 'constant'
> > > > here and
> > use 'rte_wait_event'
> > > > new macro, compiler will report error. And earlier the arm version
> > > > cannot be compiled due to I forgot enable "wfe" config in the
> > > > meson file,
> > so this issue can not happen before.
> > >
> > >
> > > Honestly, I don't understand why we have to remove perfectly valid
> 'const'
> > qualifier here.
> > > If this macro can't be used with pointers to const (still don't
> > > understand why), then let's just not use this macro here.
> > > Strictly speaking I don't see much benefit here from it.
> > >
> > > >
> > > > >  {
> > > > > - uint32_t nuse, puse;
> > > > > + uint32_t puse;
> > > > >
> > > > >   /* make sure all previous loads and stores are completed */
> > > > >   rte_smp_mb();
> > > > > @@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi
> > > > > *cbi)

[dpdk-dev] [PATCH v6 0/4] add new definitions for wait scheme

2021-10-27 Thread Feifei Wang
Add new definitions for wait scheme, and apply this new definitions into
lib to replace rte_pause.

v2:
1. use macro to create new wait scheme (Stephen)

v3:
1. delete unnecessary bug fix in bpf (Konstantin)

v4:
1. put size into the macro body (Konstantin)
2. replace assert with BUILD_BUG_ON (Stephen)
3. delete unnecessary compiler barrier for bpf (Konstantin)

v5:
1. 'size' is not the parameter (Konstantin)
2. put () around macro parameters (Konstantin)
3. fix some original typo issue (Jerin)
4. swap 'rte_wait_event' parameter location (Jerin)
5. add new macro '__LOAD_EXC'
6. delete 'undef' to prevent compilation warning

v6:
1. fix patch style check warning
2. delete 'bpf' patch due to 'const' limit

Feifei Wang (4):
  eal: add new definitions for wait scheme
  eal: use wait event for read pflock
  eal: use wait event scheme for mcslock
  lib/distributor: use wait event scheme

 lib/distributor/rte_distributor_single.c |  10 +-
 lib/eal/arm/include/rte_pause_64.h   | 136 +--
 lib/eal/include/generic/rte_mcslock.h|   9 +-
 lib/eal/include/generic/rte_pause.h  |  28 +
 lib/eal/include/generic/rte_pflock.h |   4 +-
 5 files changed, 119 insertions(+), 68 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v6 1/4] eal: add new definitions for wait scheme

2021-10-27 Thread Feifei Wang
Introduce macros as generic interface for address monitoring.
For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.

Furthermore, to prevent compilation warning in arm:
--
'warning: implicit declaration of function ...'
--
Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and '__WFE'.

This is because original macros are undefine at the end of the file.
If new macro 'rte_wait_event' calls them in other files, they will be
seen as 'not defined'.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/arm/include/rte_pause_64.h  | 136 
 lib/eal/include/generic/rte_pause.h |  28 ++
 2 files changed, 107 insertions(+), 57 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index e87d10b8cc..87df224ac1 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -31,20 +31,12 @@ static inline void rte_pause(void)
 /* Put processor into low power WFE(Wait For Event) state. */
 #define __WFE() { asm volatile("wfe" : : : "memory"); }
 
-static __rte_always_inline void
-rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
-   int memorder)
-{
-   uint16_t value;
-
-   assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
-
-   /*
-* Atomic exclusive load from addr, it returns the 16-bit content of
-* *addr while making it 'monitored',when it is written by someone
-* else, the 'monitored' state is cleared and a event is generated
-* implicitly to exit WFE.
-*/
+/*
+ * Atomic exclusive load from addr, it returns the 16-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
 #define __LOAD_EXC_16(src, dst, memorder) {   \
if (memorder == __ATOMIC_RELAXED) {   \
asm volatile("ldxrh %w[tmp], [%x[addr]]"  \
@@ -58,6 +50,62 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
: "memory");  \
} }
 
+/*
+ * Atomic exclusive load from addr, it returns the 32-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_32(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+/*
+ * Atomic exclusive load from addr, it returns the 64-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __LOAD_EXC_64(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+#define __LOAD_EXC(src, dst, memorder, size) {  \
+   assert(size == 16 || size == 32 || size == 64); \
+   if (size == 16) \
+   __LOAD_EXC_16(src, dst, memorder)   \
+   else if (size == 32)\
+   __LOAD_EXC_32(src, dst, memorder)   \
+   else if (size == 64)\
+   

[dpdk-dev] [PATCH v6 2/4] eal: use wait event for read pflock

2021-10-27 Thread Feifei Wang
Instead of polling for read pflock update, use wait event scheme for
this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_pflock.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/eal/include/generic/rte_pflock.h 
b/lib/eal/include/generic/rte_pflock.h
index e57c179ef2..7573b036bf 100644
--- a/lib/eal/include/generic/rte_pflock.h
+++ b/lib/eal/include/generic/rte_pflock.h
@@ -121,9 +121,7 @@ rte_pflock_read_lock(rte_pflock_t *pf)
return;
 
/* Wait for current write phase to complete. */
-   while ((__atomic_load_n(&pf->rd.in, __ATOMIC_ACQUIRE)
-   & RTE_PFLOCK_WBITS) == w)
-   rte_pause();
+   rte_wait_event(&pf->rd.in, RTE_PFLOCK_WBITS, ==, w, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.25.1



[dpdk-dev] [PATCH v6 3/4] eal: use wait event scheme for mcslock

2021-10-27 Thread Feifei Wang
Instead of polling for mcslock to be updated, use wait event scheme
for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_mcslock.h | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/lib/eal/include/generic/rte_mcslock.h 
b/lib/eal/include/generic/rte_mcslock.h
index 34f33c64a5..806a2b2c7e 100644
--- a/lib/eal/include/generic/rte_mcslock.h
+++ b/lib/eal/include/generic/rte_mcslock.h
@@ -116,8 +116,13 @@ rte_mcslock_unlock(rte_mcslock_t **msl, rte_mcslock_t *me)
/* More nodes added to the queue by other CPUs.
 * Wait until the next pointer is set.
 */
-   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) == NULL)
-   rte_pause();
+#ifdef RTE_ARCH_32
+   rte_wait_event((uint32_t *)&me->next, UINT32_MAX, ==, 0,
+   __ATOMIC_RELAXED);
+#else
+   rte_wait_event((uint64_t *)&me->next, UINT64_MAX, ==, 0,
+   __ATOMIC_RELAXED);
+#endif
}
 
/* Pass lock to next waiter. */
-- 
2.25.1



[dpdk-dev] [PATCH v6 4/4] lib/distributor: use wait event scheme

2021-10-27 Thread Feifei Wang
Instead of polling for bufptr64 to be updated, use
wait event for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/distributor/rte_distributor_single.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/distributor/rte_distributor_single.c 
b/lib/distributor/rte_distributor_single.c
index f4725b1d0b..d52b24a453 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -33,9 +33,8 @@ rte_distributor_request_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_GET_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on GET_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -74,9 +73,8 @@ rte_distributor_return_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_RETURN_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on RETURN_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
-- 
2.25.1



[dpdk-dev] 回复: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Ananyev, Konstantin 
> 发送时间: Wednesday, October 27, 2021 10:48 PM
> 收件人: Feifei Wang 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> ; nd ; nd 
> 主题: RE: [PATCH v5 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration
> 
> 
> 
> >
> > > -邮件原件-
> > > 发件人: dev  代表 Ananyev, Konstantin
> > > 发送时间: Tuesday, October 26, 2021 8:57 PM
> > > 收件人: Feifei Wang 
> > > 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> > > ; nd 
> > > 主题: Re: [dpdk-dev] [PATCH v5 4/5] lib/bpf: use wait event scheme for
> > > Rx/Tx iteration
> > >
> > >
> > > > Hi Feifei,
> > > >
> > > > > > Instead of polling for cbi->use to be updated, use wait event 
> > > > > > scheme.
> > > > > >
> > > > > > Furthermore, delete 'const' for 'bpf_eth_cbi_wait'. This is
> > > > > > because of a compilation error:
> > > > > > --
> > > > > > 
> > > > > > -
> > > > > > ../lib/eal/include/rte_common.h:36:13: error: read-only variable
> ‘value’
> > > > > > used as ‘asm’ output
> > > > > >36 | #define asm __asm__
> > > > > >   | ^~~
> > > > > >
> > > > > > ../lib/eal/arm/include/rte_pause_64.h:66:3: note: in expansion
> > > > > > of
> > > macro ‘asm’
> > > > > >66 |   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > > > > >   |   ^~~
> > > > > >
> > > > > > ../lib/eal/arm/include/rte_pause_64.h:96:3: note: in expansion
> > > > > > of macro ‘__LOAD_EXC_32’
> > > > > >96 |   __LOAD_EXC_32((src), dst, memorder) \
> > > > > >   |   ^
> > > > > >
> > > > > > ../lib/eal/arm/include/rte_pause_64.h:167:4: note: in
> > > > > > expansion of macro ‘__LOAD_EXC’
> > > > > >   167 |__LOAD_EXC((addr), value, memorder, size) \
> > > > > >   |^~
> > > > > >
> > > > > > ../lib/bpf/bpf_pkt.c:125:3: note: in expansion of macro
> ‘rte_wait_event’
> > > > > >   125 |   rte_wait_event(&cbi->use, UINT32_MAX, ==, puse,
> > > > > > --
> > > > > > 
> > > > > > -
> > > > > >
> > > > > > Signed-off-by: Feifei Wang 
> > > > > > Reviewed-by: Ruifeng Wang 
> > > > > > ---
> > > > > >  lib/bpf/bpf_pkt.c | 11 ---
> > > > > >  1 file changed, 4 insertions(+), 7 deletions(-)
> > > > > >
> > > > > > diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c index
> > > > > > 6e8248f0d6..213d44a75a 100644
> > > > > > --- a/lib/bpf/bpf_pkt.c
> > > > > > +++ b/lib/bpf/bpf_pkt.c
> > > > > > @@ -111,9 +111,9 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
> > > > > >   * Waits till datapath finished using given callback.
> > > > > >   */
> > > > > >  static void
> > > > > > -bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
> > > > > > +bpf_eth_cbi_wait(struct bpf_eth_cbi *cbi)
> > > > >
> > > > > Hi, Konstantin
> > > > >
> > > > > For this bpf patch, I delete 'const' through this is contrary to
> > > > > what we discussed earlier. This is because if  we keep
> > > > > 'constant' here and
> > > use 'rte_wait_event'
> > > > > new macro, compiler will report error. And earlier the arm
> > > > > version cannot be compiled due to I forgot enable "wfe" config
> > > > > in the meson file,
> > > so this issue can not happen before.
> > > >
> > > >
> > > > Honestly, I don't understand why we have to remove perfectly valid
> 'const'
> > > qualifier here.
> > > > If this macro can't be used with pointers to const (still don't
> > > > understand why), then let's just not use this macro here.
> > > > Strictly speaking I don't see much benefit here from it.
> > > >
> > > >

[dpdk-dev] 回复: [PATCH v6 3/4] eal: use wait event scheme for mcslock

2021-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Mattias Rönnblom 
> 发送时间: Wednesday, October 27, 2021 7:16 PM
> 收件人: Feifei Wang ; Honnappa Nagarahalli
> 
> 抄送: dev@dpdk.org; nd ; Ruifeng Wang
> 
> 主题: Re: [dpdk-dev] [PATCH v6 3/4] eal: use wait event scheme for mcslock
> 
> On 2021-10-27 10:10, Feifei Wang wrote:
> > Instead of polling for mcslock to be updated, use wait event scheme
> > for this case.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >   lib/eal/include/generic/rte_mcslock.h | 9 +++--
> >   1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/eal/include/generic/rte_mcslock.h
> > b/lib/eal/include/generic/rte_mcslock.h
> > index 34f33c64a5..806a2b2c7e 100644
> > --- a/lib/eal/include/generic/rte_mcslock.h
> > +++ b/lib/eal/include/generic/rte_mcslock.h
> > @@ -116,8 +116,13 @@ rte_mcslock_unlock(rte_mcslock_t **msl,
> rte_mcslock_t *me)
> > /* More nodes added to the queue by other CPUs.
> >  * Wait until the next pointer is set.
> >  */
> > -   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) ==
> NULL)
> > -   rte_pause();
> > +#ifdef RTE_ARCH_32
> > +   rte_wait_event((uint32_t *)&me->next, UINT32_MAX, ==, 0,
> > +   __ATOMIC_RELAXED);
> > +#else
> > +   rte_wait_event((uint64_t *)&me->next, UINT64_MAX, ==, 0,
> > +   __ATOMIC_RELAXED);
> > +#endif
> > }
> >
> > /* Pass lock to next waiter. */
> 
> You could do something like
> 
> rte_wait_event)&me->next, UINTPTR_MAX, ==, 0,
> __ATOMIC_RELAXED);
> 
> and avoid the #ifdef.
Good comments, it can fix the problem. Thanks for this comments.



[dpdk-dev] 回复: [PATCH v6 0/4] add new definitions for wait scheme

2021-10-27 Thread Feifei Wang


> -邮件原件-
> 发件人: Jerin Jacob 
> 发送时间: Wednesday, October 27, 2021 6:58 PM
> 收件人: Feifei Wang ; Ananyev, Konstantin
> ; Stephen Hemminger
> ; David Marchand
> ; tho...@monjalon.net
> 抄送: dpdk-dev ; nd 
> 主题: Re: [dpdk-dev] [PATCH v6 0/4] add new definitions for wait scheme
> 
> On Wed, Oct 27, 2021 at 1:40 PM Feifei Wang 
> wrote:
> >
> > Add new definitions for wait scheme, and apply this new definitions
> > into lib to replace rte_pause.
> >
> > v2:
> > 1. use macro to create new wait scheme (Stephen)
> >
> > v3:
> > 1. delete unnecessary bug fix in bpf (Konstantin)
> >
> > v4:
> > 1. put size into the macro body (Konstantin) 2. replace assert with
> > BUILD_BUG_ON (Stephen) 3. delete unnecessary compiler barrier for bpf
> > (Konstantin)
> >
> > v5:
> > 1. 'size' is not the parameter (Konstantin) 2. put () around macro
> > parameters (Konstantin) 3. fix some original typo issue (Jerin) 4.
> > swap 'rte_wait_event' parameter location (Jerin) 5. add new macro
> > '__LOAD_EXC'
> > 6. delete 'undef' to prevent compilation warning
> 
> + David, Konstantin, Stephen,
> 
> Please make a practice to add exiting reviewers.
That's Ok.
> 
> undef the local marco may result in conflict with other libraries.
> Please add __RTE_ARM_ for existing macros (mark as internal) to fix the
> namespace if we are taking that path
Thanks for the comments, I will update this in the next version.
> 
> >
> > v6:
> > 1. fix patch style check warning
> > 2. delete 'bpf' patch due to 'const' limit
> >
> > Feifei Wang (4):
> >   eal: add new definitions for wait scheme
> >   eal: use wait event for read pflock
> >   eal: use wait event scheme for mcslock
> >   lib/distributor: use wait event scheme
> >
> >  lib/distributor/rte_distributor_single.c |  10 +-
> >  lib/eal/arm/include/rte_pause_64.h   | 136 +--
> >  lib/eal/include/generic/rte_mcslock.h|   9 +-
> >  lib/eal/include/generic/rte_pause.h  |  28 +
> >  lib/eal/include/generic/rte_pflock.h |   4 +-
> >  5 files changed, 119 insertions(+), 68 deletions(-)
> >
> > --
> > 2.25.1
> >


[dpdk-dev] [PATCH v7 0/5] add new definitions for wait scheme

2021-10-27 Thread Feifei Wang
Add new definitions for wait scheme, and apply this new definitions into
lib to replace rte_pause.

v2:
1. use macro to create new wait scheme (Stephen)

v3:
1. delete unnecessary bug fix in bpf (Konstantin)

v4:
1. put size into the macro body (Konstantin)
2. replace assert with BUILD_BUG_ON (Stephen)
3. delete unnecessary compiler barrier for bpf (Konstantin)

v5:
1. 'size' is not the parameter (Konstantin)
2. put () around macro parameters (Konstantin)
3. fix some original typo issue (Jerin)
4. swap 'rte_wait_event' parameter location (Jerin)
5. add new macro '__LOAD_EXC'
6. delete 'undef' to prevent compilation warning

v6:
1. fix patch style check warning
2. delete 'bpf' patch due to 'const' limit

v7:
1. add __RTE_ARM to to fix the namespace (Jerin)
2. use 'uintptr_t *' in mcslock for different
architecture(32/64) (Mattias)
3. add a new pointer 'next' in mcslock to fix
compiler issue
4. add bpf patch and use 'uintptr_t' to fix const
discard warning (Konstantin)

Feifei Wang (5):
  eal: add new definitions for wait scheme
  eal: use wait event for read pflock
  eal: use wait event scheme for mcslock
  lib/bpf: use wait event scheme for Rx/Tx iteration
  lib/distributor: use wait event scheme

 lib/bpf/bpf_pkt.c|   9 +-
 lib/distributor/rte_distributor_single.c |  10 +-
 lib/eal/arm/include/rte_pause_64.h   | 166 +--
 lib/eal/include/generic/rte_mcslock.h|   5 +-
 lib/eal/include/generic/rte_pause.h  |  28 
 lib/eal/include/generic/rte_pflock.h |   4 +-
 6 files changed, 133 insertions(+), 89 deletions(-)

-- 
2.25.1



[dpdk-dev] [PATCH v7 1/5] eal: add new definitions for wait scheme

2021-10-27 Thread Feifei Wang
Introduce macros as generic interface for address monitoring.
For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.

Furthermore, to prevent compilation warning in arm:
--
'warning: implicit declaration of function ...'
--
Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and '__WFE'.
And add ‘__RTE_ARM’ for these macros to fix the namespace.

This is because original macros are undefine at the end of the file.
If new macro 'rte_wait_event' calls them in other files, they will be
seen as 'not defined'.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/arm/include/rte_pause_64.h  | 166 
 lib/eal/include/generic/rte_pause.h |  28 +
 2 files changed, 122 insertions(+), 72 deletions(-)

diff --git a/lib/eal/arm/include/rte_pause_64.h 
b/lib/eal/arm/include/rte_pause_64.h
index e87d10b8cc..d547226a8d 100644
--- a/lib/eal/arm/include/rte_pause_64.h
+++ b/lib/eal/arm/include/rte_pause_64.h
@@ -26,26 +26,18 @@ static inline void rte_pause(void)
 #ifdef RTE_WAIT_UNTIL_EQUAL_ARCH_DEFINED
 
 /* Send an event to quit WFE. */
-#define __SEVL() { asm volatile("sevl" : : : "memory"); }
+#define __RTE_ARM_SEVL() { asm volatile("sevl" : : : "memory"); }
 
 /* Put processor into low power WFE(Wait For Event) state. */
-#define __WFE() { asm volatile("wfe" : : : "memory"); }
+#define __RTE_ARM_WFE() { asm volatile("wfe" : : : "memory"); }
 
-static __rte_always_inline void
-rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
-   int memorder)
-{
-   uint16_t value;
-
-   assert(memorder == __ATOMIC_ACQUIRE || memorder == __ATOMIC_RELAXED);
-
-   /*
-* Atomic exclusive load from addr, it returns the 16-bit content of
-* *addr while making it 'monitored',when it is written by someone
-* else, the 'monitored' state is cleared and a event is generated
-* implicitly to exit WFE.
-*/
-#define __LOAD_EXC_16(src, dst, memorder) {   \
+/*
+ * Atomic exclusive load from addr, it returns the 16-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __RTE_ARM_LOAD_EXC_16(src, dst, memorder) {   \
if (memorder == __ATOMIC_RELAXED) {   \
asm volatile("ldxrh %w[tmp], [%x[addr]]"  \
: [tmp] "=&r" (dst)   \
@@ -58,15 +50,70 @@ rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t 
expected,
: "memory");  \
} }
 
-   __LOAD_EXC_16(addr, value, memorder)
+/*
+ * Atomic exclusive load from addr, it returns the 32-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __RTE_ARM_LOAD_EXC_32(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } }
+
+/*
+ * Atomic exclusive load from addr, it returns the 64-bit content of
+ * *addr while making it 'monitored', when it is written by someone
+ * else, the 'monitored' state is cleared and an event is generated
+ * implicitly to exit WFE.
+ */
+#define __RTE_ARM_LOAD_EXC_64(src, dst, memorder) {  \
+   if (memorder == __ATOMIC_RELAXED) {  \
+   asm volatile("ldxr %x[tmp], [%x[addr]]"  \
+   : [tmp] "=&r" (dst)  \
+   : [addr] "r"(src)\
+   : "memory"); \
+   } else { \
+   asm volatile("ldaxr %x[tmp], [%x[addr]]" \
+   : [tmp] "=&r" (dst)  \
+ 

[dpdk-dev] [PATCH v7 2/5] eal: use wait event for read pflock

2021-10-27 Thread Feifei Wang
Instead of polling for read pflock update, use wait event scheme for
this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_pflock.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/lib/eal/include/generic/rte_pflock.h 
b/lib/eal/include/generic/rte_pflock.h
index e57c179ef2..7573b036bf 100644
--- a/lib/eal/include/generic/rte_pflock.h
+++ b/lib/eal/include/generic/rte_pflock.h
@@ -121,9 +121,7 @@ rte_pflock_read_lock(rte_pflock_t *pf)
return;
 
/* Wait for current write phase to complete. */
-   while ((__atomic_load_n(&pf->rd.in, __ATOMIC_ACQUIRE)
-   & RTE_PFLOCK_WBITS) == w)
-   rte_pause();
+   rte_wait_event(&pf->rd.in, RTE_PFLOCK_WBITS, ==, w, __ATOMIC_ACQUIRE);
 }
 
 /**
-- 
2.25.1



[dpdk-dev] [PATCH v7 3/5] eal: use wait event scheme for mcslock

2021-10-27 Thread Feifei Wang
Instead of polling for mcslock to be updated, use wait event scheme
for this case.

Furthermore, use 'uintptr_t *' is for different size of pointer in 32/64
bits architecture.

And define a new pointer 'next' for the compilation error:
---
'dereferencing type-punned pointer will break strict-aliasing rules'
-------

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/eal/include/generic/rte_mcslock.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/eal/include/generic/rte_mcslock.h 
b/lib/eal/include/generic/rte_mcslock.h
index 34f33c64a5..d5b9b293cd 100644
--- a/lib/eal/include/generic/rte_mcslock.h
+++ b/lib/eal/include/generic/rte_mcslock.h
@@ -116,8 +116,9 @@ rte_mcslock_unlock(rte_mcslock_t **msl, rte_mcslock_t *me)
/* More nodes added to the queue by other CPUs.
 * Wait until the next pointer is set.
 */
-   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) == NULL)
-   rte_pause();
+   uintptr_t *next = NULL;
+   next = (uintptr_t *)&me->next;
+   rte_wait_event(next, UINTPTR_MAX, ==, 0, __ATOMIC_RELAXED);
}
 
/* Pass lock to next waiter. */
-- 
2.25.1



[dpdk-dev] [PATCH v7 4/5] lib/bpf: use wait event scheme for Rx/Tx iteration

2021-10-27 Thread Feifei Wang
Instead of polling for cbi->use to be updated, use wait event scheme.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/bpf/bpf_pkt.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/lib/bpf/bpf_pkt.c b/lib/bpf/bpf_pkt.c
index 6e8248f0d6..c8a1cd1eb8 100644
--- a/lib/bpf/bpf_pkt.c
+++ b/lib/bpf/bpf_pkt.c
@@ -113,7 +113,7 @@ bpf_eth_cbi_unuse(struct bpf_eth_cbi *cbi)
 static void
 bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
 {
-   uint32_t nuse, puse;
+   uint32_t puse;
 
/* make sure all previous loads and stores are completed */
rte_smp_mb();
@@ -122,11 +122,8 @@ bpf_eth_cbi_wait(const struct bpf_eth_cbi *cbi)
 
/* in use, busy wait till current RX/TX iteration is finished */
if ((puse & BPF_ETH_CBI_INUSE) != 0) {
-   do {
-   rte_pause();
-   rte_compiler_barrier();
-   nuse = cbi->use;
-   } while (nuse == puse);
+   rte_wait_event((uint32_t *)(uintptr_t)&cbi->use, UINT32_MAX,
+   ==, puse, __ATOMIC_RELAXED);
}
 }
 
-- 
2.25.1



[dpdk-dev] [PATCH v7 5/5] lib/distributor: use wait event scheme

2021-10-27 Thread Feifei Wang
Instead of polling for bufptr64 to be updated, use
wait event for this case.

Signed-off-by: Feifei Wang 
Reviewed-by: Ruifeng Wang 
---
 lib/distributor/rte_distributor_single.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/lib/distributor/rte_distributor_single.c 
b/lib/distributor/rte_distributor_single.c
index f4725b1d0b..d52b24a453 100644
--- a/lib/distributor/rte_distributor_single.c
+++ b/lib/distributor/rte_distributor_single.c
@@ -33,9 +33,8 @@ rte_distributor_request_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
int64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_GET_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on GET_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
@@ -74,9 +73,8 @@ rte_distributor_return_pkt_single(struct 
rte_distributor_single *d,
union rte_distributor_buffer_single *buf = &d->bufs[worker_id];
uint64_t req = (((int64_t)(uintptr_t)oldpkt) << RTE_DISTRIB_FLAG_BITS)
| RTE_DISTRIB_RETURN_BUF;
-   while (unlikely(__atomic_load_n(&buf->bufptr64, __ATOMIC_RELAXED)
-   & RTE_DISTRIB_FLAGS_MASK))
-   rte_pause();
+   rte_wait_event(&buf->bufptr64, RTE_DISTRIB_FLAGS_MASK,
+   !=, 0, __ATOMIC_RELAXED);
 
/* Sync with distributor on RETURN_BUF flag. */
__atomic_store_n(&(buf->bufptr64), req, __ATOMIC_RELEASE);
-- 
2.25.1



[dpdk-dev] 回复: [PATCH v7 3/5] eal: use wait event scheme for mcslock

2021-10-28 Thread Feifei Wang


> -邮件原件-
> 发件人: dev  代表 Jerin Jacob
> 发送时间: Thursday, October 28, 2021 3:02 PM
> 收件人: Feifei Wang 
> 抄送: Honnappa Nagarahalli ; dpdk-dev
> ; nd ; Ananyev, Konstantin
> ; Stephen Hemminger
> ; David Marchand
> ; tho...@monjalon.net; Mattias Rönnblom
> ; Ruifeng Wang 
> 主题: Re: [dpdk-dev] [PATCH v7 3/5] eal: use wait event scheme for mcslock
> 
> On Thu, Oct 28, 2021 at 12:27 PM Feifei Wang 
> wrote:
> >
> > Instead of polling for mcslock to be updated, use wait event scheme
> > for this case.
> >
> > Furthermore, use 'uintptr_t *' is for different size of pointer in
> > 32/64 bits architecture.
> >
> > And define a new pointer 'next' for the compilation error:
> > ---
> > 'dereferencing type-punned pointer will break strict-aliasing rules'
> > ---
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> >  lib/eal/include/generic/rte_mcslock.h | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/eal/include/generic/rte_mcslock.h
> > b/lib/eal/include/generic/rte_mcslock.h
> > index 34f33c64a5..d5b9b293cd 100644
> > --- a/lib/eal/include/generic/rte_mcslock.h
> > +++ b/lib/eal/include/generic/rte_mcslock.h
> > @@ -116,8 +116,9 @@ rte_mcslock_unlock(rte_mcslock_t **msl,
> rte_mcslock_t *me)
> > /* More nodes added to the queue by other CPUs.
> >  * Wait until the next pointer is set.
> >  */
> > -   while (__atomic_load_n(&me->next, __ATOMIC_RELAXED) ==
> NULL)
> > -   rte_pause();
> > +   uintptr_t *next = NULL;
> 
> It is going to update in the next line. Why explicit NULL assignment?
You are right, it is unnecessary to initialize it as NULL. I will update this.
> 
> > +   next = (uintptr_t *)&me->next;
> > +   rte_wait_event(next, UINTPTR_MAX, ==, 0,
> > + __ATOMIC_RELAXED);
> > }
> >
> > /* Pass lock to next waiter. */
> > --
> > 2.25.1
> >


[dpdk-dev] 回复: [PATCH v7 1/5] eal: add new definitions for wait scheme

2021-10-28 Thread Feifei Wang


> -邮件原件-
> 发件人: Jerin Jacob 
> 发送时间: Thursday, October 28, 2021 3:16 PM
> 收件人: Feifei Wang 
> 抄送: Ruifeng Wang ; dpdk-dev ;
> nd ; Ananyev, Konstantin ;
> Stephen Hemminger ; David Marchand
> ; tho...@monjalon.net; Mattias Rönnblom
> 
> 主题: Re: [PATCH v7 1/5] eal: add new definitions for wait scheme
> 
> On Thu, Oct 28, 2021 at 12:26 PM Feifei Wang 
> wrote:
> >
> > Introduce macros as generic interface for address monitoring.
> > For different size, encapsulate '__LOAD_EXC_16', '__LOAD_EXC_32'
> > and '__LOAD_EXC_64' into a new macro '__LOAD_EXC'.
> >
> > Furthermore, to prevent compilation warning in arm:
> > --
> > 'warning: implicit declaration of function ...'
> > --
> > Delete 'undef' constructions for '__LOAD_EXC_xx', '__SEVL' and '__WFE'.
> > And add ‘__RTE_ARM’ for these macros to fix the namespace.
> >
> > This is because original macros are undefine at the end of the file.
> > If new macro 'rte_wait_event' calls them in other files, they will be
> > seen as 'not defined'.
> >
> > Signed-off-by: Feifei Wang 
> > Reviewed-by: Ruifeng Wang 
> > ---
> 
> > +static __rte_always_inline void
> > +rte_wait_until_equal_16(volatile uint16_t *addr, uint16_t expected,
> > +   int memorder)
> > +{
> > +   uint16_t value;
> > +
> > +   assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > + __ATOMIC_RELAXED);
> 
> Assert is not good in the library, Why not RTE_BUILD_BUG_ON here
[Feifei] This line is the original code which has nothing to do with this 
patch, 
I can change it in the next version.
> 
> 
> > +
> > +   __RTE_ARM_LOAD_EXC_16(addr, value, memorder)
> > if (value != expected) {
> > -   __SEVL()
> > +__RTE_ARM_SEVL()
> > do {
> > -   __WFE()
> > -   __LOAD_EXC_16(addr, value, memorder)
> > +   __RTE_ARM_WFE()
> > +   __RTE_ARM_LOAD_EXC_16(addr, value, memorder)
> > } while (value != expected);
> > }
> > -#undef __LOAD_EXC_16
> >  }
> >
> >  static __rte_always_inline void
> > @@ -77,34 +124,14 @@ rte_wait_until_equal_32(volatile uint32_t *addr,
> > uint32_t expected,
> >
> > assert(memorder == __ATOMIC_ACQUIRE || memorder ==
> > __ATOMIC_RELAXED);
> >
> > -   /*
> > -* Atomic exclusive load from addr, it returns the 32-bit content of
> > -* *addr while making it 'monitored',when it is written by someone
> > -* else, the 'monitored' state is cleared and a event is generated
> > -* implicitly to exit WFE.
> > -*/
> > -#define __LOAD_EXC_32(src, dst, memorder) {  \
> > -   if (memorder == __ATOMIC_RELAXED) {  \
> > -   asm volatile("ldxr %w[tmp], [%x[addr]]"  \
> > -   : [tmp] "=&r" (dst)  \
> > -   : [addr] "r"(src)\
> > -   : "memory"); \
> > -   } else { \
> > -   asm volatile("ldaxr %w[tmp], [%x[addr]]" \
> > -   : [tmp] "=&r" (dst)  \
> > -   : [addr] "r"(src)\
> > -   : "memory"); \
> > -   } }
> > -
> > -   __LOAD_EXC_32(addr, value, memorder)
> > +   __RTE_ARM_LOAD_EXC_32(addr, value, memorder)
> > if (value != expected) {
> > -   __SEVL()
> > +   __RTE_ARM_SEVL()
> > do {
> > -   __WFE()
> > -   __LOAD_EXC_32(addr, value, memorder)
> > +   __RTE_ARM_WFE()
> > +   __RTE_ARM_LOAD_EXC_32(addr, value, memorder)
> > } while (value != expected);
> > }
> > -#undef __LOAD_EXC_32
> >  }
> >
> >  static __rte_always_inline void
> > @@ -115,38 +142,33 @@ rte_wait_until_equal_64(volatile uint64_t *addr,
> > uint64_t expected,
> >
> > assert(memorder == __ATOMIC_ACQUIRE || memord

  1   2   3   4   5   6   >