from:"Spike Du"

[PATCH] app/testpmd: fix testpmd crash when quit with mlx5 avail_thresh enabled

2022-10-23 Thread Spike Du

When testpmd quit with mlx5 avail_thresh enabled, a rte timer handler
delays to reconfigure rx queue to re-arm this event. However at the same
time, testpmd is destroying rx queues.
It's never a valid use case for mlx5 avail_thresh. Before testpmd quit,
user should disable avail_thresh configuration to not handle the events.
This is documented in mlx5 driver guide.

To avoid the crash in such use case, check port status, if it is not
RTE_PORT_STARTED, don't process the avail_thresh event.

Fixes: 0edfc9b08316 ("app/testpmd: add Host Shaper command")

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5_testpmd.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
index ed84583..1a9ec78 100644
--- a/drivers/net/mlx5/mlx5_testpmd.c
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -25,6 +25,7 @@
 
 static uint8_t host_shaper_avail_thresh_triggered[RTE_MAX_ETHPORTS];
 #define SHAPER_DISABLE_DELAY_US 10 /* 100ms */
+extern struct rte_port *ports;
 
 /**
  * Disable the host shaper and re-arm available descriptor threshold event.
@@ -39,7 +40,15 @@
uint16_t port_id = port_rxq_id & 0x;
uint16_t qid = (port_rxq_id >> 16) & 0x;
struct rte_eth_rxq_info qinfo;
+   struct rte_port *port;
 
+   port = &ports[port_id];
+   if (port->port_status != RTE_PORT_STARTED) {
+   printf("%s port_status(%d) is incorrect, stop avail_thresh "
+  "event processing.\n",
+  __func__, port->port_status);
+   return;
+   }
printf("%s disable shaper\n", __func__);
if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
printf("rx_queue_info_get returns error\n");
-- 
1.8.3.1

[PATCH v2] mlx5/testpmd: fix crash on quit with avail thresh enabled

2022-11-02 Thread Spike Du

When testpmd quit with mlx5 avail_thresh enabled, a rte timer handler
delays to reconfigure rx queue to re-arm this event. However at the same
time, testpmd is destroying rx queues.
It's never a valid use case for mlx5 avail_thresh. Before testpmd quit,
user should disable avail_thresh configuration to not handle the events.
This is documented in mlx5 driver guide.

To avoid the crash in such use case, check port status, if it is not
RTE_PORT_STARTED, don't process the avail_thresh event.

Fixes: f41a5092e6ae ("app/testpmd: add host shaper command")
Cc: sta...@dpdk.org

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5_testpmd.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5/mlx5_testpmd.c
index ed84583..879ea28 100644
--- a/drivers/net/mlx5/mlx5_testpmd.c
+++ b/drivers/net/mlx5/mlx5_testpmd.c
@@ -39,7 +39,15 @@
uint16_t port_id = port_rxq_id & 0x;
uint16_t qid = (port_rxq_id >> 16) & 0x;
struct rte_eth_rxq_info qinfo;
+   struct rte_port *port;
 
+   port = &ports[port_id];
+   if (port->port_status != RTE_PORT_STARTED) {
+   printf("%s port_status(%d) is incorrect, stop avail_thresh "
+  "event processing.\n",
+  __func__, port->port_status);
+   return;
+   }
printf("%s disable shaper\n", __func__);
if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
printf("rx_queue_info_get returns error\n");
-- 
1.8.3.1

RE: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold

2022-06-06 Thread Spike Du

Hi Andrew,
Please see below for "fill threshold" concept, I'm ok with other 
comments about code.

Regards,
Spike.

> -Original Message-
> From: Andrew Rybchenko 
> Sent: Saturday, June 4, 2022 8:46 PM
> To: Spike Du ; Matan Azrad ;
> Slava Ovsiienko ; Ori Kam ;
> NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Wenzhuo Lu ; Beilei Xing ;
> Bernard Iremonger ; Ray Kinsella
> ; Neil Horman 
> Cc: step...@networkplumber.org; m...@smartsharesystems.com;
> dev@dpdk.org; Raslan Darawsheh 
> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
> 
> External email: Use caution opening links or attachments
> 
> 
> On 6/3/22 15:48, Spike Du wrote:
> > Fill threshold describes the fullness of a Rx queue. If the Rx queue
> > fullness is above the threshold, the device will trigger the event
> > RTE_ETH_EVENT_RX_FILL_THRESH.
> 
> Sorry, I'm not sure that I understand. As far as I know the process to add
> more Rx buffers to Rx queue is called 'refill' in many drivers. So fill level 
> is a
> number (or percentage) of free buffers in an Rx queue.
> If so, fill threashold should be a minimum fill level and below the level we
> should generate an event.
> 
> However reading the first paragraph of the descrition it looks like you mean
> oposite thing - a number (or percentage) of ready Rx buffers with received
> packets.
> 
> I think that the term "fill threshold" is suggested by me, but I did it with 
> mine
> understanding of the added feature. Now I'm confused.
> 
> Moreover, I don't understand how "fill threshold" could be in terms of ready
> Rx buffers. HW simply don't really know when ready Rx buffers are
> processed by SW. So, HW can't say for sure how many ready Rx buffers are
> pending. It could be calculated as Rx queue size minus number of free Rx
> buffers, but it is imprecise. First of all not all Rx descriptors could be 
> used.
> Second, HW ring size could differ queue size specified in SW.
> Queue size specified in SW could just limit maximum nubmer of free Rx
> buffers provided by the driver.
> 

Let me use other terms because "fill"/"refill" is also ambiguous to me.
In a RX ring, there are Rx buffers with received packets, you call it "ready Rx 
buffers", there is a RTE api rte_eth_rx_queue_count() to get the number,
It's also called "used descriptors" in the code.
Also there are Rx buffers provided by SW to allow HW "fill in" received 
packets, we can call it "usable Rx buffers" (here "usable" means usable for HW).
Let's define Rx queue "fullness":
Fullness = ready-Rx-buffers/Rxq-size
On the opposite, we have "emptiness"
Emptiness = usable-Rx-buffers/Rxq-size
Here "fill threshold" describes "fullness", it's not "refill" described in you 
above words. Because in your words, "refill" is the opposite, it's filling 
"usable/free Rx buffers", or "emptiness".

I can only briefly explain how mlx5 works to get LWM, because I'm not a 
Firmware guy.
Mlx5 Rx queue is basically RDMA queue. It has two indexes: producer index which 
increases when HW fills in packet, consumer index which increases when SW 
consumes the packet.
The queue size is known when it's created. The fullness is something like 
(producer_index - consumer_index) (I don't consider in wrap-around here).
So mlx5 has the way to get the fullness or emptiness in HW or FW. 
Another detail is mlx5 uses the term "LWM"(limit watermark), which describes 
"emptiness". When usable-Rx-buffers is below LWM, we trigger an event.
But Thomas think "fullness" is easier to understand, so we use "fullness" in 
rte APIs and we'll translate it to LWM in mlx5 PMD.

> > Fill threshold is defined as a percentage of Rx queue size with valid
> > value of [0,99].
> > Setting fill threshold to 0 means disable it, which is the default.
> > Add fill threshold configuration and query driver callbacks in eth_dev_ops.
> > Add command line options to support fill_thresh per-rxq configure.
> > - Command syntax:
> >set port  rxq  fill_thresh 
> >
> > - Example commands:
> > To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
> > testpmd> set port 1 rxq 0 fill_thresh 30
> >
> > To disable fill_thresh on port 1 rxq 0:
> > testpmd> set port 1 rxq 0 fill_thresh 0
> >
> > Signed-off-by: Spike Du 
> > ---
> >   app/test-pmd/cmdline.c | 68
> +++
> >   app/test-pmd/config.c  | 21 ++
> >   app/test

RE: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold

2022-06-06 Thread Spike Du

If you think HW knows "emptiness", and we want to stick to DPDK descriptor 
terms. 
Can we use the name "avail_thresh?"
When HW sees available descriptors is under this threshold, an event is 
triggered.

> -Original Message-
> From: Andrew Rybchenko 
> Sent: Tuesday, June 7, 2022 1:16 AM
> To: Spike Du ; Matan Azrad ;
> Slava Ovsiienko ; Ori Kam ;
> NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Wenzhuo Lu ; Beilei Xing ;
> Bernard Iremonger ; Ray Kinsella
> ; Neil Horman 
> Cc: step...@networkplumber.org; m...@smartsharesystems.com;
> dev@dpdk.org; Raslan Darawsheh 
> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold
> 
> External email: Use caution opening links or attachments
> 
> 
> On 6/6/22 16:16, Spike Du wrote:
> > Hi Andrew,
> >   Please see below for "fill threshold" concept, I'm ok with other
> comments about code.
> >
> > Regards,
> > Spike.
> >
> >
> >> -Original Message-
> >> From: Andrew Rybchenko 
> >> Sent: Saturday, June 4, 2022 8:46 PM
> >> To: Spike Du ; Matan Azrad ;
> >> Slava Ovsiienko ; Ori Kam ;
> >> NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Wenzhuo
> >> Lu ; Beilei Xing ;
> >> Bernard Iremonger ; Ray Kinsella
> >> ; Neil Horman 
> >> Cc: step...@networkplumber.org; m...@smartsharesystems.com;
> >> dev@dpdk.org; Raslan Darawsheh 
> >> Subject: Re: [PATCH v4 3/7] ethdev: introduce Rx queue based fill
> >> threshold
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 6/3/22 15:48, Spike Du wrote:
> >>> Fill threshold describes the fullness of a Rx queue. If the Rx queue
> >>> fullness is above the threshold, the device will trigger the event
> >>> RTE_ETH_EVENT_RX_FILL_THRESH.
> >>
> >> Sorry, I'm not sure that I understand. As far as I know the process
> >> to add more Rx buffers to Rx queue is called 'refill' in many
> >> drivers. So fill level is a number (or percentage) of free buffers in an Rx
> queue.
> >> If so, fill threashold should be a minimum fill level and below the
> >> level we should generate an event.
> >>
> >> However reading the first paragraph of the descrition it looks like
> >> you mean oposite thing - a number (or percentage) of ready Rx buffers
> >> with received packets.
> >>
> >> I think that the term "fill threshold" is suggested by me, but I did
> >> it with mine understanding of the added feature. Now I'm confused.
> >>
> >> Moreover, I don't understand how "fill threshold" could be in terms
> >> of ready Rx buffers. HW simply don't really know when ready Rx
> >> buffers are processed by SW. So, HW can't say for sure how many ready
> >> Rx buffers are pending. It could be calculated as Rx queue size minus
> >> number of free Rx buffers, but it is imprecise. First of all not all Rx
> descriptors could be used.
> >> Second, HW ring size could differ queue size specified in SW.
> >> Queue size specified in SW could just limit maximum nubmer of free Rx
> >> buffers provided by the driver.
> >>
> >
> > Let me use other terms because "fill"/"refill" is also ambiguous to me.
> > In a RX ring, there are Rx buffers with received packets, you call it
> > "ready Rx buffers", there is a RTE api rte_eth_rx_queue_count() to get the
> number, It's also called "used descriptors" in the code.
> > Also there are Rx buffers provided by SW to allow HW "fill in" received
> packets, we can call it "usable Rx buffers" (here "usable" means usable for
> HW).
> 
> May be it is better to stick to Rx descriptor status terminology?
> Available - Rx descriptor available to HW to put received packet to Done - Rx
> descriptor with received packet reported to Sw Unavailable - other (e.g. gap
> which cannot be used or just processed Done, but not refilled (made
> available to HW).
> 
> > Let's define Rx queue "fullness":
> >   Fullness = ready-Rx-buffers/Rxq-size
> 
> i.e. number of DONE descriptors divided by RxQ size
> 
> > On the opposite, we have "emptiness"
> >   Emptiness = usable-Rx-buffers/Rxq-size
> 
> i.e. number of AVAIL descriptors divided by RxQ size Note, that AVAIL !=
> RxQ-size - DONE
> 
> HW really knows number of available descriptors by its nature.
> It is a space between latest done and la

[PATCH v5 0/7] introduce per-queue available descriptor threshold and host shaper

2022-06-07 Thread Spike Du

available descriptor threshold(ADT for short) is per RX queue attribute, when 
RX queue available descriptors for HW is below ADT, HW sends an event to 
application.
Host shaper can configure shaper rate and avail_thresh-triggered for a host 
port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on 
Nvidia BlueField 2 NIC.
If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically 
when one of the host port's Rx queues receives available descriptor threshold 
event.

These two features can combine to control traffic from host port to wire port 
for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure available descriptor threshold to 
RX queue and enable avail_thresh-triggered flag in host shaper, after receiving 
available descriptor threshold event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops on 
ARM system.

Add new libethdev API to set available descriptor threshold, add rte event 
RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. 
For host shaper, because it doesn't align to existing DPDK framework and is 
specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and available 
descriptor threshold event handler in mlx5 PMD directory by adding a new file 
mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to 
add mlx5 specific commands.


Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based available descriptor threshold
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based available descriptor threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/cmdline.c   |  68 +++
 app/test-pmd/config.c|  21 ++
 app/test-pmd/testpmd.c   |  24 +++
 app/test-pmd/testpmd.h   |   2 +
 doc/guides/nics/mlx5.rst |  93 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 ---
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +
 drivers/net/mlx5/meson.build |   4 +
 drivers/net/mlx5/mlx5.c  |  68 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 289 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c  | 201 +++
 drivers/net/mlx5/mlx5_testpmd.h  |  26 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 +++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +
 lib/ethdev/ethdev_driver.h   |  22 ++
 lib/ethdev/rte_ethdev.c  |  52 +
 lib/ethdev/rte_ethdev.h  |  73 +++
 lib/ethdev/version.map   |   2 +
 33 files changed, 1318 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v5 1/7] net/mlx5: add LWM support for Rxq

2022-06-07 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[PATCH v5 3/7] ethdev: introduce Rx queue based available descriptor threshold

2022-06-07 Thread Spike Du

available descriptor threshold describes the availability of a Rx queue
for hardware.
If the availability is below the threshold, the device will trigger the
event RTE_ETH_EVENT_RX_AVAIL_THRESH.
available descriptor threshold is defined as a percentage of Rx queue
size with valid value of [0,99].
Setting available descriptor threshold to 0 means disable it, which is
the default.
Add available descriptor threshold configuration and query driver
callbacks in eth_dev_ops.
Add command line options to support avail_thresh per-rxq configure.
- Command syntax:
  set port  rxq  avail_thresh 

- Example commands:
To configure avail_thresh as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 avail_thresh 30

To disable avail_thresh on port 1 rxq 0:
testpmd> set port 1 rxq 0 avail_thresh 0

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c | 68 ++
 app/test-pmd/config.c  | 20 +
 app/test-pmd/testpmd.c | 14 +
 app/test-pmd/testpmd.h |  2 ++
 lib/ethdev/ethdev_driver.h | 22 ++
 lib/ethdev/rte_ethdev.c| 44 
 lib/ethdev/rte_ethdev.h| 73 ++
 lib/ethdev/version.map |  2 ++
 8 files changed, 245 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 0410bad..bbf5835 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
}
 };
 
+/* *** SET AVAIL THRESHOLD FOR A RXQ OF A PORT *** */
+struct cmd_rxq_avail_thresh_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t avail_thresh;
+   uint8_t avail_thresh_num;
+};
+
+static void cmd_rxq_avail_thresh_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_avail_thresh_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->avail_thresh, "avail_thresh") == 0))
+   ret = set_rxq_avail_thresh(res->port_num, res->rxq_num,
+ res->avail_thresh_num);
+   if (ret < 0)
+   printf("rxq_avail_thresh_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   set, "set");
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   port, "port");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   port_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   rxq, "rxq");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   rxq_num, RTE_UINT16);
+static cmdline_parse_token_string_t cmd_rxq_avail_thresh_avail_thresh =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   avail_thresh, "avail_thresh");
+static cmdline_parse_token_num_t cmd_rxq_avail_thresh_avail_threshnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_avail_thresh_result,
+   avail_thresh_num, RTE_UINT8);
+
+static cmdline_parse_inst_t cmd_rxq_avail_thresh = {
+   .f = cmd_rxq_avail_thresh_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  avail_thresh 
"
+   "Set avail_thresh for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_avail_thresh_set,
+   (void *)&cmd_rxq_avail_thresh_port,
+   (void *)&cmd_rxq_avail_thresh_portnum,
+   (void *)&cmd_rxq_avail_thresh_rxq,
+   (void *)&cmd_rxq_avail_thresh_rxqnum,
+   (void *)&cmd_rxq_avail_thresh_avail_thresh,
+   (void *)&cmd_rxq_avail_thresh_avail_threshnum,
+   NULL,
+   },
+};
+
 /* 

 */
 
 /* list of instructions */
@@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
(cmdline_parse_inst_t *)&cmd_show_capability,
(cmdline_p

[PATCH v5 5/7] net/mlx5: support Rx queue based available descriptor threshold

2022-06-07 Thread Spike Du

Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 151 +
 drivers/net/mlx5/mlx5_rx.h |   5 ++
 6 files changed, 172 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..9163b78 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue available descriptor threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Available descriptor threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Available descriptor threshold introduction
+
+
+Available descriptor threshold is a per Rx queue attribute, it should be 
configured as
+a percentage of the Rx queue size.
+When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92..46fd73a 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue available descriptor threshold support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5..3b5e605 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..998846a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_avail_thresh_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_avail_thresh_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 197d708..2cb7006 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -25,6 +25,7 @@
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +129,16 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+   return rxq->lwm * 100 / wqe_cnt;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +161,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +181,8 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->avail_thresh = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1202,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_eth_dev *dev,
+

[PATCH v5 7/7] app/testpmd: add Host Shaper command

2022-06-07 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |   6 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 201 
 drivers/net/mlx5/mlx5_testpmd.h |  26 ++
 5 files changed, 283 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 33d9b85..e15882d 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3659,6 +3662,9 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port:%d 
rxq_id:%d\n",
   port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a1e13e7..b5a3ee3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+--
+
+There are sample command lines to configure available descriptor threshold in 
testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The left commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable available descriptor threshold and avail_thresh_triggered, we can 
invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug

[PATCH v5 6/7] net/mlx5: add private API to config host port shaper

2022-06-07 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  35 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 104 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 212 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 9163b78..a1e13e7 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue available descriptor threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with 
MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED flag set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,31 @@ Available descriptor threshold is a per Rx queue 
attribute, it should be configu
 a percentage of the Rx queue size.
 When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+available descriptor threshold event trigger. In immediate mode, the rate 
limit is configured
+immediately to host shaper. When deferring to available descriptor threshold 
trigger, the shaper
+is not set until an available descriptor threshold event is received by any Rx 
queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the available descriptor threshold event, which allows throttling 
host traffic on
+available descriptor threshold events at minimum latency, preventing excess 
drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+---
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on 
``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 46fd73a..3349cda 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue available descriptor threshold support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [

[PATCH v5 2/7] common/mlx5: share interrupt management

2022-06-07 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 ++
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[PATCH v5 4/7] net/mlx5: add LWM event handling support

2022-06-07 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 +++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1

RE: [PATCH v5 7/7] app/testpmd: add Host Shaper command

2022-06-09 Thread Spike Du



> -Original Message-
> From: Andrew Rybchenko 
> Sent: Thursday, June 9, 2022 3:55 PM
> To: Spike Du ; Matan Azrad ;
> Slava Ovsiienko ; Ori Kam ;
> NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Xiaoyun Li ; Aman Singh
> ; Yuying Zhang 
> Cc: step...@networkplumber.org; m...@smartsharesystems.com;
> dev@dpdk.org; Raslan Darawsheh 
> Subject: Re: [PATCH v5 7/7] app/testpmd: add Host Shaper command
> 
> External email: Use caution opening links or attachments
> 
> 
> Since ethdev patch is factored out from the patch series the rest could go to
> mlx5 maintainers.
> 
> On 6/7/22 15:59, Spike Du wrote:
> > Add command line options to support host shaper configure.
> > - Command syntax:
> >mlx5 set port  host_shaper avail_thresh_triggered <0|1>
> > rate 
> >
> > - Example commands:
> > To enable avail_thresh_triggered on port 1 and disable current host
> > shaper:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
> >
> > To disable avail_thresh_triggered and current host shaper on port 1:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
> >
> > The rate unit is 100Mbps.
> > To disable avail_thresh_triggered and configure a shaper of 5Gbps on
> > port 1:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
> >
> > Add sample code to handle rxq available descriptor threshold event, it
> > delays a while so that rxq empties, then disables host shaper and
> > rearms available descriptor threshold event.
> >
> > Signed-off-by: Spike Du 
> 
> [snip]
> 
> > +
> > +static uint8_t host_shaper_avail_thresh_triggered[RTE_MAX_ETHPORTS];
> > +#define SHAPER_DISABLE_DELAY_US 10 /* 100ms */
> > +
> > +/**
> > + * Disable the host shaper and re-arm available descriptor threshold event.
> > + *
> > + * @param[in] args
> > + *   uint32_t integer combining port_id and rxq_id.
> > + */
> > +static void
> > +mlx5_test_host_shaper_disable(void *args) {
> > + uint32_t port_rxq_id = (uint32_t)(uintptr_t)args;
> > + uint16_t port_id = port_rxq_id & 0x;
> > + uint16_t qid = (port_rxq_id >> 16) & 0x;
> > + struct rte_eth_rxq_info qinfo;
> > +
> > + printf("%s disable shaper\n", __func__);
> > + if (rte_eth_rx_queue_info_get(port_id, qid, &qinfo)) {
> > + printf("rx_queue_info_get returns error\n");
> > + return;
> > + }
> > + /* Rearm the available descriptor threshold event. */
> > + if (rte_eth_rx_avail_thresh_set(port_id, qid, qinfo.avail_thresh)) {
> > + printf("config avail_thresh returns error\n");
> > + return;
> > + }
> > + /* Only disable the shaper when avail_thresh_triggered is set. */
> > + if (host_shaper_avail_thresh_triggered[port_id] &&
> > + rte_pmd_mlx5_host_shaper_config(port_id, 0, 0))
> > + printf("%s disable shaper returns error\n", __func__); }
> > +
> > +void
> > +mlx5_test_avail_thresh_event_handler(uint16_t port_id, uint16_t
> > +rxq_id) {
> > + uint32_t port_rxq_id = port_id | (rxq_id << 16);
> 
> Nobody guarantees here that port_id refers to an mlx5 port.
> It could be any port with avail_thres support.
I think at the beginning of this function, we can add below check:

  if (rte_eth_dev_info_get(port_id, &dev_info) != 0 ||
  (strncmp(dev_info.driver_name, "mlx5", 4) != 0))
return;

is it ok?
> 
> > +
> > + rte_eal_alarm_set(SHAPER_DISABLE_DELAY_US,
> > +   mlx5_test_host_shaper_disable,
> > +   (void *)(uintptr_t)port_rxq_id);
> > + printf("%s port_id:%u rxq_id:%u\n", __func__, port_id, rxq_id);
> > +}
> 
> [snip]
> 
> > +cmdline_parse_token_string_t cmd_port_host_shaper_mlx5 =
> > + TOKEN_STRING_INITIALIZER(struct cmd_port_host_shaper_result,
> > + mlx5, "mlx5");
> 
> I think it lucks 'static' keyword (many cases below).

Sure, will add 'static'.
> 
> [snip]

[PATCH v6] app/testpmd: add Host Shaper command

2022-06-12 Thread Spike Du

this patch is taken out from series of "introduce per-queue available 
descriptor threshold and host shaper"
to simplify the review, and it's the last one for non-PMD change.
However it depends on a PMD commit for host shaper config API, should be merged 
after PMD patches. 

--
v6:
 - add 'static' keyword for cmdline structs
 - in AVAIL_THRESH event callback, add check to ensure it's mlx5 port

Spike Du (1):
  app/testpmd: add Host Shaper command

 app/test-pmd/testpmd.c  |  11 +++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 206 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 293 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v6] app/testpmd: add Host Shaper command

2022-06-12 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |  11 +++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 206 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 293 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 33d9b85..e1ac75a 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3659,6 +3662,14 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port:%d 
rxq_id:%d\n",
   port_id, rxq_id);
+
+   struct rte_eth_dev_info dev_info;
+   if (rte_eth_dev_info_get(port_id, &dev_info) != 0 ||
+   (strncmp(dev_info.driver_name, "mlx5", 4) != 0))
+   printf("%s\n", dev_info.driver_name);
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a1e13e7..b5a3ee3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+--
+
+There are sample command lines to configure available descriptor threshold in 
testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The left commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable available descriptor threshold and avail_thresh_triggered, we can 
invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5

RE: [PATCH v6] app/testpmd: add Host Shaper command

2022-06-14 Thread Spike Du



> -Original Message-
> From: Singh, Aman Deep 
> Sent: Tuesday, June 14, 2022 5:44 PM
> To: Spike Du ; Matan Azrad ;
> Slava Ovsiienko ; Ori Kam ;
> NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Wenzhuo Lu ; Beilei Xing ;
> Bernard Iremonger ; Shahaf Shuler
> 
> Cc: andrew.rybche...@oktetlabs.ru; step...@networkplumber.org;
> m...@smartsharesystems.com; dev@dpdk.org; Raslan Darawsheh
> 
> Subject: Re: [PATCH v6] app/testpmd: add Host Shaper command
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Spike,
> 
> 
> On 6/13/2022 8:20 AM, Spike Du wrote:
> > Add command line options to support host shaper configure.
> > - Command syntax:
> >mlx5 set port  host_shaper avail_thresh_triggered <0|1>
> > rate 
> >
> > - Example commands:
> > To enable avail_thresh_triggered on port 1 and disable current host
> > shaper:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
> >
> > To disable avail_thresh_triggered and current host shaper on port 1:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
> >
> > The rate unit is 100Mbps.
> > To disable avail_thresh_triggered and configure a shaper of 5Gbps on
> > port 1:
> > testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
> >
> > Add sample code to handle rxq available descriptor threshold event, it
> > delays a while so that rxq empties, then disables host shaper and
> > rearms available descriptor threshold event.
> >
> > Signed-off-by: Spike Du 
> > ---
> >   app/test-pmd/testpmd.c  |  11 +++
> >   doc/guides/nics/mlx5.rst|  46 +
> >   drivers/net/mlx5/meson.build|   4 +
> >   drivers/net/mlx5/mlx5_testpmd.c | 206
> 
> >   drivers/net/mlx5/mlx5_testpmd.h |  26 +
> >   5 files changed, 293 insertions(+)
> >   create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
> >   create mode 100644 drivers/net/mlx5/mlx5_testpmd.h
> >
> > diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index
> > 33d9b85..e1ac75a 100644
> > --- a/app/test-pmd/testpmd.c
> > +++ b/app/test-pmd/testpmd.c
> > @@ -69,6 +69,9 @@
> >   #ifdef RTE_NET_BOND
> >   #include 
> >   #endif
> > +#ifdef RTE_NET_MLX5
> > +#include "mlx5_testpmd.h"
> > +#endif
> >
> >   #include "testpmd.h"
> >
> > @@ -3659,6 +3662,14 @@ struct pmd_test_command {
> >   break;
> >   printf("Received avail_thresh event, port:%d 
> > rxq_id:%d\n",
> >  port_id, rxq_id);
> > +
> > + struct rte_eth_dev_info dev_info;
> > + if (rte_eth_dev_info_get(port_id, &dev_info) != 0 ||
> > + (strncmp(dev_info.driver_name, "mlx5", 4) != 0))
> > + printf("%s\n", dev_info.driver_name);
> > +#ifdef RTE_NET_MLX5
> > + mlx5_test_avail_thresh_event_handler(port_id,
> > +rxq_id); #endif
> >   }
> 
> Wanted to check the intend of above "if-statement". Currently i think only
> print() is dependent on it.
> Do we want to call mlx5 event_handler, only if driver_name is mlx5 ?
> 
> >   break;
> >   default:
> >
> 

Sorry, it is test code that I should remove. The check of mlx5 driver_name 
Is done in mlx5_test_avail_thresh_event_handler(), if the port is not mlx5 port,
We simply return there.
Will update the patch soon, thanks for the catch!

[PATCH v7] app/testpmd: add Host Shaper command

2022-06-14 Thread Spike Du

this patch is taken out from series of "introduce per-queue available 
descriptor threshold and host shaper"
to simplify the review, and it's the last one for non-PMD change.
However it depends on a PMD commit for host shaper config API, should be merged 
after PMD patches.

--
v7:
 - remove some test code.


Spike Du (1):
  app/testpmd: add Host Shaper command

 app/test-pmd/testpmd.c  |   7 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 206 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 289 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v7] app/testpmd: add Host Shaper command

2022-06-14 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |   7 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 206 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 289 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 33d9b85..b491719 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3659,6 +3662,10 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port:%d 
rxq_id:%d\n",
   port_id, rxq_id);
+
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a1e13e7..b5a3ee3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+--
+
+There are sample command lines to configure available descriptor threshold in 
testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The left commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable available descriptor threshold and avail_thresh_triggered, we can 
invoke below commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug

[PATCH v8 2/6] common/mlx5: share interrupt management

2022-06-15 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 ++
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[PATCH v8 0/6] introduce per-queue available descriptor threshold and host shaper

2022-06-15 Thread Spike Du

available descriptor threshold(ADT for short) is per RX queue attribute, when 
RX queue available descriptors for HW is below ADT, HW sends an event to 
application.
Host shaper can configure shaper rate and avail_thresh-triggered for a host 
port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on 
Nvidia BlueField 2 NIC.
If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically 
when one of the host port's Rx queues receives available descriptor threshold 
event.

These two features can combine to control traffic from host port to wire port 
for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure available descriptor threshold to 
RX queue and enable avail_thresh-triggered flag in host shaper, after receiving 
available descriptor threshold event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops on 
ARM system.

Add new libethdev API to set available descriptor threshold, add rte event 
RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. 
For host shaper, because it doesn't align to existing DPDK framework and is 
specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and available 
descriptor threshold event handler in mlx5 PMD directory by adding a new file 
mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to 
add mlx5 specific commands.

Spike Du (6):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based available descriptor threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/testpmd.c   |   7 +
 doc/guides/nics/mlx5.rst |  93 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 ---
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +
 drivers/net/mlx5/meson.build |   4 +
 drivers/net/mlx5/mlx5.c  |  68 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 288 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c  | 205 +++
 drivers/net/mlx5/mlx5_testpmd.h  |  26 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 +++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +
 26 files changed, 1064 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v8 5/6] net/mlx5: add private API to config host port shaper

2022-06-15 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  35 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 104 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 212 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index cceaddf..5f7b060 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue available descriptor threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with 
MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED flag set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,31 @@ Available descriptor threshold is a per Rx queue 
attribute, it should be configu
 a percentage of the Rx queue size.
 When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+available descriptor threshold event trigger. In immediate mode, the rate 
limit is configured
+immediately to host shaper. When deferring to available descriptor threshold 
trigger, the shaper
+is not set until an available descriptor threshold event is received by any Rx 
queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the available descriptor threshold event, which allows throttling 
host traffic on
+available descriptor threshold events at minimum latency, preventing excess 
drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+---
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on 
``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 46fd73a..3349cda 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue available descriptor threshold support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [

[PATCH v8 1/6] net/mlx5: add LWM support for Rxq

2022-06-15 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[PATCH v8 3/6] net/mlx5: add LWM event handling support

2022-06-15 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 +++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1

[PATCH v8 4/6] net/mlx5: support Rx queue based available descriptor threshold

2022-06-15 Thread Spike Du

Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 151 +
 drivers/net/mlx5/mlx5_rx.h |   5 ++
 6 files changed, 172 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..cceaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue available descriptor threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Available descriptor threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Available descriptor threshold introduction
+---
+
+Available descriptor threshold is a per Rx queue attribute, it should be 
configured as
+a percentage of the Rx queue size.
+When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92..46fd73a 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue available descriptor threshold support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5..3b5e605 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..998846a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_avail_thresh_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_avail_thresh_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 197d708..2cb7006 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -25,6 +25,7 @@
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +129,16 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+   return rxq->lwm * 100 / wqe_cnt;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +161,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +181,8 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->avail_thresh = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1202,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struct rte_

[PATCH v8 6/6] app/testpmd: add Host Shaper command

2022-06-15 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |   7 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 205 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 288 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 33d9b85..b491719 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3659,6 +3662,10 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port:%d 
rxq_id:%d\n",
   port_id, rxq_id);
+
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 5f7b060..64eaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+-
+
+There is a command to configure available descriptor threshold in testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple BlueField 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The other commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+The threshold event and shaper can be disabled like this:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug')
 else
 cflags += [ '-UPEDANTIC&

[PATCH v9 1/6] net/mlx5: add LWM support for Rxq

2022-06-15 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[PATCH v9 0/6] introduce per-queue available descriptor threshold and host shaper

2022-06-15 Thread Spike Du

available descriptor threshold(ADT for short) is per RX queue attribute, when 
RX queue available descriptors for HW is below ADT, HW sends an event to 
application.
Host shaper can configure shaper rate and avail_thresh-triggered for a host 
port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on 
Nvidia BlueField 2 NIC.
If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically 
when one of the host port's Rx queues receives available descriptor threshold 
event.

These two features can combine to control traffic from host port to wire port 
for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure available descriptor threshold to 
RX queue and enable avail_thresh-triggered flag in host shaper, after receiving 
available descriptor threshold event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops on 
ARM system.

Add new libethdev API to set available descriptor threshold, add rte event 
RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. 
For host shaper, because it doesn't align to existing DPDK framework and is 
specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and available 
descriptor threshold event handler in mlx5 PMD directory by adding a new file 
mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to 
add mlx5 specific commands.

Spike Du (6):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based available descriptor threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/testpmd.c   |   7 +
 doc/guides/nics/mlx5.rst |  93 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 ---
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +
 drivers/net/mlx5/meson.build |   4 +
 drivers/net/mlx5/mlx5.c  |  68 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 288 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c  | 205 +++
 drivers/net/mlx5/mlx5_testpmd.h  |  26 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 +++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +
 26 files changed, 1064 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v9 2/6] common/mlx5: share interrupt management

2022-06-15 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 ++
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[PATCH v9 3/6] net/mlx5: add LWM event handling support

2022-06-15 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 +++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1

[PATCH v9 4/6] net/mlx5: support Rx queue based available descriptor threshold

2022-06-15 Thread Spike Du

Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 151 +
 drivers/net/mlx5/mlx5_rx.h |   5 ++
 6 files changed, 172 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..cceaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue available descriptor threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Available descriptor threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Available descriptor threshold introduction
+---
+
+Available descriptor threshold is a per Rx queue attribute, it should be 
configured as
+a percentage of the Rx queue size.
+When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 6fc044e..7fb98cd 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -151,6 +151,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue available descriptor threshold support.
 
 * **Updated VMware vmxnet3 networking driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 654e5f4..7c4030a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3294,6 +3294,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..998846a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_avail_thresh_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_avail_thresh_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 197d708..2cb7006 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -25,6 +25,7 @@
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +129,16 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+   return rxq->lwm * 100 / wqe_cnt;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +161,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +181,8 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->avail_thresh = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1202,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
return -ENOTSUP;
 }
 
+int
+mlx5_rx_queue_lwm_query(struc

[PATCH v9 5/6] net/mlx5: add private API to config host port shaper

2022-06-15 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  35 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 104 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 212 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index cceaddf..5f7b060 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue available descriptor threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with 
MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED flag set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,31 @@ Available descriptor threshold is a per Rx queue 
attribute, it should be configu
 a percentage of the Rx queue size.
 When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+available descriptor threshold event trigger. In immediate mode, the rate 
limit is configured
+immediately to host shaper. When deferring to available descriptor threshold 
trigger, the shaper
+is not set until an available descriptor threshold event is received by any Rx 
queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the available descriptor threshold event, which allows throttling 
host traffic on
+available descriptor threshold events at minimum latency, preventing excess 
drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+---
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on 
``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 7fb98cd..199a775 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -152,6 +152,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue available descriptor threshold support.
+  * Added host shaper support.
 
 * **Updated VMware vmxnet3 networking driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+

[PATCH v9 6/6] app/testpmd: add Host Shaper command

2022-06-15 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |   7 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 205 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 288 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 205d98e..e6321bd 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3726,6 +3729,10 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port: %u, rxq_id: 
%u\n",
   port_id, rxq_id);
+
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
}
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 5f7b060..64eaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+-
+
+There is a command to configure available descriptor threshold in testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple BlueField 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The other commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+The threshold event and shaper can be disabled like this:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug')
 else
 cflags += [ '-UPEDANTIC&

[PATCH v10 0/6] introduce per-queue available descriptor threshold and host shaper

2022-06-16 Thread Spike Du

available descriptor threshold(ADT for short) is per RX queue attribute, when 
RX queue available descriptors for HW is below ADT, HW sends an event to 
application.
Host shaper can configure shaper rate and avail_thresh-triggered for a host 
port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on 
Nvidia BlueField 2 NIC.
If avail_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically 
when one of the host port's Rx queues receives available descriptor threshold 
event.

These two features can combine to control traffic from host port to wire port 
for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure available descriptor threshold to 
RX queue and enable avail_thresh-triggered flag in host shaper, after receiving 
available descriptor threshold event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops on 
ARM system.

Add new libethdev API to set available descriptor threshold, add rte event 
RTE_ETH_EVENT_RX_AVAIL_THRESH to handle available descriptor threshold event. 
For host shaper, because it doesn't align to existing DPDK framework and is 
specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and available 
descriptor threshold event handler in mlx5 PMD directory by adding a new file 
mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to 
add mlx5 specific commands.

Spike Du (6):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based available descriptor threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/testpmd.c   |   7 +
 doc/guides/nics/mlx5.rst |  93 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 ---
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +
 drivers/net/mlx5/meson.build |   3 +
 drivers/net/mlx5/mlx5.c  |  68 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 288 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c  | 205 +++
 drivers/net/mlx5/mlx5_testpmd.h  |  26 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 +++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +
 26 files changed, 1063 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v10 1/6] net/mlx5: add LWM support for Rxq

2022-06-16 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[PATCH v10 2/6] common/mlx5: share interrupt management

2022-06-16 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 ++
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregister

[PATCH v10 3/6] net/mlx5: add LWM event handling support

2022-06-16 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 +++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *de

[PATCH v10 4/6] net/mlx5: support Rx queue based available descriptor threshold

2022-06-16 Thread Spike Du

Add mlx5 specific available descriptor threshold configuration
and query handler.
In mlx5 PMD, available descriptor threshold is also called
LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 doc/guides/nics/mlx5.rst   |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 151 +
 drivers/net/mlx5/mlx5_rx.h |   5 ++
 6 files changed, 172 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..cceaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue available descriptor threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Available descriptor threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Available descriptor threshold introduction
+---
+
+Available descriptor threshold is a per Rx queue attribute, it should be 
configured as
+a percentage of the Rx queue size.
+When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 6fc044e..7fb98cd 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -151,6 +151,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue available descriptor threshold support.
 
 * **Updated VMware vmxnet3 networking driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 654e5f4..7c4030a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3294,6 +3294,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..998846a 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_avail_thresh_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_avail_thresh_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 197d708..2cb7006 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -25,6 +25,7 @@
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +129,16 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+   return rxq->lwm * 100 / wqe_cnt;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +161,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +181,8 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->avail_thresh = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1202,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
return -ENOTSUP;
 }

[PATCH v10 5/6] net/mlx5: add private API to config host port shaper

2022-06-16 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives available descriptor
threshold event.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 doc/guides/nics/mlx5.rst   |  35 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 104 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 212 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index cceaddf..5f7b060 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue available descriptor threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with 
MLX5_HOST_SHAPER_FLAG_AVAIL_THRESH_TRIGGERED flag set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,31 @@ Available descriptor threshold is a per Rx queue 
attribute, it should be configu
 a percentage of the Rx queue size.
 When Rx queue available descriptors for hardware are below the threshold, an 
event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+available descriptor threshold event trigger. In immediate mode, the rate 
limit is configured
+immediately to host shaper. When deferring to available descriptor threshold 
trigger, the shaper
+is not set until an available descriptor threshold event is received by any Rx 
queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the available descriptor threshold event, which allows throttling 
host traffic on
+available descriptor threshold events at minimum latency, preventing excess 
drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+---
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on 
``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 7fb98cd..199a775 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -152,6 +152,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue available descriptor threshold support.
+  * Added host shaper support.
 
 * **Updated VMware vmxnet3 networking driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_

[PATCH v10 6/6] app/testpmd: add Host Shaper command

2022-06-16 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper avail_thresh_triggered <0|1> rate


- Example commands:
To enable avail_thresh_triggered on port 1 and disable current host
shaper:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0

To disable avail_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable avail_thresh_triggered and configure a shaper of 5Gbps on
port 1:
testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50

Add sample code to handle rxq available descriptor threshold event, it
delays a while so that rxq empties, then disables host shaper and
rearms available descriptor threshold event.

Signed-off-by: Spike Du 
Acked-by: Matan Azard 
---
 app/test-pmd/testpmd.c  |   7 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   3 +
 drivers/net/mlx5/mlx5_testpmd.c | 205 
 drivers/net/mlx5/mlx5_testpmd.h |  26 +
 5 files changed, 287 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 205d98e..e6321bd 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3726,6 +3729,10 @@ struct pmd_test_command {
break;
printf("Received avail_thresh event, port: %u, rxq_id: 
%u\n",
   port_id, rxq_id);
+
+#ifdef RTE_NET_MLX5
+   mlx5_test_avail_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
}
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 5f7b060..64eaddf 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use available descriptor threshold and Host Shaper
+-
+
+There is a command to configure available descriptor threshold in testpmd.
+Testpmd also contains sample logic to handle available descriptor threshold 
event.
+The typical workflow is: testpmd configure available descriptor threshold for 
Rx queues, enable
+avail_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue emptiness is below available descriptor threshold, PMD 
receives an event and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple BlueField 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable available 
descriptor threshold in testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 70
+   testpmd> set port 1 rxq 1 avail_thresh 70
+
+The first command disables current host shaper, and enables available 
descriptor threshold triggered mode.
+The other commands configure available descriptor threshold to 70% of Rx queue 
size for both Rx queues,
+When traffic from host is too high, you can see testpmd console prints log
+about available descriptor threshold event receiving, then host shaper is 
disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+The threshold event and shaper can be disabled like this:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 avail_thresh 0
+   testpmd> set port 1 rxq 1 avail_thresh 0
+
+It's recommended an application disables available descriptor threshold and 
avail_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables avail_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper avail_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..6a84d96 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,7 @@ if get_option('buildtype').contains('debug')
 else
 cflags +

[PATCH] vdpa/mlx5: refactor with common interrupt management API

2022-07-05 Thread Spike Du

Replace vdpa interrupt handle creation logic with common interrupt
management API.

Signed-off-by: Spike Du 
---
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c | 25 -
 1 file changed, 4 insertions(+), 21 deletions(-)

diff --git a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c 
b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
index ed17fb5..607e290 100644
--- a/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
+++ b/drivers/vdpa/mlx5/mlx5_vdpa_virtq.c
@@ -492,30 +492,13 @@
 mlx5_vdpa_virtq_doorbell_setup(struct mlx5_vdpa_virtq *virtq,
struct rte_vhost_vring *vq, int index)
 {
-   virtq->intr_handle =
-   rte_intr_instance_alloc(RTE_INTR_INSTANCE_F_SHARED);
+   virtq->intr_handle = mlx5_os_interrupt_handler_create(
+ RTE_INTR_INSTANCE_F_SHARED, false,
+ vq->kickfd, mlx5_vdpa_virtq_kick_handler, 
virtq);
if (virtq->intr_handle == NULL) {
-   DRV_LOG(ERR, "Fail to allocate intr_handle");
+   DRV_LOG(ERR, "Fail to allocate intr_handle for virtq %d.", 
index);
return -1;
}
-   if (rte_intr_fd_set(virtq->intr_handle, vq->kickfd))
-   return -1;
-   if (rte_intr_fd_get(virtq->intr_handle) == -1) {
-   DRV_LOG(WARNING, "Virtq %d kickfd is invalid.", index);
-   } else {
-   if (rte_intr_type_set(virtq->intr_handle,
-   RTE_INTR_HANDLE_EXT))
-   return -1;
-   if (rte_intr_callback_register(virtq->intr_handle,
-   mlx5_vdpa_virtq_kick_handler, virtq)) {
-   (void)rte_intr_fd_set(virtq->intr_handle, -1);
-   DRV_LOG(ERR, "Failed to register virtq %d interrupt.",
-   index);
-   return -1;
-   }
-   DRV_LOG(DEBUG, "Register fd %d interrupt for virtq %d.",
-   rte_intr_fd_get(virtq->intr_handle), index);
-   }
return 0;
 }
 
-- 
1.8.3.1

RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper

2022-05-01 Thread Spike Du

Hi Jerin,

> > For case two(host shaper), I think we can't use RX meter, because 
> > it's
> actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia
> BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have
> two terms here: Host-system stands for the system the BF2 NIC is inserted;
> ARM-system stands for the embedded ARM in BF2. ARM-system is doing the
> forwarding. This is the way host shaper works: we configure the register on
> ARM-system, but it affects Host-system's TX shaper, which means the
> shaper is working on the remote port, it's not a RX meter concept, hence we
> can't use DPDK RX meter framework. I'd suggest to still use private API.
> 
> OK. If the host is using the DPDK application then rte_tm can be used on the
> egress side to enable the same. If it is not DPDK, then yes, we need private
> APIs.
I see your point. The RX drop happens on ARM-system, it'll be too late 
to notify Host-system to reduce traffic rate. To achieve dropless, MLX developed
this feature to configure host shaper on remote port. The Host-system 
is flexible, it may use DPDK or not.

Regards,
Spike.


> -Original Message-
> From: Jerin Jacob 
> Sent: Sunday, May 1, 2022 8:51 PM
> To: Spike Du 
> Cc: Andrew Rybchenko ; Cristian
> Dumitrescu ; Ferruh Yigit
> ; techbo...@dpdk.org; Matan Azrad
> ; Slava Ovsiienko ; Ori Kam
> ; NBU-Contact-Thomas Monjalon (EXTERNAL)
> ; dpdk-dev ; Raslan Darawsheh
> 
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Tue, Apr 26, 2022 at 8:12 AM Spike Du  wrote:
> >
> > Hi Jerin,
> 
> Hi Spike,
> 
> > Thanks for your comments and sorry for the late response.
> >
> > For case one, I think I can refine the design and add LWM(limit
> watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.
> 
> OK.
> 
> >
> > For case two(host shaper), I think we can't use RX meter, because 
> > it's
> actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia
> BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have
> two terms here: Host-system stands for the system the BF2 NIC is inserted;
> ARM-system stands for the embedded ARM in BF2. ARM-system is doing the
> forwarding. This is the way host shaper works: we configure the register on
> ARM-system, but it affects Host-system's TX shaper, which means the
> shaper is working on the remote port, it's not a RX meter concept, hence we
> can't use DPDK RX meter framework. I'd suggest to still use private API.
> 
> OK. If the host is using the DPDK application then rte_tm can be used on the
> egress side to enable the same. If it is not DPDK, then yes, we need private
> APIs.
> 
> >
> > For testpmd part, I understand your concern. Because we need one
> private API for host shaper, and we need testpmd's forwarding code to show
> how it works to user, we need to call the private API in testpmd. If current
> patch is not acceptable, what's the correct way to do it? Any framework to
> isolate the PMD private logic from testpmd common code, but still give a
> chance to call private APIs in testpmd?
> 
> Please check "PMD API" item in
> http://mails.dpdk.org/archives/dev/2022-April/239191.html
> 
> >
> >
> > Regards,
> > Spike.
> >
> >
> >
> > > -Original Message-
> > > From: Jerin Jacob 
> > > Sent: Tuesday, April 5, 2022 4:59 PM
> > > To: Spike Du ; Andrew Rybchenko
> > > ; Cristian Dumitrescu
> > > ; Ferruh Yigit
> > > ; techbo...@dpdk.org
> > > Cc: Matan Azrad ; Slava Ovsiienko
> > > ; Ori Kam ; NBU-Contact-
> > > Thomas Monjalon (EXTERNAL) ; dpdk-dev
> > > ; Raslan Darawsheh 
> > > Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host
> > > shaper
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Fri, Apr 1, 2022 at 8:53 AM Spike Du  wrote:
> > > >
> > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > > fullness reach the LWM limit, HW sends an event to dpdk application.
> > > > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > > > The shaper limits the rate of traffic from host port to wire port.
> > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > > automatically

[RFC v1 0/7] net/mlx5: introduce limit watermark and host shaper

2022-05-05 Thread Spike Du

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in
host shaper, after receiving LWM event, delay a while until RX queue is empty
, then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED
to handle LWM event. For host shaper, because it doesn't align to existing DPDK
framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event
handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add minimal
code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c   |  74 
 app/test-pmd/config.c|  23 +++
 app/test-pmd/meson.build |   3 +
 app/test-pmd/testpmd.c   |  13 ++
 app/test-pmd/testpmd.h   |   1 +
 doc/guides/nics/mlx5.rst |  87 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  44 +++--
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++---
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +-
 drivers/net/mlx5/meson.build |   7 +-
 drivers/net/mlx5/mlx5.c  |  62 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 ++-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 253 +++
 drivers/net/mlx5/mlx5_rx.h   |  11 ++
 drivers/net/mlx5/mlx5_test.c | 191 
 drivers/net/mlx5/mlx5_test.h |  27 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 ---
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  52 +-
 lib/ethdev/ethdev_driver.h   |   7 +
 lib/ethdev/rte_ethdev.c  |  28 +++
 lib/ethdev/rte_ethdev.h  |  30 +++-
 lib/ethdev/version.map   |   3 +
 34 files changed, 1193 insertions(+), 331 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_test.c
 create mode 100644 drivers/net/mlx5/mlx5_test.h

-- 
1.8.3.1

[RFC v1 1/7] net/mlx5: add LWM support for Rxq

2022-05-05 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 23a28f6..f3e6682 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1391,6 +1391,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 03c0fac..4fbfcaa 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[RFC v1 2/7] common/mlx5: share interrupt management

2022-05-05 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  52 ++-
 11 files changed, 217 insertions(+), 312 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[RFC v1 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-05 Thread Spike Du

LWM(limit watermark) is a per Rx queue attribute that notifies dpdk
application event of RTE_ETH_EVENT_RXQ_LIMIT_REACHED when the Rx
queue's usable descriptor is under the watermark.
To simplify its configuration, LWM is a percentage of Rx queue
descriptor size with valid value of [0,99].
Setting LWM to 0 means disable it.
Add LWM's configuration handle in eth_dev_ops.

Signed-off-by: Spike Du 
---
 lib/ethdev/ethdev_driver.h |  7 +++
 lib/ethdev/rte_ethdev.c| 28 
 lib/ethdev/rte_ethdev.h| 30 +-
 lib/ethdev/version.map |  3 +++
 4 files changed, 67 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc2..1e9cdbf 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,10 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev 
*dev,
const struct rte_eth_rxconf *rx_conf,
struct rte_mempool *mb_pool);
 
+typedef int (*eth_rx_queue_set_lwm_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ uint8_t lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id,
@@ -1283,6 +1287,9 @@ struct eth_dev_ops {
 
/** Dump private info from device */
eth_dev_priv_dump_t eth_dev_priv_dump;
+
+   /** Set Rx queue limit watermark */
+   eth_rx_queue_set_lwm_t rx_queue_set_lwm;
 };
 
 /**
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 29a3d80..1e4fc6a 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4414,6 +4414,34 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, 
uint16_t queue_idx,
queue_idx, tx_rate));
 }
 
+int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
+uint8_t lwm)
+{
+   struct rte_eth_dev *dev;
+   struct rte_eth_dev_info dev_info;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+   dev = &rte_eth_devices[port_id];
+
+   ret = rte_eth_dev_info_get(port_id, &dev_info);
+   if (ret != 0)
+   return ret;
+
+   if (queue_idx > dev_info.max_rx_queues) {
+   RTE_ETHDEV_LOG(ERR,
+   "Set queue rate limit:port %u: invalid queue ID=%u\n",
+   port_id, queue_idx);
+   return -EINVAL;
+   }
+
+   if (lwm > 99)
+   return -EINVAL;
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_set_lwm, -ENOTSUP);
+   return eth_err(port_id, (*dev->dev_ops->rx_queue_set_lwm)(dev,
+queue_idx, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8e..f29e53b 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1249,8 +1249,12 @@ struct rte_eth_rxconf {
 */
union rte_eth_rxseg *rx_seg;
 
-   uint64_t reserved_64s[2]; /**< Reserved for future fields */
+   uint64_t reserved_64s;
+   uint32_t reserved_32s;
+   uint32_t lwm:8;
+   uint32_t reserved_bits:24;
void *reserved_ptrs[2];   /**< Reserved for future fields */
+
 };
 
 /**
@@ -3668,6 +3672,29 @@ int rte_eth_dev_set_vlan_ether_type(uint16_t port_id,
  */
 int rte_eth_dev_set_vlan_pvid(uint16_t port_id, uint16_t pvid, int on);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set Rx queue based limit watermark.
+ *
+ * @param port_id
+ *  The port identifier of the Ethernet device.
+ * @param queue_idx
+ *  The index of the receive queue
+ * @param lwm
+ *  The limit watermark percentage of Rx queue descriptor size.
+ *  The valid range is [0,99].
+ *  Setting 0 means disable limit watermark.
+ *
+ * @return
+ *   - (0) if successful.
+ *   - negative if failed.
+ */
+__rte_experimental
+int rte_eth_rx_queue_set_lwm(uint16_t port_id, uint16_t queue_idx,
+   uint8_t lwm);
+
 typedef void (*buffer_tx_error_fn)(struct rte_mbuf **unsent, uint16_t count,
void *userdata);
 
@@ -3873,6 +3900,7 @@ enum rte_eth_event_type {
RTE_ETH_EVENT_DESTROY,  /**< port is released */
RTE_ETH_EVENT_IPSEC,/**< IPsec offload related event */
RTE_ETH_EVENT_FLOW_AGED,/**< New aged-out flows is detected */
+   RTE_ETH_EVENT_RXQ_LIMIT_REACHED,/**< RX queue limit reached */
RTE_ETH_EVENT_MAX   /**< max value of this enum */
 };
 
diff --git a/lib/ethdev/version.map b/lib/ethdev/version.map
index 20391ab..8b85ad8 100644
--- a/lib/ethdev/versio

[RFC v1 4/7] net/mlx5: add LWM event handling support

2022-05-05 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 61 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 ++
 drivers/net/mlx5/mlx5_rx.c   | 27 
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 149 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 72b1e35..334223e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1521,6 +1523,64 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   mlx5_lwm_unset(priv->sh);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1597,6 +1657,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4821ff0..515ff33 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1264,6 +1264,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1401,6 +1404,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1409,6 +1413,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1599,6 +1604,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *priv);
+v

[RFC v1 5/7] net/mlx5: support Rx queue based limit watermark

2022-05-05 Thread Spike Du

Add mlx5 specific LWM(limit watermark) configuration handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |   4 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   1 +
 drivers/net/mlx5/mlx5_rx.c | 123 +
 drivers/net/mlx5/mlx5_rx.h |   3 +
 6 files changed, 133 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 4805d08..a7698c9 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -92,6 +92,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -518,6 +519,9 @@ Limitations
 - The NIC egress flow rules on representor port are not supported.
 
 
+- LWM:
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
+
 Statistics
 --
 
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 88d6e96..f3cf2f1 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -64,6 +64,7 @@ New Features
 
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 
 Removed Items
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 44b1822..23b13e3 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3290,6 +3290,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 334223e..628003d 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2062,6 +2062,7 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_set_lwm = mlx5_rx_queue_set_lwm,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 6b2ef45..68564ea 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,16 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+   return (rxq->lwm * 100 / wqe_cnt);
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +162,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +182,7 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->conf.lwm = mlx5_rxq_lwm_to_percentage(rxq_priv);
 }
 
 /**
@@ -1214,3 +1228,112 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_RXQ_LIMIT_REACHED,
 (void *)(uintptr_t)rxq_idx);
 }
+
+/**
+ * DPDK callback to arm an Rx queue LWM(limit watermark) event.
+ * While the Rx queue fullness reaches the LWM limit, the driver catches
+ * an HW event and invokes the user event callback.
+ * After the last event handling, the user needs to call this API again
+ * to arm an additional event.
+ *
+ * @param dev
+ *   Pointer to the device structure.
+ * @param[in] rx_queue_id
+ *   Rx queue identificator.
+ * @param[in] lwm
+ *   The LWM value, is defined by a percentage of the Rx queue size.
+ *   [1-99] to set a new LWM (update the old value).
+ *   0 to unarm the event.
+ *
+ * @return
+ *   0 : operation success.
+ *   Otherwise:
+ *   - ENOMEM - not enough memory to create LWM event channel.
+ *   - EINVAL - the input Rxq is not created by devx.
+ *   - E2BIG  - lwm is bigger than 99.
+ */
+int
+mlx5_rx_queue_set_lwm(str

[RFC v1 6/7] net/mlx5: add private API to config host port shaper

2022-05-05 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |   7 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  44 +-
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 103 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 199 insertions(+), 15 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a7698c9..4e2ebff 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -522,6 +523,12 @@ Limitations
 - LWM:
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag 
set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index f3cf2f1..96083eb 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -65,6 +65,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 
 Removed Items
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index ed48245..e332261 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -16,23 +16,24 @@ if dlopen_ibverbs
 ]
 endif
 
-libnames = [ 'mlx5', 'ibverbs' ]
+libnames = [ 'mlx5', 'ibverbs']
 libs = []
 foreach libname:libnames
-lib = dependency('lib' + libname, static:static_ibverbs, required:false, 
method: 'pkg-config')
-if not lib.found() and not static_ibverbs
-lib = cc.find_library(libname, required:false)
-endif
-if lib.found()
-libs += lib
-if not static_ibverbs and not dlopen_ibverbs
-ext_deps += lib
-endif
-else
-build = false
-reason = 'missing dependency, "' + libname + '"'
-subdir_done()
-endif
+   lib = dependency('lib' + libname, static:static_ibverbs,
+   required:false, method: 'pkg-config')
+   if not lib.found() and not static_ibverbs
+   lib = cc.find_library(libname, required:false)
+   endif
+   if lib.found()
+   libs += lib
+   if not static_ibverbs and not dlopen_ibverbs
+   ext_deps += lib
+   endif
+   else
+   build = false
+   reason = 'missing dependency, "' + libname + '"'
+   subdir_done()
+   endif
 endforeach
 if static_ibverbs or dlopen_ibverbs
 # Build without adding shared libs to Requires.private
@@ -45,6 +46,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+   libmtcr_ul_found = true
+   ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -205,6 +213,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [
+[  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+'mopen'],
+]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
 config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: 
libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 23b13e3..3559927 100644
--- a/drivers/common/mlx5/mlx5_prm.

[RFC v1 7/7] app/testpmd: add LWM and Host Shaper command

2022-05-05 Thread Spike Du

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port  rxq  lwm 
  mlx5 set port  host_shaper lwm_triggered <0|1> rate 

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c   |  74 +
 app/test-pmd/config.c|  23 ++
 app/test-pmd/meson.build |   3 +
 app/test-pmd/testpmd.c   |  13 +++
 app/test-pmd/testpmd.h   |   1 +
 doc/guides/nics/mlx5.rst |  76 +
 drivers/net/mlx5/meson.build |   7 +-
 drivers/net/mlx5/mlx5_test.c | 191 +++
 drivers/net/mlx5/mlx5_test.h |  27 ++
 9 files changed, 413 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_test.c
 create mode 100644 drivers/net/mlx5/mlx5_test.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 6ffea8e..f98cdf5 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_test.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17807,6 +17810,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t lwm;
+   uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_lwm_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->lwm, "lwm") == 0))
+   ret = set_rxq_lwm(res->port_num, res->rxq_num,
+ res->lwm_num);
+   if (ret < 0)
+   printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+   .f = cmd_rxq_lwm_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  lwm "
+   "Set lwm for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_lwm_set,
+   (void *)&cmd_rxq_lwm_port,
+   (void *)&cmd_rxq_lwm_portnum,
+   (void *)&cmd_rxq_lwm_rxq,
+   (void *)&cmd_rxq_lwm_rxqnum,
+   (void *)&cmd_rxq_lwm_lwm,
+   (void *)&cmd_rxq_lwm_lwmnum,
+   NULL,
+   },
+};
+
 /* 

 */
 
 /* list of instructions */
@@ -18093,6 +18163,10 @@ struct cmd_show_port_flow_transfer_proxy_result {
(cmdline_parse_inst_t *)&cmd_show_capability,
(cmdline_parse_inst_t *)&am

[RFC v2 0/7] introduce per-queue limit watermark and host shaper

2022-05-21 Thread Spike Du

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach 
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically when one 
of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in 
host shaper, after receiving LWM event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED 
to handle LWM event. For host shaper, because it doesn't align to existing DPDK 
framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event 
handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add 
minimal code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c   |  74 +
 app/test-pmd/config.c|  23 ++
 app/test-pmd/meson.build |   4 +
 app/test-pmd/testpmd.c   |  24 ++
 app/test-pmd/testpmd.h   |   1 +
 doc/guides/nics/mlx5.rst |  84 ++
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 +
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 ++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 -
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++---
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +---
 drivers/net/mlx5/mlx5.c  |  68 +
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +++-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 292 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 +
 drivers/net/mlx5/mlx5_testpmd.c  | 184 
 drivers/net/mlx5/mlx5_testpmd.h  |  27 ++
 drivers/net/mlx5/mlx5_txpp.c |  28 +-
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 ++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +--
 lib/ethdev/ethdev_driver.h   |  22 ++
 lib/ethdev/rte_ethdev.c  |  52 
 lib/ethdev/rte_ethdev.h  |  74 -
 lib/ethdev/version.map   |   4 +
 33 files changed, 1305 insertions(+), 309 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
2.27.0

[RFC v2 1/7] net/mlx5: add LWM support for Rxq

2022-05-21 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee8cf..305edffe71 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f9433a..c918a50ae9 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, 
int on)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a6b9..ebd1da455a 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@ int mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx);
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6b62..25a5f2c1fa 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
2.27.0

[RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-21 Thread Spike Du

LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
queue fullness is above LWM, the device will trigger the event
RTE_ETH_EVENT_RX_LWM.
LWM is defined as a percentage of Rx queue size with valid value of
[0,99].
Setting LWM to 0 means disable it, which is the default.
When translate the percentage to queue descriptor number, the numbe
should be bigger than 0 and less than queue size.
Add LWM's configuration and query driver callbacks in eth_dev_ops.

Signed-off-by: Spike Du 
---
 lib/ethdev/ethdev_driver.h | 22 
 lib/ethdev/rte_ethdev.c| 52 +++
 lib/ethdev/rte_ethdev.h| 74 +-
 lib/ethdev/version.map |  4 +++
 4 files changed, 151 insertions(+), 1 deletion(-)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..12ec5e7e19 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev 
*dev,
const struct rte_eth_rxconf *rx_conf,
struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue limit watermark.
+ * see @rte_eth_rx_lwm_set()
+ */
+typedef int (*eth_rx_queue_lwm_set_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ uint8_t lwm);
+
+/**
+ * @internal Query queue limit watermark.
+ * see @rte_eth_rx_lwm_query()
+ */
+
+typedef int (*eth_rx_queue_lwm_query_t)(struct rte_eth_dev *dev,
+   uint16_t *rx_queue_id,
+   uint8_t *lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@ struct eth_dev_ops {
/** Priority flow control queue configure */
priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+   /** Set Rx queue limit watermark */
+   eth_rx_queue_lwm_set_t rx_queue_lwm_set;
+   /** Query Rx queue limit watermark */
+   eth_rx_queue_lwm_query_t rx_queue_lwm_query;
+
/** Set Unicast Table Array */
eth_uc_hash_table_set_tuc_hash_table_set;
/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index 8520aec561..0a46c71288 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4429,6 +4429,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, 
uint16_t queue_idx,
queue_idx, tx_rate));
 }
 
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id,
+  uint8_t lwm)
+{
+   struct rte_eth_dev *dev;
+   struct rte_eth_dev_info dev_info;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+   dev = &rte_eth_devices[port_id];
+
+   ret = rte_eth_dev_info_get(port_id, &dev_info);
+   if (ret != 0)
+   return ret;
+
+   if (queue_id > dev_info.max_rx_queues) {
+   RTE_ETHDEV_LOG(ERR,
+   "Set queue LWM:port %u: invalid queue ID=%u.\n",
+   port_id, queue_id);
+   return -EINVAL;
+   }
+
+   if (lwm > 99)
+   return -EINVAL;
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_set, -ENOTSUP);
+   return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_set)(dev,
+queue_id, lwm));
+}
+
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+uint8_t *lwm)
+{
+   struct rte_eth_dev_info dev_info;
+   struct rte_eth_dev *dev;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+   dev = &rte_eth_devices[port_id];
+
+   ret = rte_eth_dev_info_get(port_id, &dev_info);
+   if (ret != 0)
+   return ret;
+
+   if (queue_id == NULL)
+   return -EINVAL;
+   if (*queue_id >= dev_info.max_rx_queues)
+   *queue_id = 0;
+
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_query, -ENOTSUP);
+   return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_query)(dev,
+queue_id, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04cff8ee10..687ae5ff29 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
 */
union rte_eth_rxseg *rx_seg;
 
-   uint64_t reserved_64s[2]; /**< Reserved for future fields */
+   /**
+* Per-queue Rx limit watermark defined as percentage of Rx queue
+

[RFC v2 5/7] net/mlx5: support Rx queue based limit watermark

2022-05-21 Thread Spike Du

Add mlx5 specific LWM(limit watermark) configuration and query handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 156 +
 drivers/net/mlx5/mlx5_rx.h |   5 +
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56de11..79f56018ef 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- LWM:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+LWM introduction
+
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above LWM, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index a60a0d5f16..253bc7e381 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -80,6 +80,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5100..3b5e60532a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a66625e..35ae51b3af 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ const struct eth_dev_ops mlx5_dev_ops = {
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_lwm_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_lwm_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 7d556c2b45..d30522e6df 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@ mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+   /* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+   return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
rx_queue_id,
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +183,8 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
rx_queue_id,
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->conf.lwm = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204

[RFC v2 4/7] net/mlx5: add LWM event handling support

2022-05-21 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 
 drivers/net/mlx5/mlx5_devx.c | 47 +
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f0988712df..e04a66625e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1524,6 +1526,69 @@ mlx5_alloc_shared_dev_ctx(const struct 
mlx5_dev_spawn_data *spawn,
return NULL;
 }
 
+/**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
 /**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
@@ -1601,6 +1666,7 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc961..a76f2fed3d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_

[RFC v2 2/7] common/mlx5: share interrupt management

2022-05-21 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ---
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +---
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 +---
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +--
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5cd1..f10a981a37 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@ mlx5_os_wrapped_mkey_destroy(struct mlx5_pmd_wrapped_mr 
*pmd_mr)
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+

[RFC v2 6/7] net/mlx5: add private API to config host port shaper

2022-05-21 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  26 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 
 drivers/common/mlx5/mlx5_prm.h |  25 ++
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 103 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 +++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 202 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 79f56018ef..3da6f5a03c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag 
set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,22 @@ LWM (Limit WaterMark) is a per Rx queue attribute, it 
should be configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above LWM, an event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 253bc7e381..21879bda41 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -81,6 +81,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b027..51c6e5dd2e 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [
+[  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+'mopen'],
+]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
 config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: 
libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e60532a..92d05a7368 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC

[RFC v2 7/7] app/testpmd: add LWM and Host Shaper command

2022-05-21 Thread Spike Du

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port  rxq  lwm 
  mlx5 set port  host_shaper lwm_triggered <0|1> rate 

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c  |  74 +
 app/test-pmd/config.c   |  23 
 app/test-pmd/meson.build|   4 +
 app/test-pmd/testpmd.c  |  24 +
 app/test-pmd/testpmd.h  |   1 +
 doc/guides/nics/mlx5.rst|  46 
 drivers/net/mlx5/mlx5_testpmd.c | 184 
 drivers/net/mlx5/mlx5_testpmd.h |  27 +
 8 files changed, 383 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 91e4090582..e8663dd797 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17803,6 +17806,73 @@ cmdline_parse_inst_t cmd_show_port_flow_transfer_proxy 
= {
}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t lwm;
+   uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_lwm_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->lwm, "lwm") == 0))
+   ret = set_rxq_lwm(res->port_num, res->rxq_num,
+ res->lwm_num);
+   if (ret < 0)
+   printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+   .f = cmd_rxq_lwm_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  lwm "
+   "Set lwm for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_lwm_set,
+   (void *)&cmd_rxq_lwm_port,
+   (void *)&cmd_rxq_lwm_portnum,
+   (void *)&cmd_rxq_lwm_rxq,
+   (void *)&cmd_rxq_lwm_rxqnum,
+   (void *)&cmd_rxq_lwm_lwm,
+   (void *)&cmd_rxq_lwm_lwmnum,
+   NULL,
+   },
+};
+
 /* 

 */
 
 /* list of instructions */
@@ -18089,6 +18159,10 @@ cmdline_parse_ctx_t main_ctx[] = {
(cmdline_parse_inst_t *)&cmd_show_capability,
(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
(cmdli

RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-22 Thread Spike Du

Hi,

> -Original Message-
> From: Stephen Hemminger 
> Sent: Sunday, May 22, 2022 11:25 PM
> To: Spike Du 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; NBU-Contact-
> Thomas Monjalon (EXTERNAL) ; dev@dpdk.org;
> Raslan Darawsheh 
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du  wrote:
> 
> > LWM(limit watermark) describes the fullness of a Rx queue. If the Rx
> > queue fullness is above LWM, the device will trigger the event
> > RTE_ETH_EVENT_RX_LWM.
> > LWM is defined as a percentage of Rx queue size with valid value of
> > [0,99].
> > Setting LWM to 0 means disable it, which is the default.
> > When translate the percentage to queue descriptor number, the numbe
> > should be bigger than 0 and less than queue size.
> > Add LWM's configuration and query driver callbacks in eth_dev_ops.
> >
> > Signed-off-by: Spike Du 
> 
> One other objection, please don't invent yet another event channel for this.
> It should be part of existing Rx interrupt logic.

I think this is misunderstanding, the "event channel" is a specific concept in 
MLX5 PMD.
For the DPDK common code like testpmd and event register/callback, I'm using 
standard dpdk
interfaces.

RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-22 Thread Spike Du

Hi, pls see below.

> -Original Message-
> From: Stephen Hemminger 
> Sent: Sunday, May 22, 2022 11:23 PM
> To: Spike Du 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; NBU-Contact-
> Thomas Monjalon (EXTERNAL) ; dev@dpdk.org;
> Raslan Darawsheh 
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Sun, 22 May 2022 08:58:56 +0300
> Spike Du  wrote:
> 
> > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h index
> > 04cff8ee10..687ae5ff29 100644
> > --- a/lib/ethdev/rte_ethdev.h
> > +++ b/lib/ethdev/rte_ethdev.h
> > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> >*/
> >   union rte_eth_rxseg *rx_seg;
> >
> > - uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > + /**
> > +  * Per-queue Rx limit watermark defined as percentage of Rx queue
> > +  * size. If Rx queue receives traffic higher than this percentage,
> > +  * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > +  */
> > + uint8_t lwm;
> > +
> > + uint8_t reserved_bits[3];
> > + uint32_t reserved_32s;
> > + uint64_t reserved_64s;
> >   void *reserved_ptrs[2];   /**< Reserved for future fields */
> >  };
> >
> 
> Ok but, this is an ABI risk about this because reserved stuff was never
> required before.
> Whenever is a reserved field is introduced the code (in this case
> rte_ethdev_configure).
> 
> Best practice would have been to have the code require all reserved fields be
> 0 in earlier releases. In this case an application is like to define a 
> watermark of
> zero; how will your code handle it.
Having watermark of 0 is desired, which is the default. LWM of 0 means the Rx
Queue's watermark is not monitored, hence no LWM event is generated.
> 
> Also, using 8 bits as percentage is different than how other API's handle 
> this.
> Since Rx queue size is in packets, why is this not in packets?
The short answer is to simply the LWM configuration.
Rx queue descriptor is complex nowadays. 
For normal queue, user may configure LWM according to queue descriptor number 
easily.
But for below queues, it's not easy:
Take mprq as example, the testpmd cmd  options can be " -a 
:03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,mprq_log_stride_size=8,mprq_log_stride_num=3
-- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", 
For MLX5 implementation,  the minimum "unit" in queue has 64 descriptors, the 
"unit" number is 16,  if you configure according to descriptor number(1024)
Here, you may easily set LWM as something like 512, but HW doesn't allow it, 
because 512 > 16. If you want the watermark to be half, the correct value is 8.
The same issue happens to feature like "Rx queue buffer split" where a packet 
can be split to multiple descriptors.
Using percentage doesn't have such issues, PMD will cover all the details.

> Also document what behavior of 0 is.
Sure. The behavior is like the old days without this feature, pls see above.

> Why introduce new query/set operations? This should just be part of the
> overall device configuration.
Due to different implementation. LWM can be a dynamic configuration which can 
help user design a flexible flow control.
User may feel ok with LWM of 80% to get high throughput, or later on with 50% 
to throttle the traffic responsively by handling LWM event in order to reduce 
drop.
Some driver like mlx5 may implement LWM event as one-time shot. When you 
receive LWM event, you need to reconfigure LWM in order to receive the event 
again, thus you will
not likely to be overwhelmed by the events.
These all require set operation.

For the query operation. The rte_event API rte_eth_dev_callback_process() is 
per-port API, it doesn't carry much information when an event happens.
When a LWM event happens, we need to know in which Rx queue it happens or 
optionally what's the current LWM percentage of this queue.
The query operation serves this purpose.


Regards,
Spike.

RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-23 Thread Spike Du




> -Original Message-
> From: Thomas Monjalon 
> Sent: Monday, May 23, 2022 6:59 PM
> To: Spike Du ; Morten Brørup
> 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; dev@dpdk.org;
> Raslan Darawsheh 
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 08:07, Morten Brørup:
> > > +   uint8_t lwm;
> >
> > Why percentage, why not 1/128th, or 1/16th? 2^N seems more logical, and
> I wonder if such high granularity is really necessary. Just a thought, it's 
> not
> important.
> 
> I think percentage is the easiest to understand and to share with other teams
> in design documents.
> 
> > If you stick with percentage, it only needs 7 bits, and you can make the
> remaining one bit reserved.

Agree, will change to use 7 bits.
> >
> > Also, please add here that 0 means disable.

Sure.
> 
> Good idea.
>

RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-23 Thread Spike Du




> -Original Message-
> From: Thomas Monjalon 
> Sent: Tuesday, May 24, 2022 5:46 AM
> To: Stephen Hemminger ; Spike Du
> 
> Cc: dev@dpdk.org; Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; Raslan Darawsheh
> ; ferruh.yi...@amd.com;
> andrew.rybche...@oktetlabs.ru
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> 23/05/2022 05:01, Spike Du:
> > From: Stephen Hemminger 
> > > Spike Du  wrote:
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >*/
> > > >   union rte_eth_rxseg *rx_seg;
> > > >
> > > > - uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > + /**
> > > > +  * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +  * size. If Rx queue receives traffic higher than this percentage,
> > > > +  * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +  */
> > > > + uint8_t lwm;
> > > > +
> > > > + uint8_t reserved_bits[3];
> > > > + uint32_t reserved_32s;
> > > > + uint64_t reserved_64s;
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> 
> An ABI compatibility issue would be for an application compiled with an old
> DPDK, and loading a new DPDK at runtime.
> Let's think what would happen in such a case.
> 
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> 
> rte_eth_rx_queue_setup() is called with rx_conf->lwm not initialized.
> Then the library and drivers may interpret a wrong value.
> 
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> >
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> 
> The problem is to have a value not initialized.
> I think the best approach is to not expose the LWM value through this
> configuration structure.
> If the need is to get the current value, we should better add a field in the
> struct rte_eth_rxq_info.

At least from all the dpdk app/example code, rxconf is initialized to 0 then 
setup
The Rx queue, if user follows these examples we should not have ABI issue.
Since many people are concerned about rxconf change, it's ok to remove the LWM
Field there.
Yes, I think we can add lwm into rte_eth_rxq_info. If we can set Rx queue's 
attribute,
We should have a way to get it.

> 
> [...]
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> 
> Thanks to the "set" function, we can avoid the ABI compat issue.
> 
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsively by handling LWM event in order to
> reduce drop.
> > Some driver like mlx5 may implement LWM event as one-time shot. When
> > you receive LWM event, you need to reconfigure LWM in order to receive
> the event again, thus you will not likely to be overwhelmed by the events.
> > These all require set operation.
> 
> Yes it is better to allow dynamic watermark configuration, not using the
> function rte_eth_rx_queue_setup().
> 
> > For the query operation. The rte_event API
> rte_eth_dev_callback_process() is per-port API, it doesn't carry much
> information when an event happens.
> > When a LWM event happens, we need to know in which Rx queue it
> happens or optionally what's the current LWM percentage of this queue.
> > The query operation serves this purpose.
> 
> Yes "query" has to be called in the event handler because event structure is
> not specific to any event type.
> 
>

RE: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-23 Thread Spike Du




> -Original Message-
> From: Stephen Hemminger 
> Sent: Tuesday, May 24, 2022 6:55 AM
> To: Spike Du 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; NBU-Contact-
> Thomas Monjalon (EXTERNAL) ; dev@dpdk.org;
> Raslan Darawsheh 
> Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit watermark
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 23 May 2022 03:01:20 +
> Spike Du  wrote:
> 
> > Hi, pls see below.
> >
> > > -Original Message-
> > > From: Stephen Hemminger 
> > > Sent: Sunday, May 22, 2022 11:23 PM
> > > To: Spike Du 
> > > Cc: Matan Azrad ; Slava Ovsiienko
> > > ; Ori Kam ; NBU-Contact-
> > > Thomas Monjalon (EXTERNAL) ; dev@dpdk.org;
> > > Raslan Darawsheh 
> > > Subject: Re: [RFC v2 3/7] ethdev: introduce Rx queue based limit
> > > watermark
> > >
> > > External email: Use caution opening links or attachments
> > >
> > >
> > > On Sun, 22 May 2022 08:58:56 +0300
> > > Spike Du  wrote:
> > >
> > > > diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
> > > > index
> > > > 04cff8ee10..687ae5ff29 100644
> > > > --- a/lib/ethdev/rte_ethdev.h
> > > > +++ b/lib/ethdev/rte_ethdev.h
> > > > @@ -1249,7 +1249,16 @@ struct rte_eth_rxconf {
> > > >*/
> > > >   union rte_eth_rxseg *rx_seg;
> > > >
> > > > - uint64_t reserved_64s[2]; /**< Reserved for future fields */
> > > > + /**
> > > > +  * Per-queue Rx limit watermark defined as percentage of Rx queue
> > > > +  * size. If Rx queue receives traffic higher than this percentage,
> > > > +  * the event RTE_ETH_EVENT_RX_LWM is triggered.
> > > > +  */
> > > > + uint8_t lwm;
> > > > +
> > > > + uint8_t reserved_bits[3];
> > > > + uint32_t reserved_32s;
> > > > + uint64_t reserved_64s;
> > > >   void *reserved_ptrs[2];   /**< Reserved for future fields */
> > > >  };
> > > >
> > >
> > > Ok but, this is an ABI risk about this because reserved stuff was
> > > never required before.
> > > Whenever is a reserved field is introduced the code (in this case
> > > rte_ethdev_configure).
> > >
> > > Best practice would have been to have the code require all reserved
> > > fields be
> > > 0 in earlier releases. In this case an application is like to define
> > > a watermark of zero; how will your code handle it.
> > Having watermark of 0 is desired, which is the default. LWM of 0 means
> > the Rx Queue's watermark is not monitored, hence no LWM event is
> generated.
> > >
> > > Also, using 8 bits as percentage is different than how other API's handle
> this.
> > > Since Rx queue size is in packets, why is this not in packets?
> > The short answer is to simply the LWM configuration.
> > Rx queue descriptor is complex nowadays.
> > For normal queue, user may configure LWM according to queue descriptor
> number easily.
> > But for below queues, it's not easy:
> > Take mprq as example, the testpmd cmd  options can be " -a
> >
> :03:00.0,rxqs_min_mprq=1,mprq_en=1,mprq_max_memcpy_len=465,
> mprq_lo
> > g_stride_size=8,mprq_log_stride_num=3
> > -- --mbcache=512 -i  --nb-cores=7  --txd=1024 --rxd=1024 ", For MLX5
> > implementation,  the minimum "unit" in queue has 64 descriptors, the
> > "unit" number is 16,  if you configure according to descriptor number(1024)
> Here, you may easily set LWM as something like 512, but HW doesn't allow it,
> because 512 > 16. If you want the watermark to be half, the correct value is 
> 8.
> > The same issue happens to feature like "Rx queue buffer split" where a
> packet can be split to multiple descriptors.
> > Using percentage doesn't have such issues, PMD will cover all the details.
> >
> > > Also document what behavior of 0 is.
> > Sure. The behavior is like the old days without this feature, pls see above.
> >
> > > Why introduce new query/set operations? This should just be part of
> > > the overall device configuration.
> > Due to different implementation. LWM can be a dynamic configuration
> which can help user design a flexible flow control.
> > User may feel ok with LWM of 80% to get high throughput, or later on with
> 50% to throttle the traffic responsivel

[PATCH v3 0/7] introduce per-queue limit watermark and host shaper

2022-05-24 Thread Spike Du

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach 
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically when one 
of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in 
host shaper, after receiving LWM event, delay a while until RX queue is empty , 
then disable the shaper. We recycle this work flow to reduce RX queue drops.

Add new libethdev API to set LWM, add rte event RTE_ETH_EVENT_RXQ_LIMIT_REACHED 
to handle LWM event. For host shaper, because it doesn't align to existing DPDK 
framework and is specific to Nvidia NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and LWM event 
handler in mlx5 PMD directory by adding a new file mlx5_test.c. Only add 
minimal code in testpmd to invoke interfaces from mlx5_test.c.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based limit watermark
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based limit watermark
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c   |  74 +
 app/test-pmd/config.c|  21 ++
 app/test-pmd/meson.build |   4 +
 app/test-pmd/testpmd.c   |  24 ++
 app/test-pmd/testpmd.h   |   1 +
 doc/guides/nics/mlx5.rst |  84 ++
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 +
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 +
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 ++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 ++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 -
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++---
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +---
 drivers/net/mlx5/mlx5.c  |  68 +
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +++-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 292 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 +
 drivers/net/mlx5/mlx5_testpmd.c  | 184 
 drivers/net/mlx5/mlx5_testpmd.h  |  27 ++
 drivers/net/mlx5/mlx5_txpp.c |  28 +-
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 ++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +--
 lib/ethdev/ethdev_driver.h   |  22 ++
 lib/ethdev/rte_ethdev.c  |  52 
 lib/ethdev/rte_ethdev.h  |  71 +
 lib/ethdev/version.map   |   2 +
 33 files changed, 1299 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
2.27.0

[PATCH v3 1/7] net/mlx5: add LWM support for Rxq

2022-05-24 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee8cf..305edffe71 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f9433a..c918a50ae9 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@ mlx5_rxq_obj_modify_rq_vlan_strip(struct mlx5_rxq_priv *rxq, 
int on)
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@ mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a6b9..ebd1da455a 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@ int mlx5_txq_devx_obj_new(struct rte_eth_dev *dev, uint16_t 
idx);
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6b62..25a5f2c1fa 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
2.27.0

[PATCH v3 3/7] ethdev: introduce Rx queue based limit watermark

2022-05-24 Thread Spike Du

LWM (limit watermark) describes the fullness of a Rx queue. If the Rx
queue fullness is above LWM, the device will trigger the event
RTE_ETH_EVENT_RX_LWM.
LWM is defined as a percentage of Rx queue size with valid value of
[0,99].
Setting LWM to 0 means disable it, which is the default.
Add LWM's configuration and query driver callbacks in eth_dev_ops.

Signed-off-by: Spike Du 
---
 lib/ethdev/ethdev_driver.h | 22 
 lib/ethdev/rte_ethdev.c| 52 
 lib/ethdev/rte_ethdev.h| 71 ++
 lib/ethdev/version.map |  2 ++
 4 files changed, 147 insertions(+)

diff --git a/lib/ethdev/ethdev_driver.h b/lib/ethdev/ethdev_driver.h
index 69d9dc21d8..49e4ef0fbb 100644
--- a/lib/ethdev/ethdev_driver.h
+++ b/lib/ethdev/ethdev_driver.h
@@ -470,6 +470,23 @@ typedef int (*eth_rx_queue_setup_t)(struct rte_eth_dev 
*dev,
const struct rte_eth_rxconf *rx_conf,
struct rte_mempool *mb_pool);
 
+/**
+ * @internal Set Rx queue limit watermark.
+ * @see rte_eth_rx_lwm_set()
+ */
+typedef int (*eth_rx_queue_lwm_set_t)(struct rte_eth_dev *dev,
+ uint16_t rx_queue_id,
+ uint8_t lwm);
+
+/**
+ * @internal Query queue limit watermark event.
+ * @see rte_eth_rx_lwm_query()
+ */
+
+typedef int (*eth_rx_queue_lwm_query_t)(struct rte_eth_dev *dev,
+   uint16_t *rx_queue_id,
+   uint8_t *lwm);
+
 /** @internal Setup a transmit queue of an Ethernet device. */
 typedef int (*eth_tx_queue_setup_t)(struct rte_eth_dev *dev,
uint16_t tx_queue_id,
@@ -1168,6 +1185,11 @@ struct eth_dev_ops {
/** Priority flow control queue configure */
priority_flow_ctrl_queue_config_t priority_flow_ctrl_queue_config;
 
+   /** Set Rx queue limit watermark. */
+   eth_rx_queue_lwm_set_t rx_queue_lwm_set;
+   /** Query Rx queue limit watermark event. */
+   eth_rx_queue_lwm_query_t rx_queue_lwm_query;
+
/** Set Unicast Table Array */
eth_uc_hash_table_set_tuc_hash_table_set;
/** Set Unicast hash bitmap */
diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c
index a175867651..e10e874aae 100644
--- a/lib/ethdev/rte_ethdev.c
+++ b/lib/ethdev/rte_ethdev.c
@@ -4424,6 +4424,58 @@ int rte_eth_set_queue_rate_limit(uint16_t port_id, 
uint16_t queue_idx,
queue_idx, tx_rate));
 }
 
+int rte_eth_rx_lwm_set(uint16_t port_id, uint16_t queue_id,
+  uint8_t lwm)
+{
+   struct rte_eth_dev *dev;
+   struct rte_eth_dev_info dev_info;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+   dev = &rte_eth_devices[port_id];
+
+   ret = rte_eth_dev_info_get(port_id, &dev_info);
+   if (ret != 0)
+   return ret;
+
+   if (queue_id > dev_info.max_rx_queues) {
+   RTE_ETHDEV_LOG(ERR,
+   "Set queue LWM: port %u: invalid queue ID=%u.\n",
+   port_id, queue_id);
+   return -EINVAL;
+   }
+
+   if (lwm > 99)
+   return -EINVAL;
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_set, -ENOTSUP);
+   return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_set)(dev,
+queue_id, lwm));
+}
+
+int rte_eth_rx_lwm_query(uint16_t port_id, uint16_t *queue_id,
+uint8_t *lwm)
+{
+   struct rte_eth_dev_info dev_info;
+   struct rte_eth_dev *dev;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -ENODEV);
+   dev = &rte_eth_devices[port_id];
+
+   ret = rte_eth_dev_info_get(port_id, &dev_info);
+   if (ret != 0)
+   return ret;
+
+   if (queue_id == NULL)
+   return -EINVAL;
+   if (*queue_id >= dev_info.max_rx_queues)
+   *queue_id = 0;
+
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->rx_queue_lwm_query, -ENOTSUP);
+   return eth_err(port_id, (*dev->dev_ops->rx_queue_lwm_query)(dev,
+queue_id, lwm));
+}
+
 RTE_INIT(eth_dev_init_fp_ops)
 {
uint32_t i;
diff --git a/lib/ethdev/rte_ethdev.h b/lib/ethdev/rte_ethdev.h
index 04225bba4d..541178fa76 100644
--- a/lib/ethdev/rte_ethdev.h
+++ b/lib/ethdev/rte_ethdev.h
@@ -1931,6 +1931,14 @@ struct rte_eth_rxq_info {
uint8_t queue_state;/**< one of RTE_ETH_QUEUE_STATE_*. */
uint16_t nb_desc;   /**< configured number of RXDs. */
uint16_t rx_buf_size;   /**< hardware receive buffer size. */
+   /**
+* Per-queue Rx limit watermark defined as percentage of Rx queue
+* size. If Rx

[PATCH v3 4/7] net/mlx5: add LWM event handling support

2022-05-24 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 
 drivers/net/mlx5/mlx5_devx.c | 47 +
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f0988712df..e04a66625e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1524,6 +1526,69 @@ mlx5_alloc_shared_dev_ctx(const struct 
mlx5_dev_spawn_data *spawn,
return NULL;
 }
 
+/**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
 /**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
@@ -1601,6 +1666,7 @@ mlx5_free_shared_dev_ctx(struct mlx5_dev_ctx_shared *sh)
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc961..a76f2fed3d 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_

[PATCH v3 6/7] net/mlx5: add private API to config host port shaper

2022-05-24 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  26 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 
 drivers/common/mlx5/mlx5_prm.h |  25 ++
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 103 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 +++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 202 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 79f56018ef..3da6f5a03c 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag 
set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,22 @@ LWM (Limit WaterMark) is a per Rx queue attribute, it 
should be configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above LWM, an event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+LWM event trigger. In immediate mode, the rate limit is configured immediately
+to host shaper. When deferring to LWM trigger, the shaper is not set until an
+LWM event is received by any Rx queue in a VF representor belonging to the host
+port. The only rate supported for deferred mode is 100Mbps (there is no limit
+on the supported rates for immediate mode). In deferred mode, the shaper is set
+on the host port by the firmware upon receiving the LMW event, which allows
+throttling host traffic on LWM events at minimum latency, preventing excess
+drops in the Rx queue.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 34f86eaffa..94720af3af 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b027..51c6e5dd2e 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [
+[  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+'mopen'],
+]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
 config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: 
libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 3b5e60532a..92d05a7368 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3771,6 +3771,7 @@ enum {
MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC

[PATCH v3 7/7] app/testpmd: add LWM and Host Shaper command

2022-05-24 Thread Spike Du

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port  rxq  lwm 
  mlx5 set port  host_shaper lwm_triggered <0|1> rate 

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c  |  74 +
 app/test-pmd/config.c   |  21 
 app/test-pmd/meson.build|   4 +
 app/test-pmd/testpmd.c  |  24 +
 app/test-pmd/testpmd.h  |   1 +
 doc/guides/nics/mlx5.rst|  46 
 drivers/net/mlx5/mlx5_testpmd.c | 184 
 drivers/net/mlx5/mlx5_testpmd.h |  27 +
 8 files changed, 381 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 1e5b294ab3..86342f2ac6 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -67,6 +67,9 @@
 #include "cmdline_mtr.h"
 #include "cmdline_tm.h"
 #include "bpf_cmd.h"
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 static struct cmdline *testpmd_cl;
 
@@ -17804,6 +17807,73 @@ cmdline_parse_inst_t cmd_show_port_flow_transfer_proxy 
= {
}
 };
 
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t lwm;
+   uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_lwm_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->lwm, "lwm") == 0))
+   ret = set_rxq_lwm(res->port_num, res->rxq_num,
+ res->lwm_num);
+   if (ret < 0)
+   printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+   .f = cmd_rxq_lwm_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  lwm "
+   "Set lwm for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_lwm_set,
+   (void *)&cmd_rxq_lwm_port,
+   (void *)&cmd_rxq_lwm_portnum,
+   (void *)&cmd_rxq_lwm_rxq,
+   (void *)&cmd_rxq_lwm_rxqnum,
+   (void *)&cmd_rxq_lwm_lwm,
+   (void *)&cmd_rxq_lwm_lwmnum,
+   NULL,
+   },
+};
+
 /* 

 */
 
 /* list of instructions */
@@ -18091,6 +18161,10 @@ cmdline_parse_ctx_t main_ctx[] = {
(cmdline_parse_inst_t *)&cmd_show_capability,
(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
(cmdli

[PATCH v3 5/7] net/mlx5: support Rx queue based limit watermark

2022-05-24 Thread Spike Du

Add mlx5 specific LWM(limit watermark) configuration and query handler.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 ++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 156 +
 drivers/net/mlx5/mlx5_rx.h |   5 +
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56de11..79f56018ef 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- LWM:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+LWM introduction
+
+
+LWM (Limit WaterMark) is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above LWM, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92820..34f86eaffa 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue LWM(Limit WaterMark) support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5100..3b5e60532a 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a66625e..35ae51b3af 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ const struct eth_dev_ops mlx5_dev_ops = {
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_lwm_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_lwm_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index 7d556c2b45..406eae9b39 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@ mlx5_rx_descriptor_status(void *rx_queue, uint16_t offset)
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << rxq_data->elts_n;
+
+   /* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+   return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
rx_queue_id,
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +183,8 @@ mlx5_rxq_info_get(struct rte_eth_dev *dev, uint16_t 
rx_queue_id,
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->lwm = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204

[PATCH v3 2/7] common/mlx5: share interrupt management

2022-05-24 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ---
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +---
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 +---
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +--
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5cd1..f10a981a37 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@ mlx5_os_wrapped_mkey_destroy(struct mlx5_pmd_wrapped_mr 
*pmd_mr)
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+

RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper

2022-05-25 Thread Spike Du




> -Original Message-
> From: Morten Brørup 
> Sent: Wednesday, May 25, 2022 3:00 AM
> To: NBU-Contact-Thomas Monjalon (EXTERNAL) ;
> Spike Du 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; dev@dpdk.org;
> Raslan Darawsheh ; step...@networkplumber.org;
> andrew.rybche...@oktetlabs.ru; ferruh.yi...@amd.com;
> david.march...@redhat.com
> Subject: RE: [PATCH v3 0/7] introduce per-queue limit watermark and host
> shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > Sent: Tuesday, 24 May 2022 17.59
> >
> > +Cc people involved in previous versions
> >
> > 24/05/2022 17:20, Spike Du:
> > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > fullness reach the LWM limit, HW sends an event to dpdk application.
> > > Host shaper can configure shaper rate and lwm-triggered for a host
> > port.
> 
> Please ignore this comment, it is not important, but I had to get it out of my
> system: I assume that the "LWM" name is from the NIC datasheet; otherwise
> I would probably prefer something with "threshold"... LWM is easily
> confused with "low water mark", which is the opposite of what the LWM
> does. Names are always open for discussion, so I won't object to it.
> 
> > > The shaper limits the rate of traffic from host port to wire port.
> 
> From host to wire? It is RX, so you must mean from wire to host.

The host shaper is quite private to Nvidia's BlueField 2 NIC. The NIC is 
inserted
In a server which we call it host-system, and the NIC has an embedded Arm-system
Which does the forwarding.
The traffic flows from host-system to wire like this:
Host-system generates traffic, send it to Arm-system, Arm sends it to 
physical/wire port.
So the RX happens between host-system and Arm-system, and the traffic is host 
to wire.
The shaper also works in a special way: you configure it on Arm-system, but it 
takes effect
On host-sysmem's TX side. 

> 
> > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > automatically when one of the host port's Rx queues receives LWM event.
> > >
> > > These two features can combine to control traffic from host port to
> > wire port.
> 
> Again, you mean from wire to host?

Pls see above.

> 
> > > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> > reduce RX queue drops.
> 
> You delay while RX queue gets drained by some other threads, I assume.

The PMD thread drains the Rx queue, the PMD receiving  as normal, as the PMD
Implementation uses rte interrupt thread to handle LWM event.

> 
> Surely, the excess packets must be dropped somewhere, e.g. by the shaper?
> 
> > >
> > > Add new libethdev API to set LWM, add rte event
> > RTE_ETH_EVENT_RXQ_LIMIT_REACHED to handle LWM event.
> 
> Makes sense to make it public; could be usable for other purposes, similar to
> interrupt coalescing, as mentioned by Stephen.
> 
> > > For host shaper,
> > because it doesn't align to existing DPDK framework and is specific to
> > Nvidia NIC, use PMD private API.
> 
> Makes sense to keep it private.
> 
> > >
> > > For integration with testpmd, put the private cmdline function and
> > LWM event handler in mlx5 PMD directory by adding a new file
> > mlx5_test.c. Only add minimal code in testpmd to invoke interfaces
> > from mlx5_test.c.
> > >
> > > Spike Du (7):
> > >   net/mlx5: add LWM support for Rxq
> > >   common/mlx5: share interrupt management
> > >   ethdev: introduce Rx queue based limit watermark
> > >   net/mlx5: add LWM event handling support
> > >   net/mlx5: support Rx queue based limit watermark
> > >   net/mlx5: add private API to config host port shaper
> > >   app/testpmd: add LWM and Host Shaper command
> > >

RE: [PATCH v3 0/7] introduce per-queue limit watermark and host shaper

2022-05-25 Thread Spike Du




> -Original Message-
> From: Morten Brørup 
> Sent: Wednesday, May 25, 2022 9:40 PM
> To: Spike Du ; NBU-Contact-Thomas Monjalon
> (EXTERNAL) 
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; dev@dpdk.org;
> Raslan Darawsheh ; step...@networkplumber.org;
> andrew.rybche...@oktetlabs.ru; ferruh.yi...@amd.com;
> david.march...@redhat.com
> Subject: RE: [PATCH v3 0/7] introduce per-queue limit watermark and host
> shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> > From: Spike Du [mailto:spi...@nvidia.com]
> > Sent: Wednesday, 25 May 2022 15.15
> >
> > > From: Morten Brørup 
> > > Sent: Wednesday, May 25, 2022 3:00 AM
> > >
> > > > From: Thomas Monjalon [mailto:tho...@monjalon.net]
> > > > Sent: Tuesday, 24 May 2022 17.59
> > > >
> > > > +Cc people involved in previous versions
> > > >
> > > > 24/05/2022 17:20, Spike Du:
> > > > > LWM(limit watermark) is per RX queue attribute, when RX queue
> > > > fullness reach the LWM limit, HW sends an event to dpdk
> > application.
> > > > > Host shaper can configure shaper rate and lwm-triggered for a
> > host
> > > > port.
> > >
> > > Please ignore this comment, it is not important, but I had to get it
> > out of my
> > > system: I assume that the "LWM" name is from the NIC datasheet;
> > otherwise
> > > I would probably prefer something with "threshold"... LWM is easily
> > > confused with "low water mark", which is the opposite of what the
> > > LWM does. Names are always open for discussion, so I won't object to it.
> > >
> > > > > The shaper limits the rate of traffic from host port to wire
> > port.
> > >
> > > From host to wire? It is RX, so you must mean from wire to host.
> >
> > The host shaper is quite private to Nvidia's BlueField 2 NIC. The NIC
> > is inserted In a server which we call it host-system, and the NIC has
> > an embedded Arm-system Which does the forwarding.
> > The traffic flows from host-system to wire like this:
> > Host-system generates traffic, send it to Arm-system, Arm sends it to
> > physical/wire port.
> > So the RX happens between host-system and Arm-system, and the traffic
> > is host to wire.
> > The shaper also works in a special way: you configure it on
> > Arm-system, but it takes effect On host-sysmem's TX side.
> >
> > >
> > > > > If lwm-triggered is enabled, a 100Mbps shaper is enabled
> > > > automatically when one of the host port's Rx queues receives LWM
> > event.
> > > > >
> > > > > These two features can combine to control traffic from host port
> > to
> > > > wire port.
> > >
> > > Again, you mean from wire to host?
> >
> > Pls see above.
> >
> > >
> > > > > The work flow is configure LWM to RX queue and enable lwm-
> > triggered
> > > > flag in host shaper, after receiving LWM event, delay a while
> > > > until
> > RX
> > > > queue is empty , then disable the shaper. We recycle this work
> > > > flow
> > to
> > > > reduce RX queue drops.
> > >
> > > You delay while RX queue gets drained by some other threads, I
> > assume.
> >
> > The PMD thread drains the Rx queue, the PMD receiving  as normal, as
> > the PMD Implementation uses rte interrupt thread to handle LWM event.
> >
> 
> Thank you for the explanation, Spike. It really clarifies a lot!
> 
> If this patch is intended for DPDK running on the host-system, then the LWM
> attribute is associated with a TX queue, not an RX queue. The packets are
> egressing from the host-system, so TX from the host-system's perspective.
> 
> Otherwise, if this patch is for DPDK running on the embedded ARM-system,
> it should be highlighted somewhere.

The host-shaper patch is running on ARM-system, I think in that patch I have 
some explanation in mlx5.rst.
The LWM patch is common and should work on any Rx queue(right now mlx5 doesn't 
support Hairpin Rx queue and shared Rx queue).
On ARM-system, we can use it to monitor traffic from host(representor port) or 
from wire(physical port).
LWM can also work on host-system if there is DPDK running, for example it can 
monitor traffic from Arm-system to host-system.

> 
> > >
> > > Surely, the excess packets must be dropped somewhere, e.g. by the
> > shaper?
> 
> I guess the shaper doesn't have to drop any packets, but the host-system will
> simply be unable to put more packets into the queue if it runs full.
> 

When LWM event happens, the host-shaper throttles traffic from host-system to 
Arm-system. Yes, the shaper doesn't drop pkts.
Normally the shaper is small and if PMD thread on Arm keeps working, Rx queue 
is dropless.
But if PMD thread doesn't receive fast enough, or even with a small shaper but 
host-system is sending some burst,  Rx queue may still drop on Arm.
Anyway even sometimes drop still happens, the cooperation of host-shaper and 
LWM greatly reduce the Rx drop on Arm.

[PATCH v4 0/7] introduce per-queue fill threshold and host shaper

2022-06-03 Thread Spike Du

Fill threshold is per RX queue attribute, when RX queue fullness reach the fill 
threshold limit, HW sends an event to application.
Host shaper can configure shaper rate and fill_thresh-triggered for a host port.
The shaper limits the rate of traffic from host port to embedded ARM rx port on 
Nvidia BlueField 2 NIC.
If fill_thresh-triggered is enabled, a 100Mbps shaper is enabled automatically 
when one of the host port's Rx queues receives fill threshold event.

These two features can combine to control traffic from host port to wire port 
for BlueField 2 NIC.
The traffic flows from host to embedded ARM, then to the physical port.
The work flow is on the ARM system, configure fill threshold to RX queue and 
enable fill_thresh-triggered flag in host shaper, after receiving fill 
threshold event, delay a while until RX queue is empty , then disable the 
shaper. We recycle this work flow to reduce RX queue drops on ARM system.

Add new libethdev API to set fill threshold, add rte event 
RTE_ETH_EVENT_RX_FILL_THRESH to handle fill threshold event. For host shaper, 
because it doesn't align to existing DPDK framework and is specific to Nvidia 
NIC, use PMD private API.

For integration with testpmd, put the private cmdline function and fill 
threshold event handler in mlx5 PMD directory by adding a new file 
mlx5_testpmd.c. Follow David Marchand's driver specific commands framework to 
add mlx5 specific commands.

Spike Du (7):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  ethdev: introduce Rx queue based fill threshold
  net/mlx5: add LWM event handling support
  net/mlx5: support Rx queue based fill threshold
  net/mlx5: add private API to config host port shaper
  app/testpmd: add Host Shaper command

 app/test-pmd/cmdline.c   |  68 +++
 app/test-pmd/config.c|  21 ++
 app/test-pmd/testpmd.c   |  24 +++
 app/test-pmd/testpmd.h   |   2 +
 doc/guides/nics/mlx5.rst |  93 +
 doc/guides/rel_notes/release_22_07.rst   |   2 +
 drivers/common/mlx5/linux/meson.build|  13 ++
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +
 drivers/common/mlx5/mlx5_prm.h   |  26 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 ---
 drivers/net/mlx5/linux/mlx5_os.c | 132 +++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +
 drivers/net/mlx5/meson.build |   4 +
 drivers/net/mlx5/mlx5.c  |  68 +++
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  60 +-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 292 +++
 drivers/net/mlx5/mlx5_rx.h   |  13 ++
 drivers/net/mlx5/mlx5_testpmd.c  | 201 ++
 drivers/net/mlx5/mlx5_testpmd.h  |  26 +++
 drivers/net/mlx5/mlx5_txpp.c |  28 +--
 drivers/net/mlx5/rte_pmd_mlx5.h  |  30 +++
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 --
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 +
 lib/ethdev/ethdev_driver.h   |  22 ++
 lib/ethdev/rte_ethdev.c  |  52 +
 lib/ethdev/rte_ethdev.h  |  72 +++
 lib/ethdev/version.map   |   2 +
 33 files changed, 1320 insertions(+), 308 deletions(-)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

-- 
1.8.3.1

[PATCH v4 1/7] net/mlx5: add LWM support for Rxq

2022-06-03 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 13 -
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index ef755ee..305edff 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1395,6 +1395,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index 4b48f94..c918a50 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,11 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   if (rxq->lwm) {
+   rq_attr.modify_bitmask |=
+   MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   }
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +90,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index e715ed6..25a5f2c 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -175,6 +175,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[PATCH v4 2/7] common/mlx5: share interrupt management

2022-06-03 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   2 +
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  48 ++
 11 files changed, 217 insertions(+), 307 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index d40cfd5..f10a981 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -964,3 +965,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[PATCH v4 3/7] ethdev: introduce Rx queue based fill threshold

2022-06-03 Thread Spike Du

Fill threshold describes the fullness of a Rx queue. If the Rx
queue fullness is above the threshold, the device will trigger the event
RTE_ETH_EVENT_RX_FILL_THRESH.
Fill threshold is defined as a percentage of Rx queue size with valid
value of [0,99].
Setting fill threshold to 0 means disable it, which is the default.
Add fill threshold configuration and query driver callbacks in eth_dev_ops.
Add command line options to support fill_thresh per-rxq configure.
- Command syntax:
  set port  rxq  fill_thresh 

- Example commands:
To configure fill_thresh as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 fill_thresh 30

To disable fill_thresh on port 1 rxq 0:
testpmd> set port 1 rxq 0 fill_thresh 0

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c | 68 +++
 app/test-pmd/config.c  | 21 ++
 app/test-pmd/testpmd.c | 18 
 app/test-pmd/testpmd.h |  2 ++
 lib/ethdev/ethdev_driver.h | 22 ++
 lib/ethdev/rte_ethdev.c| 52 +
 lib/ethdev/rte_ethdev.h| 72 ++
 lib/ethdev/version.map |  2 ++
 8 files changed, 257 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 0410bad..918581e 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17823,6 +17823,73 @@ struct cmd_show_port_flow_transfer_proxy_result {
}
 };
 
+/* *** SET FILL THRESHOLD FOR A RXQ OF A PORT *** */
+struct cmd_rxq_fill_thresh_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t fill_thresh;
+   uint16_t fill_thresh_num;
+};
+
+static void cmd_rxq_fill_thresh_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_fill_thresh_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->fill_thresh, "fill_thresh") == 0))
+   ret = set_rxq_fill_thresh(res->port_num, res->rxq_num,
+ res->fill_thresh_num);
+   if (ret < 0)
+   printf("rxq_fill_thresh_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   set, "set");
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   port, "port");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_fill_thresh_fill_thresh =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   fill_thresh, "fill_thresh");
+cmdline_parse_token_num_t cmd_rxq_fill_thresh_fill_threshnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_fill_thresh_result,
+   fill_thresh_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_fill_thresh = {
+   .f = cmd_rxq_fill_thresh_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  fill_thresh 
"
+   "Set fill_thresh for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_fill_thresh_set,
+   (void *)&cmd_rxq_fill_thresh_port,
+   (void *)&cmd_rxq_fill_thresh_portnum,
+   (void *)&cmd_rxq_fill_thresh_rxq,
+   (void *)&cmd_rxq_fill_thresh_rxqnum,
+   (void *)&cmd_rxq_fill_thresh_fill_thresh,
+   (void *)&cmd_rxq_fill_thresh_fill_threshnum,
+   NULL,
+   },
+};
+
 /* 

 */
 
 /* list of instructions */
@@ -18110,6 +18177,7 @@ struct cmd_show_port_flow_transfer_proxy_result {
(cmdline_parse_inst_t *)&cmd_show_capability,
(cmdline_parse_inst_t *)&cmd_set_flex_is_pattern,
(cmdline_parse_inst_t *)&cmd_set_flex_spec_pattern,
+   (cmdline_parse_inst_t *)&cmd_rxq

[PATCH v4 4/7] net/mlx5: add LWM event handling support

2022-06-03 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 66 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 +++
 drivers/net/mlx5/mlx5_rx.c   | 33 ++
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 160 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index f098871..e04a666 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1525,6 +1527,69 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   if (priv->sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (priv->sh->devx_channel_lwm);
+   priv->sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&priv->sh->lwm_config_lock);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1601,6 +1666,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 7ebb2cc..a76f2fe 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1268,6 +1268,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1405,6 +1408,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1413,6 +1417,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1

[PATCH v4 5/7] net/mlx5: support Rx queue based fill threshold

2022-06-03 Thread Spike Du

Add mlx5 specific fill threshold configuration and query handler.
In mlx5 PMD, fill threshold is also called LWM(limit watermark).
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.
The query handler finds the next RX queue with pending LWM event
if any, starting from the given RX queue index.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  12 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/mlx5_prm.h |   1 +
 drivers/net/mlx5/mlx5.c|   2 +
 drivers/net/mlx5/mlx5_rx.c | 156 +
 drivers/net/mlx5/mlx5_rx.h |   5 ++
 6 files changed, 177 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index d83c56d..ea393fb 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue fill threshold configuration.
 
 
 Limitations
@@ -520,6 +521,9 @@ Limitations
 
 - The NIC egress flow rules on representor port are not supported.
 
+- Fill threshold:
+
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
 
 Statistics
 --
@@ -1680,3 +1684,11 @@ The procedure below is an example of using a ConnectX-5 
adapter card (pf0) with
 #. For each VF PCIe, using the following command to bind the driver::
 
$ echo ":82:00.2" >> /sys/bus/pci/drivers/mlx5_core/bind
+
+Fill threshold introduction
+
+
+Fill threshold is a per Rx queue attribute, it should be configured as
+a percentage of the Rx queue size.
+When Rx queue fullness is above the threshold, an event is sent to PMD.
+
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 0ed4f92..62a8874 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -89,6 +89,7 @@ New Features
   * Added support for promiscuous mode on Windows.
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
+  * Added Rx queue fill threshold support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 630b2c5..3b5e605 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3293,6 +3293,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index e04a666..a4a39ab 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -2071,6 +2071,8 @@ struct mlx5_dev_ctx_shared *
.dev_supported_ptypes_get = mlx5_dev_supported_ptypes_get,
.vlan_filter_set = mlx5_vlan_filter_set,
.rx_queue_setup = mlx5_rx_queue_setup,
+   .rx_queue_fill_thresh_set = mlx5_rx_queue_lwm_set,
+   .rx_queue_fill_thresh_query = mlx5_rx_queue_lwm_query,
.rx_hairpin_queue_setup = mlx5_rx_hairpin_queue_setup,
.tx_queue_setup = mlx5_tx_queue_setup,
.tx_hairpin_queue_setup = mlx5_tx_hairpin_queue_setup,
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index aacb43e..4099496 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,12 +19,14 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
 
@@ -128,6 +130,17 @@
return RTE_ETH_RX_DESC_AVAIL;
 }
 
+/* Get rxq lwm percentage according to lwm number. */
+static uint8_t
+mlx5_rxq_lwm_to_percentage(struct mlx5_rxq_priv *rxq)
+{
+   struct mlx5_rxq_data *rxq_data = &rxq->ctrl->rxq;
+   uint32_t wqe_cnt = 1 << (rxq_data->elts_n - rxq_data->sges_n);
+
+   /* ethdev LWM describes fullness, mlx5 LWM describes emptiness. */
+   return rxq->lwm ? (100 - rxq->lwm * 100 / wqe_cnt) : 0;
+}
+
 /**
  * DPDK callback to get the RX queue information.
  *
@@ -150,6 +163,7 @@
 {
struct mlx5_rxq_ctrl *rxq_ctrl = mlx5_rxq_ctrl_get(dev, rx_queue_id);
struct mlx5_rxq_data *rxq = mlx5_rxq_data_get(dev, rx_queue_id);
+   struct mlx5_rxq_priv *rxq_priv = mlx5_rxq_get(dev, rx_queue_id);
 
if (!rxq)
return;
@@ -169,6 +183,8 @@
qinfo->nb_desc = mlx5_rxq_mprq_enabled(rxq) ?
RTE_BIT32(rxq->elts_n) * RTE_BIT32(rxq->log_strd_num) :
RTE_BIT32(rxq->elts_n);
+   qinfo->fill_thresh = rxq_priv ?
+   mlx5_rxq_lwm_to_percentage(rxq_priv) : 0;
 }
 
 /**
@@ -1188,6 +1204,34 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)

[PATCH v4 6/7] net/mlx5: add private API to config host port shaper

2022-06-03 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives fill threshold event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  35 +++
 doc/guides/rel_notes/release_22_07.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  13 +
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 103 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   2 +
 8 files changed, 211 insertions(+)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index ea393fb..39bfebb 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -94,6 +94,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue fill threshold configuration.
+- Host shaper support.
 
 
 Limitations
@@ -525,6 +526,12 @@ Limitations
 
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with 
MLX5_HOST_SHAPER_FLAG_FILL_THRESH_TRIGGERED flag set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
@@ -1692,3 +1699,31 @@ Fill threshold is a per Rx queue attribute, it should be 
configured as
 a percentage of the Rx queue size.
 When Rx queue fullness is above the threshold, an event is sent to PMD.
 
+Host shaper introduction
+
+
+Host shaper register is per host port register which sets a shaper
+on the host port.
+All VF/hostPF representors belonging to one host port share one host shaper.
+For example, if representor 0 and representor 1 belong to same host port,
+and a host shaper rate of 1Gbps is configured, the shaper throttles both
+representors' traffic from host.
+Host shaper has two modes for setting the shaper, immediate and deferred to
+fill threshold event trigger. In immediate mode, the rate limit is configured
+immediately to host shaper. When deferring to fill threshold trigger, the 
shaper
+is not set until an fill threshold event is received by any Rx queue in a VF
+representor belonging to the host port. The only rate supported for deferred
+mode is 100Mbps (there is no limit on the supported rates for immediate mode).
+In deferred mode, the shaper is set on the host port by the firmware upon
+receiving the fill threshold event, which allows throttling host traffic on
+fill threshold events at minimum latency, preventing excess drops in the
+Rx queue.
+
+Host shaper dependency for mstflint package
+---
+
+In order to configure host shaper register, ``librte_net_mlx5`` depends on 
``libmtcr_ul``
+which can be installed from OFED mstflint package.
+Meson detects ``libmtcr_ul`` existence at configure stage.
+If the library is detected, the application must link with ``-lmtcr_ul``,
+as done by the pkg-config file libdpdk.pc.
diff --git a/doc/guides/rel_notes/release_22_07.rst 
b/doc/guides/rel_notes/release_22_07.rst
index 62a8874..eaf074c 100644
--- a/doc/guides/rel_notes/release_22_07.rst
+++ b/doc/guides/rel_notes/release_22_07.rst
@@ -90,6 +90,7 @@ New Features
   * Added support for MTU on Windows.
   * Added matching and RSS on IPsec ESP.
   * Added Rx queue fill threshold support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto driver.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index 5335f5b..51c6e5d 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -45,6 +45,13 @@ if static_ibverbs
 ext_deps += declare_dependency(link_args:ibv_ldflags.split())
 endif
 
+libmtcr_ul_found = false
+lib = cc.find_library('mtcr_ul', required:false)
+if lib.found() and run_command('meson', 
'--version').stdout().version_compare('>= 0.49.2')
+libmtcr_ul_found = true
+ext_deps += lib
+endif
+
 sources += files('mlx5_nl.c')
 sources += files('mlx5_common_auxiliary.c')
 sources += files('mlx5_common_os.c')
@@ -207,6 +214,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [
+[  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+'mopen'],
+]
+endif
 config = configuration_data()
 foreach arg:has_sym_

[PATCH v4 7/7] app/testpmd: add Host Shaper command

2022-06-03 Thread Spike Du

Add command line options to support host shaper configure.
- Command syntax:
  mlx5 set port  host_shaper fill_thresh_triggered <0|1> rate


- Example commands:
To enable fill_thresh_triggered on port 1 and disable current host shaper:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 1 rate 0

To disable fill_thresh_triggered and current host shaper on port 1:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 0

The rate unit is 100Mbps.
To disable fill_thresh_triggered and configure a shaper of 5Gbps on port 1:
testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 50

Add sample code to handle rxq fill_thresh event, it delays a while so
that rxq empties, then disables host shaper and rearms fill_thresh event.

Signed-off-by: Spike Du 
---
 app/test-pmd/testpmd.c  |   6 ++
 doc/guides/nics/mlx5.rst|  46 +
 drivers/net/mlx5/meson.build|   4 +
 drivers/net/mlx5/mlx5_testpmd.c | 201 
 drivers/net/mlx5/mlx5_testpmd.h |  26 ++
 5 files changed, 283 insertions(+)
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.c
 create mode 100644 drivers/net/mlx5/mlx5_testpmd.h

diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index 1209230..babbc94 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -69,6 +69,9 @@
 #ifdef RTE_NET_BOND
 #include 
 #endif
+#ifdef RTE_NET_MLX5
+#include "mlx5_testpmd.h"
+#endif
 
 #include "testpmd.h"
 
@@ -3663,6 +3666,9 @@ struct pmd_test_command {
break;
printf("Received fill_thresh event, port:%d 
rxq_id:%d\n",
   port_id, rxq_id);
+#ifdef RTE_NET_MLX5
+   mlx5_test_fill_thresh_event_handler(port_id, rxq_id);
+#endif
}
break;
default:
diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 39bfebb..cdeeef3 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -1727,3 +1727,49 @@ which can be installed from OFED mstflint package.
 Meson detects ``libmtcr_ul`` existence at configure stage.
 If the library is detected, the application must link with ``-lmtcr_ul``,
 as done by the pkg-config file libdpdk.pc.
+
+How to use fill threshold and Host Shaper
+--
+
+There are sample command lines to configure fill threshold in testpmd.
+Testpmd also contains sample logic to handle fill threshold event.
+The typical workflow is: testpmd configure fill threshold for Rx queues, enable
+fill_thresh_triggered in host shaper and register a callback, when traffic 
from host is
+too high and Rx queue fullness is above fill threshold, PMD receives an event 
and
+firmware configures a 100Mbps shaper on host port automatically, then PMD call
+the callback registered previously, which will delay a while to let Rx queue
+empty, then disable host shaper.
+
+Let's assume we have a simple Blue Field 2 setup: port 0 is uplink, port 1
+is VF representor. Each port has 2 Rx queues.
+In order to control traffic from host to ARM, we can enable fill threshold in 
testpmd by:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 1 rate 0
+   testpmd> set port 1 rxq 0 fill_thresh 70
+   testpmd> set port 1 rxq 1 fill_thresh 70
+
+The first command disables current host shaper, and enables fill threshold 
triggered mode.
+The left commands configure fill threshold to 70% of Rx queue size for both Rx 
queues,
+When traffic from host is too high, you can see testpmd console prints log
+about fill threshold event receiving, then host shaper is disabled.
+The traffic rate from host is controlled and less drop happens in Rx queues.
+
+When disable fill threshold and fill_thresh_triggered, we can invoke below 
commands in testpmd:
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 0
+   testpmd> set port 1 rxq 0 fill_thresh 0
+   testpmd> set port 1 rxq 1 fill_thresh 0
+
+It's recommended an application disables fill threshold and 
fill_thresh_triggered before exit,
+if it enables them before.
+
+We can also configure the shaper with a value, the rate unit is 100Mbps, below
+command sets current shaper to 5Gbps and disables fill_thresh_triggered.
+
+.. code-block:: console
+
+   testpmd> mlx5 set port 1 host_shaper fill_thresh_triggered 0 rate 50
diff --git a/drivers/net/mlx5/meson.build b/drivers/net/mlx5/meson.build
index 99210fd..941642b 100644
--- a/drivers/net/mlx5/meson.build
+++ b/drivers/net/mlx5/meson.build
@@ -68,4 +68,8 @@ if get_option('buildtype').contains('debug')
 else
 cflags += [ '-UPEDANTIC' ]
 endif
+
+testpmd_sources += files('mlx5_testpmd.c')
+testpmd_drivers_deps += 'net_mlx5'
+
 subdir(exec_env)
diff --git a/drivers/net/mlx5/mlx5_testpmd.c b/drivers/net/mlx5

[RFC 0/6] net/mlx5: introduce limit watermark and host shaper

2022-03-31 Thread Spike Du

LWM(limit watermark) is per RX queue attribute, when RX queue fullness reach
the LWM limit, HW sends an event to dpdk application.
Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM event.

These two features can combine to control traffic from host port to wire port.
The work flow is configure LWM to RX queue and enable lwm-triggered flag in
host shaper, after receiving LWM event, delay a while until RX queue is empty
, then disable the shaper. We recycle this work flow to reduce RX queue drops.

Spike Du (6):
  net/mlx5: add LWM support for Rxq
  common/mlx5: share interrupt management
  net/mlx5: add LWM event handling support
  net/mlx5: add private API to configure Rxq LWM
  net/mlx5: add private API to config host port shaper
  app/testpmd: add LWM and Host Shaper command

 app/test-pmd/cmdline.c   | 149 ++
 app/test-pmd/config.c| 122 +++
 app/test-pmd/meson.build |   3 +
 app/test-pmd/testpmd.c   |   3 +
 app/test-pmd/testpmd.h   |   5 +
 doc/guides/nics/mlx5.rst |  87 +++
 doc/guides/rel_notes/release_22_03.rst   |   7 +
 drivers/common/mlx5/linux/meson.build|  21 ++-
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 ++
 drivers/common/mlx5/mlx5_prm.h   |  26 
 drivers/common/mlx5/version.map  |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +++
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 -
 drivers/net/mlx5/linux/mlx5_os.c | 132 
 drivers/net/mlx5/linux/mlx5_socket.c |  53 +--
 drivers/net/mlx5/mlx5.c  |  61 
 drivers/net/mlx5/mlx5.h  |  12 +-
 drivers/net/mlx5/mlx5_devx.c |  57 ++-
 drivers/net/mlx5/mlx5_devx.h |   1 +
 drivers/net/mlx5/mlx5_rx.c   | 221 ++-
 drivers/net/mlx5/mlx5_rx.h   |   9 ++
 drivers/net/mlx5/mlx5_txpp.c |  28 +---
 drivers/net/mlx5/rte_pmd_mlx5.h  |  62 
 drivers/net/mlx5/version.map |   2 +
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 ---
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  52 +--
 27 files changed, 1057 insertions(+), 318 deletions(-)

-- 
1.8.3.1

[RFC 1/6] net/mlx5: add LWM support for Rxq

2022-03-31 Thread Spike Du

Add lwm(Limit WaterMark) field to Rxq object which indicates the percentage
of RX queue size used by HW to raise LWM event to the user.
Allow LWM setting in modify_rq command.
Allow the LWM configuration dynamically by adding RDY2RDY state change.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.h  |  1 +
 drivers/net/mlx5/mlx5_devx.c | 10 +-
 drivers/net/mlx5/mlx5_devx.h |  1 +
 drivers/net/mlx5/mlx5_rx.h   |  1 +
 4 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 23a28f6..f3e6682 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1391,6 +1391,7 @@ enum mlx5_rxq_modify_type {
MLX5_RXQ_MOD_RST2RDY, /* modify state from reset to ready. */
MLX5_RXQ_MOD_RDY2ERR, /* modify state from ready to error. */
MLX5_RXQ_MOD_RDY2RST, /* modify state from ready to reset. */
+   MLX5_RXQ_MOD_RDY2RDY, /* modify state from ready to ready. */
 };
 
 enum mlx5_txq_modify_type {
diff --git a/drivers/net/mlx5/mlx5_devx.c b/drivers/net/mlx5/mlx5_devx.c
index af106bd..d6de882 100644
--- a/drivers/net/mlx5/mlx5_devx.c
+++ b/drivers/net/mlx5/mlx5_devx.c
@@ -62,7 +62,7 @@
  * @return
  *   0 on success, a negative errno value otherwise and rte_errno is set.
  */
-static int
+int
 mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type)
 {
struct mlx5_devx_modify_rq_attr rq_attr;
@@ -76,6 +76,8 @@
case MLX5_RXQ_MOD_RST2RDY:
rq_attr.rq_state = MLX5_RQC_STATE_RST;
rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
break;
case MLX5_RXQ_MOD_RDY2ERR:
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
@@ -85,6 +87,12 @@
rq_attr.rq_state = MLX5_RQC_STATE_RDY;
rq_attr.state = MLX5_RQC_STATE_RST;
break;
+   case MLX5_RXQ_MOD_RDY2RDY:
+   rq_attr.rq_state = MLX5_RQC_STATE_RDY;
+   rq_attr.state = MLX5_RQC_STATE_RDY;
+   rq_attr.modify_bitmask |= 
MLX5_MODIFY_RQ_IN_MODIFY_BITMASK_WQ_LWM;
+   rq_attr.lwm = rxq->lwm;
+   break;
default:
break;
}
diff --git a/drivers/net/mlx5/mlx5_devx.h b/drivers/net/mlx5/mlx5_devx.h
index a95207a..ebd1da4 100644
--- a/drivers/net/mlx5/mlx5_devx.h
+++ b/drivers/net/mlx5/mlx5_devx.h
@@ -11,6 +11,7 @@
 int mlx5_txq_devx_modify(struct mlx5_txq_obj *obj,
 enum mlx5_txq_modify_type type, uint8_t dev_port);
 void mlx5_txq_devx_obj_release(struct mlx5_txq_obj *txq_obj);
+int mlx5_devx_modify_rq(struct mlx5_rxq_priv *rxq, uint8_t type);
 
 extern struct mlx5_obj_ops devx_obj_ops;
 
diff --git a/drivers/net/mlx5/mlx5_rx.h b/drivers/net/mlx5/mlx5_rx.h
index acebe33..98d7cae 100644
--- a/drivers/net/mlx5/mlx5_rx.h
+++ b/drivers/net/mlx5/mlx5_rx.h
@@ -174,6 +174,7 @@ struct mlx5_rxq_priv {
struct mlx5_devx_rq devx_rq;
struct rte_eth_hairpin_conf hairpin_conf; /* Hairpin configuration. */
uint32_t hairpin_status; /* Hairpin binding status. */
+   uint32_t lwm:16;
 };
 
 /* External RX queue descriptor. */
-- 
1.8.3.1

[RFC 2/6] common/mlx5: share interrupt management

2022-03-31 Thread Spike Du

There are many duplicate code of creating and initializing rte_intr_handle.
Add a new mlx5_os API to do this, replace all PMD related code with this
API.

Signed-off-by: Spike Du 
---
 drivers/common/mlx5/linux/mlx5_common_os.c   | 131 ++
 drivers/common/mlx5/linux/mlx5_common_os.h   |  11 +++
 drivers/common/mlx5/version.map  |   3 +-
 drivers/common/mlx5/windows/mlx5_common_os.h |  24 +
 drivers/net/mlx5/linux/mlx5_ethdev_os.c  |  71 --
 drivers/net/mlx5/linux/mlx5_os.c | 132 ++-
 drivers/net/mlx5/linux/mlx5_socket.c |  53 ++-
 drivers/net/mlx5/mlx5.h  |   2 -
 drivers/net/mlx5/mlx5_txpp.c |  28 ++
 drivers/net/mlx5/windows/mlx5_ethdev_os.c|  22 -
 drivers/vdpa/mlx5/mlx5_vdpa_virtq.c  |  52 ++-
 11 files changed, 217 insertions(+), 312 deletions(-)

diff --git a/drivers/common/mlx5/linux/mlx5_common_os.c 
b/drivers/common/mlx5/linux/mlx5_common_os.c
index 030ceb5..6e7c59b 100644
--- a/drivers/common/mlx5/linux/mlx5_common_os.c
+++ b/drivers/common/mlx5/linux/mlx5_common_os.c
@@ -11,6 +11,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -952,3 +953,133 @@
claim_zero(mlx5_glue->dereg_mr(pmd_mr->obj));
memset(pmd_mr, 0, sizeof(*pmd_mr));
 }
+
+/**
+ * Rte_intr_handle create and init helper.
+ *
+ * @param[in] mode
+ *   interrupt instance can be shared between primary and secondary
+ *   processes or not.
+ * @param[in] set_fd_nonblock
+ *   Whether to set fd to O_NONBLOCK.
+ * @param[in] fd
+ *   Fd to set in created intr_handle.
+ * @param[in] cb
+ *   Callback to register for intr_handle.
+ * @param[in] cb_arg
+ *   Callback argument for cb.
+ *
+ * @return
+ *  - Interrupt handle on success.
+ *  - NULL on failure, with rte_errno set.
+ */
+struct rte_intr_handle *
+mlx5_os_interrupt_handler_create(int mode, bool set_fd_nonblock, int fd,
+rte_intr_callback_fn cb, void *cb_arg)
+{
+   struct rte_intr_handle *tmp_intr_handle;
+   int ret, flags;
+
+   tmp_intr_handle = rte_intr_instance_alloc(mode);
+   if (!tmp_intr_handle) {
+   rte_errno = ENOMEM;
+   goto err;
+   }
+   if (set_fd_nonblock) {
+   flags = fcntl(fd, F_GETFL);
+   ret = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
+   if (ret) {
+   rte_errno = errno;
+   goto err;
+   }
+   }
+   ret = rte_intr_fd_set(tmp_intr_handle, fd);
+   if (ret)
+   goto err;
+   ret = rte_intr_type_set(tmp_intr_handle, RTE_INTR_HANDLE_EXT);
+   if (ret)
+   goto err;
+   ret = rte_intr_callback_register(tmp_intr_handle, cb, cb_arg);
+   if (ret) {
+   rte_errno = -ret;
+   goto err;
+   }
+   return tmp_intr_handle;
+err:
+   if (tmp_intr_handle)
+   rte_intr_instance_free(tmp_intr_handle);
+   return NULL;
+}
+
+/* Safe unregistration for interrupt callback. */
+static void
+mlx5_intr_callback_unregister(const struct rte_intr_handle *handle,
+ rte_intr_callback_fn cb_fn, void *cb_arg)
+{
+   uint64_t twait = 0;
+   uint64_t start = 0;
+
+   do {
+   int ret;
+
+   ret = rte_intr_callback_unregister(handle, cb_fn, cb_arg);
+   if (ret >= 0)
+   return;
+   if (ret != -EAGAIN) {
+   DRV_LOG(INFO, "failed to unregister interrupt"
+ " handler (error: %d)", ret);
+   MLX5_ASSERT(false);
+   return;
+   }
+   if (twait) {
+   struct timespec onems;
+
+   /* Wait one millisecond and try again. */
+   onems.tv_sec = 0;
+   onems.tv_nsec = NS_PER_S / MS_PER_S;
+   nanosleep(&onems, 0);
+   /* Check whether one second elapsed. */
+   if ((rte_get_timer_cycles() - start) <= twait)
+   continue;
+   } else {
+   /*
+* We get the amount of timer ticks for one second.
+* If this amount elapsed it means we spent one
+* second in waiting. This branch is executed once
+* on first iteration.
+*/
+   twait = rte_get_timer_hz();
+   MLX5_ASSERT(twait);
+   }
+   /*
+* Timeout elapsed, show message (once a second) and retry.
+* We have no other acceptable option here, if we ignore
+* the unregistering return code the ha

[RFC 3/6] net/mlx5: add LWM event handling support

2022-03-31 Thread Spike Du

When LWM meets RQ WQE, the kernel driver raises an event to SW.
Use devx event_channel to catch this and to notify the user.
Allocate this channel per shared device.
The channel has a cookie that informs the specific event port and queue.

Signed-off-by: Spike Du 
---
 drivers/net/mlx5/mlx5.c  | 61 
 drivers/net/mlx5/mlx5.h  |  7 +
 drivers/net/mlx5/mlx5_devx.c | 47 ++
 drivers/net/mlx5/mlx5_rx.c   | 29 +
 drivers/net/mlx5/mlx5_rx.h   |  7 +
 5 files changed, 151 insertions(+)

diff --git a/drivers/net/mlx5/mlx5.c b/drivers/net/mlx5/mlx5.c
index 72b1e35..334223e 100644
--- a/drivers/net/mlx5/mlx5.c
+++ b/drivers/net/mlx5/mlx5.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -22,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1521,6 +1523,64 @@ struct mlx5_dev_ctx_shared *
 }
 
 /**
+ * Create LWM event_channel and interrupt handle for shared device
+ * context. All rxqs sharing the device context share the event_channel.
+ * A callback is registered in interrupt thread to receive the LWM event.
+ *
+ * @param[in] priv
+ *   Pointer to mlx5_priv instance.
+ *
+ * @return
+ *   0 on success, negative with rte_errno set.
+ */
+int
+mlx5_lwm_setup(struct mlx5_priv *priv)
+{
+   int fd_lwm;
+
+   pthread_mutex_init(&priv->sh->lwm_config_lock, NULL);
+   priv->sh->devx_channel_lwm = mlx5_os_devx_create_event_channel
+   (priv->sh->cdev->ctx,
+MLX5DV_DEVX_CREATE_EVENT_CHANNEL_FLAGS_OMIT_EV_DATA);
+   if (!priv->sh->devx_channel_lwm)
+   goto err;
+   fd_lwm = mlx5_os_get_devx_channel_fd(priv->sh->devx_channel_lwm);
+   priv->sh->intr_handle_lwm = mlx5_os_interrupt_handler_create
+   (RTE_INTR_INSTANCE_F_SHARED, true,
+fd_lwm, mlx5_dev_interrupt_handler_lwm, priv);
+   if (!priv->sh->intr_handle_lwm)
+   goto err;
+   return 0;
+err:
+   mlx5_lwm_unset(priv->sh);
+   return -rte_errno;
+}
+
+/**
+ * Destroy LWM event_channel and interrupt handle for shared device
+ * context before free this context. The interrupt handler is also
+ * unregistered.
+ *
+ * @param[in] sh
+ *   Pointer to shared device context.
+ */
+void
+mlx5_lwm_unset(struct mlx5_dev_ctx_shared *sh)
+{
+   if (sh->intr_handle_lwm) {
+   mlx5_os_interrupt_handler_destroy(sh->intr_handle_lwm,
+   mlx5_dev_interrupt_handler_lwm, (void *)-1);
+   sh->intr_handle_lwm = NULL;
+   }
+   if (sh->devx_channel_lwm) {
+   mlx5_os_devx_destroy_event_channel
+   (sh->devx_channel_lwm);
+   sh->devx_channel_lwm = NULL;
+   }
+   pthread_mutex_destroy(&sh->lwm_config_lock);
+}
+
+/**
  * Free shared IB device context. Decrement counter and if zero free
  * all allocated resources and close handles.
  *
@@ -1597,6 +1657,7 @@ struct mlx5_dev_ctx_shared *
claim_zero(mlx5_devx_cmd_destroy(sh->td));
MLX5_ASSERT(sh->geneve_tlv_option_resource == NULL);
pthread_mutex_destroy(&sh->txpp.mutex);
+   mlx5_lwm_unset(sh);
mlx5_free(sh);
return;
 exit:
diff --git a/drivers/net/mlx5/mlx5.h b/drivers/net/mlx5/mlx5.h
index 4821ff0..515ff33 100644
--- a/drivers/net/mlx5/mlx5.h
+++ b/drivers/net/mlx5/mlx5.h
@@ -1264,6 +1264,9 @@ struct mlx5_dev_ctx_shared {
struct mlx5_lb_ctx self_lb; /* QP to enable self loopback for Devx. */
unsigned int flow_max_priority;
enum modify_reg flow_mreg_c[MLX5_MREG_C_NUM];
+   void *devx_channel_lwm;
+   struct rte_intr_handle *intr_handle_lwm;
+   pthread_mutex_t lwm_config_lock;
/* Availability of mreg_c's. */
struct mlx5_dev_shared_port port[]; /* per device port data array. */
 };
@@ -1401,6 +1404,7 @@ enum mlx5_txq_modify_type {
 };
 
 struct mlx5_rxq_priv;
+struct mlx5_priv;
 
 /* HW objects operations structure. */
 struct mlx5_obj_ops {
@@ -1409,6 +1413,7 @@ struct mlx5_obj_ops {
int (*rxq_event_get)(struct mlx5_rxq_obj *rxq_obj);
int (*rxq_obj_modify)(struct mlx5_rxq_priv *rxq, uint8_t type);
void (*rxq_obj_release)(struct mlx5_rxq_priv *rxq);
+   int (*rxq_event_get_lwm)(struct mlx5_priv *priv, int *rxq_idx, int 
*port_id);
int (*ind_table_new)(struct rte_eth_dev *dev, const unsigned int log_n,
 struct mlx5_ind_table_obj *ind_tbl);
int (*ind_table_modify)(struct rte_eth_dev *dev,
@@ -1599,6 +1604,8 @@ int mlx5_udp_tunnel_port_add(struct rte_eth_dev *dev,
 bool mlx5_is_hpf(struct rte_eth_dev *dev);
 bool mlx5_is_sf_repr(struct rte_eth_dev *dev);
 void mlx5_age_event_prepare(struct mlx5_dev_ctx_shared *sh);
+int mlx5_lwm_setup(struct mlx5_priv *pri

[RFC 4/6] net/mlx5: add private API to configure Rxq LWM

2022-03-31 Thread Spike Du

The new API allows setting/unsetting/modifying an LWM(limit watermark)
event per Rxq.
While the Rx queue fullness reaches the LWM limit, the driver catches
an HW event and invokes the user callback.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |  4 ++
 doc/guides/rel_notes/release_22_03.rst |  6 +++
 drivers/common/mlx5/mlx5_prm.h |  1 +
 drivers/net/mlx5/mlx5_rx.c | 88 +-
 drivers/net/mlx5/mlx5_rx.h |  1 +
 drivers/net/mlx5/rte_pmd_mlx5.h| 32 +
 drivers/net/mlx5/version.map   |  1 +
 7 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index a734d10..0e983a6 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -92,6 +92,7 @@ Features
 - Connection tracking.
 - Sub-Function representors.
 - Sub-Function.
+- Rx queue LWM (Limit WaterMark) configuration.
 
 
 Limitations
@@ -507,6 +508,9 @@ Limitations
 - The NIC egress flow rules on representor port are not supported.
 
 
+- LWM:
+  - Doesn't support shared Rx queue and Hairpin Rx queue.
+
 Statistics
 --
 
diff --git a/doc/guides/rel_notes/release_22_03.rst 
b/doc/guides/rel_notes/release_22_03.rst
index 60e5b4f..0c9d3b6 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -187,6 +187,12 @@ New Features
 
   An API was added to get/set an asymmetric crypto session's user data.
 
+* **Updated Mellanox mlx5 driver.**
+
+  Updated the Mellanox mlx5 driver with new features and improvements, 
including:
+
+  * Added Rx queue LWM(Limit WaterMark) support.
+
 * **Updated Marvell cnxk crypto PMD.**
 
   * Added SHA256-HMAC support in lookaside protocol (IPsec) for CN10K.
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 44b1822..23b13e3 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3290,6 +3290,7 @@ struct mlx5_aso_wqe {
 
 enum {
MLX5_EVENT_TYPE_OBJECT_CHANGE = 0x27,
+   MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED = 0x14,
 };
 
 enum {
diff --git a/drivers/net/mlx5/mlx5_rx.c b/drivers/net/mlx5/mlx5_rx.c
index f72364e..0390412 100644
--- a/drivers/net/mlx5/mlx5_rx.c
+++ b/drivers/net/mlx5/mlx5_rx.c
@@ -19,15 +19,16 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "mlx5_autoconf.h"
 #include "mlx5_defs.h"
 #include "mlx5.h"
 #include "mlx5_utils.h"
 #include "mlx5_rxtx.h"
+#include "mlx5_devx.h"
 #include "mlx5_rx.h"
 
-
 static __rte_always_inline uint32_t
 rxq_cq_to_pkt_type(struct mlx5_rxq_data *rxq, volatile struct mlx5_cqe *cqe,
   volatile struct mlx5_mini_cqe8 *mcqe);
@@ -1216,3 +1217,88 @@ int mlx5_get_monitor_addr(void *rx_queue, struct 
rte_power_monitor_cond *pmc)
if (rxq && rxq->lwm_event_rxq_limit_reached)
rxq->lwm_event_rxq_limit_reached(port_id, rxq_idx);
 }
+
+int
+rte_pmd_mlx5_config_rxq_lwm(uint16_t port_id, uint16_t rx_queue_id,
+   uint8_t lwm,
+   lwm_event_rxq_limit_reached_t cb)
+{
+   struct rte_eth_dev *dev = &rte_eth_devices[port_id];
+   struct mlx5_rxq_priv *rxq = mlx5_rxq_get(dev, rx_queue_id);
+   uint16_t event_nums[1] = {MLX5_EVENT_TYPE_SRQ_LIMIT_REACHED};
+   struct mlx5_rxq_data *rxq_data;
+   struct mlx5_priv *priv;
+   uint32_t wqe_cnt;
+   uint64_t cookie;
+   int ret = 0;
+
+   if (!rxq) {
+   rte_errno = EINVAL;
+   return -rte_errno;
+   }
+   rxq_data = &rxq->ctrl->rxq;
+   priv = rxq->priv;
+   /* Ensure the Rq is created by devx. */
+   if (priv->obj_ops.rxq_obj_new != devx_obj_ops.rxq_obj_new) {
+   rte_errno = EINVAL;
+   return -rte_errno;
+   }
+   if (lwm > 99) {
+   DRV_LOG(WARNING, "Too big LWM configuration.");
+   rte_errno = E2BIG;
+   return -rte_errno;
+   }
+   /* Start config LWM. */
+   pthread_mutex_lock(&priv->sh->lwm_config_lock);
+   if (rxq->lwm == 0 && lwm == 0) {
+   /* Both old/new values are 0, do nothing. */
+   ret = 0;
+   goto end;
+   }
+   wqe_cnt = mlx5_rxq_mprq_enabled(rxq_data)
+   ? RTE_BIT32(rxq_data->cqe_n - rxq_data->log_strd_num) :
+   RTE_BIT32(rxq_data->cqe_n);
+   if (lwm) {
+   if (!priv->sh->devx_channel_lwm) {
+   ret = mlx5_lwm_setup(priv);
+   if (ret) {
+   DRV_LOG(WARNING,
+   "Failed to create shared_lwm.");
+   rte_errno = ENOMEM;
+   ret = -rte_errno;
+

[RFC 5/6] net/mlx5: add private API to config host port shaper

2022-03-31 Thread Spike Du

Host port shaper can be configured with QSHR(QoS Shaper Host Register).
Add check in build files to enable this function or not.

The host shaper configuration affects all the ethdev ports belonging to the
same host port.

Host shaper can configure shaper rate and lwm-triggered for a host port.
The shaper limits the rate of traffic from host port to wire port.
If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
when one of the host port's Rx queues receives LWM(Limit Watermark) event.

Signed-off-by: Spike Du 
---
 doc/guides/nics/mlx5.rst   |   7 +++
 doc/guides/rel_notes/release_22_03.rst |   1 +
 drivers/common/mlx5/linux/meson.build  |  21 +--
 drivers/common/mlx5/mlx5_prm.h |  25 
 drivers/net/mlx5/mlx5.h|   2 +
 drivers/net/mlx5/mlx5_rx.c | 104 +
 drivers/net/mlx5/rte_pmd_mlx5.h|  30 ++
 drivers/net/mlx5/version.map   |   1 +
 8 files changed, 187 insertions(+), 4 deletions(-)

diff --git a/doc/guides/nics/mlx5.rst b/doc/guides/nics/mlx5.rst
index 0e983a6..35210c1 100644
--- a/doc/guides/nics/mlx5.rst
+++ b/doc/guides/nics/mlx5.rst
@@ -93,6 +93,7 @@ Features
 - Sub-Function representors.
 - Sub-Function.
 - Rx queue LWM (Limit WaterMark) configuration.
+- Host shaper support.
 
 
 Limitations
@@ -511,6 +512,12 @@ Limitations
 - LWM:
   - Doesn't support shared Rx queue and Hairpin Rx queue.
 
+- Host shaper:
+
+  - Support BlueField series NIC from BlueField 2.
+  - When configure host shaper with MLX5_HOST_SHAPER_FLAG_LWM_TRIGGERED flag 
set,
+only rate 0 and 100Mbps are supported.
+
 Statistics
 --
 
diff --git a/doc/guides/rel_notes/release_22_03.rst 
b/doc/guides/rel_notes/release_22_03.rst
index 0c9d3b6..3ab4388 100644
--- a/doc/guides/rel_notes/release_22_03.rst
+++ b/doc/guides/rel_notes/release_22_03.rst
@@ -192,6 +192,7 @@ New Features
   Updated the Mellanox mlx5 driver with new features and improvements, 
including:
 
   * Added Rx queue LWM(Limit WaterMark) support.
+  * Added host shaper support.
 
 * **Updated Marvell cnxk crypto PMD.**
 
diff --git a/drivers/common/mlx5/linux/meson.build 
b/drivers/common/mlx5/linux/meson.build
index ed48245..c88c184 100644
--- a/drivers/common/mlx5/linux/meson.build
+++ b/drivers/common/mlx5/linux/meson.build
@@ -16,8 +16,9 @@ if dlopen_ibverbs
 ]
 endif
 
-libnames = [ 'mlx5', 'ibverbs' ]
+libnames = [ 'mlx5', 'ibverbs', 'mtcr_ul' ]
 libs = []
+libmtcr_ul_found = false
 foreach libname:libnames
 lib = dependency('lib' + libname, static:static_ibverbs, required:false, 
method: 'pkg-config')
 if not lib.found() and not static_ibverbs
@@ -28,10 +29,16 @@ foreach libname:libnames
 if not static_ibverbs and not dlopen_ibverbs
 ext_deps += lib
 endif
+if libname == 'mtcr_ul'
+libmtcr_ul_found = true
+ext_deps += lib
+endif
 else
-build = false
-reason = 'missing dependency, "' + libname + '"'
-subdir_done()
+if libname != 'mtcr_ul'
+build = false
+reason = 'missing dependency, "' + libname + '"'
+subdir_done()
+endif
 endif
 endforeach
 if static_ibverbs or dlopen_ibverbs
@@ -205,6 +212,12 @@ has_sym_args = [
 [ 'HAVE_MLX5_IBV_IMPORT_CTX_PD_AND_MR', 'infiniband/verbs.h',
 'ibv_import_device' ],
 ]
+if  libmtcr_ul_found
+has_sym_args += [
+[  'HAVE_MLX5_MSTFLINT', 'mstflint/mtcr.h',
+'mopen'],
+]
+endif
 config = configuration_data()
 foreach arg:has_sym_args
 config.set(arg[0], cc.has_header_symbol(arg[1], arg[2], dependencies: 
libs))
diff --git a/drivers/common/mlx5/mlx5_prm.h b/drivers/common/mlx5/mlx5_prm.h
index 23b13e3..3559927 100644
--- a/drivers/common/mlx5/mlx5_prm.h
+++ b/drivers/common/mlx5/mlx5_prm.h
@@ -3768,6 +3768,7 @@ enum {
MLX5_CRYPTO_COMMISSIONING_REGISTER_ID = 0xC003,
MLX5_IMPORT_KEK_HANDLE_REGISTER_ID = 0xC004,
MLX5_CREDENTIAL_HANDLE_REGISTER_ID = 0xC005,
+   MLX5_QSHR_REGISTER_ID = 0x4030,
 };
 
 struct mlx5_ifc_register_mtutc_bits {
@@ -3782,6 +3783,30 @@ struct mlx5_ifc_register_mtutc_bits {
u8 time_adjustment[0x20];
 };
 
+struct mlx5_ifc_ets_global_config_register_bits {
+   u8 reserved_at_0[0x2];
+   u8 rate_limit_update[0x1];
+   u8 reserved_at_3[0x29];
+   u8 max_bw_units[0x4];
+   u8 reserved_at_48[0x8];
+   u8 max_bw_value[0x8];
+};
+
+#define ETS_GLOBAL_CONFIG_BW_UNIT_DISABLED  0x0
+#define ETS_GLOBAL_CONFIG_BW_UNIT_HUNDREDS_MBPS 0x3
+#define ETS_GLOBAL_CONFIG_BW_UNIT_GBPS  0x4
+
+struct mlx5_ifc_register_qshr_bits {
+   u8 reserved_at_0[0x4];
+   u8 connected_host[0

[RFC 6/6] app/testpmd: add LWM and Host Shaper command

2022-03-31 Thread Spike Du

Add command line options to support LWM per-rxq configure.
- Command syntax:
  set port  rxq  lwm 
  set port  host_shaper lwm_triggered <0|1> rate 

- Example commands:
To configure LWM as 30% of rxq size on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 30

To disable LWM on port 1 rxq 0:
testpmd> set port 1 rxq 0 lwm 0

To enable lwm_triggered on port 1 and disable current host shaper:
testpmd> set port 1 host_shaper lwm_triggered 1 rate 0

To disable lwm_triggered and current host shaper on port 1:
testpmd> set port 1 host_shaper lwm_triggered 0 rate 0

The rate unit is 100Mbps.
To disable lwm_triggered and configure a shaper of 5Gbps on port 1:
testpmd> set port 1 host_shaper lwm_triggered 0 rate 50

Add sample code to handle rxq LWM event, it delays a while so that rxq
empties, then disables host shaper and rearms LWM event.

Signed-off-by: Spike Du 
---
 app/test-pmd/cmdline.c   | 149 +++
 app/test-pmd/config.c| 122 ++
 app/test-pmd/meson.build |   3 +
 app/test-pmd/testpmd.c   |   3 +
 app/test-pmd/testpmd.h   |   5 ++
 doc/guides/nics/mlx5.rst |  76 
 6 files changed, 358 insertions(+)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index 7ab0575..8a5fe26 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -17807,6 +17807,151 @@ struct cmd_show_port_flow_transfer_proxy_result {
}
 };
 
+#ifdef RTE_NET_MLX5
+
+/* *** SET LIMIT WARTER MARK FOR A RXQ OF A PORT *** */
+struct cmd_rxq_lwm_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t rxq;
+   uint16_t rxq_num;
+   cmdline_fixed_string_t lwm;
+   uint16_t lwm_num;
+};
+
+static void cmd_rxq_lwm_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_rxq_lwm_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+   && (strcmp(res->rxq, "rxq") == 0)
+   && (strcmp(res->lwm, "lwm") == 0))
+   ret = set_rxq_lwm(res->port_num, res->rxq_num,
+ res->lwm_num);
+   if (ret < 0)
+   printf("rxq_lwm_cmd error: (%s)\n", strerror(-ret));
+
+}
+
+cmdline_parse_token_string_t cmd_rxq_lwm_set =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   set, "set");
+cmdline_parse_token_string_t cmd_rxq_lwm_port =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   port, "port");
+cmdline_parse_token_num_t cmd_rxq_lwm_portnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   port_num, RTE_UINT16);
+cmdline_parse_token_string_t cmd_rxq_lwm_rxq =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq, "rxq");
+cmdline_parse_token_num_t cmd_rxq_lwm_rxqnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   rxq_num, RTE_UINT8);
+cmdline_parse_token_string_t cmd_rxq_lwm_lwm =
+   TOKEN_STRING_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm, "lwm");
+cmdline_parse_token_num_t cmd_rxq_lwm_lwmnum =
+   TOKEN_NUM_INITIALIZER(struct cmd_rxq_lwm_result,
+   lwm_num, RTE_UINT16);
+
+cmdline_parse_inst_t cmd_rxq_lwm = {
+   .f = cmd_rxq_lwm_parsed,
+   .data = (void *)0,
+   .help_str = "set port  rxq  lwm "
+   "Set lwm for rxq on port_id",
+   .tokens = {
+   (void *)&cmd_rxq_lwm_set,
+   (void *)&cmd_rxq_lwm_port,
+   (void *)&cmd_rxq_lwm_portnum,
+   (void *)&cmd_rxq_lwm_rxq,
+   (void *)&cmd_rxq_lwm_rxqnum,
+   (void *)&cmd_rxq_lwm_lwm,
+   (void *)&cmd_rxq_lwm_lwmnum,
+   NULL,
+   },
+};
+
+/* *** SET HOST_SHAPER LWM TRIGGERED FOR A PORT *** */
+struct cmd_port_host_shaper_result {
+   cmdline_fixed_string_t set;
+   cmdline_fixed_string_t port;
+   uint16_t port_num;
+   cmdline_fixed_string_t host_shaper;
+   cmdline_fixed_string_t lwm_triggered;
+   uint16_t fr;
+   cmdline_fixed_string_t rate;
+   uint8_t rate_num;
+};
+
+static void cmd_port_host_shaper_parsed(void *parsed_result,
+   __rte_unused struct cmdline *cl,
+   __rte_unused void *data)
+{
+   struct cmd_port_host_shaper_result *res = parsed_result;
+   int ret = 0;
+
+   if ((strcmp(res->set, "set") == 0) && (strcmp(res->port, "port") == 0)
+

RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper

2022-04-25 Thread Spike Du

Hi Jerin,   
Thanks for your comments and sorry for the late response.

For case one, I think I can refine the design and add LWM(limit 
watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.

For case two(host shaper), I think we can't use RX meter, because it's 
actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia 
BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have two 
terms here: Host-system stands for the system the BF2 NIC is inserted; 
ARM-system stands for the embedded ARM in BF2. ARM-system is doing the 
forwarding. This is the way host shaper works: we configure the register on 
ARM-system, but it affects Host-system's TX shaper, which means the shaper is 
working on the remote port, it's not a RX meter concept, hence we can't use 
DPDK RX meter framework. I'd suggest to still use private API.

For testpmd part, I understand your concern. Because we need one 
private API for host shaper, and we need testpmd's forwarding code to show how 
it works to user, we need to call the private API in testpmd. If current patch 
is not acceptable, what's the correct way to do it? Any framework to isolate 
the PMD private logic from testpmd common code, but still give a chance to call 
private APIs in testpmd?

Regards,
Spike.

> -Original Message-
> From: Jerin Jacob 
> Sent: Tuesday, April 5, 2022 4:59 PM
> To: Spike Du ; Andrew Rybchenko
> ; Cristian Dumitrescu
> ; Ferruh Yigit ;
> techbo...@dpdk.org
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; NBU-Contact-
> Thomas Monjalon (EXTERNAL) ; dpdk-dev
> ; Raslan Darawsheh 
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Apr 1, 2022 at 8:53 AM Spike Du  wrote:
> >
> > LWM(limit watermark) is per RX queue attribute, when RX queue fullness
> > reach the LWM limit, HW sends an event to dpdk application.
> > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > The shaper limits the rate of traffic from host port to wire port.
> > If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> > when one of the host port's Rx queues receives LWM event.
> >
> > These two features can combine to control traffic from host port to wire
> port.
> > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> reduce RX queue drops.
> >
> > Spike Du (6):
> >   net/mlx5: add LWM support for Rxq
> >   common/mlx5: share interrupt management
> >   net/mlx5: add LWM event handling support
> >   net/mlx5: add private API to configure Rxq LWM
> >   net/mlx5: add private API to config host port shaper
> >   app/testpmd: add LWM and Host Shaper command
> 
> + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitre...@intel.com
> 
> I think, case one, can be easily abstracted via adding new
> rte_eth_event_type event and case two can be abstracted via the existing
> Rx meter framework in ethdev.
> 
> Also, Updating generic testpmd to support PMD specific API should be
> avoided, I know there is existing stuff in testpmd, I think, we should have 
> the
> policy to add PMD specific commands to testpmd.
> 
> There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
> API in testpmd it will be bloated or at minimum, it should a separate file in
> testpmd if we choose to take that path.
> 
> + @techbo...@dpdk.org

RE: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper

2022-04-28 Thread Spike Du

Hi Jerin,   
Thanks for your comments and sorry for the late response.

For case one, I think I can refine the design and add LWM(limit 
watermark) in rte_eth_rxconf, and add a new rte_eth_event_type event.

For case two(host shaper), I think we can't use RX meter, because it's 
actually TX shaper on a remote system. It's quite specific to Mellanox/Nvidia 
BlueField 2(BF2 for short) NIC. The NIC contains an ARM system. We have two 
terms here: Host-system stands for the system the BF2 NIC is inserted; 
ARM-system stands for the embedded ARM in BF2. ARM-system is doing the 
forwarding. This is the way host shaper works: we configure the register on 
ARM-system, but it affects Host-system's TX shaper, which means the shaper is 
working on the remote port, it's not a RX meter concept, hence we can't use 
DPDK RX meter framework. I'd suggest to still use private API.

For testpmd part, I understand your concern. Because we need one 
private API for host shaper, and we need testpmd's forwarding code to show how 
it works to user, we need to call the private API in testpmd. If current patch 
is not acceptable, what's the correct way to do it? Any framework to isolate 
the PMD private logic from testpmd common code, but still give a chance to call 
private APIs in testpmd?

Regards,
Spike.

> -Original Message-
> From: Jerin Jacob 
> Sent: Tuesday, April 5, 2022 4:59 PM
> To: Spike Du ; Andrew Rybchenko
> ; Cristian Dumitrescu
> ; Ferruh Yigit ;
> techbo...@dpdk.org
> Cc: Matan Azrad ; Slava Ovsiienko
> ; Ori Kam ; NBU-Contact-
> Thomas Monjalon (EXTERNAL) ; dpdk-dev
> ; Raslan Darawsheh 
> Subject: Re: [RFC 0/6] net/mlx5: introduce limit watermark and host shaper
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Apr 1, 2022 at 8:53 AM Spike Du  wrote:
> >
> > LWM(limit watermark) is per RX queue attribute, when RX queue fullness
> > reach the LWM limit, HW sends an event to dpdk application.
> > Host shaper can configure shaper rate and lwm-triggered for a host port.
> > The shaper limits the rate of traffic from host port to wire port.
> > If lwm-triggered is enabled, a 100Mbps shaper is enabled automatically
> > when one of the host port's Rx queues receives LWM event.
> >
> > These two features can combine to control traffic from host port to wire
> port.
> > The work flow is configure LWM to RX queue and enable lwm-triggered
> > flag in host shaper, after receiving LWM event, delay a while until RX
> > queue is empty , then disable the shaper. We recycle this work flow to
> reduce RX queue drops.
> >
> > Spike Du (6):
> >   net/mlx5: add LWM support for Rxq
> >   common/mlx5: share interrupt management
> >   net/mlx5: add LWM event handling support
> >   net/mlx5: add private API to configure Rxq LWM
> >   net/mlx5: add private API to config host port shaper
> >   app/testpmd: add LWM and Host Shaper command
> 
> + @Andrew Rybchenko  @Ferruh Yigit cristian.dumitre...@intel.com
> 
> I think, case one, can be easily abstracted via adding new
> rte_eth_event_type event and case two can be abstracted via the existing
> Rx meter framework in ethdev.
> 
> Also, Updating generic testpmd to support PMD specific API should be
> avoided, I know there is existing stuff in testpmd, I think, we should have 
> the
> policy to add PMD specific commands to testpmd.
> 
> There are around 56PMDs in ethdev now, If PMDs try to add PMD specific
> API in testpmd it will be bloated or at minimum, it should a separate file in
> testpmd if we choose to take that path.
> 
> + @techbo...@dpdk.org

89 matches

Mail list logo