Re: [dpdk-dev] Question on mlx5 PMD txq memory registration

2017-07-23 Thread Shahaf Shuler
Thursday, July 20, 2017 7:22 PM, Sagi Grimberg:
> >> As I said, there are primitives which are designed to handle frequent
> >> reads and rare mutations.
> >
> > Even with such primitives, rarely lock is more than never lock.
> 
> You do realize that the cache mutation involves ibv_dereg_mr() right?
> Any locking scheme for mutation is negligible compared to that, and
> rcu_read_lock is practically free.

The fix for the issue you presented includes not to dereg MR on datapath. MR 
will be deregistered only on device close.
MRs replace on cache can still happen, but it won't involve any 
de-registration. 

> 
> >>> It is one possibility discussed also with Mellanox guys, the point
> >>> is this breaks the security point of view which is also an important 
> >>> stuff.
> >>
> >> What security aspect? The entire dpdk model builds on top of physical
> >> addresses awareness running under root permissions.
> >> I'm not saying
> >> exposing it to the application nor granting remote permissions to the
> >> physical space.
> >>
> >> mlx5_pmd is a network driver, and as a driver, it should allowed to
> >> use the device dma lkey as it sees fit. I honestly think its pretty
> >> much mandatory considering the work-arounds mlx5_pmd tries to do
> >> (which we agreed are broken).
> >
> > True, There are many PMDs which can work only with physical memory.
> > However Mellanox NICs have the option to work with virtual one thus
> provide more security.
> 
> I don't understand the security argument. Its completely private to the
> driver. anything under librte is equivalent to an OS wrt networking, so I 
> fail to
> see what is the security feature your talking about.

You are correct that as a root you are able to do whatever you want on the 
server.
The security I refer to is to protect against badly written code.

The fact the PMD only registers the mempools, and use the device engine to 
translate the VA, provide some protection.
For example, one DPDK process will not be able to access the memory of other 
DPDK process *by mistake*.

I am not saying using the reserved lkey is not a good suggestion, and we plan 
to test its value.
All I am saying is there are maybe other option to provide the same performance 
with the extra protection mentioned above.
One of them can be to use indirect keys. One indirect key to represent 64b 
memory area, and other regular keys for the hugepages.
The rest of the memory area can be filled with keys pointing to /dev/null. 

> 
> > The fact running under root doesn't mean you have privileges to access
> every physical page on the server (even if you try very hard to be aware).
> 
> But dpdk core mbufs structs are built this way.
> 
> > The issue here, AFAIU, is performance.
> > We are now looking into ways to provide the same performance as if it was
> only a single lkey, while preserving the security feature.
> 
> Hmm, What exactly do you have in mind?
> 
> I'm hoping that you are not referring to ODP. If you are, I think that latency
> unpredictability would be a huge non-starter, page-faults are way too
> expensive for dpdk users.

No ODP :). 
As all relevant DPDK memory is on top of hugepages, there is no reason to avoid 
registration and pinning  in advance. 

> 
> Again, personally, I don't think that using virtual memory registration are 
> very
> useful for a network driver, It's a driver, not an application.
> 
> >>> If this is added in the future it will certainly be as an option,
> >>> this way both will be possible, the application could then choose
> >>> about security vs performance.
> >>
> >> Why should the application even be aware of that? Does any other
> >> driver expose the user how it maps pkt mbufs to the NIC? Just like
> >> the MR handling, its 100% internal to mlx5 and no reason why the user
> >> should ever be exposed to any of these details.
> >
> > Other option is the reserved lkey as you suggested, but it will lose the
> security guarantees.
> 
> OK, obviously I'm missing something. Can you articulate what do you mean
> by "security guarantees"?
> 
> > Like every performance optimization it should be the application decision.
> 
> I tend to disagree with you on this. The application must never be aware of
> the nitty details of the underlying driver implementation. In fact, 
> propagating
> this information to the application sounds like a layering violation.
> 
> > In fact, there are some discussions on the ML of exposing the option
> > to use va instead of pa. [1]
> 
> My understanding is that the correct proposal was to hide this knowledge
> from the user. Also the use-case is not security (which I'm not even sure
> what it means yet).
> 
> >>> This is also a in progress in PMD part, it should be part of the
> >>> next DPDK release.
> >>
> >> That is *very* good to hear! Can you guys share a branch? I'm willing
> >> to take it for testing.
> >
> > The branch is still pre-mature. It may be good enough for external testing 
> > in
> about two weeks.
> > Contac

Re: [dpdk-dev] Question on mlx5 PMD txq memory registration

2017-07-23 Thread Sagi Grimberg



I don't understand the security argument. Its completely private to the
driver. anything under librte is equivalent to an OS wrt networking, so I fail 
to
see what is the security feature your talking about.


You are correct that as a root you are able to do whatever you want on the 
server.
The security I refer to is to protect against badly written code.

The fact the PMD only registers the mempools, and use the device engine to 
translate the VA, provide some protection.
For example, one DPDK process will not be able to access the memory of other 
DPDK process *by mistake*.


Well, this is a fair argument, but without a *complete* solution for all
of dpdk peripherals, it has very little merit (if at all). A badly
written code can just as easily crash a server by passing a mbuf to
a crypto device or another network device that co-exists with mlx5.

So, while I understand the argument, I think its value is not worth the
hassle that mlx5_pmd needs to take to achieve it. Did this come from a
real requirement (from a real implementation)?


I am not saying using the reserved lkey is not a good suggestion, and we plan 
to test its value.
All I am saying is there are maybe other option to provide the same performance 
with the extra protection mentioned above.
One of them can be to use indirect keys. One indirect key to represent 64b 
memory area, and other regular keys for the hugepages.
The rest of the memory area can be filled with keys pointing to /dev/null.


If I understand what you are suggesting, this would trigger out-of-order
transfers on an indirect memory key just about always (each transfer can 
originate from a different hugepage and SGL resolution alone will

require a walk on the memory key context SGL list). I'm afraid this
would introduce a very bad performance scaling due to the fact that a
SGL context (klm) will need to be fetched from the ICM for essentially
every send operation.

Having said that, its just my 2 cents, if your solution works then I
don't really care. You are the one testing it...


The fact running under root doesn't mean you have privileges to access

every physical page on the server (even if you try very hard to be aware).

But dpdk core mbufs structs are built this way.


The issue here, AFAIU, is performance.
We are now looking into ways to provide the same performance as if it was

only a single lkey, while preserving the security feature.

Hmm, What exactly do you have in mind?

I'm hoping that you are not referring to ODP. If you are, I think that latency
unpredictability would be a huge non-starter, page-faults are way too
expensive for dpdk users.


No ODP :).
As all relevant DPDK memory is on top of hugepages, there is no reason to avoid 
registration and pinning  in advance.


Agree.


[dpdk-dev] [PATCH v9 3/5] net/i40e: add support of reset

2017-07-23 Thread Wei Dai
Reset a NIC by calling dev_uninit() and then dev_init().
Go through the same way in NIC PCI remove without release
of ethdev resource and then NIC PCI probe function without
ethdev resource allocation.

Signed-off-by: Wei Dai 
---
 drivers/net/i40e/i40e_ethdev.c| 28 
 drivers/net/i40e/i40e_ethdev_vf.c | 19 +++
 2 files changed, 47 insertions(+)

diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c
index 9fcccda..1641e00 100644
--- a/drivers/net/i40e/i40e_ethdev.c
+++ b/drivers/net/i40e/i40e_ethdev.c
@@ -250,6 +250,7 @@ static int i40e_dev_configure(struct rte_eth_dev *dev);
 static int i40e_dev_start(struct rte_eth_dev *dev);
 static void i40e_dev_stop(struct rte_eth_dev *dev);
 static void i40e_dev_close(struct rte_eth_dev *dev);
+static int  i40e_dev_reset(struct rte_eth_dev *dev);
 static void i40e_dev_promiscuous_enable(struct rte_eth_dev *dev);
 static void i40e_dev_promiscuous_disable(struct rte_eth_dev *dev);
 static void i40e_dev_allmulticast_enable(struct rte_eth_dev *dev);
@@ -449,6 +450,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = {
.dev_start= i40e_dev_start,
.dev_stop = i40e_dev_stop,
.dev_close= i40e_dev_close,
+   .dev_reset= i40e_dev_reset,
.promiscuous_enable   = i40e_dev_promiscuous_enable,
.promiscuous_disable  = i40e_dev_promiscuous_disable,
.allmulticast_enable  = i40e_dev_allmulticast_enable,
@@ -2155,6 +2157,32 @@ i40e_dev_close(struct rte_eth_dev *dev)
I40E_WRITE_FLUSH(hw);
 }
 
+/*
+ * Reest PF device only to re-initialize resources in PMD layer
+ */
+static int
+i40e_dev_reset(struct rte_eth_dev *dev)
+{
+   int ret;
+
+   /* When a DPDK PMD PF begin to reset PF port, it should notify all
+* its VF to make them align with it. The detailed notification
+* mechanism is PMD specific. As to i40e PF, it is rather complex.
+* To avoid unexpected behavior in VF, currently reset of PF with
+* SR-IOV activation is not supported. It might be supported later.
+*/
+   if (dev->data->sriov.active)
+   return -ENOTSUP;
+
+   ret = eth_i40e_dev_uninit(dev);
+   if (ret)
+   return ret;
+
+   ret = eth_i40e_dev_init(dev);
+
+   return ret;
+}
+
 static void
 i40e_dev_promiscuous_enable(struct rte_eth_dev *dev)
 {
diff --git a/drivers/net/i40e/i40e_ethdev_vf.c 
b/drivers/net/i40e/i40e_ethdev_vf.c
index f6d8293..f951d4e 100644
--- a/drivers/net/i40e/i40e_ethdev_vf.c
+++ b/drivers/net/i40e/i40e_ethdev_vf.c
@@ -122,6 +122,7 @@ static void i40evf_vlan_offload_set(struct rte_eth_dev 
*dev, int mask);
 static int i40evf_vlan_pvid_set(struct rte_eth_dev *dev, uint16_t pvid,
int on);
 static void i40evf_dev_close(struct rte_eth_dev *dev);
+static int  i40evf_dev_reset(struct rte_eth_dev *dev);
 static void i40evf_dev_promiscuous_enable(struct rte_eth_dev *dev);
 static void i40evf_dev_promiscuous_disable(struct rte_eth_dev *dev);
 static void i40evf_dev_allmulticast_enable(struct rte_eth_dev *dev);
@@ -203,6 +204,7 @@ static const struct eth_dev_ops i40evf_eth_dev_ops = {
.xstats_get_names = i40evf_dev_xstats_get_names,
.xstats_reset = i40evf_dev_xstats_reset,
.dev_close= i40evf_dev_close,
+   .dev_reset= i40evf_dev_reset,
.dev_infos_get= i40evf_dev_info_get,
.dev_supported_ptypes_get = i40e_dev_supported_ptypes_get,
.vlan_filter_set  = i40evf_vlan_filter_set,
@@ -2373,6 +2375,23 @@ i40evf_dev_close(struct rte_eth_dev *dev)
i40evf_disable_irq0(hw);
 }
 
+/*
+ * Reest VF device only to re-initialize resources in PMD layer
+ */
+static int
+i40evf_dev_reset(struct rte_eth_dev *dev)
+{
+   int ret;
+
+   ret = i40evf_dev_uninit(dev);
+   if (ret)
+   return ret;
+
+   ret = i40evf_dev_init(dev);
+
+   return ret;
+}
+
 static int
 i40evf_get_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, uint16_t lut_size)
 {
-- 
2.7.5



[dpdk-dev] [PATCH v9 2/5] net/ixgbe: add support of reset

2017-07-23 Thread Wei Dai
Reset a NIC by calling dev_uninit and then dev_init.
Go through same way in NIC PCI remove without release of
ethdev resource and then NIC PCI probe function without
ethdev resource allocation.

Signed-off-by: Wei Dai 
---
 drivers/net/ixgbe/ixgbe_ethdev.c | 47 
 1 file changed, 47 insertions(+)

diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c
index 194058f..3f176fe 100644
--- a/drivers/net/ixgbe/ixgbe_ethdev.c
+++ b/drivers/net/ixgbe/ixgbe_ethdev.c
@@ -169,6 +169,7 @@ static void ixgbe_dev_stop(struct rte_eth_dev *dev);
 static int  ixgbe_dev_set_link_up(struct rte_eth_dev *dev);
 static int  ixgbe_dev_set_link_down(struct rte_eth_dev *dev);
 static void ixgbe_dev_close(struct rte_eth_dev *dev);
+static int  ixgbe_dev_reset(struct rte_eth_dev *dev);
 static void ixgbe_dev_promiscuous_enable(struct rte_eth_dev *dev);
 static void ixgbe_dev_promiscuous_disable(struct rte_eth_dev *dev);
 static void ixgbe_dev_allmulticast_enable(struct rte_eth_dev *dev);
@@ -265,6 +266,7 @@ static int ixgbevf_dev_link_update(struct rte_eth_dev *dev,
   int wait_to_complete);
 static void ixgbevf_dev_stop(struct rte_eth_dev *dev);
 static void ixgbevf_dev_close(struct rte_eth_dev *dev);
+static int  ixgbevf_dev_reset(struct rte_eth_dev *dev);
 static void ixgbevf_intr_disable(struct ixgbe_hw *hw);
 static void ixgbevf_intr_enable(struct ixgbe_hw *hw);
 static void ixgbevf_dev_stats_get(struct rte_eth_dev *dev,
@@ -523,6 +525,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = {
.dev_set_link_up= ixgbe_dev_set_link_up,
.dev_set_link_down  = ixgbe_dev_set_link_down,
.dev_close= ixgbe_dev_close,
+   .dev_reset= ixgbe_dev_reset,
.promiscuous_enable   = ixgbe_dev_promiscuous_enable,
.promiscuous_disable  = ixgbe_dev_promiscuous_disable,
.allmulticast_enable  = ixgbe_dev_allmulticast_enable,
@@ -613,6 +616,7 @@ static const struct eth_dev_ops ixgbevf_eth_dev_ops = {
.xstats_reset = ixgbevf_dev_stats_reset,
.xstats_get_names = ixgbevf_dev_xstats_get_names,
.dev_close= ixgbevf_dev_close,
+   .dev_reset= ixgbevf_dev_reset,
.allmulticast_enable  = ixgbevf_dev_allmulticast_enable,
.allmulticast_disable = ixgbevf_dev_allmulticast_disable,
.dev_infos_get= ixgbevf_dev_info_get,
@@ -2857,6 +2861,32 @@ ixgbe_dev_close(struct rte_eth_dev *dev)
ixgbe_set_rar(hw, 0, hw->mac.addr, 0, IXGBE_RAH_AV);
 }
 
+/*
+ * Reest PF device.
+ */
+static int
+ixgbe_dev_reset(struct rte_eth_dev *dev)
+{
+   int ret;
+
+   /* When a DPDK PMD PF begin to reset PF port, it should notify all
+* its VF to make them align with it. The detailed notification
+* mechanism is PMD specific. As to ixgbe PF, it is rather complex.
+* To avoid unexpected behavior in VF, currently reset of PF with
+* SR-IOV activation is not supported. It might be supported later.
+*/
+   if (dev->data->sriov.active)
+   return -ENOTSUP;
+
+   ret = eth_ixgbe_dev_uninit(dev);
+   if (ret)
+   return ret;
+
+   ret = eth_ixgbe_dev_init(dev);
+
+   return ret;
+}
+
 static void
 ixgbe_read_stats_registers(struct ixgbe_hw *hw,
   struct ixgbe_hw_stats *hw_stats,
@@ -5058,6 +5088,23 @@ ixgbevf_dev_close(struct rte_eth_dev *dev)
ixgbevf_remove_mac_addr(dev, 0);
 }
 
+/*
+ * Reset VF device
+ */
+static int
+ixgbevf_dev_reset(struct rte_eth_dev *dev)
+{
+   int ret;
+
+   ret = eth_ixgbevf_dev_uninit(dev);
+   if (ret)
+   return ret;
+
+   ret = eth_ixgbevf_dev_init(dev);
+
+   return ret;
+}
+
 static void ixgbevf_set_vfta_all(struct rte_eth_dev *dev, bool on)
 {
struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private);
-- 
2.7.5



[dpdk-dev] [PATCH v9 0/5] Support of NIC reset and keep same port id

2017-07-23 Thread Wei Dai
This patch set adds a function rte_eth_dev_reset( ) in rte_ethdev layer.
Sometimes a port have to be reset passively. For example a PF is reset,
all its VFs should also be reset by application itself to align with 
the PF.
A DPDK application also can call this function to trigger an initative
port reset.

When processing reset, if the port goes through PCI remove() and then
PCI probe() for restoration, its port id may be changed and this is not
expected by some DPDK application.

Normally, PCI probe() includes two parts: one is in rte_ethdev layer
to allocate resource in rte_ethdev layer and the other is calling PMD
specific dev_init() to allocate and initialize resource in PMD layer.
PCI remove( ) releases all resource allocated from rte_ethdev layer
in PCI probe( ) and calls PMD specific dev_uninit( ) to releaase
resource allocated by dev_init( ) in PMD layer.

To keep same port id and reset the port, only dev_uninit() and
dev_init( ) in PMD can be called and keep all resources allocated
from rte_ethdev layer poart in PCI probe( ). All these are what
rte_eth_dev_reset() does.

The rte_eth_dev_reset( ) calls rte_eth_dev_stop( ), PMD dev_uninit( )
and then PMD dev_init( ) to reset a port and keep same port id.

After calling rte_eth_dev_reset( ), the application can go through
rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ),
rte_eth_tx_queue_setup( ) and rte_eth_dev_start( ) again to restore
its previous settings or to reconfigure itself with different settings.

To test this new feature, a testpmd command "port reset port_id" is added.
The mapping between port number and its PCI address can be monitored to
confirm its port number is kept.
And following test case can also be used to confirm the port can work
again after reset.

A typical test steps are listed as follows:
For example, run "ifconfig PF-name down" will trigger a reset to VF.
1.  run testpmd with 2 ixgbe VF ports belonging to same PF
2.  testpmd > set verbose 1 //to observe VF working
3.  testpmd > show port info all //show port number and MAC addr
4.  testpmd > start
5.  let all ports forwarding work for a while
6.  testpmd > show port stats all
7.  ifconfig name-of-PF down
8.  A message is shown in testmd to indicate PF reset
9.  ifconfig name-of-PF up
10. testpmd > stop // stop forwarding to avoid crash during reset
11. testpmd > port reset all
12. testpmd > port stop all
13. testpmd > port start all //recofnig all ports
14. testpmd > show port info all
//get mapping of port id and MAC addr for forwarding
15. testpmd > start // restore forwarding
14. let all ports forwarding work for a while
15. testpmd > show port stats all //confirm all port can work again
16. repeat above step 7 - 15

chagnes:
v9:
  rebase to master branch
  modify some text after it is reviewed by an English native speaker

v8:
  modify the comments before the declaration of rte_eth_dev_reset( ) to
  make doc generated from doxygen be concise.
 
v7:
  add description of NIC reset in doc/poll_mode_drv.rst

v6:
  add more comments to explain why the rte_eth_dev_reset( ) is needeed,
  when it is called and what the application should do after calling it.
  add more comments to explain why reset of PF with SR-IOV is not
  supported currently.
 
v5:
  remove PCI address output to align with other modification which
will output it in other way
  disable PF reset when its VF is ative to avoid unexpected VF behavior
v4:
  add PCI address to confirm its port number keep same
  correct test method in cover letter
v3:
  update testpmd command
v2:
  only reset PMD layer resource and keep same port id, but
  not restore settings

Signed-off-by: Wei Dai 
Tested-by: Yuan Peng 
Acked-by: Jingjing Wu 
Reviewed-by: Remy Horton 

Wei Dai (5):
  ethdev: add support of NIC reset
  net/ixgbe: add support of reset
  net/i40e: add support of reset
  app/testpmd: enhance command to test NIC reset
  doc: add description of the NIC reset API

 app/test-pmd/cmdline.c  | 12 ++---
 app/test-pmd/testpmd.c  | 41 
 app/test-pmd/testpmd.h  |  1 +
 doc/guides/prog_guide/poll_mode_drv.rst | 41 
 drivers/net/i40e/i40e_ethdev.c  | 28 
 drivers/net/i40e/i40e_ethdev_vf.c   | 19 +
 drivers/net/ixgbe/ixgbe_ethdev.c| 47 +
 lib/librte_ether/rte_ethdev.c   | 17 
 lib/librte_ether/rte_ethdev.h   | 34 
 lib/librte_ether/rte_ether_version.map  |  1 +
 10 files changed, 237 insertions(+), 4 deletions(-)

-- 
2.7.5



[dpdk-dev] [PATCH v9 1/5] ethdev: add support of NIC reset

2017-07-23 Thread Wei Dai
This patch adds a new eth_dev layer API function rte_eth_dev_reset(),
which a DPDK application can call to reset a NIC and keep its port id
afterwards. It means that all software resources allocated in the ethdev
layer are kept, and software & hardware resources of the NIC within the
NIC's PMD are reset to a state simular to that obtained by calling the
PCI dev_uninit() and then dev_init(). This effective sequence of
dev_uninit() and dev_init() is packed into a single API function
rte_eth_dev_reset().

Please see the comments before the declaration of rte_eht_dev_reset()
in lib/librte_ether/rte_ethdev.h to get more details on why this
function is needed, what it does, when it should be called
and what an application should do after calling this function.

Signed-off-by: Wei Dai 
Reviewed-by: Remy Horton 
---
 lib/librte_ether/rte_ethdev.c  | 17 +
 lib/librte_ether/rte_ethdev.h  | 34 ++
 lib/librte_ether/rte_ether_version.map |  1 +
 3 files changed, 52 insertions(+)

diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index d4ebb1b..68ba64d 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -1004,6 +1004,23 @@ rte_eth_dev_close(uint8_t port_id)
 }
 
 int
+rte_eth_dev_reset(uint8_t port_id)
+{
+   struct rte_eth_dev *dev;
+   int ret;
+
+   RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL);
+   dev = &rte_eth_devices[port_id];
+
+   RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dev_reset, -ENOTSUP);
+
+   rte_eth_dev_stop(port_id);
+   ret = dev->dev_ops->dev_reset(dev);
+
+   return ret;
+}
+
+int
 rte_eth_rx_queue_setup(uint8_t port_id, uint16_t rx_queue_id,
   uint16_t nb_rx_desc, unsigned int socket_id,
   const struct rte_eth_rxconf *rx_conf,
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 0e99090..fde92a1 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1115,6 +1115,9 @@ typedef int  (*eth_dev_set_link_down_t)(struct 
rte_eth_dev *dev);
 typedef void (*eth_dev_close_t)(struct rte_eth_dev *dev);
 /**< @internal Function used to close a configured Ethernet device. */
 
+typedef int (*eth_dev_reset_t)(struct rte_eth_dev *dev);
+/** <@internal Function used to reset a configured Ethernet device. */
+
 typedef void (*eth_promiscuous_enable_t)(struct rte_eth_dev *dev);
 /**< @internal Function used to enable the RX promiscuous mode of an Ethernet 
device. */
 
@@ -1435,6 +1438,7 @@ struct eth_dev_ops {
eth_dev_set_link_up_t  dev_set_link_up;   /**< Device link up. */
eth_dev_set_link_down_tdev_set_link_down; /**< Device link down. */
eth_dev_close_tdev_close; /**< Close device. */
+   eth_dev_reset_tdev_reset; /**< Reset device. */
eth_link_update_t  link_update;   /**< Get device link state. */
 
eth_promiscuous_enable_t   promiscuous_enable; /**< Promiscuous ON. */
@@ -2140,6 +2144,36 @@ int rte_eth_dev_set_link_down(uint8_t port_id);
 void rte_eth_dev_close(uint8_t port_id);
 
 /**
+ * Reset a Ethernet device and keep its port id.
+ *
+ * When a port has to be reset passively, the DPDK application can invoke
+ * this function. For example when a PF is reset, all its VFs should also
+ * be reset. Normally a DPDK application can invoke this function when
+ * RTE_ETH_EVENT_INTR_RESET event is detected, but can also use it to start
+ * a port reset in other circumstances.
+ *
+ * When this function is called, it first stops the port and then calls the
+ * PMD specific dev_uninit( ) and dev_init( ) to return the port to initial
+ * state, in which no Tx and Rx queues are setup, as if the port has been
+ * reset and not started. The port keeps the port id it had before the
+ * function call.
+ *
+ * After calling rte_eth_dev_reset( ), the application should use
+ * rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ),
+ * rte_eth_tx_queue_setup( ), and rte_eth_dev_start( )
+ * to reconfigure the device as appropriate.
+ *
+ * Note: To avoid unexpected behavior, the application should stop calling
+ * Tx and Rx functions before calling rte_eth_dev_reset( ). For thread
+ * safety, all these controlling functions should be called from the same
+ * thread.
+ *
+ * @param port_id
+ *   The port identifier of the Ethernet device.
+ */
+int rte_eth_dev_reset(uint8_t port_id);
+
+/**
  * Enable receipt in promiscuous mode for an Ethernet device.
  *
  * @param port_id
diff --git a/lib/librte_ether/rte_ether_version.map 
b/lib/librte_ether/rte_ether_version.map
index 4283728..e86d51e 100644
--- a/lib/librte_ether/rte_ether_version.map
+++ b/lib/librte_ether/rte_ether_version.map
@@ -155,6 +155,7 @@ DPDK_17.08 {
rte_eth_dev_adjust_nb_rx_tx_desc;
rte_flow_copy;
rte_flow_isolate;
+   rte_eth_dev_reset;
rte_tm_capabilities_get;
rte_tm_get_leaf_n

[dpdk-dev] [PATCH v9 4/5] app/testpmd: enhance command to test NIC reset

2017-07-23 Thread Wei Dai
When PF is reset, a message will show it and all its
VF need to be reset.
User can run the command "port reset port_id"
to reset the VF port and to keep same port id without
any configuration. Then user can run "port stop port_id"
and "port start port_id" to reconfigure its forwarding
mode and parmaters as previous ones.
To avoid crash, current forwarding should be stopped
before running "port reset port_id".

Signed-off-by: Wei Dai 
Tested-by: Yuan Peng 
Acked-by: Jingjing Wu 
---
 app/test-pmd/cmdline.c | 12 
 app/test-pmd/testpmd.c | 41 +
 app/test-pmd/testpmd.h |  1 +
 3 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c
index b1b36c1..e37d6d7 100644
--- a/app/test-pmd/cmdline.c
+++ b/app/test-pmd/cmdline.c
@@ -972,6 +972,8 @@ static void cmd_operate_port_parsed(void *parsed_result,
stop_port(RTE_PORT_ALL);
else if (!strcmp(res->name, "close"))
close_port(RTE_PORT_ALL);
+   else if (!strcmp(res->name, "reset"))
+   reset_port(RTE_PORT_ALL);
else
printf("Unknown parameter\n");
 }
@@ -981,14 +983,14 @@ cmdline_parse_token_string_t cmd_operate_port_all_cmd =
"port");
 cmdline_parse_token_string_t cmd_operate_port_all_port =
TOKEN_STRING_INITIALIZER(struct cmd_operate_port_result, name,
-   "start#stop#close");
+   "start#stop#close#reset");
 cmdline_parse_token_string_t cmd_operate_port_all_all =
TOKEN_STRING_INITIALIZER(struct cmd_operate_port_result, value, "all");
 
 cmdline_parse_inst_t cmd_operate_port = {
.f = cmd_operate_port_parsed,
.data = NULL,
-   .help_str = "port start|stop|close all: Start/Stop/Close all ports",
+   .help_str = "port start|stop|close all: Start/Stop/Close/Reset all 
ports",
.tokens = {
(void *)&cmd_operate_port_all_cmd,
(void *)&cmd_operate_port_all_port,
@@ -1016,6 +1018,8 @@ static void cmd_operate_specific_port_parsed(void 
*parsed_result,
stop_port(res->value);
else if (!strcmp(res->name, "close"))
close_port(res->value);
+   else if (!strcmp(res->name, "reset"))
+   reset_port(res->value);
else
printf("Unknown parameter\n");
 }
@@ -1025,7 +1029,7 @@ cmdline_parse_token_string_t 
cmd_operate_specific_port_cmd =
keyword, "port");
 cmdline_parse_token_string_t cmd_operate_specific_port_port =
TOKEN_STRING_INITIALIZER(struct cmd_operate_specific_port_result,
-   name, "start#stop#close");
+   name, "start#stop#close#reset");
 cmdline_parse_token_num_t cmd_operate_specific_port_id =
TOKEN_NUM_INITIALIZER(struct cmd_operate_specific_port_result,
value, UINT8);
@@ -1033,7 +1037,7 @@ cmdline_parse_token_num_t cmd_operate_specific_port_id =
 cmdline_parse_inst_t cmd_operate_specific_port = {
.f = cmd_operate_specific_port_parsed,
.data = NULL,
-   .help_str = "port start|stop|close : Start/Stop/Close port_id",
+   .help_str = "port start|stop|close : Start/Stop/Close/Reset 
port_id",
.tokens = {
(void *)&cmd_operate_specific_port_cmd,
(void *)&cmd_operate_specific_port_port,
diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c
index e754d12..59fc441 100644
--- a/app/test-pmd/testpmd.c
+++ b/app/test-pmd/testpmd.c
@@ -1683,6 +1683,47 @@ close_port(portid_t pid)
 }
 
 void
+reset_port(portid_t pid)
+{
+   int diag;
+   portid_t pi;
+   struct rte_port *port;
+
+   if (port_id_is_invalid(pid, ENABLED_WARN))
+   return;
+
+   printf("Resetting ports...\n");
+
+   RTE_ETH_FOREACH_DEV(pi) {
+   if (pid != pi && pid != (portid_t)RTE_PORT_ALL)
+   continue;
+
+   if (port_is_forwarding(pi) != 0 && test_done == 0) {
+   printf("Please remove port %d from forwarding "
+  "configuration.\n", pi);
+   continue;
+   }
+
+   if (port_is_bonding_slave(pi)) {
+   printf("Please remove port %d from bonded device.\n",
+  pi);
+   continue;
+   }
+
+   diag = rte_eth_dev_reset(pi);
+   if (diag == 0) {
+   port = &ports[pi];
+   port->need_reconfig = 1;
+   port->need_reconfig_queues = 1;
+   } else {
+   printf("Failed to reset port %d. diag=%d\n", pi, diag);

[dpdk-dev] [PATCH v9 5/5] doc: add description of the NIC reset API

2017-07-23 Thread Wei Dai
This patch add the description of NIC reset API in
doc/guides/prog_guide/poll_mode_drv.rst .
It explains why this API is needed, when it should
be called and some noticeable information.

Signed-off-by: Wei Dai 
Reviewed-by: Remy Horton 
---
 doc/guides/prog_guide/poll_mode_drv.rst | 41 +
 1 file changed, 41 insertions(+)

diff --git a/doc/guides/prog_guide/poll_mode_drv.rst 
b/doc/guides/prog_guide/poll_mode_drv.rst
index 1ac8f7e..7c95b3e 100644
--- a/doc/guides/prog_guide/poll_mode_drv.rst
+++ b/doc/guides/prog_guide/poll_mode_drv.rst
@@ -536,3 +536,44 @@ call. As an end result, the application is able to achieve 
its goal of
 monitoring a single statistic ("rx_errors" in this case), and if that shows
 packets being dropped, it can easily retrieve a "set" of statistics using the
 IDs array parameter to ``rte_eth_xstats_get_by_id`` function.
+
+NIC Reset API
+~
+
+.. code-block:: c
+
+int rte_eth_dev_reset(uint8_t port_id);
+
+Sometimes a port has to be reset passively. For example when a PF is
+reset, all its VFs should also be reset by the application to make them
+consistent with the PF. A DPDK application also can call this function
+to trigger a port reset. Normally, a DPDK application would invokes this
+function when an RTE_ETH_EVENT_INTR_RESET event is detected.
+
+It is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and
+the application should register a callback function to handle these
+events. When a PMD needs to trigger a reset, it can trigger an
+RTE_ETH_EVENT_INTR_RESET event. On receiving an RTE_ETH_EVENT_INTR_RESET
+event, applications can handle it as follows: Stop working queues, stop
+calling Rx and Tx functions, and then call rte_eth_dev_reset( ). For
+thread safety all these operations should be called from the same thread.
+
+For example when PF is reset, the PF sends a message to notify VFs of
+this event and also trigger an interrupt to VFs. Then in the interrupt
+service routine the VFs detects this notification message and calls
+_rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET, NULL,
+NULL). This means that a PF reset triggers an RTE_ETH_EVENT_INTR_RESET
+event within VFs. The function _rte_eth_dev_callback_process( ) will
+call the registered callback function. The callback function can trigger
+the application to handle all operations the VF reset requires including
+stopping Rx/Tx queues and calling rte_eth_dev_reset().
+
+The rte_eth_dev_reset( ) itself is a generic function which only does
+some hardware reset operations through calling dev_unint() and
+dev_init(), and itself does not handle synchronization, which is handled
+by application.
+
+The PMD itself should not call rte_eth_dev_reset( ). The PMD can trigger
+the application to handle reset event. It is duty of application to
+handle all synchronization before it calls rte_eth_dev_reset( ).
+
-- 
2.7.5



Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call

2017-07-23 Thread Sagi Grimberg

mlx5_tx_complete() polls completion queue multiple times until it
encounters an invalid entry. As Tx completions are suppressed by
MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions
in a poll. And freeing too many buffers in a call can cause high jitter.
This patch improves throughput a little.


What if the device generates burst of completions?

mlx5 PMD suppresses completions anyway. It requests a completion per every
MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion
queue is even much small.


Yes I realize that, but can't the device still complete in a burst (of
unsuppressed completions)? I mean its not guaranteed that for every
txq_complete a signaled completion is pending right? What happens if
the device has inconsistent completion pacing? Can't the sw grow a
batch of completions if txq_complete will process a single completion
unconditionally?


Holding these completions un-reaped can theoretically cause resource stress on
the corresponding mempool(s).

Can you make your point clearer? Do you think the "stress" can impact
performance? I think stress doesn't matter unless it is depleted. And app is
responsible for supplying enough mbufs considering the depth of all queues (max
# of outstanding mbufs).


I might be missing something, but # of outstanding mbufs should be
relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right?
Why should the pool account for the entire TX queue depth (which can
be very large)?

Is there a hard requirement documented somewhere that the application
needs to account for the entire TX queue depths for sizing its mbuf
pool?

My question is with the proposed change, doesn't this mean that the 
application might need to allocate a bigger TX mbuf pool? Because the

pmd can theoretically consume completions slower (as in multiple TX
burst calls)?


I totally get the need for a stopping condition, but is "loop once"
the best stop condition?

Best for what?


Best condition to stop consuming TX completions. As I said, I think that
leaving TX completions un-reaped can (at least in theory) slow down the
mbuf reclamation, which impacts the application. (unless I'm not
understanding something fundamental)


Perhaps an adaptive budget (based on online stats) perform better?

Please bring up any suggestion or submit a patch if any.


I was simply providing a review for the patch. I don't have the time
to come up with a better patch unfortunately, but I still think its
fair to raise a point.


Does "budget" mean the
threshold? If so, calculation of stats for adaptive threshold can impact single
core performance. With multiple cores, adjusting threshold doesn't affect much.


If you look at mlx5e driver in the kernel, it maintains online stats on
its RX and TX queues. It maintain these stats mostly for adaptive
interrupt moderation control (but not only).

I was suggesting maintaining per TX queue stats on average completions
consumed for each TX burst call, and adjust the stopping condition
according to a calculated stat.


[dpdk-dev] [PATCH] doc: postpone unaccomplished deprecation notice

2017-07-23 Thread Shahaf Shuler
The work on ethdev Rx and Tx offloads has not been completed for 17.08.
It will be completed on 17.11

The notice is kept and postponed until the end of this work.

Signed-off-by: Shahaf Shuler 
---
 doc/guides/rel_notes/deprecation.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/deprecation.rst 
b/doc/guides/rel_notes/deprecation.rst
index 257dcba32..f7fd0e699 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -39,7 +39,7 @@ Deprecation Notices
   PKT_RX_QINQ_STRIPPED, that are better described. The old flags and
   their behavior will be kept until 17.05 and will be removed in 17.08.
 
-* ethdev: Tx offloads will no longer be enabled by default in 17.08.
+* ethdev: Tx offloads will no longer be enabled by default in 17.11.
   Instead, the ``rte_eth_txmode`` structure will be extended with
   bit field to enable each Tx offload.
   Besides of making the Rx/Tx configuration API more consistent for the
-- 
2.12.0



Re: [dpdk-dev] [PATCH v8 1/5] ethdev: add support of NIC reset

2017-07-23 Thread Dai, Wei
Many thanks, Remy.
My v9 patch set has followed your guidance.

> -Original Message-
> From: Horton, Remy
> Sent: Thursday, July 20, 2017 9:22 PM
> To: Dai, Wei ; dev@dpdk.org
> Cc: tho...@monjalon.net; Lu, Wenzhuo ;
> Ananyev, Konstantin ; Xing, Beilei
> ; Wu, Jingjing ; Mcnamara,
> John 
> Subject: Re: [dpdk-dev] [PATCH v8 1/5] ethdev: add support of NIC reset
> 
> See suggested wording inline
> 
> On 17/07/2017 16:14, Wei Dai wrote:
> > This patch adds a new eth_dev layer API function rte_eth_dev_reset().
> > A DPDK application can call this function to reset a NIC and keep
> > its port id afterwards.
> > It means that all SW resources allocated in ethdev layer should be
> > kept and SW and HW resources of the NIC in PMD need to be reset in
> > similar way that it runs PCI dev_uninit() and then dev_init().
> > The sequence of dev_uninit() and dev_init() can be packed into a
> > single interface API rte_eth_dev_reset().
> > Please See the comments before the declaration of rte_eht_dev_reset()
> > in lib/librte_ether/rte_ethdev.h to get more details on why this
> > function is needed, what it does, when it should be called
> > and what an application should do after calling this function.
> 
> This patch adds a new eth_dev layer API function rte_eth_dev_reset(),
> which a DPDK application can call to reset a NIC and keep its port id
> afterwards. It means that all software resources allocated in the ethdev
> layer are kept, and software & hardware resources of the NIC within the
> NIC's PMD are reset to a state simular to that obtained by calling the
> PCI dev_uninit() and then dev_init(). This effective sequence of
> dev_uninit() and dev_init() is packed into a single API function
> rte_eth_dev_reset().
> 
> Please see the comments before the declaration of rte_eht_dev_reset()
> in lib/librte_ether/rte_ethdev.h to get more details on why this
> function is needed, what it does, when it should be called
> and what an application should do after calling this function.
> 
> 
> > Signed-off-by: Wei Dai 
> > ---
> >  lib/librte_ether/rte_ethdev.c  | 17 +
> >  lib/librte_ether/rte_ethdev.h  | 33
> +
> >  lib/librte_ether/rte_ether_version.map |  1 +
> >  3 files changed, 51 insertions(+)
> 
> Reviewed-by: Remy Horton 
> 
> 
> 
> >  /**
> > + * Reset a Ethernet device and keep its port id.
> > + *
> > + * When a port has to be reset passively, the DPDK application can invoke
> this
> > + * function. For example a PF is reset, all its VFs should also be reset.
> > + * Normally, a DPDK application can invoke this function when
> > + * RTE_ETH_EVENT_INTR_RESET event is detected. A DPDK application
> can also call
> > + * this function to start an initiative port reset.
> 
> When a port has to be reset passively, the DPDK application can invoke
> this function. For example when a PF is reset, all its VFs should also
> be reset. Normally a DPDK application can invoke this function when
> RTE_ETH_EVENT_INTR_RESET event is detected, but can also use it to start
> a port reset in other circumstances.
> 
> > + * When this function is called, it first stops the port and then call PMD
> > + * specific dev_uninit( ) and  dev_init( ) to makes the port return to 
> > initial
> > + * status in which no any Tx queue and no Rx queue are setup and the port
> has
> > + * just be reset and not started. And the port keeps its port id after 
> > calling
> > + * this function.
> 
> When this function is called, it first stops the port and then calls the
> PMD specific dev_uninit( ) and dev_init( ) to return the port to initial
> state, in which no Tx and Rx queues are setup, as if the port has been
> reset and not started. The port keeps the port id it had before the
> function call.
> 
> > + * After calling rte_eth_dev_reset( ), the application should go through
> > + * rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ),
> > + * rte_eth_tx_queue_setup( ) and rte_eth_dev_start( ) again to restore
> > + * its previous settings or to reconfigure itself with different settings.
> 
> After calling rte_eth_dev_reset( ), the application should use
> rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ),
> rte_eth_tx_queue_setup( ), and rte_eth_dev_start( )
> to reconfigure the device as appropriate.
> 
> 
> > + * Note: to avoid unexpected behaviour, the application should stop
> calling Rx
> > + *   and Rx function before calling rte_eth_dev_reset( ).For thread
> safety,
> > + *   all these controlling operations had better be made in same
> thread.
> 
> Note: To avoid unexpected behavior, the application should stop calling
> Tx and Rx functions before calling rte_eth_dev_reset( ). For thread
> safety, all these controlling functions should be called from the same
> thread.
> 



Re: [dpdk-dev] [PATCH v1] net/i40e: fix sync phy type by adding retry

2017-07-23 Thread Wu, Jingjing


> -Original Message-
> From: Hunt, David
> Sent: Friday, July 21, 2017 10:41 PM
> To: dev@dpdk.org
> Cc: Wu, Jingjing ; Hunt, David 
> Subject: [PATCH v1] net/i40e: fix sync phy type by adding retry
> 
> Some phy's take longer than others to come up. Add a retry to give more phy's
> a chance to come up before returning an error.
> 
> Signed-off-by: David Hunt 

Patch looks fine, but you missed the Fixes line and it's better to Cc 
sta...@dpdk.org

Thanks
Jingjing


[dpdk-dev] LACP bond link broken chain due to the timeout

2017-07-23 Thread Zhangkun (K)
Hi, all

I use dpdk LACP bond for long time large flow test, there are LACP broken 
chain. The dpdk log is as follow:
Bond 1: slave id 0 distributing stopped.
Bond 1: slave id 1 distributing stopped.

Through the analysis of code , LACP protocol packets are handled by 
eal-intr-thread thread, at the same time, the thread will also deal with the 
other driver interrupt event and query the NIC card state .
And each polling thread is waiting for a long time that is leading to the 
switch disconnection timeout.

On the generic x86 OS server, has it been considered that the eal-intr-thread 
thread can 't get a timely scheduling, and that would result in LACP timeout 
and Link broken chain?
Or has it been considered that LACP bond would be put on a single thread, not 
shared with others?