Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
Thursday, July 20, 2017 7:22 PM, Sagi Grimberg: > >> As I said, there are primitives which are designed to handle frequent > >> reads and rare mutations. > > > > Even with such primitives, rarely lock is more than never lock. > > You do realize that the cache mutation involves ibv_dereg_mr() right? > Any locking scheme for mutation is negligible compared to that, and > rcu_read_lock is practically free. The fix for the issue you presented includes not to dereg MR on datapath. MR will be deregistered only on device close. MRs replace on cache can still happen, but it won't involve any de-registration. > > >>> It is one possibility discussed also with Mellanox guys, the point > >>> is this breaks the security point of view which is also an important > >>> stuff. > >> > >> What security aspect? The entire dpdk model builds on top of physical > >> addresses awareness running under root permissions. > >> I'm not saying > >> exposing it to the application nor granting remote permissions to the > >> physical space. > >> > >> mlx5_pmd is a network driver, and as a driver, it should allowed to > >> use the device dma lkey as it sees fit. I honestly think its pretty > >> much mandatory considering the work-arounds mlx5_pmd tries to do > >> (which we agreed are broken). > > > > True, There are many PMDs which can work only with physical memory. > > However Mellanox NICs have the option to work with virtual one thus > provide more security. > > I don't understand the security argument. Its completely private to the > driver. anything under librte is equivalent to an OS wrt networking, so I > fail to > see what is the security feature your talking about. You are correct that as a root you are able to do whatever you want on the server. The security I refer to is to protect against badly written code. The fact the PMD only registers the mempools, and use the device engine to translate the VA, provide some protection. For example, one DPDK process will not be able to access the memory of other DPDK process *by mistake*. I am not saying using the reserved lkey is not a good suggestion, and we plan to test its value. All I am saying is there are maybe other option to provide the same performance with the extra protection mentioned above. One of them can be to use indirect keys. One indirect key to represent 64b memory area, and other regular keys for the hugepages. The rest of the memory area can be filled with keys pointing to /dev/null. > > > The fact running under root doesn't mean you have privileges to access > every physical page on the server (even if you try very hard to be aware). > > But dpdk core mbufs structs are built this way. > > > The issue here, AFAIU, is performance. > > We are now looking into ways to provide the same performance as if it was > only a single lkey, while preserving the security feature. > > Hmm, What exactly do you have in mind? > > I'm hoping that you are not referring to ODP. If you are, I think that latency > unpredictability would be a huge non-starter, page-faults are way too > expensive for dpdk users. No ODP :). As all relevant DPDK memory is on top of hugepages, there is no reason to avoid registration and pinning in advance. > > Again, personally, I don't think that using virtual memory registration are > very > useful for a network driver, It's a driver, not an application. > > >>> If this is added in the future it will certainly be as an option, > >>> this way both will be possible, the application could then choose > >>> about security vs performance. > >> > >> Why should the application even be aware of that? Does any other > >> driver expose the user how it maps pkt mbufs to the NIC? Just like > >> the MR handling, its 100% internal to mlx5 and no reason why the user > >> should ever be exposed to any of these details. > > > > Other option is the reserved lkey as you suggested, but it will lose the > security guarantees. > > OK, obviously I'm missing something. Can you articulate what do you mean > by "security guarantees"? > > > Like every performance optimization it should be the application decision. > > I tend to disagree with you on this. The application must never be aware of > the nitty details of the underlying driver implementation. In fact, > propagating > this information to the application sounds like a layering violation. > > > In fact, there are some discussions on the ML of exposing the option > > to use va instead of pa. [1] > > My understanding is that the correct proposal was to hide this knowledge > from the user. Also the use-case is not security (which I'm not even sure > what it means yet). > > >>> This is also a in progress in PMD part, it should be part of the > >>> next DPDK release. > >> > >> That is *very* good to hear! Can you guys share a branch? I'm willing > >> to take it for testing. > > > > The branch is still pre-mature. It may be good enough for external testing > > in > about two weeks. > > Contac
Re: [dpdk-dev] Question on mlx5 PMD txq memory registration
I don't understand the security argument. Its completely private to the driver. anything under librte is equivalent to an OS wrt networking, so I fail to see what is the security feature your talking about. You are correct that as a root you are able to do whatever you want on the server. The security I refer to is to protect against badly written code. The fact the PMD only registers the mempools, and use the device engine to translate the VA, provide some protection. For example, one DPDK process will not be able to access the memory of other DPDK process *by mistake*. Well, this is a fair argument, but without a *complete* solution for all of dpdk peripherals, it has very little merit (if at all). A badly written code can just as easily crash a server by passing a mbuf to a crypto device or another network device that co-exists with mlx5. So, while I understand the argument, I think its value is not worth the hassle that mlx5_pmd needs to take to achieve it. Did this come from a real requirement (from a real implementation)? I am not saying using the reserved lkey is not a good suggestion, and we plan to test its value. All I am saying is there are maybe other option to provide the same performance with the extra protection mentioned above. One of them can be to use indirect keys. One indirect key to represent 64b memory area, and other regular keys for the hugepages. The rest of the memory area can be filled with keys pointing to /dev/null. If I understand what you are suggesting, this would trigger out-of-order transfers on an indirect memory key just about always (each transfer can originate from a different hugepage and SGL resolution alone will require a walk on the memory key context SGL list). I'm afraid this would introduce a very bad performance scaling due to the fact that a SGL context (klm) will need to be fetched from the ICM for essentially every send operation. Having said that, its just my 2 cents, if your solution works then I don't really care. You are the one testing it... The fact running under root doesn't mean you have privileges to access every physical page on the server (even if you try very hard to be aware). But dpdk core mbufs structs are built this way. The issue here, AFAIU, is performance. We are now looking into ways to provide the same performance as if it was only a single lkey, while preserving the security feature. Hmm, What exactly do you have in mind? I'm hoping that you are not referring to ODP. If you are, I think that latency unpredictability would be a huge non-starter, page-faults are way too expensive for dpdk users. No ODP :). As all relevant DPDK memory is on top of hugepages, there is no reason to avoid registration and pinning in advance. Agree.
[dpdk-dev] [PATCH v9 3/5] net/i40e: add support of reset
Reset a NIC by calling dev_uninit() and then dev_init(). Go through the same way in NIC PCI remove without release of ethdev resource and then NIC PCI probe function without ethdev resource allocation. Signed-off-by: Wei Dai --- drivers/net/i40e/i40e_ethdev.c| 28 drivers/net/i40e/i40e_ethdev_vf.c | 19 +++ 2 files changed, 47 insertions(+) diff --git a/drivers/net/i40e/i40e_ethdev.c b/drivers/net/i40e/i40e_ethdev.c index 9fcccda..1641e00 100644 --- a/drivers/net/i40e/i40e_ethdev.c +++ b/drivers/net/i40e/i40e_ethdev.c @@ -250,6 +250,7 @@ static int i40e_dev_configure(struct rte_eth_dev *dev); static int i40e_dev_start(struct rte_eth_dev *dev); static void i40e_dev_stop(struct rte_eth_dev *dev); static void i40e_dev_close(struct rte_eth_dev *dev); +static int i40e_dev_reset(struct rte_eth_dev *dev); static void i40e_dev_promiscuous_enable(struct rte_eth_dev *dev); static void i40e_dev_promiscuous_disable(struct rte_eth_dev *dev); static void i40e_dev_allmulticast_enable(struct rte_eth_dev *dev); @@ -449,6 +450,7 @@ static const struct eth_dev_ops i40e_eth_dev_ops = { .dev_start= i40e_dev_start, .dev_stop = i40e_dev_stop, .dev_close= i40e_dev_close, + .dev_reset= i40e_dev_reset, .promiscuous_enable = i40e_dev_promiscuous_enable, .promiscuous_disable = i40e_dev_promiscuous_disable, .allmulticast_enable = i40e_dev_allmulticast_enable, @@ -2155,6 +2157,32 @@ i40e_dev_close(struct rte_eth_dev *dev) I40E_WRITE_FLUSH(hw); } +/* + * Reest PF device only to re-initialize resources in PMD layer + */ +static int +i40e_dev_reset(struct rte_eth_dev *dev) +{ + int ret; + + /* When a DPDK PMD PF begin to reset PF port, it should notify all +* its VF to make them align with it. The detailed notification +* mechanism is PMD specific. As to i40e PF, it is rather complex. +* To avoid unexpected behavior in VF, currently reset of PF with +* SR-IOV activation is not supported. It might be supported later. +*/ + if (dev->data->sriov.active) + return -ENOTSUP; + + ret = eth_i40e_dev_uninit(dev); + if (ret) + return ret; + + ret = eth_i40e_dev_init(dev); + + return ret; +} + static void i40e_dev_promiscuous_enable(struct rte_eth_dev *dev) { diff --git a/drivers/net/i40e/i40e_ethdev_vf.c b/drivers/net/i40e/i40e_ethdev_vf.c index f6d8293..f951d4e 100644 --- a/drivers/net/i40e/i40e_ethdev_vf.c +++ b/drivers/net/i40e/i40e_ethdev_vf.c @@ -122,6 +122,7 @@ static void i40evf_vlan_offload_set(struct rte_eth_dev *dev, int mask); static int i40evf_vlan_pvid_set(struct rte_eth_dev *dev, uint16_t pvid, int on); static void i40evf_dev_close(struct rte_eth_dev *dev); +static int i40evf_dev_reset(struct rte_eth_dev *dev); static void i40evf_dev_promiscuous_enable(struct rte_eth_dev *dev); static void i40evf_dev_promiscuous_disable(struct rte_eth_dev *dev); static void i40evf_dev_allmulticast_enable(struct rte_eth_dev *dev); @@ -203,6 +204,7 @@ static const struct eth_dev_ops i40evf_eth_dev_ops = { .xstats_get_names = i40evf_dev_xstats_get_names, .xstats_reset = i40evf_dev_xstats_reset, .dev_close= i40evf_dev_close, + .dev_reset= i40evf_dev_reset, .dev_infos_get= i40evf_dev_info_get, .dev_supported_ptypes_get = i40e_dev_supported_ptypes_get, .vlan_filter_set = i40evf_vlan_filter_set, @@ -2373,6 +2375,23 @@ i40evf_dev_close(struct rte_eth_dev *dev) i40evf_disable_irq0(hw); } +/* + * Reest VF device only to re-initialize resources in PMD layer + */ +static int +i40evf_dev_reset(struct rte_eth_dev *dev) +{ + int ret; + + ret = i40evf_dev_uninit(dev); + if (ret) + return ret; + + ret = i40evf_dev_init(dev); + + return ret; +} + static int i40evf_get_rss_lut(struct i40e_vsi *vsi, uint8_t *lut, uint16_t lut_size) { -- 2.7.5
[dpdk-dev] [PATCH v9 2/5] net/ixgbe: add support of reset
Reset a NIC by calling dev_uninit and then dev_init. Go through same way in NIC PCI remove without release of ethdev resource and then NIC PCI probe function without ethdev resource allocation. Signed-off-by: Wei Dai --- drivers/net/ixgbe/ixgbe_ethdev.c | 47 1 file changed, 47 insertions(+) diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c b/drivers/net/ixgbe/ixgbe_ethdev.c index 194058f..3f176fe 100644 --- a/drivers/net/ixgbe/ixgbe_ethdev.c +++ b/drivers/net/ixgbe/ixgbe_ethdev.c @@ -169,6 +169,7 @@ static void ixgbe_dev_stop(struct rte_eth_dev *dev); static int ixgbe_dev_set_link_up(struct rte_eth_dev *dev); static int ixgbe_dev_set_link_down(struct rte_eth_dev *dev); static void ixgbe_dev_close(struct rte_eth_dev *dev); +static int ixgbe_dev_reset(struct rte_eth_dev *dev); static void ixgbe_dev_promiscuous_enable(struct rte_eth_dev *dev); static void ixgbe_dev_promiscuous_disable(struct rte_eth_dev *dev); static void ixgbe_dev_allmulticast_enable(struct rte_eth_dev *dev); @@ -265,6 +266,7 @@ static int ixgbevf_dev_link_update(struct rte_eth_dev *dev, int wait_to_complete); static void ixgbevf_dev_stop(struct rte_eth_dev *dev); static void ixgbevf_dev_close(struct rte_eth_dev *dev); +static int ixgbevf_dev_reset(struct rte_eth_dev *dev); static void ixgbevf_intr_disable(struct ixgbe_hw *hw); static void ixgbevf_intr_enable(struct ixgbe_hw *hw); static void ixgbevf_dev_stats_get(struct rte_eth_dev *dev, @@ -523,6 +525,7 @@ static const struct eth_dev_ops ixgbe_eth_dev_ops = { .dev_set_link_up= ixgbe_dev_set_link_up, .dev_set_link_down = ixgbe_dev_set_link_down, .dev_close= ixgbe_dev_close, + .dev_reset= ixgbe_dev_reset, .promiscuous_enable = ixgbe_dev_promiscuous_enable, .promiscuous_disable = ixgbe_dev_promiscuous_disable, .allmulticast_enable = ixgbe_dev_allmulticast_enable, @@ -613,6 +616,7 @@ static const struct eth_dev_ops ixgbevf_eth_dev_ops = { .xstats_reset = ixgbevf_dev_stats_reset, .xstats_get_names = ixgbevf_dev_xstats_get_names, .dev_close= ixgbevf_dev_close, + .dev_reset= ixgbevf_dev_reset, .allmulticast_enable = ixgbevf_dev_allmulticast_enable, .allmulticast_disable = ixgbevf_dev_allmulticast_disable, .dev_infos_get= ixgbevf_dev_info_get, @@ -2857,6 +2861,32 @@ ixgbe_dev_close(struct rte_eth_dev *dev) ixgbe_set_rar(hw, 0, hw->mac.addr, 0, IXGBE_RAH_AV); } +/* + * Reest PF device. + */ +static int +ixgbe_dev_reset(struct rte_eth_dev *dev) +{ + int ret; + + /* When a DPDK PMD PF begin to reset PF port, it should notify all +* its VF to make them align with it. The detailed notification +* mechanism is PMD specific. As to ixgbe PF, it is rather complex. +* To avoid unexpected behavior in VF, currently reset of PF with +* SR-IOV activation is not supported. It might be supported later. +*/ + if (dev->data->sriov.active) + return -ENOTSUP; + + ret = eth_ixgbe_dev_uninit(dev); + if (ret) + return ret; + + ret = eth_ixgbe_dev_init(dev); + + return ret; +} + static void ixgbe_read_stats_registers(struct ixgbe_hw *hw, struct ixgbe_hw_stats *hw_stats, @@ -5058,6 +5088,23 @@ ixgbevf_dev_close(struct rte_eth_dev *dev) ixgbevf_remove_mac_addr(dev, 0); } +/* + * Reset VF device + */ +static int +ixgbevf_dev_reset(struct rte_eth_dev *dev) +{ + int ret; + + ret = eth_ixgbevf_dev_uninit(dev); + if (ret) + return ret; + + ret = eth_ixgbevf_dev_init(dev); + + return ret; +} + static void ixgbevf_set_vfta_all(struct rte_eth_dev *dev, bool on) { struct ixgbe_hw *hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); -- 2.7.5
[dpdk-dev] [PATCH v9 0/5] Support of NIC reset and keep same port id
This patch set adds a function rte_eth_dev_reset( ) in rte_ethdev layer. Sometimes a port have to be reset passively. For example a PF is reset, all its VFs should also be reset by application itself to align with the PF. A DPDK application also can call this function to trigger an initative port reset. When processing reset, if the port goes through PCI remove() and then PCI probe() for restoration, its port id may be changed and this is not expected by some DPDK application. Normally, PCI probe() includes two parts: one is in rte_ethdev layer to allocate resource in rte_ethdev layer and the other is calling PMD specific dev_init() to allocate and initialize resource in PMD layer. PCI remove( ) releases all resource allocated from rte_ethdev layer in PCI probe( ) and calls PMD specific dev_uninit( ) to releaase resource allocated by dev_init( ) in PMD layer. To keep same port id and reset the port, only dev_uninit() and dev_init( ) in PMD can be called and keep all resources allocated from rte_ethdev layer poart in PCI probe( ). All these are what rte_eth_dev_reset() does. The rte_eth_dev_reset( ) calls rte_eth_dev_stop( ), PMD dev_uninit( ) and then PMD dev_init( ) to reset a port and keep same port id. After calling rte_eth_dev_reset( ), the application can go through rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ), rte_eth_tx_queue_setup( ) and rte_eth_dev_start( ) again to restore its previous settings or to reconfigure itself with different settings. To test this new feature, a testpmd command "port reset port_id" is added. The mapping between port number and its PCI address can be monitored to confirm its port number is kept. And following test case can also be used to confirm the port can work again after reset. A typical test steps are listed as follows: For example, run "ifconfig PF-name down" will trigger a reset to VF. 1. run testpmd with 2 ixgbe VF ports belonging to same PF 2. testpmd > set verbose 1 //to observe VF working 3. testpmd > show port info all //show port number and MAC addr 4. testpmd > start 5. let all ports forwarding work for a while 6. testpmd > show port stats all 7. ifconfig name-of-PF down 8. A message is shown in testmd to indicate PF reset 9. ifconfig name-of-PF up 10. testpmd > stop // stop forwarding to avoid crash during reset 11. testpmd > port reset all 12. testpmd > port stop all 13. testpmd > port start all //recofnig all ports 14. testpmd > show port info all //get mapping of port id and MAC addr for forwarding 15. testpmd > start // restore forwarding 14. let all ports forwarding work for a while 15. testpmd > show port stats all //confirm all port can work again 16. repeat above step 7 - 15 chagnes: v9: rebase to master branch modify some text after it is reviewed by an English native speaker v8: modify the comments before the declaration of rte_eth_dev_reset( ) to make doc generated from doxygen be concise. v7: add description of NIC reset in doc/poll_mode_drv.rst v6: add more comments to explain why the rte_eth_dev_reset( ) is needeed, when it is called and what the application should do after calling it. add more comments to explain why reset of PF with SR-IOV is not supported currently. v5: remove PCI address output to align with other modification which will output it in other way disable PF reset when its VF is ative to avoid unexpected VF behavior v4: add PCI address to confirm its port number keep same correct test method in cover letter v3: update testpmd command v2: only reset PMD layer resource and keep same port id, but not restore settings Signed-off-by: Wei Dai Tested-by: Yuan Peng Acked-by: Jingjing Wu Reviewed-by: Remy Horton Wei Dai (5): ethdev: add support of NIC reset net/ixgbe: add support of reset net/i40e: add support of reset app/testpmd: enhance command to test NIC reset doc: add description of the NIC reset API app/test-pmd/cmdline.c | 12 ++--- app/test-pmd/testpmd.c | 41 app/test-pmd/testpmd.h | 1 + doc/guides/prog_guide/poll_mode_drv.rst | 41 drivers/net/i40e/i40e_ethdev.c | 28 drivers/net/i40e/i40e_ethdev_vf.c | 19 + drivers/net/ixgbe/ixgbe_ethdev.c| 47 + lib/librte_ether/rte_ethdev.c | 17 lib/librte_ether/rte_ethdev.h | 34 lib/librte_ether/rte_ether_version.map | 1 + 10 files changed, 237 insertions(+), 4 deletions(-) -- 2.7.5
[dpdk-dev] [PATCH v9 1/5] ethdev: add support of NIC reset
This patch adds a new eth_dev layer API function rte_eth_dev_reset(), which a DPDK application can call to reset a NIC and keep its port id afterwards. It means that all software resources allocated in the ethdev layer are kept, and software & hardware resources of the NIC within the NIC's PMD are reset to a state simular to that obtained by calling the PCI dev_uninit() and then dev_init(). This effective sequence of dev_uninit() and dev_init() is packed into a single API function rte_eth_dev_reset(). Please see the comments before the declaration of rte_eht_dev_reset() in lib/librte_ether/rte_ethdev.h to get more details on why this function is needed, what it does, when it should be called and what an application should do after calling this function. Signed-off-by: Wei Dai Reviewed-by: Remy Horton --- lib/librte_ether/rte_ethdev.c | 17 + lib/librte_ether/rte_ethdev.h | 34 ++ lib/librte_ether/rte_ether_version.map | 1 + 3 files changed, 52 insertions(+) diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c index d4ebb1b..68ba64d 100644 --- a/lib/librte_ether/rte_ethdev.c +++ b/lib/librte_ether/rte_ethdev.c @@ -1004,6 +1004,23 @@ rte_eth_dev_close(uint8_t port_id) } int +rte_eth_dev_reset(uint8_t port_id) +{ + struct rte_eth_dev *dev; + int ret; + + RTE_ETH_VALID_PORTID_OR_ERR_RET(port_id, -EINVAL); + dev = &rte_eth_devices[port_id]; + + RTE_FUNC_PTR_OR_ERR_RET(*dev->dev_ops->dev_reset, -ENOTSUP); + + rte_eth_dev_stop(port_id); + ret = dev->dev_ops->dev_reset(dev); + + return ret; +} + +int rte_eth_rx_queue_setup(uint8_t port_id, uint16_t rx_queue_id, uint16_t nb_rx_desc, unsigned int socket_id, const struct rte_eth_rxconf *rx_conf, diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h index 0e99090..fde92a1 100644 --- a/lib/librte_ether/rte_ethdev.h +++ b/lib/librte_ether/rte_ethdev.h @@ -1115,6 +1115,9 @@ typedef int (*eth_dev_set_link_down_t)(struct rte_eth_dev *dev); typedef void (*eth_dev_close_t)(struct rte_eth_dev *dev); /**< @internal Function used to close a configured Ethernet device. */ +typedef int (*eth_dev_reset_t)(struct rte_eth_dev *dev); +/** <@internal Function used to reset a configured Ethernet device. */ + typedef void (*eth_promiscuous_enable_t)(struct rte_eth_dev *dev); /**< @internal Function used to enable the RX promiscuous mode of an Ethernet device. */ @@ -1435,6 +1438,7 @@ struct eth_dev_ops { eth_dev_set_link_up_t dev_set_link_up; /**< Device link up. */ eth_dev_set_link_down_tdev_set_link_down; /**< Device link down. */ eth_dev_close_tdev_close; /**< Close device. */ + eth_dev_reset_tdev_reset; /**< Reset device. */ eth_link_update_t link_update; /**< Get device link state. */ eth_promiscuous_enable_t promiscuous_enable; /**< Promiscuous ON. */ @@ -2140,6 +2144,36 @@ int rte_eth_dev_set_link_down(uint8_t port_id); void rte_eth_dev_close(uint8_t port_id); /** + * Reset a Ethernet device and keep its port id. + * + * When a port has to be reset passively, the DPDK application can invoke + * this function. For example when a PF is reset, all its VFs should also + * be reset. Normally a DPDK application can invoke this function when + * RTE_ETH_EVENT_INTR_RESET event is detected, but can also use it to start + * a port reset in other circumstances. + * + * When this function is called, it first stops the port and then calls the + * PMD specific dev_uninit( ) and dev_init( ) to return the port to initial + * state, in which no Tx and Rx queues are setup, as if the port has been + * reset and not started. The port keeps the port id it had before the + * function call. + * + * After calling rte_eth_dev_reset( ), the application should use + * rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ), + * rte_eth_tx_queue_setup( ), and rte_eth_dev_start( ) + * to reconfigure the device as appropriate. + * + * Note: To avoid unexpected behavior, the application should stop calling + * Tx and Rx functions before calling rte_eth_dev_reset( ). For thread + * safety, all these controlling functions should be called from the same + * thread. + * + * @param port_id + * The port identifier of the Ethernet device. + */ +int rte_eth_dev_reset(uint8_t port_id); + +/** * Enable receipt in promiscuous mode for an Ethernet device. * * @param port_id diff --git a/lib/librte_ether/rte_ether_version.map b/lib/librte_ether/rte_ether_version.map index 4283728..e86d51e 100644 --- a/lib/librte_ether/rte_ether_version.map +++ b/lib/librte_ether/rte_ether_version.map @@ -155,6 +155,7 @@ DPDK_17.08 { rte_eth_dev_adjust_nb_rx_tx_desc; rte_flow_copy; rte_flow_isolate; + rte_eth_dev_reset; rte_tm_capabilities_get; rte_tm_get_leaf_n
[dpdk-dev] [PATCH v9 4/5] app/testpmd: enhance command to test NIC reset
When PF is reset, a message will show it and all its VF need to be reset. User can run the command "port reset port_id" to reset the VF port and to keep same port id without any configuration. Then user can run "port stop port_id" and "port start port_id" to reconfigure its forwarding mode and parmaters as previous ones. To avoid crash, current forwarding should be stopped before running "port reset port_id". Signed-off-by: Wei Dai Tested-by: Yuan Peng Acked-by: Jingjing Wu --- app/test-pmd/cmdline.c | 12 app/test-pmd/testpmd.c | 41 + app/test-pmd/testpmd.h | 1 + 3 files changed, 50 insertions(+), 4 deletions(-) diff --git a/app/test-pmd/cmdline.c b/app/test-pmd/cmdline.c index b1b36c1..e37d6d7 100644 --- a/app/test-pmd/cmdline.c +++ b/app/test-pmd/cmdline.c @@ -972,6 +972,8 @@ static void cmd_operate_port_parsed(void *parsed_result, stop_port(RTE_PORT_ALL); else if (!strcmp(res->name, "close")) close_port(RTE_PORT_ALL); + else if (!strcmp(res->name, "reset")) + reset_port(RTE_PORT_ALL); else printf("Unknown parameter\n"); } @@ -981,14 +983,14 @@ cmdline_parse_token_string_t cmd_operate_port_all_cmd = "port"); cmdline_parse_token_string_t cmd_operate_port_all_port = TOKEN_STRING_INITIALIZER(struct cmd_operate_port_result, name, - "start#stop#close"); + "start#stop#close#reset"); cmdline_parse_token_string_t cmd_operate_port_all_all = TOKEN_STRING_INITIALIZER(struct cmd_operate_port_result, value, "all"); cmdline_parse_inst_t cmd_operate_port = { .f = cmd_operate_port_parsed, .data = NULL, - .help_str = "port start|stop|close all: Start/Stop/Close all ports", + .help_str = "port start|stop|close all: Start/Stop/Close/Reset all ports", .tokens = { (void *)&cmd_operate_port_all_cmd, (void *)&cmd_operate_port_all_port, @@ -1016,6 +1018,8 @@ static void cmd_operate_specific_port_parsed(void *parsed_result, stop_port(res->value); else if (!strcmp(res->name, "close")) close_port(res->value); + else if (!strcmp(res->name, "reset")) + reset_port(res->value); else printf("Unknown parameter\n"); } @@ -1025,7 +1029,7 @@ cmdline_parse_token_string_t cmd_operate_specific_port_cmd = keyword, "port"); cmdline_parse_token_string_t cmd_operate_specific_port_port = TOKEN_STRING_INITIALIZER(struct cmd_operate_specific_port_result, - name, "start#stop#close"); + name, "start#stop#close#reset"); cmdline_parse_token_num_t cmd_operate_specific_port_id = TOKEN_NUM_INITIALIZER(struct cmd_operate_specific_port_result, value, UINT8); @@ -1033,7 +1037,7 @@ cmdline_parse_token_num_t cmd_operate_specific_port_id = cmdline_parse_inst_t cmd_operate_specific_port = { .f = cmd_operate_specific_port_parsed, .data = NULL, - .help_str = "port start|stop|close : Start/Stop/Close port_id", + .help_str = "port start|stop|close : Start/Stop/Close/Reset port_id", .tokens = { (void *)&cmd_operate_specific_port_cmd, (void *)&cmd_operate_specific_port_port, diff --git a/app/test-pmd/testpmd.c b/app/test-pmd/testpmd.c index e754d12..59fc441 100644 --- a/app/test-pmd/testpmd.c +++ b/app/test-pmd/testpmd.c @@ -1683,6 +1683,47 @@ close_port(portid_t pid) } void +reset_port(portid_t pid) +{ + int diag; + portid_t pi; + struct rte_port *port; + + if (port_id_is_invalid(pid, ENABLED_WARN)) + return; + + printf("Resetting ports...\n"); + + RTE_ETH_FOREACH_DEV(pi) { + if (pid != pi && pid != (portid_t)RTE_PORT_ALL) + continue; + + if (port_is_forwarding(pi) != 0 && test_done == 0) { + printf("Please remove port %d from forwarding " + "configuration.\n", pi); + continue; + } + + if (port_is_bonding_slave(pi)) { + printf("Please remove port %d from bonded device.\n", + pi); + continue; + } + + diag = rte_eth_dev_reset(pi); + if (diag == 0) { + port = &ports[pi]; + port->need_reconfig = 1; + port->need_reconfig_queues = 1; + } else { + printf("Failed to reset port %d. diag=%d\n", pi, diag);
[dpdk-dev] [PATCH v9 5/5] doc: add description of the NIC reset API
This patch add the description of NIC reset API in doc/guides/prog_guide/poll_mode_drv.rst . It explains why this API is needed, when it should be called and some noticeable information. Signed-off-by: Wei Dai Reviewed-by: Remy Horton --- doc/guides/prog_guide/poll_mode_drv.rst | 41 + 1 file changed, 41 insertions(+) diff --git a/doc/guides/prog_guide/poll_mode_drv.rst b/doc/guides/prog_guide/poll_mode_drv.rst index 1ac8f7e..7c95b3e 100644 --- a/doc/guides/prog_guide/poll_mode_drv.rst +++ b/doc/guides/prog_guide/poll_mode_drv.rst @@ -536,3 +536,44 @@ call. As an end result, the application is able to achieve its goal of monitoring a single statistic ("rx_errors" in this case), and if that shows packets being dropped, it can easily retrieve a "set" of statistics using the IDs array parameter to ``rte_eth_xstats_get_by_id`` function. + +NIC Reset API +~ + +.. code-block:: c + +int rte_eth_dev_reset(uint8_t port_id); + +Sometimes a port has to be reset passively. For example when a PF is +reset, all its VFs should also be reset by the application to make them +consistent with the PF. A DPDK application also can call this function +to trigger a port reset. Normally, a DPDK application would invokes this +function when an RTE_ETH_EVENT_INTR_RESET event is detected. + +It is the duty of the PMD to trigger RTE_ETH_EVENT_INTR_RESET events and +the application should register a callback function to handle these +events. When a PMD needs to trigger a reset, it can trigger an +RTE_ETH_EVENT_INTR_RESET event. On receiving an RTE_ETH_EVENT_INTR_RESET +event, applications can handle it as follows: Stop working queues, stop +calling Rx and Tx functions, and then call rte_eth_dev_reset( ). For +thread safety all these operations should be called from the same thread. + +For example when PF is reset, the PF sends a message to notify VFs of +this event and also trigger an interrupt to VFs. Then in the interrupt +service routine the VFs detects this notification message and calls +_rte_eth_dev_callback_process(dev, RTE_ETH_EVENT_INTR_RESET, NULL, +NULL). This means that a PF reset triggers an RTE_ETH_EVENT_INTR_RESET +event within VFs. The function _rte_eth_dev_callback_process( ) will +call the registered callback function. The callback function can trigger +the application to handle all operations the VF reset requires including +stopping Rx/Tx queues and calling rte_eth_dev_reset(). + +The rte_eth_dev_reset( ) itself is a generic function which only does +some hardware reset operations through calling dev_unint() and +dev_init(), and itself does not handle synchronization, which is handled +by application. + +The PMD itself should not call rte_eth_dev_reset( ). The PMD can trigger +the application to handle reset event. It is duty of application to +handle all synchronization before it calls rte_eth_dev_reset( ). + -- 2.7.5
Re: [dpdk-dev] [PATCH] net/mlx5: poll completion queue once per a call
mlx5_tx_complete() polls completion queue multiple times until it encounters an invalid entry. As Tx completions are suppressed by MLX5_TX_COMP_THRESH, it is waste of cycles to expect multiple completions in a poll. And freeing too many buffers in a call can cause high jitter. This patch improves throughput a little. What if the device generates burst of completions? mlx5 PMD suppresses completions anyway. It requests a completion per every MLX5_TX_COMP_THRESH Tx mbufs, not every single mbuf. So, the size of completion queue is even much small. Yes I realize that, but can't the device still complete in a burst (of unsuppressed completions)? I mean its not guaranteed that for every txq_complete a signaled completion is pending right? What happens if the device has inconsistent completion pacing? Can't the sw grow a batch of completions if txq_complete will process a single completion unconditionally? Holding these completions un-reaped can theoretically cause resource stress on the corresponding mempool(s). Can you make your point clearer? Do you think the "stress" can impact performance? I think stress doesn't matter unless it is depleted. And app is responsible for supplying enough mbufs considering the depth of all queues (max # of outstanding mbufs). I might be missing something, but # of outstanding mbufs should be relatively small as the pmd reaps every MLX5_TX_COMP_THRESH mbufs right? Why should the pool account for the entire TX queue depth (which can be very large)? Is there a hard requirement documented somewhere that the application needs to account for the entire TX queue depths for sizing its mbuf pool? My question is with the proposed change, doesn't this mean that the application might need to allocate a bigger TX mbuf pool? Because the pmd can theoretically consume completions slower (as in multiple TX burst calls)? I totally get the need for a stopping condition, but is "loop once" the best stop condition? Best for what? Best condition to stop consuming TX completions. As I said, I think that leaving TX completions un-reaped can (at least in theory) slow down the mbuf reclamation, which impacts the application. (unless I'm not understanding something fundamental) Perhaps an adaptive budget (based on online stats) perform better? Please bring up any suggestion or submit a patch if any. I was simply providing a review for the patch. I don't have the time to come up with a better patch unfortunately, but I still think its fair to raise a point. Does "budget" mean the threshold? If so, calculation of stats for adaptive threshold can impact single core performance. With multiple cores, adjusting threshold doesn't affect much. If you look at mlx5e driver in the kernel, it maintains online stats on its RX and TX queues. It maintain these stats mostly for adaptive interrupt moderation control (but not only). I was suggesting maintaining per TX queue stats on average completions consumed for each TX burst call, and adjust the stopping condition according to a calculated stat.
[dpdk-dev] [PATCH] doc: postpone unaccomplished deprecation notice
The work on ethdev Rx and Tx offloads has not been completed for 17.08. It will be completed on 17.11 The notice is kept and postponed until the end of this work. Signed-off-by: Shahaf Shuler --- doc/guides/rel_notes/deprecation.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst index 257dcba32..f7fd0e699 100644 --- a/doc/guides/rel_notes/deprecation.rst +++ b/doc/guides/rel_notes/deprecation.rst @@ -39,7 +39,7 @@ Deprecation Notices PKT_RX_QINQ_STRIPPED, that are better described. The old flags and their behavior will be kept until 17.05 and will be removed in 17.08. -* ethdev: Tx offloads will no longer be enabled by default in 17.08. +* ethdev: Tx offloads will no longer be enabled by default in 17.11. Instead, the ``rte_eth_txmode`` structure will be extended with bit field to enable each Tx offload. Besides of making the Rx/Tx configuration API more consistent for the -- 2.12.0
Re: [dpdk-dev] [PATCH v8 1/5] ethdev: add support of NIC reset
Many thanks, Remy. My v9 patch set has followed your guidance. > -Original Message- > From: Horton, Remy > Sent: Thursday, July 20, 2017 9:22 PM > To: Dai, Wei ; dev@dpdk.org > Cc: tho...@monjalon.net; Lu, Wenzhuo ; > Ananyev, Konstantin ; Xing, Beilei > ; Wu, Jingjing ; Mcnamara, > John > Subject: Re: [dpdk-dev] [PATCH v8 1/5] ethdev: add support of NIC reset > > See suggested wording inline > > On 17/07/2017 16:14, Wei Dai wrote: > > This patch adds a new eth_dev layer API function rte_eth_dev_reset(). > > A DPDK application can call this function to reset a NIC and keep > > its port id afterwards. > > It means that all SW resources allocated in ethdev layer should be > > kept and SW and HW resources of the NIC in PMD need to be reset in > > similar way that it runs PCI dev_uninit() and then dev_init(). > > The sequence of dev_uninit() and dev_init() can be packed into a > > single interface API rte_eth_dev_reset(). > > Please See the comments before the declaration of rte_eht_dev_reset() > > in lib/librte_ether/rte_ethdev.h to get more details on why this > > function is needed, what it does, when it should be called > > and what an application should do after calling this function. > > This patch adds a new eth_dev layer API function rte_eth_dev_reset(), > which a DPDK application can call to reset a NIC and keep its port id > afterwards. It means that all software resources allocated in the ethdev > layer are kept, and software & hardware resources of the NIC within the > NIC's PMD are reset to a state simular to that obtained by calling the > PCI dev_uninit() and then dev_init(). This effective sequence of > dev_uninit() and dev_init() is packed into a single API function > rte_eth_dev_reset(). > > Please see the comments before the declaration of rte_eht_dev_reset() > in lib/librte_ether/rte_ethdev.h to get more details on why this > function is needed, what it does, when it should be called > and what an application should do after calling this function. > > > > Signed-off-by: Wei Dai > > --- > > lib/librte_ether/rte_ethdev.c | 17 + > > lib/librte_ether/rte_ethdev.h | 33 > + > > lib/librte_ether/rte_ether_version.map | 1 + > > 3 files changed, 51 insertions(+) > > Reviewed-by: Remy Horton > > > > > /** > > + * Reset a Ethernet device and keep its port id. > > + * > > + * When a port has to be reset passively, the DPDK application can invoke > this > > + * function. For example a PF is reset, all its VFs should also be reset. > > + * Normally, a DPDK application can invoke this function when > > + * RTE_ETH_EVENT_INTR_RESET event is detected. A DPDK application > can also call > > + * this function to start an initiative port reset. > > When a port has to be reset passively, the DPDK application can invoke > this function. For example when a PF is reset, all its VFs should also > be reset. Normally a DPDK application can invoke this function when > RTE_ETH_EVENT_INTR_RESET event is detected, but can also use it to start > a port reset in other circumstances. > > > + * When this function is called, it first stops the port and then call PMD > > + * specific dev_uninit( ) and dev_init( ) to makes the port return to > > initial > > + * status in which no any Tx queue and no Rx queue are setup and the port > has > > + * just be reset and not started. And the port keeps its port id after > > calling > > + * this function. > > When this function is called, it first stops the port and then calls the > PMD specific dev_uninit( ) and dev_init( ) to return the port to initial > state, in which no Tx and Rx queues are setup, as if the port has been > reset and not started. The port keeps the port id it had before the > function call. > > > + * After calling rte_eth_dev_reset( ), the application should go through > > + * rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ), > > + * rte_eth_tx_queue_setup( ) and rte_eth_dev_start( ) again to restore > > + * its previous settings or to reconfigure itself with different settings. > > After calling rte_eth_dev_reset( ), the application should use > rte_eth_dev_configure( ), rte_eth_rx_queue_setup( ), > rte_eth_tx_queue_setup( ), and rte_eth_dev_start( ) > to reconfigure the device as appropriate. > > > > + * Note: to avoid unexpected behaviour, the application should stop > calling Rx > > + * and Rx function before calling rte_eth_dev_reset( ).For thread > safety, > > + * all these controlling operations had better be made in same > thread. > > Note: To avoid unexpected behavior, the application should stop calling > Tx and Rx functions before calling rte_eth_dev_reset( ). For thread > safety, all these controlling functions should be called from the same > thread. >
Re: [dpdk-dev] [PATCH v1] net/i40e: fix sync phy type by adding retry
> -Original Message- > From: Hunt, David > Sent: Friday, July 21, 2017 10:41 PM > To: dev@dpdk.org > Cc: Wu, Jingjing ; Hunt, David > Subject: [PATCH v1] net/i40e: fix sync phy type by adding retry > > Some phy's take longer than others to come up. Add a retry to give more phy's > a chance to come up before returning an error. > > Signed-off-by: David Hunt Patch looks fine, but you missed the Fixes line and it's better to Cc sta...@dpdk.org Thanks Jingjing
[dpdk-dev] LACP bond link broken chain due to the timeout
Hi, all I use dpdk LACP bond for long time large flow test, there are LACP broken chain. The dpdk log is as follow: Bond 1: slave id 0 distributing stopped. Bond 1: slave id 1 distributing stopped. Through the analysis of code , LACP protocol packets are handled by eal-intr-thread thread, at the same time, the thread will also deal with the other driver interrupt event and query the NIC card state . And each polling thread is waiting for a long time that is leading to the switch disconnection timeout. On the generic x86 OS server, has it been considered that the eal-intr-thread thread can 't get a timely scheduling, and that would result in LACP timeout and Link broken chain? Or has it been considered that LACP bond would be put on a single thread, not shared with others?