On 08/20/15 11:41, Ananyev, Konstantin wrote: > Hi Vlad, > >> -----Original Message----- >> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com] >> Sent: Wednesday, August 19, 2015 11:03 AM >> To: Ananyev, Konstantin; Lu, Wenzhuo >> Cc: dev at dpdk.org >> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 >> for all NICs but 82598 >> >> >> >> On 08/19/15 10:43, Ananyev, Konstantin wrote: >>> Hi Vlad, >>> Sorry for delay with review, I am OOO till next week. >>> Meanwhile, few questions/comments from me. >> Hi, Konstantin, long time no see... ;) >> >>>>>>>> This patch fixes the Tx hang we were constantly hitting with a >>>> seastar-based >>>>>>>> application on x540 NIC. >>>>>>> Could you help to share with us how to reproduce the tx hang issue, >>>> with using >>>>>>> typical DPDK examples? >>>>>> Sorry. I'm not very familiar with the typical DPDK examples to help u >>>>>> here. However this is quite irrelevant since without this this patch >>>>>> ixgbe PMD obviously abuses the HW spec as has been explained above. >>>>>> >>>>>> We saw the issue when u stressed the xmit path with a lot of highly >>>>>> fragmented TCP frames (packets with up to 33 fragments with non-headers >>>>>> fragments as small as 4 bytes) with all offload features enabled. >>> Could you provide us with the pcap file to reproduce the issue? >> Well, the thing is it takes some time to reproduce it (a few minutes of >> heavy load) therefore a pcap would be quite large. > Probably you can upload it to some place, from which we will be able to > download it?
I'll see what I can do but no promises... > Or might be you have some sort of scapy script to generate it? > I suppose we'll need something to reproduce the issue and verify the fix. Since the original code abuses the HW spec u don't have to... ;) > >>> My concern with you approach is that it would affect TX performance. >> It certainly will ;) But it seem inevitable. See below. >> >>> Right now, for simple TX PMD usually reads only (nb_tx_desc/tx_rs_thresh) >>> TXDs, >>> While with your patch (if I understand it correctly) it has to read all >>> TXDs in the HW TX ring. >> If by "simple" u refer an always single fragment per Tx packet - then u >> are absolutely correct. >> >> My initial patch was to only set RS on every EOP descriptor without >> changing the rs_thresh value and this patch worked. >> However HW spec doesn't ensure in a general case that packets are always >> handled/completion write-back completes in the same order the packets >> are placed on the ring (see "Tx arbitration schemes" chapter in 82599 >> spec for instance). Therefore AFAIU one should not assume that if >> packet[x+1] DD bit is set then packet[x] is completed too. > From my understanding, TX arbitration controls the order in which TXDs from > different queues are fetched/processed. > But descriptors from the same TX queue are processed in FIFO order. > So, I think that - yes, if TXD[x+1] DD bit is set, then TXD[x] is completed > too, > and setting RS on every EOP TXD should be enough. Ok. I'll rework the patch under this assumption then. > >> That's why I changed the patch to be as u see it now. However if I miss >> something here and your HW people ensure the in-order completion this of >> course may be changed back. >> >>> Even if we really need to setup RS bit in each TXD (I still doubt we really >>> do) - , >> Well, if u doubt u may ask the guys from the Intel networking division >> that wrote the 82599 and x540 HW specs where they clearly state that. ;) > Good point, we'll see what we can do here :) > Konstantin > >>> I think inside PMD it still should be possible to check TX completion in >>> chunks. >>> Konstantin >>> >>> >>>>>> Thanks, >>>>>> vlad >>>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com> >>>>>>>> --- >>>>>>>> drivers/net/ixgbe/ixgbe_ethdev.c | 9 +++++++++ >>>>>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 23 ++++++++++++++++++++++- >>>>>>>> 2 files changed, 31 insertions(+), 1 deletion(-) >>>>>>>> >>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>> index b8ee1e9..6714fd9 100644 >>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c >>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev *dev, >>>>>> struct >>>>>>>> rte_eth_dev_info *dev_info) >>>>>>>> .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | >>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS, >>>>>>>> }; >>>>>>>> + >>>>>>>> + /* >>>>>>>> + * According to 82599 and x540 specifications RS bit *must* be >>>> set on >>>>>> the >>>>>>>> + * last descriptor of *every* packet. Therefore we will not allow >>>> the >>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. >>>>>>>> + */ >>>>>>>> + if (hw->mac.type > ixgbe_mac_82598EB) >>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1; >>>>>>>> + >>>>>>>> dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * sizeof(uint32_t); >>>>>>>> dev_info->reta_size = ETH_RSS_RETA_SIZE_128; >>>>>>>> dev_info->flow_type_rss_offloads = IXGBE_RSS_OFFLOAD_ALL; diff -- >>>>>> git >>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c b/drivers/net/ixgbe/ixgbe_rxtx.c >>>> index >>>>>>>> 91023b9..8dbdffc 100644 >>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c >>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev >>>>>>>> *dev, >>>>>>>> struct ixgbe_tx_queue *txq; >>>>>>>> struct ixgbe_hw *hw; >>>>>>>> uint16_t tx_rs_thresh, tx_free_thresh; >>>>>>>> + bool rs_deferring_allowed; >>>>>>>> >>>>>>>> PMD_INIT_FUNC_TRACE(); >>>>>>>> hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); >>>>>>>> >>>>>>>> /* >>>>>>>> + * According to 82599 and x540 specifications RS bit *must* be >>>> set on >>>>>> the >>>>>>>> + * last descriptor of *every* packet. Therefore we will not allow >>>> the >>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. >>>>>>>> + */ >>>>>>>> + rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB); >>>>>>>> + >>>>>>>> + /* >>>>>>>> * Validate number of transmit descriptors. >>>>>>>> * It must not exceed hardware maximum, and must be multiple >>>>>>>> * of IXGBE_ALIGN. >>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev >>>>>> *dev, >>>>>>>> * to transmit a packet is greater than the number of free TX >>>>>>>> * descriptors. >>>>>>>> * The following constraints must be satisfied: >>>>>>>> + * tx_rs_thresh must be less than 2 for NICs for which RS >>>> deferring is >>>>>>>> + * forbidden (all but 82598). >>>>>>>> * tx_rs_thresh must be greater than 0. >>>>>>>> * tx_rs_thresh must be less than the size of the ring minus 2. >>>>>>>> * tx_rs_thresh must be less than or equal to tx_free_thresh. >>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct rte_eth_dev >>>>>> *dev, >>>>>>>> * When set to zero use default values. >>>>>>>> */ >>>>>>>> tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ? >>>>>>>> - tx_conf->tx_rs_thresh : DEFAULT_TX_RS_THRESH); >>>>>>>> + tx_conf->tx_rs_thresh : >>>>>>>> + (rs_deferring_allowed ? DEFAULT_TX_RS_THRESH : >>>> 1)); >>>>>>>> tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ? >>>>>>>> tx_conf->tx_free_thresh : DEFAULT_TX_FREE_THRESH); >>>>>>>> + >>>>>>>> + if (!rs_deferring_allowed && tx_rs_thresh > 1) { >>>>>>>> + PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than 2 since >>>> RS >>>>>> " >>>>>>>> + "must be set for every packet for this >>>> HW. " >>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)", >>>>>>>> + (unsigned int)tx_rs_thresh, >>>>>>>> + (int)dev->data->port_id, (int)queue_idx); >>>>>>>> + return -(EINVAL); >>>>>>>> + } >>>>>>>> + >>>>>>>> if (tx_rs_thresh >= (nb_desc - 2)) { >>>>>>>> PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than the >>>>>> number " >>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u >>>> " >>>>>>>> -- >>>>>>>> 2.1.0