Hi Vlad, > -----Original Message----- > From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com] > Sent: Thursday, August 20, 2015 10:07 AM > To: Ananyev, Konstantin; Lu, Wenzhuo > Cc: dev at dpdk.org > Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh above 1 for > all NICs but 82598 > > > > On 08/20/15 12:05, Vlad Zolotarov wrote: > > > > > > On 08/20/15 11:56, Vlad Zolotarov wrote: > >> > >> > >> On 08/20/15 11:41, Ananyev, Konstantin wrote: > >>> Hi Vlad, > >>> > >>>> -----Original Message----- > >>>> From: Vlad Zolotarov [mailto:vladz at cloudius-systems.com] > >>>> Sent: Wednesday, August 19, 2015 11:03 AM > >>>> To: Ananyev, Konstantin; Lu, Wenzhuo > >>>> Cc: dev at dpdk.org > >>>> Subject: Re: [dpdk-dev] [PATCH v1] ixgbe_pmd: forbid tx_rs_thresh > >>>> above 1 for all NICs but 82598 > >>>> > >>>> > >>>> > >>>> On 08/19/15 10:43, Ananyev, Konstantin wrote: > >>>>> Hi Vlad, > >>>>> Sorry for delay with review, I am OOO till next week. > >>>>> Meanwhile, few questions/comments from me. > >>>> Hi, Konstantin, long time no see... ;) > >>>> > >>>>>>>>>> This patch fixes the Tx hang we were constantly hitting with a > >>>>>> seastar-based > >>>>>>>>>> application on x540 NIC. > >>>>>>>>> Could you help to share with us how to reproduce the tx hang > >>>>>>>>> issue, > >>>>>> with using > >>>>>>>>> typical DPDK examples? > >>>>>>>> Sorry. I'm not very familiar with the typical DPDK examples to > >>>>>>>> help u > >>>>>>>> here. However this is quite irrelevant since without this this > >>>>>>>> patch > >>>>>>>> ixgbe PMD obviously abuses the HW spec as has been explained > >>>>>>>> above. > >>>>>>>> > >>>>>>>> We saw the issue when u stressed the xmit path with a lot of > >>>>>>>> highly > >>>>>>>> fragmented TCP frames (packets with up to 33 fragments with > >>>>>>>> non-headers > >>>>>>>> fragments as small as 4 bytes) with all offload features enabled. > >>>>> Could you provide us with the pcap file to reproduce the issue? > >>>> Well, the thing is it takes some time to reproduce it (a few > >>>> minutes of > >>>> heavy load) therefore a pcap would be quite large. > >>> Probably you can upload it to some place, from which we will be able > >>> to download it? > >> > >> I'll see what I can do but no promises... > > > > On a second thought pcap file won't help u much since in order to > > reproduce the issue u have to reproduce exactly the same structure of > > clusters i give to HW and it's not what u see on wire in a TSO case. > > And not only in a TSO case... ;)
I understand that, but my thought was you can add some sort of TX callback for the rte_eth_tx_burst() into your code that would write the packet into pcap file and then re-run your hang scenario. I know that it means extra work for you - but I think it would be very helpful if we would be able to reproduce your hang scenario: - if HW guys would confirm that setting RS bit for every EOP packet is not really required, then we probably have to look at what else can cause it. - it might be added to our validation cycle, to prevent hitting similar problem in future. Thanks Konstantin > > > > >> > >>> Or might be you have some sort of scapy script to generate it? > >>> I suppose we'll need something to reproduce the issue and verify the > >>> fix. > >> > >> Since the original code abuses the HW spec u don't have to... ;) > >> > >>> > >>>>> My concern with you approach is that it would affect TX performance. > >>>> It certainly will ;) But it seem inevitable. See below. > >>>> > >>>>> Right now, for simple TX PMD usually reads only > >>>>> (nb_tx_desc/tx_rs_thresh) TXDs, > >>>>> While with your patch (if I understand it correctly) it has to > >>>>> read all TXDs in the HW TX ring. > >>>> If by "simple" u refer an always single fragment per Tx packet - > >>>> then u > >>>> are absolutely correct. > >>>> > >>>> My initial patch was to only set RS on every EOP descriptor without > >>>> changing the rs_thresh value and this patch worked. > >>>> However HW spec doesn't ensure in a general case that packets are > >>>> always > >>>> handled/completion write-back completes in the same order the packets > >>>> are placed on the ring (see "Tx arbitration schemes" chapter in 82599 > >>>> spec for instance). Therefore AFAIU one should not assume that if > >>>> packet[x+1] DD bit is set then packet[x] is completed too. > >>> From my understanding, TX arbitration controls the order in which > >>> TXDs from > >>> different queues are fetched/processed. > >>> But descriptors from the same TX queue are processed in FIFO order. > >>> So, I think that - yes, if TXD[x+1] DD bit is set, then TXD[x] is > >>> completed too, > >>> and setting RS on every EOP TXD should be enough. > >> > >> Ok. I'll rework the patch under this assumption then. > >> > >>> > >>>> That's why I changed the patch to be as u see it now. However if I > >>>> miss > >>>> something here and your HW people ensure the in-order completion > >>>> this of > >>>> course may be changed back. > >>>> > >>>>> Even if we really need to setup RS bit in each TXD (I still doubt > >>>>> we really do) - , > >>>> Well, if u doubt u may ask the guys from the Intel networking division > >>>> that wrote the 82599 and x540 HW specs where they clearly state > >>>> that. ;) > >>> Good point, we'll see what we can do here :) > >>> Konstantin > >>> > >>>>> I think inside PMD it still should be possible to check TX > >>>>> completion in chunks. > >>>>> Konstantin > >>>>> > >>>>> > >>>>>>>> Thanks, > >>>>>>>> vlad > >>>>>>>>>> Signed-off-by: Vlad Zolotarov <vladz at cloudius-systems.com> > >>>>>>>>>> --- > >>>>>>>>>> drivers/net/ixgbe/ixgbe_ethdev.c | 9 +++++++++ > >>>>>>>>>> drivers/net/ixgbe/ixgbe_rxtx.c | 23 > >>>>>>>>>> ++++++++++++++++++++++- > >>>>>>>>>> 2 files changed, 31 insertions(+), 1 deletion(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/drivers/net/ixgbe/ixgbe_ethdev.c > >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_ethdev.c > >>>>>>>>>> index b8ee1e9..6714fd9 100644 > >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_ethdev.c > >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_ethdev.c > >>>>>>>>>> @@ -2414,6 +2414,15 @@ ixgbe_dev_info_get(struct rte_eth_dev > >>>>>>>>>> *dev, > >>>>>>>> struct > >>>>>>>>>> rte_eth_dev_info *dev_info) > >>>>>>>>>> .txq_flags = ETH_TXQ_FLAGS_NOMULTSEGS | > >>>>>>>>>> ETH_TXQ_FLAGS_NOOFFLOADS, > >>>>>>>>>> }; > >>>>>>>>>> + > >>>>>>>>>> + /* > >>>>>>>>>> + * According to 82599 and x540 specifications RS bit > >>>>>>>>>> *must* be > >>>>>> set on > >>>>>>>> the > >>>>>>>>>> + * last descriptor of *every* packet. Therefore we will > >>>>>>>>>> not allow > >>>>>> the > >>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. > >>>>>>>>>> + */ > >>>>>>>>>> + if (hw->mac.type > ixgbe_mac_82598EB) > >>>>>>>>>> + dev_info->default_txconf.tx_rs_thresh = 1; > >>>>>>>>>> + > >>>>>>>>>> dev_info->hash_key_size = IXGBE_HKEY_MAX_INDEX * > >>>>>>>>>> sizeof(uint32_t); > >>>>>>>>>> dev_info->reta_size = ETH_RSS_RETA_SIZE_128; > >>>>>>>>>> dev_info->flow_type_rss_offloads = > >>>>>>>>>> IXGBE_RSS_OFFLOAD_ALL; diff -- > >>>>>>>> git > >>>>>>>>>> a/drivers/net/ixgbe/ixgbe_rxtx.c > >>>>>>>>>> b/drivers/net/ixgbe/ixgbe_rxtx.c > >>>>>> index > >>>>>>>>>> 91023b9..8dbdffc 100644 > >>>>>>>>>> --- a/drivers/net/ixgbe/ixgbe_rxtx.c > >>>>>>>>>> +++ b/drivers/net/ixgbe/ixgbe_rxtx.c > >>>>>>>>>> @@ -2085,11 +2085,19 @@ ixgbe_dev_tx_queue_setup(struct > >>>>>>>>>> rte_eth_dev > >>>>>>>>>> *dev, > >>>>>>>>>> struct ixgbe_tx_queue *txq; > >>>>>>>>>> struct ixgbe_hw *hw; > >>>>>>>>>> uint16_t tx_rs_thresh, tx_free_thresh; > >>>>>>>>>> + bool rs_deferring_allowed; > >>>>>>>>>> > >>>>>>>>>> PMD_INIT_FUNC_TRACE(); > >>>>>>>>>> hw = IXGBE_DEV_PRIVATE_TO_HW(dev->data->dev_private); > >>>>>>>>>> > >>>>>>>>>> /* > >>>>>>>>>> + * According to 82599 and x540 specifications RS bit > >>>>>>>>>> *must* be > >>>>>> set on > >>>>>>>> the > >>>>>>>>>> + * last descriptor of *every* packet. Therefore we will > >>>>>>>>>> not allow > >>>>>> the > >>>>>>>>>> + * tx_rs_thresh above 1 for all NICs newer than 82598. > >>>>>>>>>> + */ > >>>>>>>>>> + rs_deferring_allowed = (hw->mac.type <= ixgbe_mac_82598EB); > >>>>>>>>>> + > >>>>>>>>>> + /* > >>>>>>>>>> * Validate number of transmit descriptors. > >>>>>>>>>> * It must not exceed hardware maximum, and must be > >>>>>>>>>> multiple > >>>>>>>>>> * of IXGBE_ALIGN. > >>>>>>>>>> @@ -2110,6 +2118,8 @@ ixgbe_dev_tx_queue_setup(struct > >>>>>>>>>> rte_eth_dev > >>>>>>>> *dev, > >>>>>>>>>> * to transmit a packet is greater than the number of > >>>>>>>>>> free TX > >>>>>>>>>> * descriptors. > >>>>>>>>>> * The following constraints must be satisfied: > >>>>>>>>>> + * tx_rs_thresh must be less than 2 for NICs for which RS > >>>>>> deferring is > >>>>>>>>>> + * forbidden (all but 82598). > >>>>>>>>>> * tx_rs_thresh must be greater than 0. > >>>>>>>>>> * tx_rs_thresh must be less than the size of the ring > >>>>>>>>>> minus 2. > >>>>>>>>>> * tx_rs_thresh must be less than or equal to > >>>>>>>>>> tx_free_thresh. > >>>>>>>>>> @@ -2121,9 +2131,20 @@ ixgbe_dev_tx_queue_setup(struct > >>>>>>>>>> rte_eth_dev > >>>>>>>> *dev, > >>>>>>>>>> * When set to zero use default values. > >>>>>>>>>> */ > >>>>>>>>>> tx_rs_thresh = (uint16_t)((tx_conf->tx_rs_thresh) ? > >>>>>>>>>> - tx_conf->tx_rs_thresh : > >>>>>>>>>> DEFAULT_TX_RS_THRESH); > >>>>>>>>>> + tx_conf->tx_rs_thresh : > >>>>>>>>>> + (rs_deferring_allowed ? > >>>>>>>>>> DEFAULT_TX_RS_THRESH : > >>>>>> 1)); > >>>>>>>>>> tx_free_thresh = (uint16_t)((tx_conf->tx_free_thresh) ? > >>>>>>>>>> tx_conf->tx_free_thresh : > >>>>>>>>>> DEFAULT_TX_FREE_THRESH); > >>>>>>>>>> + > >>>>>>>>>> + if (!rs_deferring_allowed && tx_rs_thresh > 1) { > >>>>>>>>>> + PMD_INIT_LOG(ERR, "tx_rs_thresh must be less than > >>>>>>>>>> 2 since > >>>>>> RS > >>>>>>>> " > >>>>>>>>>> + "must be set for every packet for this > >>>>>> HW. " > >>>>>>>>>> + "(tx_rs_thresh=%u port=%d queue=%d)", > >>>>>>>>>> + (unsigned int)tx_rs_thresh, > >>>>>>>>>> + (int)dev->data->port_id, (int)queue_idx); > >>>>>>>>>> + return -(EINVAL); > >>>>>>>>>> + } > >>>>>>>>>> + > >>>>>>>>>> if (tx_rs_thresh >= (nb_desc - 2)) { > >>>>>>>>>> PMD_INIT_LOG(ERR, "tx_rs_thresh must be less > >>>>>>>>>> than the > >>>>>>>> number " > >>>>>>>>>> "of TX descriptors minus 2. (tx_rs_thresh=%u > >>>>>> " > >>>>>>>>>> -- > >>>>>>>>>> 2.1.0 > >> > >