On 3/6/2023 10:32 AM, Konstantin Ananyev wrote: > > >>> -----Original Message----- >>> From: Ferruh Yigit <ferruh.yi...@amd.com> >>> Sent: Saturday, March 4, 2023 1:19 AM >>> To: Konstantin Ananyev <konstantin.v.anan...@yandex.ru>; dev@dpdk.org; >>> fengchengwen >>> <fengcheng...@huawei.com>; Konstantin Ananyev >>> <konstantin.anan...@huawei.com>; Honnappa >>> Nagarahalli <honnappa.nagaraha...@arm.com>; Stephen Hemminger >>> <step...@networkplumber.org>; >>> Ruifeng Wang <ruifeng.w...@arm.com>; Ajit Khaparde >>> (ajit.khapa...@broadcom.com) >>> <ajit.khapa...@broadcom.com> >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup >>> >>> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote: >>>> >>>>>>>>>>>>>> If ethdev enqueue or dequeue function is called during >>>>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting >>>>>>>>>>>>>> the function pointers, but before setting the pointer to port >>>>>>>>>>>>>> data. >>>>>>>>>>>>>> In this case the newly registered enqueue/dequeue function >>>>>>>>>>>>>> will use dummy port data and end up in seg fault. >>>>>>>>>>>>>> >>>>>>>>>>>>>> This patch moves the updation of each data pointers before >>>>>>>>>>>>>> updating corresponding function pointers. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into >>>>>>>>>>>>>> separate >>>>>>>>>>>>>> structure") >>>>>>>>>>>>>> Cc: sta...@dpdk.org >>>>>>>>>>> >>>>>>>>>>> Why is something calling enqueue/dequeue when device is not >>>>>>>>>>> fully >>>>>>>>> started. >>>>>>>>>>> A correctly written application would not call rx/tx burst >>>>>>>>>>> until after ethdev start had finished. >>>>>>>>>> >>>>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error >>>>>>>>>> handling mode), when driver recover itself, the application may >>>>>>>>>> still invoke >>>>>>>>> enqueue/dequeue API. >>>>>>>>> >>>>>>>>> Right now DPDK ethdev layer *does not* provide synchronization >>>>>>>>> mechanisms between data-path and control-path functions. >>>>>>>>> That was a deliberate deisgn choice. If we want to change that >>>>>>>>> rule, then I suppose we need a community consensus for it. >>>>>>>>> I think that if the driver wants to provide some sort of error >>>>>>>>> recovery procedure, then it has to provide some synchronization >>>>>>>>> mechanism inside it between data-path and control-path functions. >>>>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error >>>>>>>>> handling mode), and following patches I wonder how it creeped in? >>>>>>>>> It seems we just introduced a loophole for race condition with >>>>>>>>> this approach... >>>>>>> >>>>>>> Could you try to describe the specific scenario of loophole ? >>>>>> >>>>>> Ok, as I understand the existing mechanism: >>>>>> >>>>>> When PMD wants to start a recovery it has to: >>>>>> - invoke >>>>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); >>>>>> That supposed to call user provided callback. After callback is >>>>>> finished PMD assumes >>>>>> that user is aware that recovery is about to start and should >>>>>> make some precautions. >>>>>> - when recovery is finished it invokes another callback: >>>>>> RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either >>>>>> can continue to >>>>>> use port or have to treat is as faulty. >>>>>> >>>>>> The idea is ok in principle, but there is a problem. >>>>>> >>>>>> lib/ethdev/rte_ethdev.h: >>>>>> /** Port recovering from a hardware or firmware error. >>>>>> * If PMD supports proactive error recovery, >>>>>> * it should trigger this event to notify application >>>>>> * that it detected an error and the recovery is being started. >>>>>> >>>>>> <<< !!!!! >>>>>> * Upon receiving the event, the application should not >>>>>> invoke any control path API >>>>>> * (such as rte_eth_dev_configure/rte_eth_dev_stop...) >>>>>> until receiving >>>>>> * RTE_ETH_EVENT_RECOVERY_SUCCESS or >>>>>> RTE_ETH_EVENT_RECOVERY_FAILED event. >>>>>> * The PMD will set the data path pointers to dummy >>>>>> functions, >>>>>> * and re-set the data path pointers to non-dummy functions >>>>>> * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event. >>>>>> <<< !!!!! >>>>>> >>>>>> That part is just wrong I believe. >>>>>> It should be: >>>>>> Upon receiving the event, the application should not invoke any >>>>>> *both control and data-path* API until receiving >>>>>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED >>>>>> event. >>>>>> Resetting data path pointers to dummy functions by PMD *before* >>>>>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); >>>>>> introduces a race-condition with data-path threads, as such thread >>>>>> could already be inside RX/TX function or can already read RX/TX >>>>>> function/data pointers and be about to use them. >>>>> >>>>> Current practices: the PMDs already add some delay after set Rx/Tx >>>>> callback to dummy, and plus the DPDK worker thread is busypolling, >>>>> the probability of occurence in reality is zero. But in theoretically >>>>> exist the above race-condition. >>>> >>>> >>>> Adding delay might make a problem a bit less reproducible, but it >>>> doesn't fix it. >>>> The bug is still there. >>>> >>>> >>>>> >>>>>> And right now rte_ethdev layer doesn't provide any mechanism to >>>>>> check it or wait when they'll finish, etc. >>>>> >>>>> Yes >>>>> >>>>>> >>>>>> So, probably the simplest way to fix it with existing DPDK design: >>>>>> - user level callback RTE_ETH_EVENT_ERR_RECOVERING should return >>>>>> only after it ensures that *all* >>>>>> application threads (and processes) stopped using either control >>>>>> or data-path functions for that port >>>>> >>>>> Agree >>>>> >>>>>> (yes it means that application that wants to use this feature has >>>>>> to provide its own synchronization mechanism >>>>>> around data-path functions (RX/TX) that it is going to use). >>>>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy ones. >>>>>> >>>>>> And message to all PMD developers: >>>>>> *please stop updating rte_eth_fp_ops[] on your own*. >>>>>> That's a bad practice and it is not supposed to do things that way. >>>>>> There is a special API provided for these purposes: >>>>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it. >>>>> >>>>> This two function is in private.h, so it should be expose to public >>>>> header file. >>>> >>>> You mean we need to move these functions declarations into ethdev_driver.h? >>>> If so, then yes, I think we probably do. >>>> >>>> >>> >>> >>> What about making slightly different version available to drivers, which >>> only updates >>> function pointers, but not 'fpo->rxq' / 'fpo->txq'. >>> >>> This way driver can switch to between dummy and real burst function without >>> worrying Rx/Tx >>> queue validity. >>> >>> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering >>> systems? >> >> Yes, updating only function pointers removes the synchronization requirement >> between function >> pointer and qdata. > > Lads, that wouldn't work anyway. > The race between recovery procedure and data-path persists: > Recovery still has no idea is at given moment any thread doing RX/TX or not, > and there is no > way for it to know when such thread will finish.
Yes race condition persists, but as long as data (rxq/txq) stays valid, does it cause a trouble? At lest this fixes the potential crash I think. > We do need some synchronization mechanism between control(recovery) and > data-path threads. > I believe it is unavoidable. > >>> >>> >>> >>>>>> >>>>>> BTW, I don't see any implementation for >>>>>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other >>>>>> example apps. >>>>>> Am I missing something? >>>>> >>>>> Currently it just promote the event. >>>> >>>> >>>> Ok, can I suggest then to add a proper usage for into in testpmd? >>>> It looks really strange that we add new feature into ethdev (and 2 >>>> PMDs), but didn't provide any way for users to test it. >>>> >>>>> >>>>>> If not, then probably it could be a good starting point - let's >>>>>> incorporate it inside testpmd (new forwarding engine probably) so >>>>>> everyone can test/try it. >>>>>> >>>>>> * It means that the application cannot send or receive any >>>>>> packets >>>>>> * during this period. >>>>>> * @note Before the PMD reports the recovery result, >>>>>> * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING >>>>>> event again, >>>>>> * because a larger error may occur during the recovery. >>>>>> */ >>>>>> RTE_ETH_EVENT_ERR_RECOVERING, >>>>>> >>>>>>>>> It probably needs to be either deprecated or reworked. >>>>>>>> Looking at the commit, it does not say anything about the data >>>>>>>> plane functions which probably means, the error recovery is >>>>>>> happening within the data plane thread. What happens to other data >>>>>>> plane threads that are polling the same port on which the error >>>>>>> recovery is happening? >>>>>>> >>>>>>> The commit log says: "the PMD sets the data path pointers to dummy >>>>>>> functions". >>>>>>> >>>>>>> So the data plane threads will receive non-packet and send zero >>>>>>> with port which in error recovery. >>>>>>> >>>>>>>> >>>>>>>> Also, the commit log says that while the error recovery is under >>>>>>>> progress, the application should not call any control plane APIs. >>>>>>>> Does >>>>>>> that mean, the application has to check for error condition every >>>>>>> time it calls a control plane API? >>>>>>> >>>>>>> If application has not register event >>>>>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control >>>>>>> plane API, but it will return failed. >>>>>>> If application has register above callback, it can wait for >>>>>>> recovery result, or direct call without wait but this will return >>>>>>> failed. >>>>>>> >>>>>>>> >>>>>>>> The commit message also says that "PMD makes sure the control path >>>>>>>> operations failed with retcode -EBUSY". It does not say how it >>>>>>> does this. But, any communication from the PMD thread to control >>>>>>> plane thread may introduce race conditions if not done correctly. >>>>>>> >>>>>>> First there are no PMD thread, do you mean eal-intr-thread ? >>>>>>> >>>>>>> As for this question, you can see PMDs which already implement it, >>>>>>> they both provides mutual exclusion protection. >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Would something like this work better? >>>>>>>>>>> >>>>>>>>>>> Note: there is another bug in current code. The check for link >>>>>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave >>>>>>>>>>> device in >>>>>>>>> indeterminate state. >>>>>>>>>>> The check should be done before calling PMD. >>>>>>>>>>> >>>>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c >>>>>>>>>>> index >>>>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644 >>>>>>>>>>> --- a/lib/ethdev/rte_ethdev.c >>>>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c >>>>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id) >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> + if (dev->data->dev_conf.intr_conf.lsc == 0 && >>>>>>>>>>> + dev->dev_ops->link_update == NULL) { >>>>>>>>>>> + RTE_ETHDEV_LOG(INFO, >>>>>>>>>>> + "Device with port_id=%"PRIu16" link update >>>>>>>>>>> +not >>>>>>>>> supported\n", >>>>>>>>>>> + port_id); >>>>>>>>>>> + return -ENOTSUP; >>>>>>>>>>> + } >>>>>>>>>>> + >>>>>>>>>>> ret = rte_eth_dev_info_get(port_id, &dev_info); >>>>>>>>>>> if (ret != 0) >>>>>>>>>>> return ret; >>>>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id) >>>>>>>>>>> eth_dev_mac_restore(dev, &dev_info); >>>>>>>>>>> >>>>>>>>>>> diag = (*dev->dev_ops->dev_start)(dev); >>>>>>>>>>> - if (diag == 0) >>>>>>>>>>> - dev->data->dev_started = 1; >>>>>>>>>>> - else >>>>>>>>>>> + if (diag != 0) >>>>>>>>>>> return eth_err(port_id, diag); >>>>>>>>>>> >>>>>>>>>>> ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ >>>>>>>>>>> -1611,16 >>>>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id) >>>>>>>>>>> return ret; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> - if (dev->data->dev_conf.intr_conf.lsc == 0) { >>>>>>>>>>> - if (*dev->dev_ops->link_update == NULL) >>>>>>>>>>> - return -ENOTSUP; >>>>>>>>>>> - (*dev->dev_ops->link_update)(dev, 0); >>>>>>>>>>> - } >>>>>>>>>>> - >>>>>>>>>>> /* expose selection of PMD fast-path functions */ >>>>>>>>>>> eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev); >>>>>>>>>>> >>>>>>>>>>> + /* ensure state is set before marking device ready */ >>>>>>>>>>> + rte_smp_wmb(); >>>>>>>>>>> + >>>>>>>>>>> rte_ethdev_trace_start(port_id); >>>>>>>>>>> + >>>>>>>>>>> + /* Update current link state */ >>>>>>>>>>> + if (dev->data->dev_conf.intr_conf.lsc == 0) >>>>>>>>>>> + (*dev->dev_ops->link_update)(dev, 0); >>>>>>>>>>> + >>>>>>>>>>> return 0; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> . >>>>>>>>>>> >>>>>>>> >>>> >