> On 3/6/2023 10:32 AM, Konstantin Ananyev wrote: > > > > > >>> -----Original Message----- > >>> From: Ferruh Yigit <ferruh.yi...@amd.com> > >>> Sent: Saturday, March 4, 2023 1:19 AM > >>> To: Konstantin Ananyev <konstantin.v.anan...@yandex.ru>; dev@dpdk.org; > >>> fengchengwen > >>> <fengcheng...@huawei.com>; Konstantin Ananyev > >>> <konstantin.anan...@huawei.com>; Honnappa > >>> Nagarahalli <honnappa.nagaraha...@arm.com>; Stephen Hemminger > >>> <step...@networkplumber.org>; > >>> Ruifeng Wang <ruifeng.w...@arm.com>; Ajit Khaparde > >>> (ajit.khapa...@broadcom.com) > >>> <ajit.khapa...@broadcom.com> > >>> Subject: Re: [PATCH 2/2] ethdev: fix race condition in fast-path ops setup > >>> > >>> On 2/26/2023 5:22 PM, Konstantin Ananyev wrote: > >>>> > >>>>>>>>>>>>>> If ethdev enqueue or dequeue function is called during > >>>>>>>>>>>>>> eth_dev_fp_ops_setup(), it may get pre-empted after setting > >>>>>>>>>>>>>> the function pointers, but before setting the pointer to port > >>>>>>>>>>>>>> data. > >>>>>>>>>>>>>> In this case the newly registered enqueue/dequeue function > >>>>>>>>>>>>>> will use dummy port data and end up in seg fault. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> This patch moves the updation of each data pointers before > >>>>>>>>>>>>>> updating corresponding function pointers. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Fixes: c87d435a4d79 ("ethdev: copy fast-path API into > >>>>>>>>>>>>>> separate > >>>>>>>>>>>>>> structure") > >>>>>>>>>>>>>> Cc: sta...@dpdk.org > >>>>>>>>>>> > >>>>>>>>>>> Why is something calling enqueue/dequeue when device is not > >>>>>>>>>>> fully > >>>>>>>>> started. > >>>>>>>>>>> A correctly written application would not call rx/tx burst > >>>>>>>>>>> until after ethdev start had finished. > >>>>>>>>>> > >>>>>>>>>> Please refer the eb0d471a894 (ethdev: add proactive error > >>>>>>>>>> handling mode), when driver recover itself, the application may > >>>>>>>>>> still invoke > >>>>>>>>> enqueue/dequeue API. > >>>>>>>>> > >>>>>>>>> Right now DPDK ethdev layer *does not* provide synchronization > >>>>>>>>> mechanisms between data-path and control-path functions. > >>>>>>>>> That was a deliberate deisgn choice. If we want to change that > >>>>>>>>> rule, then I suppose we need a community consensus for it. > >>>>>>>>> I think that if the driver wants to provide some sort of error > >>>>>>>>> recovery procedure, then it has to provide some synchronization > >>>>>>>>> mechanism inside it between data-path and control-path functions. > >>>>>>>>> Actually looking at eb0d471a894 (ethdev: add proactive error > >>>>>>>>> handling mode), and following patches I wonder how it creeped in? > >>>>>>>>> It seems we just introduced a loophole for race condition with > >>>>>>>>> this approach... > >>>>>>> > >>>>>>> Could you try to describe the specific scenario of loophole ? > >>>>>> > >>>>>> Ok, as I understand the existing mechanism: > >>>>>> > >>>>>> When PMD wants to start a recovery it has to: > >>>>>> - invoke > >>>>>> rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); > >>>>>> That supposed to call user provided callback. After callback is > >>>>>> finished PMD assumes > >>>>>> that user is aware that recovery is about to start and should > >>>>>> make some precautions. > >>>>>> - when recovery is finished it invokes another callback: > >>>>>> RTE_ETH_EVENT_RECOVERY_(SUCCESS/FAILED). After that user either > >>>>>> can continue to > >>>>>> use port or have to treat is as faulty. > >>>>>> > >>>>>> The idea is ok in principle, but there is a problem. > >>>>>> > >>>>>> lib/ethdev/rte_ethdev.h: > >>>>>> /** Port recovering from a hardware or firmware error. > >>>>>> * If PMD supports proactive error recovery, > >>>>>> * it should trigger this event to notify application > >>>>>> * that it detected an error and the recovery is being > >>>>>> started. > >>>>>> > >>>>>> <<< !!!!! > >>>>>> * Upon receiving the event, the application should not > >>>>>> invoke any control path API > >>>>>> * (such as rte_eth_dev_configure/rte_eth_dev_stop...) > >>>>>> until receiving > >>>>>> * RTE_ETH_EVENT_RECOVERY_SUCCESS or > >>>>>> RTE_ETH_EVENT_RECOVERY_FAILED event. > >>>>>> * The PMD will set the data path pointers to dummy > >>>>>> functions, > >>>>>> * and re-set the data path pointers to non-dummy functions > >>>>>> * before reporting RTE_ETH_EVENT_RECOVERY_SUCCESS event. > >>>>>> <<< !!!!! > >>>>>> > >>>>>> That part is just wrong I believe. > >>>>>> It should be: > >>>>>> Upon receiving the event, the application should not invoke any > >>>>>> *both control and data-path* API until receiving > >>>>>> RTE_ETH_EVENT_RECOVERY_SUCCESS or RTE_ETH_EVENT_RECOVERY_FAILED > >>>>>> event. > >>>>>> Resetting data path pointers to dummy functions by PMD *before* > >>>>>> invoking rte_eth_dev_callback_process(RTE_ETH_EVENT_ERR_RECOVERING); > >>>>>> introduces a race-condition with data-path threads, as such thread > >>>>>> could already be inside RX/TX function or can already read RX/TX > >>>>>> function/data pointers and be about to use them. > >>>>> > >>>>> Current practices: the PMDs already add some delay after set Rx/Tx > >>>>> callback to dummy, and plus the DPDK worker thread is busypolling, > >>>>> the probability of occurence in reality is zero. But in theoretically > >>>>> exist the above race-condition. > >>>> > >>>> > >>>> Adding delay might make a problem a bit less reproducible, but it > >>>> doesn't fix it. > >>>> The bug is still there. > >>>> > >>>> > >>>>> > >>>>>> And right now rte_ethdev layer doesn't provide any mechanism to > >>>>>> check it or wait when they'll finish, etc. > >>>>> > >>>>> Yes > >>>>> > >>>>>> > >>>>>> So, probably the simplest way to fix it with existing DPDK design: > >>>>>> - user level callback RTE_ETH_EVENT_ERR_RECOVERING should return > >>>>>> only after it ensures that *all* > >>>>>> application threads (and processes) stopped using either control > >>>>>> or data-path functions for that port > >>>>> > >>>>> Agree > >>>>> > >>>>>> (yes it means that application that wants to use this feature has > >>>>>> to provide its own synchronization mechanism > >>>>>> around data-path functions (RX/TX) that it is going to use). > >>>>>> - after that PMD is safe to reset rte_eth_fp_ops[] values to dummy > >>>>>> ones. > >>>>>> > >>>>>> And message to all PMD developers: > >>>>>> *please stop updating rte_eth_fp_ops[] on your own*. > >>>>>> That's a bad practice and it is not supposed to do things that way. > >>>>>> There is a special API provided for these purposes: > >>>>>> eth_dev_fp_ops_reset(), eth_dev_fp_ops_setup(), so use it. > >>>>> > >>>>> This two function is in private.h, so it should be expose to public > >>>>> header file. > >>>> > >>>> You mean we need to move these functions declarations into > >>>> ethdev_driver.h? > >>>> If so, then yes, I think we probably do. > >>>> > >>>> > >>> > >>> > >>> What about making slightly different version available to drivers, which > >>> only updates > >>> function pointers, but not 'fpo->rxq' / 'fpo->txq'. > >>> > >>> This way driver can switch to between dummy and real burst function > >>> without worrying Rx/Tx > >>> queue validity. > >>> > >>> @Chengwen, @Ruifeng, can this solve the issue for relaxed memory ordering > >>> systems? > >> > >> Yes, updating only function pointers removes the synchronization > >> requirement between function > >> pointer and qdata. > > > > Lads, that wouldn't work anyway. > > The race between recovery procedure and data-path persists: > > Recovery still has no idea is at given moment any thread doing RX/TX or > > not, and there is no > > way for it to know when such thread will finish. > > > Yes race condition persists, but as long as data (rxq/txq) stays valid, > does it cause a trouble? At lest this fixes the potential crash I think.
Yes, I believe it still would cause the trouble. We still have control thread and RX/TX threads simultaneously accessing rxq/txq data and probably trying to access/modify the same HW registers. With current ethdev design (no sync between control and daya-path) dev_fp_ops_setup() and RX/TX functions should not be called simultaneously. > > > We do need some synchronization mechanism between control(recovery) and > > data-path threads. > > I believe it is unavoidable. > > > >>> > >>> > >>> > >>>>>> > >>>>>> BTW, I don't see any implementation for > >>>>>> RTE_ETH_EVENT_ERR_RECOVERING within either testpmd or any other > >>>>>> example apps. > >>>>>> Am I missing something? > >>>>> > >>>>> Currently it just promote the event. > >>>> > >>>> > >>>> Ok, can I suggest then to add a proper usage for into in testpmd? > >>>> It looks really strange that we add new feature into ethdev (and 2 > >>>> PMDs), but didn't provide any way for users to test it. > >>>> > >>>>> > >>>>>> If not, then probably it could be a good starting point - let's > >>>>>> incorporate it inside testpmd (new forwarding engine probably) so > >>>>>> everyone can test/try it. > >>>>>> > >>>>>> * It means that the application cannot send or receive any > >>>>>> packets > >>>>>> * during this period. > >>>>>> * @note Before the PMD reports the recovery result, > >>>>>> * the PMD may report the RTE_ETH_EVENT_ERR_RECOVERING > >>>>>> event again, > >>>>>> * because a larger error may occur during the recovery. > >>>>>> */ > >>>>>> RTE_ETH_EVENT_ERR_RECOVERING, > >>>>>> > >>>>>>>>> It probably needs to be either deprecated or reworked. > >>>>>>>> Looking at the commit, it does not say anything about the data > >>>>>>>> plane functions which probably means, the error recovery is > >>>>>>> happening within the data plane thread. What happens to other data > >>>>>>> plane threads that are polling the same port on which the error > >>>>>>> recovery is happening? > >>>>>>> > >>>>>>> The commit log says: "the PMD sets the data path pointers to dummy > >>>>>>> functions". > >>>>>>> > >>>>>>> So the data plane threads will receive non-packet and send zero > >>>>>>> with port which in error recovery. > >>>>>>> > >>>>>>>> > >>>>>>>> Also, the commit log says that while the error recovery is under > >>>>>>>> progress, the application should not call any control plane APIs. > >>>>>>>> Does > >>>>>>> that mean, the application has to check for error condition every > >>>>>>> time it calls a control plane API? > >>>>>>> > >>>>>>> If application has not register event > >>>>>>> (RTE_ETH_EVENT_ERR_RECOVERING) callback, it could calls control > >>>>>>> plane API, but it will return failed. > >>>>>>> If application has register above callback, it can wait for > >>>>>>> recovery result, or direct call without wait but this will return > >>>>>>> failed. > >>>>>>> > >>>>>>>> > >>>>>>>> The commit message also says that "PMD makes sure the control path > >>>>>>>> operations failed with retcode -EBUSY". It does not say how it > >>>>>>> does this. But, any communication from the PMD thread to control > >>>>>>> plane thread may introduce race conditions if not done correctly. > >>>>>>> > >>>>>>> First there are no PMD thread, do you mean eal-intr-thread ? > >>>>>>> > >>>>>>> As for this question, you can see PMDs which already implement it, > >>>>>>> they both provides mutual exclusion protection. > >>>>>>> > >>>>>>>> > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Would something like this work better? > >>>>>>>>>>> > >>>>>>>>>>> Note: there is another bug in current code. The check for link > >>>>>>>>>>> state interrupt and link_ops could return -ENOTSUP and leave > >>>>>>>>>>> device in > >>>>>>>>> indeterminate state. > >>>>>>>>>>> The check should be done before calling PMD. > >>>>>>>>>>> > >>>>>>>>>>> diff --git a/lib/ethdev/rte_ethdev.c b/lib/ethdev/rte_ethdev.c > >>>>>>>>>>> index > >>>>>>>>>>> 0266cc82acb6..d6c163ed85e7 100644 > >>>>>>>>>>> --- a/lib/ethdev/rte_ethdev.c > >>>>>>>>>>> +++ b/lib/ethdev/rte_ethdev.c > >>>>>>>>>>> @@ -1582,6 +1582,14 @@ rte_eth_dev_start(uint16_t port_id) > >>>>>>>>>>> return 0; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> + if (dev->data->dev_conf.intr_conf.lsc == 0 && > >>>>>>>>>>> + dev->dev_ops->link_update == NULL) { > >>>>>>>>>>> + RTE_ETHDEV_LOG(INFO, > >>>>>>>>>>> + "Device with port_id=%"PRIu16" link update > >>>>>>>>>>> +not > >>>>>>>>> supported\n", > >>>>>>>>>>> + port_id); > >>>>>>>>>>> + return -ENOTSUP; > >>>>>>>>>>> + } > >>>>>>>>>>> + > >>>>>>>>>>> ret = rte_eth_dev_info_get(port_id, &dev_info); > >>>>>>>>>>> if (ret != 0) > >>>>>>>>>>> return ret; > >>>>>>>>>>> @@ -1591,9 +1599,7 @@ rte_eth_dev_start(uint16_t port_id) > >>>>>>>>>>> eth_dev_mac_restore(dev, &dev_info); > >>>>>>>>>>> > >>>>>>>>>>> diag = (*dev->dev_ops->dev_start)(dev); > >>>>>>>>>>> - if (diag == 0) > >>>>>>>>>>> - dev->data->dev_started = 1; > >>>>>>>>>>> - else > >>>>>>>>>>> + if (diag != 0) > >>>>>>>>>>> return eth_err(port_id, diag); > >>>>>>>>>>> > >>>>>>>>>>> ret = eth_dev_config_restore(dev, &dev_info, port_id); @@ > >>>>>>>>>>> -1611,16 > >>>>>>>>>>> +1617,18 @@ rte_eth_dev_start(uint16_t port_id) > >>>>>>>>>>> return ret; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> - if (dev->data->dev_conf.intr_conf.lsc == 0) { > >>>>>>>>>>> - if (*dev->dev_ops->link_update == NULL) > >>>>>>>>>>> - return -ENOTSUP; > >>>>>>>>>>> - (*dev->dev_ops->link_update)(dev, 0); > >>>>>>>>>>> - } > >>>>>>>>>>> - > >>>>>>>>>>> /* expose selection of PMD fast-path functions */ > >>>>>>>>>>> eth_dev_fp_ops_setup(rte_eth_fp_ops + port_id, dev); > >>>>>>>>>>> > >>>>>>>>>>> + /* ensure state is set before marking device ready */ > >>>>>>>>>>> + rte_smp_wmb(); > >>>>>>>>>>> + > >>>>>>>>>>> rte_ethdev_trace_start(port_id); > >>>>>>>>>>> + > >>>>>>>>>>> + /* Update current link state */ > >>>>>>>>>>> + if (dev->data->dev_conf.intr_conf.lsc == 0) > >>>>>>>>>>> + (*dev->dev_ops->link_update)(dev, 0); > >>>>>>>>>>> + > >>>>>>>>>>> return 0; > >>>>>>>>>>> } > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> . > >>>>>>>>>>> > >>>>>>>> > >>>> > >