Hi, Owing to my current development schedule and obligations, I see no opportunity to make this set of changes in the near future.
Sorry, Elad. בתאריך יום ב׳, 15 במרץ 2021, 19:17, מאת Ferruh Yigit < ferruh.yi...@intel.com>: > On 2/25/2021 2:32 PM, Elad Nachman wrote: > > This part of the series includes my fixes for the issues reported > > by Ferruh and Igor (and Igor comments for v3 of the patch) > > on top of part 1 of the patch series: > > > > A. KNI sync lock is being locked while rtnl is held. > > If two threads are calling kni_net_process_request() , > > then the first one will take the sync lock, release rtnl lock then sleep. > > The second thread will try to lock sync lock while holding rtnl. > > The first thread will wake, and try to lock rtnl, resulting in a > deadlock. > > The remedy is to release rtnl before locking the KNI sync lock. > > Since in between nothing is accessing Linux network-wise, > > no rtnl locking is needed. > > > > B. There is a race condition in __dev_close_many() processing the > > close_list while the application terminates. > > It looks like if two vEth devices are terminating, > > and one releases the rtnl lock, the other takes it, > > updating the close_list in an unstable state, > > causing the close_list to become a circular linked list, > > hence list_for_each_entry() will endlessly loop inside > > __dev_close_many() . > > Since the description for the original patch indicate the > > original motivation was bringing the device up, > > I have changed kni_net_process_request() to hold the rtnl mutex > > in case of bringing the device down since this is the path called > > from __dev_close_many() , causing the corruption of the close_list. > > In order to prevent deadlock in Mellanox device in this case, the > > code has been modified not to wait for user-space while holding > > the rtnl lock. > > Instead, after the request has been sent, all locks are relinquished > > and the function exits immediately with return code of zero (success). > > > > To summarize: > > request != interface down : unlock rtnl, send request to user-space, > > wait for response, send the response error code to caller in user-space. > > > > request == interface down: send request to user-space, return immediately > > with error code of 0 (success) to user-space. > > > > Signed-off-by: Elad Nachman <ela...@gmail.com> > > > > > > --- > > v4: > > * for if down case, send asynchronously with rtnl locked and without > > wait, returning immediately to avoid both kernel race conditions > > and deadlock in user-space > > v3: > > * Include original patch and new patch as a series of patch, added a > > comment to the new patch > > v2: > > * rebuild the patch as increment from patch 64106 > > * fix comment and blank lines > > --- > > kernel/linux/kni/kni_net.c | 41 +++++++++++++++++++++++++++------ > > lib/librte_kni/rte_kni.c | 7 ++++-- > > lib/librte_kni/rte_kni_common.h | 1 + > > 3 files changed, 40 insertions(+), 9 deletions(-) > > > > diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c > > index f0b6e9a8d..ba991802b 100644 > > --- a/kernel/linux/kni/kni_net.c > > +++ b/kernel/linux/kni/kni_net.c > > @@ -110,12 +110,34 @@ kni_net_process_request(struct net_device *dev, > struct rte_kni_request *req) > > void *resp_va; > > uint32_t num; > > int ret_val; > > + int req_is_dev_stop = 0; > > + > > + /* For configuring the interface to down, > > + * rtnl must be held all the way to prevent race condition > > + * inside __dev_close_many() between two netdev instances of KNI > > + */ > > + if (req->req_id == RTE_KNI_REQ_CFG_NETWORK_IF && > > + req->if_up == 0) > > + req_is_dev_stop = 1; > > Having this request type checks in the 'kni_net_process_request()' > function > looks like hack. > Since adding a new field into the "struct rte_kni_request", that can be a > more > generic 'asnyc' field, and the requested function, like > 'kni_net_release()' can > set it to support async requests. > > And can you please separate the function to add a more generic async > request > support on its patch, which should do: > - add new 'asnyc' field to "struct rte_kni_request" > - in 'kni_net_process_request()', if 'req->async' set, do not wait for > response > - in library, 'lib/librte_kni/rte_kni.c', in 'rte_kni_handle_request()' > function, if the request is async don't put the response > (These are already done in this patch) > > Overall it can be three patch set: > 1) Function parameter change > 2) Add generic async request support (with documentation update) > 3) rtnl unlock and make 'kni_net_release()' request async (actual fix) > (We can discuss more if to make 'kni_net_release()' async with a kernel > parameter or not) > > What do you think, does it make sense? > >