On 12/22/2019 5:55 PM, Stephen Hemminger wrote: > This fixes a deadlock when using KNI with bifurcated drivers. > Bringing kni device up always times out when using Mellanox > devices. > > The kernel KNI driver sends message to userspace to complete > the request. For the case of bifurcated driver, this may involve > an additional request to kernel to change state. This request > would deadlock because KNI was holding the RTNL mutex. > > This was a bad design which goes back to the original code. > A workaround is for KNI driver to drop RTNL while waiting. > To prevent the device from disappearing while the operation > is in progress, it needs to hold reference to network device > while waiting. > > As an added benefit, an useless error check can also be removed. > > Fixes: 3fc5ca2f6352 ("kni: initial import") > Cc: sta...@dpdk.org > Signed-off-by: Stephen Hemminger <step...@networkplumber.org> > --- > kernel/linux/kni/kni_net.c | 34 ++++++++++++++++++---------------- > 1 file changed, 18 insertions(+), 16 deletions(-) > > diff --git a/kernel/linux/kni/kni_net.c b/kernel/linux/kni/kni_net.c > index 1ba9b1b99f66..b7337c1410b8 100644 > --- a/kernel/linux/kni/kni_net.c > +++ b/kernel/linux/kni/kni_net.c > @@ -17,6 +17,7 @@ > #include <linux/skbuff.h> > #include <linux/kthread.h> > #include <linux/delay.h> > +#include <linux/rtnetlink.h> > > #include <rte_kni_common.h> > #include <kni_fifo.h> > @@ -102,17 +103,15 @@ get_data_kva(struct kni_dev *kni, void *pkt_kva) > * It can be called to process the request. > */ > static int > -kni_net_process_request(struct kni_dev *kni, struct rte_kni_request *req) > +kni_net_process_request(struct net_device *dev, struct rte_kni_request *req) > { > + struct kni_dev *kni = netdev_priv(dev); > int ret = -1; > void *resp_va; > uint32_t num; > int ret_val; > > - if (!kni || !req) { > - pr_err("No kni instance or request\n"); > - return -EINVAL; > - } > + ASSERT_RTNL(); > > mutex_lock(&kni->sync_lock); > > @@ -125,8 +124,17 @@ kni_net_process_request(struct kni_dev *kni, struct > rte_kni_request *req) > goto fail; > } > > + /* Since we need to wait and RTNL mutex is held > + * drop the mutex and hold refernce to keep device > + */ > + dev_hold(dev); > + rtnl_unlock(); > + > ret_val = wait_event_interruptible_timeout(kni->wq, > kni_fifo_count(kni->resp_q), 3 * HZ); > + rtnl_lock(); > + dev_put(dev); > + > if (signal_pending(current) || ret_val <= 0) { > ret = -ETIME; > goto fail;
<...> This patch cause a hang on my server, not sure what exactly was the problem but kernel log was continuously printing "Cannot send to req_q". Will dig more.