Hi, Didn't see this patch previously, but we came up with the same idea internally and also faced a hang during the application shutdown. We didn't dig deep, but it occurred in kni_release function.
Igor On Mon, Jul 27, 2020 at 8:53 PM Stephen Hemminger < step...@networkplumber.org> wrote: > On Mon, 27 Jul 2020 18:33:08 +0100 > Ferruh Yigit <ferruh.yi...@intel.com> wrote: > > > On 5/6/2020 1:14 AM, Stephen Hemminger wrote: > > > On Wed, 18 Mar 2020 16:17:57 +0100 > > > Thomas Monjalon <tho...@monjalon.net> wrote: > > > > > >> 17/01/2020 17:43, Ferruh Yigit: > > >>> On 12/22/2019 5:55 PM, Stephen Hemminger wrote: > > >>>> This fixes a deadlock when using KNI with bifurcated drivers. > > >>>> Bringing kni device up always times out when using Mellanox > > >>>> devices. > > >>>> > > >>>> The kernel KNI driver sends message to userspace to complete > > >>>> the request. For the case of bifurcated driver, this may involve > > >>>> an additional request to kernel to change state. This request > > >>>> would deadlock because KNI was holding the RTNL mutex. > > >>>> > > >>>> This was a bad design which goes back to the original code. > > >>>> A workaround is for KNI driver to drop RTNL while waiting. > > >>>> To prevent the device from disappearing while the operation > > >>>> is in progress, it needs to hold reference to network device > > >>>> while waiting. > > >>>> > > >>>> As an added benefit, an useless error check can also be removed. > > >>>> > > >>>> Fixes: 3fc5ca2f6352 ("kni: initial import") > > >>>> Cc: sta...@dpdk.org > > >>>> Signed-off-by: Stephen Hemminger <step...@networkplumber.org> > > >>>> --- > > >>> > > >>> This patch cause a hang on my server, not sure what exactly was the > problem but > > >>> kernel log was continuously printing "Cannot send to req_q". Will > dig more. > > >> > > >> Ferruh, did you have a chance to check what is hanging? > > >> Stephen, is there any news on your side? > > >> > > >> > > > > > > It did not hang when I tested it. The bug report is still open > > > > > > > Sorry for the delay, since I am working remotely I was worried about > loosing the > > connection to my server, finally I did create a virtual environment to > test again. > > > > I confirm the hang observed %100 when two different process updates the > kni > > interface, like two different process sets the mtu. Without this patch > this > > works fine. > > > > I understand the motivation of the patch, but with change there is a > possibility > > to hang the server, which we can't allow, need to find another way. Can > updating > > mlx interface wait KNI interface operation to complete? > > Still KNI driver is broken. Calling userspace with RTNL held is > fundamentally > broken design. If KNI were to be incorporated in upstream kernel, then the > netdev > developer would see this. > > What ever solution you think is best. > I will continue to recommend against anyone using KNI. >