Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

Michael S. Tsirkin Wed, 30 Nov 2016 19:21:41 -0800

On Thu, Dec 01, 2016 at 02:48:59AM +0000, wangyunjian wrote:
> 
> >-----Original Message-----
> >From: Michael S. Tsirkin [mailto:m...@redhat.com] 
> >Sent: Wednesday, November 30, 2016 9:41 PM
> >To: wangyunjian
> >Cc: jasow...@redhat.com; netdev@vger.kernel.org; 
> >linux-ker...@vger.kernel.org; caihe
> >Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
> >meet errors
> >
> >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> >> When we meet an error(err=-EBADFD) recvmsg,
> >
> >How do you get EBADFD? Won't vhost_net_rx_peek_head_len
> >return 0 in this case, breaking the loop?
> 
> We started many guest VMs while attaching/detaching some virtio-net nics for 
> loop. 
> The soft lockup might happened. The err is -EBADFD.
>


OK, I'd like to figure out what happened here. why don't
we get 0 when we peek at the head?

EBADFD is from here:
        struct tun_struct *tun = __tun_get(tfile);
...
        if (!tun)
                return -EBADFD;

but then:
static int tun_peek_len(struct socket *sock)
{       

...

        struct tun_struct *tun;
...
        
        tun = __tun_get(tfile);
        if (!tun) 
                return 0;


so peek len should return 0.

then while will exit:
        while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk)))
...


> meesage log:
> kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! 
> [vhost-60898:126093]
> call trace:
> [<fffffffa0132967>]vhost_get_vq_desc+0x1e7/0x984 [vhost]
> [<fffffffa02037e6>]handle_rx+0x226/0x810 [vhost_net]
> [<fffffffa0203de5>]handle_rx_net+0x15/0x20 [vhost_net]
> [<fffffffa013160b>]vhost_worker+0xfb/0x1e0 [vhost]
> [<fffffffa0131510>]? vhost_dev_reset_owner+0x50/0x50 [vhost]
> [<fffffff810a5c7f>]kthread+0xcf/0xe0
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140
> [<fffffff81648898>]ret_from_fork+0x58/0x90
> [<fffffff810a5bb0>]? kthread_create_on_node+0x140/0x140

So somehow you keep seeing something in tun when we peek.
IMO this is the problem we should try to fix.
Could you try debugging the root cause here?

> >> the error handling in vhost
> >> handle_rx() will continue. This will cause a soft CPU lockup in vhost 
> >> thread.
> >> 
> >> Signed-off-by: Yunjian Wang <wangyunj...@huawei.com>
> >> ---
> >>  drivers/vhost/net.c | 3 +++
> >>  1 file changed, 3 insertions(+)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 5dc128a..edc470b 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> >>                    pr_debug("Discarded rx packet: "
> >>                             " len %d, expected %zd\n", err, sock_len);
> >>                    vhost_discard_vq_desc(vq, headcount);
> >> +                  /* Don't continue to do, when meet errors. */
> >> +                  if (err < 0)
> >> +                          goto out;
> >
> >You might get e.g. EAGAIN and I think you need to retry
> >in this case.
> >
> >>                    continue;
> >>            }
> >>            /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> >> -- 
> >> 1.9.5.msysgit.1
> >>

Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

Reply via email to