Maxime, First of all, thank you very much for your detailed analysis and pinpoint the possible problem area. Is it fair for me to ask the question that when the client processes the message VHOST_USER_GET_VRING_BASE, it must respond it with the existing last_avail_idx but not 0? I could not find any document about this except this statement in the url
Client must start ring upon receiving a kick (that is, detecting that file descriptor is readable) on the descriptor specified by VHOST_USER_SET_VRING_KICK, and stop ring upon receiving VHOST_USER_GET_VRING_BASE. https://github.com/qemu/qemu/blob/master/docs/specs/vhost-user.txt As you see, the spec only mentions stop the ring, but nothing about the last_avail_idx. If you know where the spec/document says that the client must respond it with the existing last_avail_idx, please point me to read about it. Steven On 3/8/17, 8:10 AM, "Maxime Coquelin" <maxime.coque...@redhat.com> wrote: >Steven, > >On 03/08/2017 04:44 PM, Steven Luong (sluong) wrote: >> Maxime, >> >> Last I heard csit uses qemu 2.5. Could you please check if the >> aforementioned patch exists in that qemu version? >In upstream QEMU, the patch landed into v2.7.0 tag, so if you are using >upstream version then the patch is not there. > >Regards, >Maxime > >> >> Steven >> >> On 3/8/17, 7:36 AM, "vpp-dev-boun...@lists.fd.io on behalf of Maxime >> Coquelin" <vpp-dev-boun...@lists.fd.io on behalf of >> maxime.coque...@redhat.com> wrote: >> >>> Tom, >>> >>> On 03/08/2017 03:59 PM, Thomas F Herbert wrote: >>>> Maxime, >>>> >>>> The >>>> >>>> OK. Thanks. I am forwarding this to Pierre as he has been working on >>>> vhost-user. >>>> >>>> I thought we were now using dpdk's vhost-user driver as of 17.01. >>> No it isn't. >>> I would be interested in the reasons not doing the move, and propose my >>> help if this is in VPP project plans. >>> >>>> >>>> It is interesting how we haven't seen this in our csit testing and we >>>> are testing multi-queue. Also will this manifest only on certain >>>> revisions of Qemu/KVM? >>> >>> My understanding is: >>> Before VPP 17.01, migration was fine whatever the QEMU version. >>> The bug has been introduced in VPP 17.01 in commit e21c5286. From this >>> commit, VPP will always reply 0 to GET_VRING_BASE whatever the real >>> last_avail_idx value. >>> >>> But, depending on QEMU version, the problem will or won't be detected. >>> If QEMU doesn't contain commit "virtio: recalculate vq->inuse after >>> migration", then the bug will be silent and last_avail_idx will be >>> wrongly restored in destination backend. >>> If QEMU contains this patch, problem is detected and migration fails. >>> >>> Do you know which QEMU version you are using in your CSIT testing? >>> I can check whether it contains this patch. >>> >>> Regards, >>> Maxime >>>> >>>> --TFH >>>> >>>> -------- Forwarded Message -------- >>>> Subject: [Bug 1422534] vdev->vq[i] .used_idx does not consider the >>>> right value for vhostuser >>>> Date: Wed, 08 Mar 2017 08:56:49 +0000 >>>> From: bugzi...@redhat.com >>>> To: therb...@redhat.com >>>> >>>> >>>> >>>> https://bugzilla.redhat.com/show_bug.cgi?id=1422534 >>>> >>>> >>>> >>>> --- Comment #27 from Maxime Coquelin <maxime.coque...@redhat.com> --- >>>> Tom, >>>> >>>> I had a look at mainline VPP: >>>> $ build-root/scripts/version >>>> 17.04-rc0~382-ga0b34a7 >>>> >>>> Looking at 16.06 release, it seems to be a regression, as the vring >>>> struct is >>>> not reset before getting last_avail_idx, only the enable bit is >>>>cleared: >>>> >>>> case VHOST_USER_GET_VRING_BASE: >>>> DBG_SOCK("if %d msg VHOST_USER_GET_VRING_BASE idx %d num %d", >>>> vui->hw_if_index, msg.state.index, msg.state.num); >>>> >>>> /* Spec says: Client must [...] stop ring upon receiving >>>> VHOST_USER_GET_VRING_BASE. */ >>>> vui->vrings[msg.state.index].enabled = 0; >>>> >>>> msg.state.num = vui->vrings[msg.state.index].last_avail_idx; >>>> msg.flags |= 4; >>>> msg.size = sizeof(msg.state); >>>> break; >>>> >>>> The regression seems to have been introduced by below commit, which >>>> landed into >>>> v17.01 release: >>>> >>>> $ git blame src/vnet/devices/virtio/vhost-user.c -L 1031 >>>> e21c5286 vnet/vnet/devices/virtio/vhost-user.c (Pierre Pfister >>>> 2016-09-21 >>>> 08:04:59 +0100 1031) vhost_user_vring_close (vui, >>>> msg.state.index); >>>> >>>> $ git show e21c5286 >>>> commit e21c52861d7c503bef7fc464f23bbc7891e150d7 >>>> Author: Pierre Pfister <ppfis...@cisco.com> >>>> Date: Wed Sep 21 08:04:59 2016 +0100 >>>> >>>> vhost-user: multiqueue support >>>> >>>> This patch adds multi-queue support to non-dpdk's vhost-user >>>> driver. >>>> Waiting for a unified way to manage threads, this patch >>>> defines a way to assign threads to interfaces that is >>>> specific to vhost. >>>> >>>> Change-Id: I86298788b1a4e886c5431f187dc17175d12c7a8b >>>> Signed-off-by: Pierre Pfister <ppfis...@cisco.com> >>>> >>>> $ git tag --contains e21c5286 >>>> v17.01 >>>> v17.01-rc1 >>>> v17.01-rc2 >>>> v17.01.1 >>>> v17.04-rc0 >>>> We should invite Pierre Pfister from Cisco in the discussion, >>>> as he might have some ideas on the fix to do. >>>> >>>> I guess that reverting to just clearing the enabled bit is not the >>>> right fix. >>>> Maybe the vring struct shouldn't be re-initialized in the close >>>> function, >>>> but should be called directly when needed. >>>> >>>> I'm attaching a debug patch to confirm the root cause if someone want >>>> to have a >>>> try. >>>> It is not intended to be the fix for this issue, and has not neither >>>> been >>>> compiled nor tested. >>>> >>>> Regards, >>>> Maxime >>>> >>>> -- >>>> You are receiving this mail because: >>>> You are on the CC list for the bug. >>>> Unsubscribe from this bug >>>> https://bugzilla.redhat.com/token.cgi?t=Y79DLPsrVV&a=cc_unsubscribe >>>> >>> _______________________________________________ >>> vpp-dev mailing list >>> vpp-dev@lists.fd.io >>> https://lists.fd.io/mailman/listinfo/vpp-dev >> _______________________________________________ vpp-dev mailing list vpp-dev@lists.fd.io https://lists.fd.io/mailman/listinfo/vpp-dev