Hi Kyle,
On 9/2/2016 8:35 PM, Kyle Larose wrote: > >> -----Original Message----- >> From: Tan, Jianfeng [mailto:jianfeng.tan at intel.com] >> Sent: Friday, September 02, 2016 2:56 AM >> To: Kyle Larose; dev at dpdk.org >> Cc: huawei.xie at intel.com; yuanhan.liu at linux.intel.com >> Subject: Re: virtio kills qemu VM after stopping/starting ports >> >> Hi Kyle, >> >> >> On 9/2/2016 4:53 AM, Kyle Larose wrote: >>> Hello everyone, >>> >>> In my own testing, I recently stumbled across an issue where I could get >>> qemu >> to exit when sending traffic to my application. To do this, I simply needed >> to do >> the following: >>> 1) Start my virtio interfaces >>> 2) Send some traffic into/out of the interfaces >>> 3) Stop the interfaces >>> 4) Start the interfaces >>> 5) Send some more traffic >>> >>> At this point, I would lose connectivity to my VM. Further investigation >> revealed qemu exiting with the following log: >>> 2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index >> from 5 >>> to 1 >>> >>> I found the following bug report against qemu, reported by a user of >>> DPDK: https://bugs.launchpad.net/qemu/+bug/1558175 >>> >>> That thread seems to have stalled out, so I think we probably should deal >>> with >> the problem within DPDK itself. Either way, later in the bug report chain, we >> see a link to this patch to DPDK: >> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of >> the bug report claims that this patch fixes the problem. Perhaps it does. >> >> I once got a chance to analyze the bug you referred here. The patch >> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix that >> bug. The root cause of that bug is: when DPDK appilcation gets killed, nobody >> tells the vhost backend to stop. So when restarting the DPDK app, those >> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in >> those memory is changed to 0. I have written a patch to notify the vhost >> backend to stop when DPDK app gets killed suddenly (more accurate, when >> /dev/uioX gets closed), and this patch will be sent out recently. And this >> patch >> does not fix your problem, either. You did not kill the app. (I should not >> update >> info about that bug here, and I mean they are different problems) >> >>> However, it introduces a new problem: If I remove the patch, I cannot >> reproduce the problem. So, that leads me to believe that it has caused a >> regression. >>> To summarize the patch?s changes, it basically changes the virtio_dev_stop >> function to flag the device as stopped, and stops the device when >> closing/uninitializing it. However, there is a seemingly unintended >> side-effect. >>> In virtio_dev_start, we have the following block of code: >>> >>> /* On restart after stop do not touch queues */ >>> if (hw->started) >>> return 0; >>> >>> /* Do final configuration before rx/tx engine starts */ >>> virtio_dev_rxtx_start(dev); >>> >>> ?. >>> >>> Prior to the patch, if an interface were stopped then started, without >> restarting the application, the queues would be left as-is, because >> hw->started >> would be set to 1. >> >> Yes, my previous patch did break this behavior (by stopping and re-starting >> the >> device, the queues would be left as-is) and leads to the problem here. And >> you >> are proposing to recover. >> >> But is this a right behavior to follow? >> On the one side, when we stop the virtio device, should we notify the vhost >> backend to stop too? Currently, we just flag it as stopped. >> On the other side, now in many PMDs, we mix the dev >> initialization/configuration code into dev_start functions, that is to day, >> we re- >> configure the device every time we start it (may be to make sure that changed >> configuration will be configured). Then what if we increase the queue numbers >> of virtio? > Hi Jianfeng, > > > Let me see if my understanding is correct. You're saying that there is a > problem introduced by your patch, but going back to the original behaviour is > not ideal: it still leaves a gap with respect to changing the number of > queues with virtio. The ideal solution would have us properly tear down the > device when it is stopped, and then properly reinitialize it when it is > started? That seems reasonable. However, I'm not sure what it would take. Yes, exactly. > Does it make sense to submit a patch like I suggested in the short term, and > have somebody work on a more robust long term solution? How could we go about > making sure that happens? (I can't allocate much time to fixing this, so I > likely won't be able to do anything more complicated than what I've proposed). Let's see if maintainers have any suggestions here. (And I personally do not oppose your proposal for short term). Thanks, Jianfeng