[dpdk-dev] virtio kills qemu VM after stopping/starting ports

Tan, Jianfeng Fri, 2 Sep 2016 20:53:41 +0800

Hi Kyle,


On 9/2/2016 8:35 PM, Kyle Larose wrote:
>
>> -----Original Message-----
>> From: Tan, Jianfeng [mailto:jianfeng.tan at intel.com]
>> Sent: Friday, September 02, 2016 2:56 AM
>> To: Kyle Larose; dev at dpdk.org
>> Cc: huawei.xie at intel.com; yuanhan.liu at linux.intel.com
>> Subject: Re: virtio kills qemu VM after stopping/starting ports
>>
>> Hi Kyle,
>>
>>
>> On 9/2/2016 4:53 AM, Kyle Larose wrote:
>>> Hello everyone,
>>>
>>> In my own testing, I recently stumbled across an issue where I could get 
>>> qemu
>> to exit when sending traffic to my application. To do this, I simply needed 
>> to do
>> the following:
>>> 1) Start my virtio interfaces
>>> 2) Send some traffic into/out of the interfaces
>>> 3) Stop the interfaces
>>> 4) Start the interfaces
>>> 5) Send some more traffic
>>>
>>> At this point, I would lose connectivity to my VM.  Further investigation
>> revealed qemu exiting with the following log:
>>>     2016-09-01T15:45:32.119059Z qemu-kvm: Guest moved used index
>> from 5
>>> to 1
>>>
>>> I found the following bug report against qemu, reported by a user of
>>> DPDK: https://bugs.launchpad.net/qemu/+bug/1558175
>>>
>>> That thread seems to have stalled out, so I think we probably should deal 
>>> with
>> the problem within DPDK itself. Either way, later in the bug report chain, we
>> see a link to this patch to DPDK:
>> http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648. The submitter of
>> the bug report claims that this patch fixes the problem. Perhaps it does.
>>
>> I once got a chance to analyze the bug you referred here. The patch
>> (http://dpdk.org/browse/dpdk/commit/?id=9a0615af774648) does not fix that
>> bug. The root cause of that bug is: when DPDK appilcation gets killed, nobody
>> tells the vhost backend to stop. So when restarting the DPDK app, those
>> hugepages are reused, and initialized to all zero. And unavoidably, "idx" in
>> those memory is changed to 0. I have written a patch to notify the vhost
>> backend to stop when DPDK app gets killed suddenly (more accurate, when
>> /dev/uioX gets closed), and this patch will be sent out recently. And this 
>> patch
>> does not fix your problem, either. You did not kill the app. (I should not 
>> update
>> info about that bug here, and I mean they are different problems)
>>
>>> However, it introduces a new problem: If I remove the patch, I cannot
>> reproduce the problem. So, that leads me to believe that it has caused a
>> regression.
>>> To summarize the patch?s changes, it basically changes the virtio_dev_stop
>> function to flag the device as stopped, and stops the device when
>> closing/uninitializing it. However, there is a seemingly unintended 
>> side-effect.
>>> In virtio_dev_start, we have the following block of code:
>>>
>>>     /* On restart after stop do not touch queues */
>>>     if (hw->started)
>>>             return 0;
>>>
>>>     /* Do final configuration before rx/tx engine starts */
>>>     virtio_dev_rxtx_start(dev);
>>>
>>> ?.
>>>
>>> Prior to the patch, if an interface were stopped then started, without
>> restarting the application, the queues would be left as-is, because 
>> hw->started
>> would be set to 1.
>>
>> Yes, my previous patch did break this behavior (by stopping and re-starting 
>> the
>> device, the queues would be left as-is) and leads to the problem here. And 
>> you
>> are proposing to recover.
>>
>> But is this a right behavior to follow?
>> On the one side, when we stop the virtio device, should we notify the vhost
>> backend to stop too? Currently, we just flag it as stopped.
>> On the other side, now in many PMDs, we mix the dev
>> initialization/configuration code into dev_start functions, that is to day, 
>> we re-
>> configure the device every time we start it (may be to make sure that changed
>> configuration will be configured). Then what if we increase the queue numbers
>> of virtio?
> Hi Jianfeng,
>
>
> Let me see if my understanding is correct. You're saying that there is a 
> problem introduced by your patch, but going back to the original behaviour is 
> not ideal: it still leaves a gap with respect to changing the number of 
> queues with virtio. The ideal solution would have us properly tear down the 
> device when it is stopped, and then properly reinitialize it when it is 
> started? That seems reasonable. However, I'm not sure what it would take.

Yes, exactly.

> Does it make sense to submit a patch like I suggested in the short term, and 
> have somebody work on a more robust long term solution? How could we go about 
> making sure that happens? (I can't allocate much time to fixing this, so I 
> likely won't be able to do anything more complicated than what I've proposed).

Let's see if maintainers have any suggestions here. (And I personally do 
not oppose your proposal for short term).

Thanks,
Jianfeng

[dpdk-dev] virtio kills qemu VM after stopping/starting ports

Reply via email to