Hi Daniel, On 2020/9/14 22:42, Daniel P. Berrangé wrote: > On Tue, Aug 11, 2020 at 09:54:08PM +0800, Zhenyu Ye wrote: >> Hi Kevin, >> >> On 2020/8/10 23:38, Kevin Wolf wrote: >>> Am 10.08.2020 um 16:52 hat Zhenyu Ye geschrieben: >>>> Before doing qmp actions, we need to lock the qemu_global_mutex, >>>> so the qmp actions should not take too long time. >>>> >>>> Unfortunately, some qmp actions need to acquire aio context and >>>> this may take a long time. The vm will soft lockup if this time >>>> is too long. >>> >>> Do you have a specific situation in mind where getting the lock of an >>> AioContext can take a long time? I know that the main thread can >>> block for considerable time, but QMP commands run in the main thread, so >>> this patch doesn't change anything for this case. It would be effective >>> if an iothread blocks, but shouldn't everything running in an iothread >>> be asynchronous and therefore keep the AioContext lock only for a short >>> time? >>> >> >> Theoretically, everything running in an iothread is asynchronous. However, >> some 'asynchronous' actions are not non-blocking entirely, such as >> io_submit(). This will block while the iodepth is too big and I/O pressure >> is too high. If we do some qmp actions, such as 'info block', at this time, >> may cause vm soft lockup. This series can make these qmp actions safer. >> >> I constructed the scene as follow: >> 1. create a vm with 4 disks, using iothread. >> 2. add press to the CPU on the host. In my scene, the CPU usage exceeds 95%. >> 3. add press to the 4 disks in the vm at the same time. I used the fio and >> some parameters are: >> >> fio -rw=randrw -bs=1M -size=1G -iodepth=512 -ioengine=libaio -numjobs=4 >> >> 4. do block query actions, for example, by virsh: >> >> virsh qemu-monitor-command [vm name] --hmp info block >> >> Then the vm will soft lockup, the calltrace is: > > [snip] > >> This problem can be avoided after this series applied. > > At what cost though ? With this timeout, QEMU is going to start > reporting bogus failures for various QMP commands when running > under high load, even if those commands would actually run > successfully. This will turn into an error report from libvirt > which will in turn probably cause an error in the mgmt application > using libvirt, and in turn could break the user's automation. >
I think it's worth reporting an error to avoid the VM softlockup. The VM may even crash if kernel.softlockup_panic is configured! We can increase the timeout value (close to the VM cpu soft lock time) to avoid unnecessary errors. Thanks, Zhenyu