On Mon, May 23, 2022 at 3:56 PM Jag Raman <jag.ra...@oracle.com> wrote: > > > > > On May 23, 2022, at 11:09 AM, Alexander Duyck <alexander.du...@gmail.com> > > wrote: > > > > From: Alexander Duyck <alexanderdu...@fb.com> > > > > When I run Multi-process QEMU with an e1000 as the remote device and SMP > > enabled I see the combination lock up and become unresponsive. The QEMU > > build > > is a fairly standard x86_64-softmmu setup. After doing some digging I > > tracked > > the lockup down to the what appears to be a race with the mpqemu-link > > msg_send > > and msg_receive functions and the reacquisition of the lock. > > > > I am assuming the issue is some sort of lock inversion though I haven't > > identified exactly what the other lock involved is yet. For now removing > > the logic to unlock the iothread and then reacquire the lock seems to > > resolve the issue. I am assuming the releasing of the lock was some form of > > optimization but I am not certain so I am submitting this as an RFC. > > Hi Alexander, > > We are working on moving away from Multi-process QEMU and to using vfio-user > based approach. The vfio-user patches are under review. I believe we would > drop > the Multi-process support once vfio-user is merged. > > We release the lock here while communicating with the remote process via the > QIOChannel. It is to prevent lockup of the VM in case the QIOChannel hangs. > > I was able to reproduce this issue at my end. There is a deadlock between > "mpqemu_msg_send() -> qemu_mutex_lock_iothread()" and > "mpqemu_msg_send_and_await_reply() -> QEMU_LOCK_GUARD(&pdev->io_mutex)”. > > From what I can tell, as soon as one vcpu thread drops the iothread lock, > another > thread running mpqemu_msg_send_and_await_reply() holds on to it. That prevents > the first thread from completing. Attaching backtrace below. > > To avoid the deadlock, I think we should drop both the iothread lock and > io_mutex > and reacquire them in the correct order - first iothread and then io_mutex. > Given > multiprocess QEMU would be dropped in the near future, I suppose we don’t have > to proceed further along these lines. > > I tested your patch, and that fixes the e1000 issue at my end also. I believe > we > could adopt it.
Hi Jag, I will go take a look at the vfio-user patches. I hadn't been following the list lately so I didn't realize things were headed in that direction. It may work out better for me depending on what is enabled, as I was looking at working with PCIe config and MSI-X anyway so if the vfio-user supports that already then it may save me some work. What you describe as being the issue doesn't sound too surprising, as I had mentioned disabling SMP also solved the problem for me. It is likely due to the fact that the e1000 has separate threads running for handling stats and such and then the main thread handling the networking. I would think we cannot drop the io_mutex though as it is needed to enforce the ordering of the request send and the reply receive. Rather if we need to drop the iothread lock I would think it might be better to drop it before acquiring the io_mutex, or to at least wait until after we release the io_mutex before reacquiring it. If you want I can resubmit my patch for acceptance, but it sounds like the code will be deprecated soon so I am not sure there is much point. I will turn my focus to vfio-user and see if I can get it up and running for my use case. Thanks, - Alex