On Thu, Aug 29, 2024 at 02:45:45PM +0530, Prasad Pandit wrote: > Hello Michael, > > On Thu, 29 Aug 2024 at 13:12, Michael S. Tsirkin <m...@redhat.com> wrote: > > Weird. Seems to indicate some kind of deadlock? > > * Such a deadlock should occur across all environments I guess, not > sure why it happens selectively. It is strange. > > > So maybe vhost_user_postcopy_end should take the BQL? > === > diff --git a/migration/savevm.c b/migration/savevm.c > index e7c1215671..31acda3818 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -2050,7 +2050,9 @@ static void *postcopy_ram_listen_thread(void *opaque) > */ > qemu_event_wait(&mis->main_thread_load_event); > } > + bql_lock(); > postcopy_ram_incoming_cleanup(mis); > + bql_unlock(); > > if (load_res < 0) { > /* > === > > * Actually a BQL patch above was tested and it worked fine. But not > sure if it is an acceptable solution. Another contention was taking > BQL could make things more complicated, so a local vhost-user specific > lock should be better. > > ...wdyt?
I think Michael was suggesting taking bql in vhost_user_postcopy_end(), not in postcopy code directly. I'm recently looking at how to make precopy load even take less bql and even make it a separate thread. Above is definitely going backwards, per we discussed already internally. I cherish postcopy doesn't need to take bql on its own in most paths, and we shouldn't add unnecessary bql requirement even if vhost-user isn't used. Personally I still prefer we look into why a separate mutex won't work and why that timed out; that could be part of whoever is going to investigate the whole issue (including the hang later on). Otherwise I'm ok from migration pov that we take bql in the vhost-user hook, but not in savevm.c. Thanks, -- Peter Xu