On Thu, Aug 29, 2024 at 10:29:24AM -0400, Peter Xu wrote: > On Thu, Aug 29, 2024 at 02:45:45PM +0530, Prasad Pandit wrote: > > Hello Michael, > > > > On Thu, 29 Aug 2024 at 13:12, Michael S. Tsirkin <m...@redhat.com> wrote: > > > Weird. Seems to indicate some kind of deadlock? > > > > * Such a deadlock should occur across all environments I guess, not > > sure why it happens selectively. It is strange. > > > > > So maybe vhost_user_postcopy_end should take the BQL? > > === > > diff --git a/migration/savevm.c b/migration/savevm.c > > index e7c1215671..31acda3818 100644 > > --- a/migration/savevm.c > > +++ b/migration/savevm.c > > @@ -2050,7 +2050,9 @@ static void *postcopy_ram_listen_thread(void *opaque) > > */ > > qemu_event_wait(&mis->main_thread_load_event); > > } > > + bql_lock(); > > postcopy_ram_incoming_cleanup(mis); > > + bql_unlock(); > > > > if (load_res < 0) { > > /* > > === > > > > * Actually a BQL patch above was tested and it worked fine. But not > > sure if it is an acceptable solution. Another contention was taking > > BQL could make things more complicated, so a local vhost-user specific > > lock should be better. > > > > ...wdyt? > > I think Michael was suggesting taking bql in vhost_user_postcopy_end(), not > in postcopy code directly.
maybe that's better, ok. > I'm recently looking at how to make precopy > load even take less bql and even make it a separate thread. Above is > definitely going backwards, per we discussed already internally. At the same time a small bugfix is better, can be backported. > I cherish postcopy doesn't need to take bql on its own in most paths, and > we shouldn't add unnecessary bql requirement even if vhost-user isn't used. > > Personally I still prefer we look into why a separate mutex won't work and > why that timed out; that could be part of whoever is going to investigate > the whole issue (including the hang later on). Otherwise I'm ok from > migration pov that we take bql in the vhost-user hook, but not in savevm.c. > > Thanks, ok > -- > Peter Xu