On Thu, Aug 29, 2024 at 02:45:45PM +0530, Prasad Pandit wrote:
> Hello Michael,
> 
> On Thu, 29 Aug 2024 at 13:12, Michael S. Tsirkin <m...@redhat.com> wrote:
> > Weird.  Seems to indicate some kind of deadlock?
> 
> * Such a deadlock should occur across all environments I guess, not
> sure why it happens selectively. It is strange.
> 
> > So maybe vhost_user_postcopy_end should take the BQL?
> ===
> diff --git a/migration/savevm.c b/migration/savevm.c
> index e7c1215671..31acda3818 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2050,7 +2050,9 @@ static void *postcopy_ram_listen_thread(void *opaque)
>           */
>          qemu_event_wait(&mis->main_thread_load_event);
>      }
> +    bql_lock();
>      postcopy_ram_incoming_cleanup(mis);
> +    bql_unlock();
> 
>      if (load_res < 0) {
>          /*
> ===
> 
> * Actually a BQL patch above was tested and it worked fine. But not
> sure if it is an acceptable solution. Another contention was taking
> BQL could make things more complicated, so a local vhost-user specific
> lock should be better.
> 
> ...wdyt?

I think Michael was suggesting taking bql in vhost_user_postcopy_end(), not
in postcopy code directly.  I'm recently looking at how to make precopy
load even take less bql and even make it a separate thread. Above is
definitely going backwards, per we discussed already internally.

I cherish postcopy doesn't need to take bql on its own in most paths, and
we shouldn't add unnecessary bql requirement even if vhost-user isn't used.

Personally I still prefer we look into why a separate mutex won't work and
why that timed out; that could be part of whoever is going to investigate
the whole issue (including the hang later on). Otherwise I'm ok from
migration pov that we take bql in the vhost-user hook, but not in savevm.c.

Thanks,

-- 
Peter Xu


Reply via email to