On Mon, Oct 21, 2024 at 05:14:11PM +0200, Cédric Le Goater wrote:
> Hello Avihai,
> 
> On 10/20/24 15:01, Avihai Horon wrote:
> > When the VM is shut down vfio_vmstate_change/_prepare() are called to
> > transition the VFIO device state to STOP. They are called after
> > migration_shutdown() and thus, by this time, the migration object is
> > already freed (more specifically, MigrationState->qemu_file_lock is
> > already destroyed).
> > 
> > In this case, if there is an error in vfio_vmstate_change/_prepare(), it
> > calls migration_file_set_error() which tries to lock the already
> > destroyed MigrationState->qemu_file_lock, leading to the following
> > assert:
> > 
> >    qemu-system-x86_64: ../util/qemu-thread-posix.c:92: 
> > qemu_mutex_lock_impl: Assertion `mutex->initialized' failed.
> > 
> > Fix this by not setting migration file error in the shut down flow.
> > 
> > Fixes: 20c64c8a51a4 ("migration: migration_file_set_error")
> > Signed-off-by: Avihai Horon <avih...@nvidia.com>
> > ---
> >   hw/vfio/migration.c | 31 +++++++++++++++++++++----------
> >   1 file changed, 21 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> > index 992dc3b102..1c44b036ea 100644
> > --- a/hw/vfio/migration.c
> > +++ b/hw/vfio/migration.c
> > @@ -783,6 +783,25 @@ static const SaveVMHandlers savevm_vfio_handlers = {
> >   /* ---------------------------------------------------------------------- 
> > */
> > +static void vfio_vmstate_change_error_report(int ret, Error *err,
> > +                                             RunState state)
> > +{
> > +    if (state == RUN_STATE_SHUTDOWN) {
> > +        /*
> > +         * If VM is being shut down, migration object might have already 
> > been
> > +         * freed, so just report the error.
> > +         */
> > +        error_report_err(err);
> > +        return;
> > +    }
> > +
> > +    /*
> > +     * Migration should be aborted in this case, but vm_state_notify()
> > +     * currently does not support reporting failures.
> > +     */
> > +    migration_file_set_error(ret, err);
> > +}
> 
> This seems correct, but testing the machine's runtime state to
> work around the fact that 'current_migration' is not safe to
> use reveals a flaw.
> 
> In vfio, migration_is_setup_or_active() is unsafe. So are the
> calls to vfio_set_migration_error().
> 
> 
> This comment seems in contradiction with the migration code :
> 
> /* When we add fault tolerance, we could have several
>    migrations at once.  For now we don't need to add
>    dynamic creation of migration */
> 
> Why is 'current_migration' allocated and destroyed today ?
> 
> Could we replace it with a singleton and switch the state to
> MIGRATION_STATUS_NONE instead of destroying it ?
> 
> If not, should 'current_migration' be set to NULL in
> migration_shutdown() and its value tested before accessing
> to any of its fields, in a thread safe manner if necessary.

IIUC the migration thread should always see valid migration object, as it
takes one refcount at the entrance of migration_thread():

    object_ref(OBJECT(s));

So it's not yet clear to me on why the mutex was destroyed, if the main
object should still be there; logically it was only destroyed in the
finalize phase (migration_instance_finalize), but that should be after
migration thread quits and releases the refcount..

-- 
Peter Xu


Reply via email to