On Wed, Oct 23, 2024 at 04:32:01PM -0300, Fabiano Rosas wrote:
> Peter Xu <pet...@redhat.com> writes:
> 
> > This is a follow up of below patch from Avihai as a replacement:
> >
> > https://lore.kernel.org/qemu-devel/20241020130108.27148-3-avih...@nvidia.com/
> >
> > This is v2 of the series, and it became a more generic rework on how we do
> > migration object refcounts, so I skipped a changelog because most of this
> > is new things.
> >
> > To put it simple, now I introduced another pointer to migration object, and
> > here's a simple explanation for both after all change applied (copy-paste
> > from one of the patch):
> >
> > /*
> >  * We have two pointers for the global migration objects.  Both of them are
> >  * initialized early during QEMU starts, but they have different lifecycles.
> 
> The very next person that needs to access one of those in migration code
> will get this wrong.

Migration code should never access global_migration except during init /
shutdown (or a renamed version of it), that's the plan.. so no one should
use it within migration/.

> 
> >  *
> >  * - current_migration
> >  *
> >  *   This variable reflects the whole lifecycle of the migration object
> >  *   (which each QEMU can only have one).  It is valid until the migration
> >  *   object is destroyed.
> >  *
> >  *   This is the object that internal migration so far use.  For example,
> >  *   internal helper migrate_get_current() references it.
> >  *
> >  *   When all migration code can always pass over a MigrationState* around,
> >  *   this variable can logically be dropped.  But we're not yet there.
> 
> Won't dropping it just bring us back to the situation before this patch?
> I'd say we need the opposite, to stop using migrate_get_current()
> everywhere in the migration code and instead expose the
> current_migration via an internal header.

I meant dropping all the global access to current_migration within
migration/ dir.

Consider all functions has the 1st parameter as MigrationState*, for
example.  Then we don't need that for internal use, while global_migration
is still needed for external use, but only for external use.

> 
> >  *
> >  * - global_migration
> 
> Both are global, we can't prefix one of them with global_.

I can rename it to migration_export, but the question is whether the name
is the issue, or you'd think having two global variables pointing to
migration object is the issue / concern.

So I think fundaementally we indeed only need one global var there, if we
can cleanup the migration/ everywhere to always take the pointer from the
caller, then migration thread will take 1 refcount and that guarantees it's
available for migration/.

> 
> >  *
> >  *   This is valid only until the migration object is still valid to the
> >  *   outside-migration world (until migration_shutdown()).
> >  *
> >  *   This should normally be always set, cleared or accessed by the main
> >  *   thread only, rather than the migration thread.
> >  *
> >  *   All the exported functions (in include/migration) should reference the
> >  *   exported migration object only to avoid race conditions, as
> >  *   current_migration can be freed concurrently by migration thread when
> >  *   the migration thread holds the last refcount.
> >  */
> 
> Have you thought of locking the migration object instead?

Yes, logically we could use RCU if we want, then take BQL for example only
if update.  but I worry that's an overkill: we'll need too many rcu read
lock all over the places..

> 
> Having two global pointers to the same object with one having slightly
> different lifecycle than the other will certainly lead to misuse.

That's indeed tricky, it's just that this is so far the easiest change I
can think of.

> 
> I worry about mixing the usage of both globals due to some migration
> code that needs to call these exported functions or external code
> reaching into migration/ through some indirect path. Check global and
> try to use current sort of scenarios (and vice-versa).

Yeh I get your concern.  I actually tried to observe such usages (for now,
especially when migration/ uses the misc.h exported functions) and they're
all safe.  I should have mentioned that.

For the other way round, I don't yet expect that to happen: the plan is
anything outside must call a function in include/migration/* and that must
only use global_migration.

> 
> Similarly, what about when a lingering reference still exists, but
> global_migration is already clear? Any migration code looking at
> global_migration would do the wrong thing.

That's how it works: migration thread will take one refcount and that'll
happen when migration is running but when VM shutdowns itself.  I checked
that all the migration code should be fine when using the exported
functions even if they should reference current_migration.

So logically if with such design, indeed internal migration/ code shouldn't
reference global_migration at all just to be always safe.

Any idea in your mind that can make this easier?  I'm definitely open to
suggestions.

> 
> >
> > It allows all misc.h exported helpers to be used for the whole VM
> > lifecycle, so as to never crash QEMU with freed migration objects.
> 
> Isn't there a race with the last reference being dropped at
> migration_shutdown() and global_migration still being set?

It needs to be protected by BQL in this case, so anyone using exported
functions need to take BQL first.

> 
> >
> > Thanks,
> >
> > Peter Xu (4):
> >   migration: Unexport dirty_bitmap_mig_init() in misc.h
> >   migration: Reset current_migration properly
> >   migration: Add global_migration
> >   migration: Make all helpers in misc.h safe to use without migration
> >
> >  include/migration/misc.h | 29 ++++++++----
> >  migration/migration.h    |  4 ++
> >  migration/migration.c    | 99 +++++++++++++++++++++++++++++++++++-----
> >  3 files changed, 113 insertions(+), 19 deletions(-)
> 

-- 
Peter Xu


Reply via email to