On Tue, Feb 18, 2025 at 05:30:40PM -0300, Fabiano Rosas wrote:
> Li Zhijian via <qemu-devel@nongnu.org> writes:
> 
> > Address an error in RDMA-based migration by ensuring RDMA is prioritized
> > when saving pages in `ram_save_target_page()`.
> >
> > Previously, the RDMA protocol's page-saving step was placed after other
> > protocols due to a refactoring in commit bc38dc2f5f3. This led to migration
> > failures characterized by unknown control messages and state loading errors
> > destination:
> > (qemu) qemu-system-x86_64: Unknown control message QEMU FILE
> > qemu-system-x86_64: error while loading state section id 1(ram)
> > qemu-system-x86_64: load of migration failed: Operation not permitted
> > source:
> > (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to 
> > abort!
> > qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1
> > qemu-system-x86_64: rdma migration: recv polling control error!
> > qemu-system-x86_64: warning: Early error. Sending error.
> > qemu-system-x86_64: warning: rdma migration: send polling control error
> >
> > RDMA migration implemented its own protocol/method to send pages to
> > destination side, hand over to RDMA first to prevent pages being saved by
> > other protocol.
> >
> > Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions")
> > Signed-off-by: Li Zhijian <lizhij...@fujitsu.com>
> > ---
> >  migration/ram.c | 9 +++++----
> >  1 file changed, 5 insertions(+), 4 deletions(-)
> >
> > diff --git a/migration/ram.c b/migration/ram.c
> > index 6f460fd22d2..635a2fe443a 100644
> > --- a/migration/ram.c
> > +++ b/migration/ram.c
> > @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, 
> > PageSearchStatus *pss)
> >      ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS;
> >      int res;
> >  
> > +    /* Hand over to RDMA first */
> > +    if (control_save_page(pss, offset, &res)) {
> > +        return res;
> > +    }
> > +
> 
> Can we hoist that migrate_rdma() from inside the function? Since the
> other paths already check first before calling their functions.

If we're talking about hoist and stuff.. and if we want to go slightly
further, I wonder if we could also drop RAM_SAVE_CONTROL_NOT_SUPP.

    if (!migrate_rdma() || migration_in_postcopy()) {
        return RAM_SAVE_CONTROL_NOT_SUPP;
    }

We should make sure rdma_control_save_page() won't get invoked at all in
either case above..  For postcopy, maybe we could fail in the QMP migrate /
migrate_incoming cmd, at migration_channels_and_transport_compatible().

> 
> >      if (!migrate_multifd()
> >          || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) {
> >          if (save_zero_page(rs, pss, offset)) {
> > @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, 
> > PageSearchStatus *pss)
> >          return ram_save_multifd_page(block, offset);
> >      }
> >  
> > -    if (control_save_page(pss, offset, &res)) {
> > -        return res;
> > -    }
> > -
> >      return ram_save_page(rs, pss);
> >  }
> 

-- 
Peter Xu


Reply via email to