On Tue, Feb 18, 2025 at 05:30:40PM -0300, Fabiano Rosas wrote: > Li Zhijian via <qemu-devel@nongnu.org> writes: > > > Address an error in RDMA-based migration by ensuring RDMA is prioritized > > when saving pages in `ram_save_target_page()`. > > > > Previously, the RDMA protocol's page-saving step was placed after other > > protocols due to a refactoring in commit bc38dc2f5f3. This led to migration > > failures characterized by unknown control messages and state loading errors > > destination: > > (qemu) qemu-system-x86_64: Unknown control message QEMU FILE > > qemu-system-x86_64: error while loading state section id 1(ram) > > qemu-system-x86_64: load of migration failed: Operation not permitted > > source: > > (qemu) qemu-system-x86_64: RDMA is in an error state waiting migration to > > abort! > > qemu-system-x86_64: failed to save SaveStateEntry with id(name): 1(ram): -1 > > qemu-system-x86_64: rdma migration: recv polling control error! > > qemu-system-x86_64: warning: Early error. Sending error. > > qemu-system-x86_64: warning: rdma migration: send polling control error > > > > RDMA migration implemented its own protocol/method to send pages to > > destination side, hand over to RDMA first to prevent pages being saved by > > other protocol. > > > > Fixes: bc38dc2f5f3 ("migration: refactor ram_save_target_page functions") > > Signed-off-by: Li Zhijian <lizhij...@fujitsu.com> > > --- > > migration/ram.c | 9 +++++---- > > 1 file changed, 5 insertions(+), 4 deletions(-) > > > > diff --git a/migration/ram.c b/migration/ram.c > > index 6f460fd22d2..635a2fe443a 100644 > > --- a/migration/ram.c > > +++ b/migration/ram.c > > @@ -1964,6 +1964,11 @@ static int ram_save_target_page(RAMState *rs, > > PageSearchStatus *pss) > > ram_addr_t offset = ((ram_addr_t)pss->page) << TARGET_PAGE_BITS; > > int res; > > > > + /* Hand over to RDMA first */ > > + if (control_save_page(pss, offset, &res)) { > > + return res; > > + } > > + > > Can we hoist that migrate_rdma() from inside the function? Since the > other paths already check first before calling their functions.
If we're talking about hoist and stuff.. and if we want to go slightly further, I wonder if we could also drop RAM_SAVE_CONTROL_NOT_SUPP. if (!migrate_rdma() || migration_in_postcopy()) { return RAM_SAVE_CONTROL_NOT_SUPP; } We should make sure rdma_control_save_page() won't get invoked at all in either case above.. For postcopy, maybe we could fail in the QMP migrate / migrate_incoming cmd, at migration_channels_and_transport_compatible(). > > > if (!migrate_multifd() > > || migrate_zero_page_detection() == ZERO_PAGE_DETECTION_LEGACY) { > > if (save_zero_page(rs, pss, offset)) { > > @@ -1976,10 +1981,6 @@ static int ram_save_target_page(RAMState *rs, > > PageSearchStatus *pss) > > return ram_save_multifd_page(block, offset); > > } > > > > - if (control_save_page(pss, offset, &res)) { > > - return res; > > - } > > - > > return ram_save_page(rs, pss); > > } > -- Peter Xu