On Sat, Feb 08, 2025 at 04:06:56PM +0530, Prasad Pandit wrote: > Hello Peter, > > On Fri, 7 Feb 2025 at 21:16, Peter Xu <pet...@redhat.com> wrote: > > This is not easy to follow neither with the current name, nor that you > > "assumed this is main channel" and test it. > > * It is not my doing, nor is there any assumption, but that is how > current implementation works. > === > static bool migration_should_start_incoming(bool main_channel) > { > /* Multifd doesn't start unless all channels are established */ > if (migrate_multifd()) { > return migration_has_all_channels(); > } > > /* Preempt channel only starts when the main channel is created */ > if (migrate_postcopy_preempt()) { > return main_channel; > } > > /* > * For all the rest types of migration, we should only reach here when > * it's the main channel that's being created, and we should always > * proceed with this channel. > */ > assert(main_channel); > return true; > } > === > * Above code returns 'true' for 'multifd' and 'main_channel' cases. > When migrate_postcopy_preempt() is true, main_channel is 'false', so > it returns false. All I have done is reused the > migration_should_start_incoming() function to simplify the 'if' > conditional at the top.
Yes, and I suggest a rename or introduce a new helper, per previous reply. > > > I think you may want to split > > migration_has_all_channels() into migration_has_essential_channels() which > > only covers main and multifd cases. Then you can check if (!has_esential) > > here. You'd better also add a comment that all "essential channels" can be > > peeked. > > > > You may also want to bypass a few things, e.g. "postcopy paused stage" here > > rather than inside, because postcopy-recover only happens: > > > > - First with a main channel, that is not peekable as no header when resume > > - Then with preempt channel, that is also not peekable > > > > [a] > > > > You may also need to keep the mapped-ram check. They also don't support > > peek. > > * Instead of adding specific conditions and splitting functions, my > request is, let's work towards consistent channel behaviour that will > automatically simplify these conditions and channel handling. Maybe we > can do that in a subsequent series. I didn't follow, sorry - do you mean this patch is correct on dropping the mapped-ram check? I don't yet understand how it can work if without. > > > > > > > > > + } else if (mis->from_src_file > > > > > + && (!strcmp(ioc->name, "migration-tls-incoming") > > > > > + || !strcmp(ioc->name, "migration-file-incoming"))) { > > > > > + channel = CH_MULTIFD; > > > > > > > > Confused here too. Why do we need to check ioc name? Shouldn't multifd > > > > has > > > > the headers? > > > > > > * Because they are not 'multifd' channels, tls/file channels don't > > > send magic values, but are still handled by > > > > It might be because you have a bug where you removed mapped-ram check at > > [b] above. I think we need to keep it. > > * ie. Because I removed the mapped-ram check, so tls/file channels are > handled by multifd_recv_new_channel()? No, that's not the case. > Rather, that is how it works currently. I have not changed anything, > only made it more explicit to see that when it is tls/file channel, > handle it as a CH_MULTIFD type. Looking at the current code, one can > not see clearly how tls/file channels are handled. > > > Why TLS channels don't send magic? > > * Probably because they do TLS hand-shake while establishing a connection? I meant tls channels should have these magics too. Do you mean they're not? > > > > because if multifd page is getting late, that network > > > latency should affect 'postcopy' channel too, no? But still if it is > > > > I don't think so. postcopy doesn't use any multifd channels. > > * Yes, but it uses the same wire/network. > > > > possible, do we want to call - multifd_ram_flush_and_sync() before > > > postcopy_start()? Will that help? I'll check if/how it works. > > > > Note that all things flushed may or may not be enough, because IIUC the > > flush only makes sure all threads are synced. > > * We are again using 'flush' and 'sync' interchangeably. What does - > flush only makes sure all threads are synced - mean really? Is it not > writing all socket data onto the wire/channel? > > * Flush should write all socket data onto the network wire/channel. > The _order_ in which threads flush/write their socket data onto the > wire/channel is to synchronise them, maintaining/controlling that > _order_ is not flush. > > > It may not make sure the order of things to happen in multifd threads and > > postcopy thread. The > > latter is what we need - we need to make sure no page land in postcopy > > threads. > > * Let's see: > 1) When migration is in Postcopy mode, ram_save_multifd_page() is > _not_ called on the source side. ie. no multifd data gets enqueued on > the multifd queue. > 1.1) multifd_queue_page() function also calls multifd_send() if > the queue is full, before enqueueing new pages. > 2) If a multifd page reaches the destination during Postcopy mode, > it must have been sent/written on the multifd channel before Postcopy > mode started, right? Yes. > 3) In this case, writing/flushing all multifd socket data onto the > wire/channel, before calling postcopy_start() function should help > IIUC. > 3.1) ie. calling multifd_send() before postcopy_start() should > help to clear the multifd queue, before Postcopy begins. > 3.2) Same can be done by - multifd_ram_flush_and_sync() -> > multifd_send() - sequence. No I don't think so. Flushing sending side makes sure send buffer is empty. It doesn't guarantee recv buffer is empty on the other side. > > * If all multifd pages are sent/written/flushed onto the multifd > channels before postcopy_start() is called, then multifd pages should > not arrive at the destination after postcopy starts IIUC. If that is > happening, we need a reproducer for such a case. Do we have such a > reproducer? With or without a reproducer, we need to at least justify it in theory. If it doesn't even work in theory, it's a problem. > > > That's why I was requesting to add an assert() in multifd recv thread to > > make sure we will never receive a page during postcopy. > > * ie. Add assert(!migrate_in_postcopy()) in multifd_recv_thread() > function? Okay. Yes, probably before multifd_recv_state->ops->recv(). Thanks, -- Peter Xu