Hi, On Wed, 12 Feb 2025 at 20:08, Peter Xu <pet...@redhat.com> wrote: > I think it's not because TLS channels don't send magic, but TLS channels > are not prone to ordering issues. > In general, I'm not convinced we need to check names of iochannels.
* If the channel does not set '_READ_MSG_SEEK' flag, which magic value are we going to read? As for checking names of the channels, it tells the reader how an incoming channel is processed. >> It did not work, as qemu would crash during multifd_shutdown(). > > Have we debugged the crash? I'm not saying we should go with this, but > crash isn't a reason to not choose a design. * Yes, I did try to debug it but couldn't get to a conclusion in time. > No this is not. As I mentioned tons of times.. copying page in socket > buffers directly into guest page during vcpus running / postcopy is a very > hard to debug issue. If it's possible to happen, the design is flawed. * Sure, agreed. So far in my testing it does not seem to happen. The theory of multifd_recv_threads getting scheduled out and causing guest page over-write seems remote to me, but I get the possibility/probability. One possible solution is to have the destination side send an 'acknowledgement' to the source side. > I don't think passing the unit tests prove the series is correct and should > be merged. We need to understand how it work, or we can't merge it. * Well, passing unit tests should confirm that it does not break existing functionality. Is there any metric/limit to such understanding? Everybody understands/sees things differently. Understanding is an ever evolving thing. Saying that merge should happen based on understanding sounds weird to me. > I feel very frustrated multiple times that you seem to want to ignore what > I comment. I don't know why you rush to repost things. * I guess we are seeing/understanding things differently here. > After a 2nd thought, I think maybe multifd flush and sync could work on src > side indeed, because when flush and sync there'll be a message > (MULTIFD_FLUSH) on main channel and that should order against the rest > postcopy messages that will also be sent on the main channel (if we do > multifd flush before all the postcopy processes). Then it should guarantee > when postcopy starts on dest, dest multifd recv threads flushed all > messages, and no multifd message will arrive anymore. * After a 2nd thought - that's evolving understanding right there. :) * To mention here, to flush multifd channels with 'multifd_ram_flush_and_sync()' I tried calling migration_iteration_run -> multifd_ram_flush_and_sync(s->to_dst_file) It does not work, it crashes. Is 's->to_dst_file' a right parameter there? But calling a function below works === /* Send enqueued data pages onto next available multifd channel */ int multifd_send_flush(void) { if (!multifd_payload_empty(multifd_ram_send)) { if (!multifd_send(&multifd_ram_send)) { error_report("%s: multifd_send fail", __func__); return -1; } } return 0; } === migration_iteration_run -> multifd_send_flush() Works, but it is not sending the 'RAM_SAVE_FLAG_MULTIFD_FLUSH' message as done by 'multifd_ram_flush_and_sync()' function. > But again we should guard it with an assert() in recv threads if you want > to postpone recycling of multifd threads, just to double check no outliers > we overlooked. * Yes, the revised patch is working with the assert(!migrate_in_postcopy) in multifd_recv_thread. * I was going to send a revised series with these changes, but will wait on that for now. Thank you. --- - Prasad