On 11/03/2015 09:47 PM, Dr. David Alan Gilbert wrote: > * Juan Quintela (quint...@redhat.com) wrote: >> "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: >>> Hi, >>> I'm trying to understand why migration_bitmap_extend is correct/safe; >>> If I understand correctly, you're arguing that: >>> >>> 1) the migration_bitmap_mutex around the extend, stops any sync's >>> happening >>> and so no new bits will be set during the extend. >>> >>> 2) If migration sends a page and clears a bitmap entry, it doesn't >>> matter if we lose the 'clear' because we're copying it as >>> we extend it, because losing the clear just means the page >>> gets resent, and so the data is OK. >>> >>> However, doesn't (2) mean that migration_dirty_pages might be wrong? >>> If a page was sent, the bit cleared, and migration_dirty_pages decremented, >>> then if we copy over that bitmap and 'set' that bit again then >>> migration_dirty_pages >>> is too small; that means that either migration would finish too early, >>> or more likely, migration_dirty_pages would wrap-around -ve and >>> never finish. >>> >>> Is there a reason it's really safe? >> >> No. It is reasonably safe. Various values of reasonably. >> >> migration_dirty_pages should never arrive at values near zero. Because >> we move to the completion stage way before it gets a value near zero. >> (We could have very, very bad luck, as in it is not safe). > > That's only true if we hit the qemu_file_rate_limit() in ram_save_iterate; > if we don't hit the rate limit (e.g. because we're CPU or network limited > to slower than the set limit) then I think ram_save_iterate will go all the > way to sending every page; if that happens it'll go once more > around the main migration loop, and call the pending routine, and now get > a -ve (very +ve) number of pending pages, so continuously do ram_save_iterate > again. > > We've had that type of bug before when we messed up the dirty-pages > calculation > during hotplug.
IIUC, migration_bitmap_extend() is called when migration is running, and we hotplug a device. In this case, I think we hold the iothread mutex when migration_bitmap_extend() is called. ram_save_complete() is also protected by the iothread mutex. So if migration_bitmap_extend() is called, the migration thread may be blocked in migration_completion() and wait it. qemu_savevm_state_complete() will be called after migration_completion() returns. Thanks Wen Congyang > >> Now, do we really care if migration_dirty_pages is exact? Not really, >> we just use it to calculate if we should start the throotle or not. >> That only test that each 1 second, so if we have written a couple of >> pages that we are not accounting for, things should be reasonably safe. >> >> Once told that, I don't know why we didn't catch that problem during >> review (yes, I am guilty here). Not sure how to really fix it, >> thought. I think that the problem is more theoretical than real, but > > Dave > >> .... >> >> Thanks, Juan. >> >>> >>> Dave >>> >>> -- >>> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > -- > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > . >