* Balamuruhan S (bal...@linux.vnet.ibm.com) wrote: > On Fri, Jul 06, 2018 at 11:56:59AM +0100, Dr. David Alan Gilbert wrote: > > * Dr. David Alan Gilbert (dgilb...@redhat.com) wrote: > > > * Peter Xu (pet...@redhat.com) wrote: > > > > Based-on: <20180627132246.5576-1-pet...@redhat.com> > > > > > > > > Based on the series to unbreak postcopy: > > > > Subject: [PATCH v3 0/4] migation: unbreak postcopy recovery > > > > Message-Id: <20180627132246.5576-1-pet...@redhat.com> > > > > > > > > This series introduce a new postcopy recovery test. The new test > > > > actually helped me to identify two bugs there so fix them as well > > > > before 3.0 release. > > > > > > > > Patch 1: a trivial cleanup for existing postcopy ram load, which I > > > > found a bit confusing during debugging the problem. > > > > > > > > Patch 2-3: two bug fixes that address different issues. Please see > > > > the commit log for more information. > > > > > > > > Patch 4-9: add the postcopy recovery unit test. > > > > > > > > Please review. Thanks, > > > > > > Queued > > > > Hi Peter, > > There's a problem in there somewhere; I'm getting > > an intermittent failure of the test if I run a make check -j 8 on my > > laptop. Just running two copies of tests/migration-test in parallel > > sometimes triggers it (but not if I turn on QTEST_LOG!). > > But it's always failing with: > > > > > > ERROR:/home/dgilbert/git/migpull/tests/migration-test.c:373:migrate_recover: > > assertion failed: (qdict_haskey(rsp, "return")) > > > > Dave > > Hi Peter, Dave,
Hi Bala, > I have applied this patchset in upstream Qemu to test postcopy > pause/recovery. Are you still seeing this with the set that got merged into 3.0-rc0? The second of your errors looks similar to problems with the race we had before Peter fixed it; but the set that I merged passed a 'make check' on a Power box. Dave > I observed error after triggering recovery command from source monitor > where the target is lost and the source remains to be in `postcopy-pause` > state. > > Please find my observation below, > > Source: > > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine > \ > pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device > virtio-blk-pci,drive=rootdisk \ > -drive > file=/home/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk \ > -monitor telnet:127.0.0.1:1234,server,nowait -net nic,model=virtio -net user \ > -redir tcp:2000::22 > > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused. > > Source Monitor: > > (qemu) migrate_set_capability postcopy-ram on > (qemu) migrate_set_parameter max-postcopy-bandwidth 4096 > (qemu) migrate -d tcp:127.0.0.1:4444 > (qemu) migrate_start_postcopy > (qemu) migrate_pause > (qemu) migrate -r tcp:127.0.0.1:4446 > > After triggering recovery, target is lost with the error mentioned below > and source remains to be in `postcopy-paused` state > > (qemu) info migrate > globals: > store-global-state: on > only-migratable: off > send-configuration: on > send-section-footer: on > decompress-error-check: on > capabilities: xbzrle: off rdma-pin-all: off auto-converge: off > zero-blocks: off \ > compress: off events: off postcopy-ram: on x-colo: off release-ram: off > block: off return-path: off pause-before-switchover: off x-multifd: off \ > dirty-bitmaps: off > postcopy-blocktime: off late-block-activate: off > Migration status: postcopy-recover > total time: 78818 milliseconds > expected downtime: 300 milliseconds > setup: 169 milliseconds > transferred ram: 177749 kbytes > throughput: 63.72 mbps > remaining ram: 28061376 kbytes > total ram: 67109120 kbytes > duplicate: 9742102 pages > skipped: 0 pages > normal: 22986 pages > normal bytes: 91944 kbytes > dirty sync count: 2 > page size: 4 kbytes > multifd bytes: 0 kbytes > dirty pages rate: 1273187 pages > postcopy request count: 236 > > > Target: > > # ppc64-softmmu/qemu-system-ppc64 --enable-kvm --nographic -vga none -machine > \ > pseries -m 64G,slots=128,maxmem=128G -smp 16,maxcpus=32 -device > virtio-blk-pci,drive=rootdisk \ > -drive > file=/home/bala/sharing/hostos-ppc64le.qcow2,if=none,cache=none,format=qcow2,id=rootdisk > \ > -monitor telnet:127.0.0.1:1235,server,nowait -net nic,model=virtio -net user \ > -redir tcp:2001::22 -incoming tcp:127.0.0.1:4444 > > > qemu-system-ppc64: check_section_footer: Read section footer failed: -5 > qemu-system-ppc64: Detected IO failure for postcopy. Migration paused. > qemu-system-ppc64: Not a migration stream > qemu-system-ppc64: load of migration failed: Invalid argument > > > Target Monitor: > > (qemu) migrate_set_capability postcopy-ram on > (qemu) migrate_recover tcp:127.0.0.1:4446 > (qemu) Connection closed by foreign host. > > QTest: > > Also with respect to Qtest, I have tested it and the recovery test > doesn't complete as it waits on the source for "completed" but due to this > issue source remains to be in `postcopy-paused` > > `migrate_postcopy_complete(from, to);` > > but it actually doesn't end. > > As it did not complete, I cancelled it forcefully > > # time QTEST_QEMU_BINARY=./ppc64-softmmu/qemu-system-ppc64 > ./tests/migration-test > /ppc64/migration/deprecated: OK > /ppc64/migration/bad_dest: OK > /ppc64/migration/postcopy/unix: OK > /ppc64/migration/postcopy/recovery: ^C > > real 21m55.176s > user 2m28.800s > sys 4m55.980s > > -- Bala > > > > > > Peter Xu (9): > > > > migration: simplify check to use qemu file buffer > > > > migration: loosen recovery check when load vm > > > > migration: fix incorrect bitmap size calculation > > > > tests: introduce migrate_postcopy_* helpers > > > > tests: allow migrate() to take extra flags > > > > tests: introduce migrate_query*() helpers > > > > tests: introduce wait_for_migration_status() > > > > tests: add postcopy recovery test > > > > tests: hide stderr for postcopy recovery test > > > > > > > > migration/ram.c | 21 +++-- > > > > migration/savevm.c | 16 ++-- > > > > tests/migration-test.c | 198 ++++++++++++++++++++++++++++++++--------- > > > > 3 files changed, 176 insertions(+), 59 deletions(-) > > > > > > > > -- > > > > 2.17.1 > > > > > > > > > > > -- > > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > -- > > Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK > > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK