On 01.07.20 18:16, Vladimir Sementsov-Ogievskiy wrote: > 29.06.2020 18:00, Max Reitz wrote: >> On 29.06.20 16:18, Vladimir Sementsov-Ogievskiy wrote: >>> 29.06.2020 16:48, Max Reitz wrote: >>>> Hi, >>>> >>>> In an iotest, I’m trying to quit qemu immediately after a migration has >>>> failed. Unfortunately, that doesn’t seem to be possible in a clean >>>> way: >>>> migrate_fd_cleanup() runs only at some point after the migration state >>>> is already “failed”, so if I just wait for that “failed” state and >>>> immediately quit, some cleanup functions may not have been run yet. >>>> >>>> This is a problem with dirty bitmap migration at least, because it >>>> increases the refcount on all block devices that are to be migrated, so >>>> if we don’t call the cleanup function before quitting, the refcount >>>> will >>>> stay elevated and bdrv_close_all() will hit an assertion because those >>>> block devices are still around after blk_remove_all_bs() and >>>> blockdev_close_all_bdrv_states(). >>>> >>>> In practice this particular issue might not be that big of a problem, >>>> because it just means qemu aborts when the user intended to let it quit >>>> anyway. But on one hand I could imagine that there are other clean-up >>>> paths that should definitely run before qemu quits (although I don’t >>>> know), and on the other, it’s a problem for my test. >>>> >>>> I tried working around the problem for my test by waiting on “Unable to >>>> write” appearing on stderr, because that indicates that >>>> migrate_fd_cleanup()’s error_report_err() has been reached. But on one >>>> hand, that isn’t really nice, and on the other, it doesn’t even work >>>> when the failure is on the source side (because then there is no >>>> s->error for migrate_fd_cleanup() to report). >> >> (I’ve now managed to work around it by invoking blockdev-del on a node >> affected by bitmap migration until it succeeds, because blockdev-del can >> only succeed once the bitmap migration code has dropped its reference to >> it.) >> >>>> In all, I’m asking: >>>> (1) Is there a nice solution for me now to delay quitting qemu until >>>> the >>>> failed migration has been fully resolved, including the clean-up? >>>> >>>> (2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at >>>> the wrong time? Like, maybe lingering subprocesses when using “exec”? >>>> >>>> >>> >>> I'll look more closely tomorrow, but as a short answer: try my series >>> "[PATCH v2 00/22] Fix error handling during bitmap postcopy" it >>> handles different problems around migration failures & qemu shutdown, >>> probably it will help. >> >> Not, it doesn’t seem to. >> >> I’m not sure what exactly that series addresses, but FWIW I’m hitting >> the problem in non-postcopy migration. What my simplest reproducer >> does is: > > Bitmaps migration is postcopy by nature (and it may not work without > migrate-start-postcopy, but work in most simple cases, as when we have > small bitmap data to migrate, it can migrate during migration downtime). > Most complicated part of the series were about postcopy. Still it fixes > some other things. > > It seems, that patch (see the second paragraph) > "[PATCH v2 10/22] migration/block-dirty-bitmap: cancel migration on > shutdown" > >> If target is turned of prior to postcopy finished, target crashes >> because busy bitmaps are found at shutdown. >> Canceling incoming migration helps, as it removes all unfinished (and >> therefore busy) bitmaps. > >> Similarly on source we crash in bdrv_close_all which asserts that all >> bdrv states are removed, because bdrv states involved into dirty >> bitmap >> migration are referenced by it. So, we need to cancel outgoing >> migration as well. > should address exactly your issue.
Hm. I’ve tested your series and still hit the issue. I could imagine that my problem lies with a failed migration that is automatically “cancelled” by nature, so the problem isn’t that it isn’t cancelled, but that the clean-up runs after accepting further QMP commands (like quit). >> >> On the source VM: >> >> blockdev-add node-name='foo' driver='null-co' >> block-dirty-bitmap-add node='foo' name='bmap0' >> >> (Launch destination VM with some -incoming, e.g. >> -incoming 'exec: cat /tmp/mig_file') >> >> Both on source and destination: >> >> migrate-set-capabilities capabilities=[ >> {capability='events', state=true}, >> {capability='dirty-bitmaps', state=true} >> ] >> >> On source: >> >> migrate uri='exec: cat > /tmp/mig_file' >> >> Then wait for a MIGRATION event with data/status == 'failed', and then >> issue 'quit'. >> >> Max >> > > Can you post exact reproducer iotest? I didn’t have an iotest until now (because it was a simple test written up in Ruby), but what I’ve attached should work. Note that you need system load to trigger the problem (or the clean-up code is scheduled too quickly), so I usually just run like three instances concurrently. (while TEST_DIR=/tmp/t$i ./check 400; do; done) Max
>From ce0cacd21058f27fcb18aa632bfd5bc4fb3feadf Mon Sep 17 00:00:00 2001 From: Max Reitz <mre...@redhat.com> Date: Thu, 2 Jul 2020 09:21:14 +0200 Subject: [PATCH] Quit crash reproducer --- tests/qemu-iotests/400 | 42 ++++++++++++++++++++++++++++++++++++++ tests/qemu-iotests/400.out | 5 +++++ tests/qemu-iotests/group | 1 + 3 files changed, 48 insertions(+) create mode 100755 tests/qemu-iotests/400 create mode 100644 tests/qemu-iotests/400.out diff --git a/tests/qemu-iotests/400 b/tests/qemu-iotests/400 new file mode 100755 index 0000000000..a32b2c3064 --- /dev/null +++ b/tests/qemu-iotests/400 @@ -0,0 +1,42 @@ +#!/usr/bin/env python3 + +import os +import iotests + +mig_sock = os.path.join(iotests.sock_dir, 'mig.sock') + +class TestMigQuit(iotests.QMPTestCase): + def setUp(self): + self.vm_a = iotests.VM(path_suffix='a') + self.vm_a.launch() + + self.vm_a.qmp('blockdev-add', node_name='foo', driver='null-co') + self.vm_a.qmp('block-dirty-bitmap-add', node='foo', name='bmap0') + + self.vm_b = iotests.VM(path_suffix='b') + self.vm_b.add_incoming(f'unix:{mig_sock}') + self.vm_b.launch() + + for vm in (self.vm_a, self.vm_b): + vm.qmp('migrate-set-capabilities', + capabilities=[{'capability': 'events', 'state': True}, + {'capability': 'dirty-bitmaps', + 'state': True}]) + + def tearDown(self): + self.vm_a.shutdown() + self.vm_b.shutdown() + + try: + os.remove(mig_sock) + except OSError: + pass + + def test_mig_quit(self): + self.vm_a.qmp('migrate', uri=f'unix:{mig_sock}') + + while self.vm_a.event_wait('MIGRATION')['data']['status'] != 'failed': + pass + +if __name__ == '__main__': + iotests.main() diff --git a/tests/qemu-iotests/400.out b/tests/qemu-iotests/400.out new file mode 100644 index 0000000000..ae1213e6f8 --- /dev/null +++ b/tests/qemu-iotests/400.out @@ -0,0 +1,5 @@ +. +---------------------------------------------------------------------- +Ran 1 tests + +OK diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group index d886fa0cb3..cdb785b034 100644 --- a/tests/qemu-iotests/group +++ b/tests/qemu-iotests/group @@ -302,3 +302,4 @@ 291 rw quick 292 rw auto quick 297 meta +400 -- 2.26.2
signature.asc
Description: OpenPGP digital signature