On 01.07.20 18:16, Vladimir Sementsov-Ogievskiy wrote:
> 29.06.2020 18:00, Max Reitz wrote:
>> On 29.06.20 16:18, Vladimir Sementsov-Ogievskiy wrote:
>>> 29.06.2020 16:48, Max Reitz wrote:
>>>> Hi,
>>>>
>>>> In an iotest, I’m trying to quit qemu immediately after a migration has
>>>> failed.  Unfortunately, that doesn’t seem to be possible in a clean
>>>> way:
>>>> migrate_fd_cleanup() runs only at some point after the migration state
>>>> is already “failed”, so if I just wait for that “failed” state and
>>>> immediately quit, some cleanup functions may not have been run yet.
>>>>
>>>> This is a problem with dirty bitmap migration at least, because it
>>>> increases the refcount on all block devices that are to be migrated, so
>>>> if we don’t call the cleanup function before quitting, the refcount
>>>> will
>>>> stay elevated and bdrv_close_all() will hit an assertion because those
>>>> block devices are still around after blk_remove_all_bs() and
>>>> blockdev_close_all_bdrv_states().
>>>>
>>>> In practice this particular issue might not be that big of a problem,
>>>> because it just means qemu aborts when the user intended to let it quit
>>>> anyway.  But on one hand I could imagine that there are other clean-up
>>>> paths that should definitely run before qemu quits (although I don’t
>>>> know), and on the other, it’s a problem for my test.
>>>>
>>>> I tried working around the problem for my test by waiting on “Unable to
>>>> write” appearing on stderr, because that indicates that
>>>> migrate_fd_cleanup()’s error_report_err() has been reached.  But on one
>>>> hand, that isn’t really nice, and on the other, it doesn’t even work
>>>> when the failure is on the source side (because then there is no
>>>> s->error for migrate_fd_cleanup() to report).
>>
>> (I’ve now managed to work around it by invoking blockdev-del on a node
>> affected by bitmap migration until it succeeds, because blockdev-del can
>> only succeed once the bitmap migration code has dropped its reference to
>> it.)
>>
>>>> In all, I’m asking:
>>>> (1) Is there a nice solution for me now to delay quitting qemu until
>>>> the
>>>> failed migration has been fully resolved, including the clean-up?
>>>>
>>>> (2) Isn’t it a problem if qemu crashes when you issue “quit” via QMP at
>>>> the wrong time?  Like, maybe lingering subprocesses when using “exec”?
>>>>
>>>>
>>>
>>> I'll look more closely tomorrow, but as a short answer: try my series
>>> "[PATCH v2 00/22] Fix error handling during bitmap postcopy" it
>>> handles different problems around migration failures & qemu shutdown,
>>> probably it will help.
>>
>> Not, it doesn’t seem to.
>>
>> I’m not sure what exactly that series addresses, but FWIW I’m hitting
>> the problem in non-postcopy migration.  What my simplest reproducer
>> does is:
> 
> Bitmaps migration is postcopy by nature (and it may not work without
> migrate-start-postcopy, but work in most simple cases, as when we have
> small bitmap data to migrate, it can migrate during migration downtime).
> Most complicated part of the series were about postcopy. Still it fixes
> some other things.
> 
> It seems, that patch (see the second paragraph)
> "[PATCH v2 10/22] migration/block-dirty-bitmap: cancel migration on
> shutdown"
> 
>>    If target is turned of prior to postcopy finished, target crashes
>>    because busy bitmaps are found at shutdown.
>>    Canceling incoming migration helps, as it removes all unfinished (and
>>    therefore busy) bitmaps.
> 
>>    Similarly on source we crash in bdrv_close_all which asserts that all
>>    bdrv states are removed, because bdrv states involved into dirty
>> bitmap
>>    migration are referenced by it. So, we need to cancel outgoing
>>    migration as well.
>     should address exactly your issue.

Hm.  I’ve tested your series and still hit the issue.

I could imagine that my problem lies with a failed migration that is
automatically “cancelled” by nature, so the problem isn’t that it isn’t
cancelled, but that the clean-up runs after accepting further QMP
commands (like quit).

>>
>> On the source VM:
>>
>> blockdev-add node-name='foo' driver='null-co'
>> block-dirty-bitmap-add node='foo' name='bmap0'
>>
>> (Launch destination VM with some -incoming, e.g.
>> -incoming 'exec: cat /tmp/mig_file')
>>
>> Both on source and destination:
>>
>> migrate-set-capabilities capabilities=[
>>      {capability='events', state=true},
>>      {capability='dirty-bitmaps', state=true}
>> ]
>>
>> On source:
>>
>> migrate uri='exec: cat > /tmp/mig_file'
>>
>> Then wait for a MIGRATION event with data/status == 'failed', and then
>> issue 'quit'.
>>
>> Max
>>
> 
> Can you post exact reproducer iotest?

I didn’t have an iotest until now (because it was a simple test written
up in Ruby), but what I’ve attached should work.

Note that you need system load to trigger the problem (or the clean-up
code is scheduled too quickly), so I usually just run like three
instances concurrently.

(while TEST_DIR=/tmp/t$i ./check 400; do; done)

Max
>From ce0cacd21058f27fcb18aa632bfd5bc4fb3feadf Mon Sep 17 00:00:00 2001
From: Max Reitz <mre...@redhat.com>
Date: Thu, 2 Jul 2020 09:21:14 +0200
Subject: [PATCH] Quit crash reproducer

---
 tests/qemu-iotests/400     | 42 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/400.out |  5 +++++
 tests/qemu-iotests/group   |  1 +
 3 files changed, 48 insertions(+)
 create mode 100755 tests/qemu-iotests/400
 create mode 100644 tests/qemu-iotests/400.out

diff --git a/tests/qemu-iotests/400 b/tests/qemu-iotests/400
new file mode 100755
index 0000000000..a32b2c3064
--- /dev/null
+++ b/tests/qemu-iotests/400
@@ -0,0 +1,42 @@
+#!/usr/bin/env python3
+
+import os
+import iotests
+
+mig_sock = os.path.join(iotests.sock_dir, 'mig.sock')
+
+class TestMigQuit(iotests.QMPTestCase):
+    def setUp(self):
+        self.vm_a = iotests.VM(path_suffix='a')
+        self.vm_a.launch()
+
+        self.vm_a.qmp('blockdev-add', node_name='foo', driver='null-co')
+        self.vm_a.qmp('block-dirty-bitmap-add', node='foo', name='bmap0')
+
+        self.vm_b = iotests.VM(path_suffix='b')
+        self.vm_b.add_incoming(f'unix:{mig_sock}')
+        self.vm_b.launch()
+
+        for vm in (self.vm_a, self.vm_b):
+            vm.qmp('migrate-set-capabilities',
+                    capabilities=[{'capability': 'events', 'state': True},
+                                  {'capability': 'dirty-bitmaps',
+                                   'state': True}])
+
+    def tearDown(self):
+        self.vm_a.shutdown()
+        self.vm_b.shutdown()
+
+        try:
+            os.remove(mig_sock)
+        except OSError:
+            pass
+
+    def test_mig_quit(self):
+        self.vm_a.qmp('migrate', uri=f'unix:{mig_sock}')
+
+        while self.vm_a.event_wait('MIGRATION')['data']['status'] != 'failed':
+            pass
+
+if __name__ == '__main__':
+    iotests.main()
diff --git a/tests/qemu-iotests/400.out b/tests/qemu-iotests/400.out
new file mode 100644
index 0000000000..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/400.out
@@ -0,0 +1,5 @@
+.
+----------------------------------------------------------------------
+Ran 1 tests
+
+OK
diff --git a/tests/qemu-iotests/group b/tests/qemu-iotests/group
index d886fa0cb3..cdb785b034 100644
--- a/tests/qemu-iotests/group
+++ b/tests/qemu-iotests/group
@@ -302,3 +302,4 @@
 291 rw quick
 292 rw auto quick
 297 meta
+400
-- 
2.26.2

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to