On Fri, 22 Nov 2019 09:46:46 +0000 "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:
> * Lukas Straub (lukasstra...@web.de) wrote: > > Hello Everyone, > > These patches introduce a resource agent for use with the Pacemaker CRM and > > a > > high-level test utilizing it for testing qemu COLO. > > > > The resource agent manages qemu COLO including continuous replication. > > > > Currently the second test case (where the peer qemu is frozen) fails on > > primary > > failover, because qemu hangs while removing the replication related block > > nodes. > > Note that this also happens in real world test when cutting power to the > > peer > > host, so this needs to be fixed. > > Do you understand why that happens? Is this it's trying to finish a > read/write to the dead partner? > > Dave I haven't looked into it too closely yet, but it's often hanging in bdrv_flush() while removing the replication blockdev and of course thats probably because the nbd client waits for a reply. So I tried with the workaround below, which will actively kill the TCP connection and with it the test passes, though I haven't tested it in real world yet. A proper solution to this would probably be a "force" parameter for blockdev-del, which skips all flushing and aborts all inflight io. Or we could add a timeout to the nbd client. Regards, Lukas Straub diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo index 5fd9cfc0b5..62210af2a1 100755 --- a/scripts/colo-resource-agent/colo +++ b/scripts/colo-resource-agent/colo @@ -935,6 +935,7 @@ def qemu_colo_notify(): and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname): fd = qmp_open() peer = qmp_get_nbd_remote(fd) + os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT)) if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname): if qmp_check_resync(fd) != None: qmp_cancel_resync(fd)