Re: [PATCH 0/4] colo: Introduce resource agent and high-level test

Lukas Straub Wed, 27 Nov 2019 13:19:55 -0800

On Fri, 22 Nov 2019 09:46:46 +0000
"Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote:

> * Lukas Straub (lukasstra...@web.de) wrote:
> > Hello Everyone,
> > These patches introduce a resource agent for use with the Pacemaker CRM and 
> > a
> > high-level test utilizing it for testing qemu COLO.
> >
> > The resource agent manages qemu COLO including continuous replication.
> >
> > Currently the second test case (where the peer qemu is frozen) fails on 
> > primary
> > failover, because qemu hangs while removing the replication related block 
> > nodes.
> > Note that this also happens in real world test when cutting power to the 
> > peer
> > host, so this needs to be fixed.
>
> Do you understand why that happens? Is this it's trying to finish a
> read/write to the dead partner?
>
> Dave

I haven't looked into it too closely yet, but it's often hanging in bdrv_flush()
while removing the replication blockdev and of course thats probably because the
nbd client waits for a reply. So I tried with the workaround below, which will
actively kill the TCP connection and with it the test passes, though I haven't
tested it in real world yet.

A proper solution to this would probably be a "force" parameter for 
blockdev-del,
which skips all flushing and aborts all inflight io. Or we could add a timeout
to the nbd client.

Regards,
Lukas Straub

diff --git a/scripts/colo-resource-agent/colo b/scripts/colo-resource-agent/colo
index 5fd9cfc0b5..62210af2a1 100755
--- a/scripts/colo-resource-agent/colo
+++ b/scripts/colo-resource-agent/colo
@@ -935,6 +935,7 @@ def qemu_colo_notify():
            and HOSTNAME == str.strip(OCF_RESKEY_CRM_meta_notify_master_uname):
             fd = qmp_open()
             peer = qmp_get_nbd_remote(fd)
+            os.system("sudo ss -K dst %s dport = %s" % (peer, NBD_PORT))
             if peer == str.strip(OCF_RESKEY_CRM_meta_notify_stop_uname):
                 if qmp_check_resync(fd) != None:
                     qmp_cancel_resync(fd)

Re: [PATCH 0/4] colo: Introduce resource agent and high-level test

Reply via email to