Re: [pve-devel] [PATCH v2 guest-common 2/2] fix 3111: replicate guest on rollback if there are replication jobs for it

Fabian Ebner Thu, 12 Aug 2021 02:13:46 -0700

I'll likely send the next version of the series today, but wanted toaddress some points from here first (so I don't have to quote everythingthere).


Am 22.06.21 um 09:41 schrieb Fabian Grünbichler:

On June 9, 2021 11:18 am, Fabian Ebner wrote:

so that there will be a valid replication snapshot again.


Otherwise, replication will be broken after a rollback if the last
(non-replication) snapshot is removed before replication can run again.


I still see issues with these two patches applied..

A: snapshot 'test2', successful replication afterwards, then rollback to
'test2':

----snip----

zfs error: could not find any snapshots to destroy; check snapshot names.
end replication job

two misleading errors from attempting to delete already-cleaned up
snapshots (this is just confusing, likely caused by the state file being
outdated after prepare(), replication is still working as expected
afterwards)

Those errors are triggered by $cleanup_local_snapshots inReplication.pm's replicate() and there currently is no easy way to knowif we are in an after-rollback situation (or other situation wheresnapshots might already be deleted) there. We could just ignore thisspecific error, but then we won't detect if an actually wrong snapshotname was passed in anymore.


B: successful replication, snapshot 'test3', rollback to 'test3'


----snip----


now the replication is broken and requires manual intervention to be
fixed:

source: no replication snapshots, regular snapshots test, test2, test3

target: one replication snapshot, regular snapshots test, test2 (test3
is missing)


Will be fixed in the next version of the series.

----snip----


I) maybe prepare or at least this calling context should hold the
replication lock so that it can't race with concurrent replication runs?

Ideally, all snapshot operations would need to hold the lock, or?Otherwise, it might happen that some volumes are replicated before thesnapshot operation was done with them, and some after.


I'll look at that in detail some time and send it as its own series.

II) maybe prepare should update the state file in general (if the last
snapshot referenced there is stale and gets removed, the next
replication run gets confused) - might/should fix A

The problem is that the state is not aware of the individual volumes androllback might not remove replication snapshots from all replicatedvolumes. It does currently, but that's wrong and causing this bug in thefirst place.

III) maybe prepare and/or run_replication need to learn how to deal with
"only regular snapshots, but not the last one match" (we could match on
name+guid to ensure the common base is actually a common, previously
replicated base/the same snapshot and not just the same name with
different content) and construct a send stream from that shared snapshot
instead of attempting a full sync where the target already (partially)
exists.. that would fix B, and improve replication robustness in
general (allowing resuming from a partial shared state instead of having
to remove, start over from scratch..)

implementing III would also avoid the need for doing a replication after
rollback - the next replication would handle the situation just fine
unless ALL previously shared snapshots are removed - we could check for
that in the remove snapshot code path though.. or we could just schedule
a replication here instead of directly doing it. rollback is an almost
instant action (as long as no vmstate is involved), and a replication
can take a long time so it seems a bit strange to conflate the two..

I went with a similar approach as also discussed off-list, sans the guidmatching as that's not really possible to get from the currentvolume_snapshot_list while being backwards compatible. And I'm not sureit's even possible to trigger a mismatch with the new approach, becauseof the "prevent removal after rollback until replication is run again"restriction.


----snip----


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH v2 guest-common 2/2] fix 3111: replicate guest on rollback if there are replication jobs for it

Reply via email to