failed-snapshot hooks

Stefan Hanreich Wed, 21 Dec 2022 03:26:47 -0800


On 12/21/22 11:44, Fabian Grünbichler wrote:

this is v2, right? ;)


Oh no - for some reason it's only in the cover letter..


On December 12, 2022 2:43 pm, Stefan Hanreich wrote:

This commit adds hooks to the snapshotting process, which can be used
to run additional setup scripts to prepare the VM for snapshotting.

Examples for use cases include:
* forcing processes to flush their writes
* blocking processes from writing
* altering the configuration of the VM to make snapshotting possible

The prepare step has been split into two parts, so the configuration
can be locked a bit earlier during the snapshotting process. Doing it
this way ensures that the configuration is already locked during the
pre-snapshot hook. Because of this split, the VM config gets written
in two stages now, rather than one.

In case of failure during the preparation step - after the lock is
written - error handling has been added so the lock gets released
properly. The failed-snapshot hook runs when the snapshot fails, if
the pre-snapshot hook ran already. This enables users to revert any
changes done during the pre-snapshot hookscript.


see below

The preparation step assumes that the hook does not convert the
current VM into a template, which is why the basic checks are not
re-run after the pre-snapshot hook. The storage check runs after the
pre-snapshot hook, because the hook might get used to setup the
storage for snapshotting. If the hook would run after the storage
checks, this becomes impossible.

cfs_update() gets called after every invocation of a hookscript, since
it is impossible to know which changes get made by the hookscript.
Doing this ensures that we see the updated state of the CFS after the
hookscript got invoked.

Signed-off-by: Stefan Hanreich <s.hanre...@proxmox.com>
---
  src/PVE/AbstractConfig.pm | 49 ++++++++++++++++++++++++++++++++++-----
  1 file changed, 43 insertions(+), 6 deletions(-)

diff --git a/src/PVE/AbstractConfig.pm b/src/PVE/AbstractConfig.pm
index a0c0bc6..3bff600 100644
--- a/src/PVE/AbstractConfig.pm
+++ b/src/PVE/AbstractConfig.pm
@@ -710,8 +710,7 @@ sub __snapshot_prepare {

my $snap;- my $updatefn = sub {

-
+    my $run_checks = sub {
        my $conf = $class->load_config($vmid);

die "you can't take a snapshot if it's a template\n"

@@ -721,15 +720,21 @@ sub __snapshot_prepare {

$conf->{lock} = 'snapshot';- my $snapshots = $conf->{snapshots};

-
        die "snapshot name '$snapname' already used\n"
-           if defined($snapshots->{$snapname});
+           if defined($conf->{snapshots}->{$snapname});
+
+       $class->write_config($vmid, $conf);
+    };

+ my $updatefn = sub {

+       my $conf = $class->load_config($vmid);
        my $storecfg = PVE::Storage::config();
+
        die "snapshot feature is not available\n"
            if !$class->has_feature('snapshot', $conf, $storecfg, undef, undef, 
$snapname eq 'vzdump');

+ my $snapshots = $conf->{snapshots};

+
        for my $snap (sort keys %$snapshots) {
            my $parent_name = $snapshots->{$snap}->{parent} // '';
            if ($snapname eq $parent_name) {
@@ -753,7 +758,32 @@ sub __snapshot_prepare {
        $class->write_config($vmid, $conf);
      };

- $class->lock_config($vmid, $updatefn);

+    $class->lock_config($vmid, $run_checks);
+
+    eval {
+       my $conf = $class->load_config($vmid);
+       PVE::GuestHelpers::exec_hookscript($conf, $vmid, "pre-snapshot", 1);
+    };
+    my $err = $@;
+
+    PVE::Cluster::cfs_update();
+
+    if ($err) {
+       $class->remove_lock($vmid, 'snapshot');
+       die $err;
+    }
+


I wonder if we don't also want to call the 'failed-snapshot' phase when just the
pre-snapshot invocation failed? might be possible to combine the error handling
then, although I am not sure it makes it more readable if combined..

I thought about it, but I thought that if the user die's in his perlscript he should be able to run any cleanup code before that. Thisdoesn't consider any problems in the hookscript unforeseen by the userthough, so I think your approach is better, since it is easier to use.This places less burden on the author of the hookscript. Might make thecode a bit more convoluted though (depending on how we want to handleerrors in failed-snapshot), but the upsides are way better imo.

One thing that would be easier with making the user do his cleanup inpre-snapshot would be that the pre-snapshot hook knows exactly whatfailed in pre-snapshot, so cleanup-code could use that information toskip certain steps. But again, it assumes that pre-snapshot willproperly handle any possible error, which might be a bit much to assume.

+
+    if (my $err = $@) {
+       my $conf = $class->load_config($vmid);
+       PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");


this exec_hookscript needs to be inside an eval {}, with warn in case it fails..

Isn't this already handled by the exec_hookscript function, since I amnot passing $stop_on_error ? It should exit with warn instead of diethen. Maybe I am misunderstanding something.


See:
https://git.proxmox.com/?p=pve-guest-common.git;a=blob;f=src/PVE/GuestHelpers.pm;h=b4ccbaa73a3fd08ba5d34350ebd57ee31355035b;hb=HEAD#l125


also, this call here happens when preparing for making the snapshot, after
possibly saving the VM state, but before taking the volume snapshots..

This should be alleviated by the envvars you proposed below, becausethen we could pass that information to the hookscript and the userdecides what to do with this information, right?

+       PVE::Cluster::cfs_update();
+
+       $class->remove_lock($vmid, 'snapshot');
+       die $err;
+    }

return $snap;

  }
@@ -837,11 +867,18 @@ sub snapshot_create {

if ($err) {

        warn "snapshot create failed: starting cleanup\n";
+
+       PVE::GuestHelpers::exec_hookscript($conf, $vmid, "failed-snapshot");


eval + warn as well


see above


this call here happens when the volume snapshots might or might not have been
created already (depending on what exactly the error cause is).


same here - should be alleviated by adding envvars, right?

+       PVE::Cluster::cfs_update();
+
        eval { $class->snapshot_delete($vmid, $snapname, 1, $drivehash); };
        warn "$@" if $@;
        die "$err\n";
      }

+ PVE::GuestHelpers::exec_hookscript($conf, $vmid, "post-snapshot");


and here we have a similar issue (no eval), what should happen if post-snapshot
fails?

A die immediately (very likely wrong, current)
B eval + warn but proceed with commit (possibly leaving leftover hook changes 
around)
C eval + warn, call failed-snapshot but proceed with commit (gives the
   hookscript a chance to cleanup, but how does it differentiate between the
   different failed-snapshot call sites?)
D eval + delete snapshot (seems suboptimal)
E eval + call failed-snapshot + delete snapshot (same, and also the issue of the
   hookscript being able to know what's going on again)

B and C seem most sensible to me, but C adds to the issue of "missing
failed-snapshot context", depending on what the hookscript is doing..

again, see above - I think it currently actually behaves like B becauseof how exec_hookscript works if I understand correctly.

Similar idea to not running the failed-snapshot hook if pre-snapshotfails. I thought that the user should be aware that his hookscriptfailed at some point and run possible cleanup code before returning. AsI said above that's probably a worse idea than just runningfailed-snapshot. It also enables the user to just have all the cleanuphandled by failed-snapshot instead of having to add it to pre/post/failed.


one way to pass information is via the environment, we do that for the migration
case already (setting PVE_MIGRATED_FROM, so that the pre-start/post-start
hookscript can know the start happens in a migration context, and where to
(possibly) find the guest config..

for example, we could set PVE_SNAPSHOT_PHASE here, and have prepare/commit/post
as sub-phases, or even pass a list of volumes already snapshotted (or created,

in case of vmstate), or

That's a good idea, I'll look into sensible values forPVE_SNAPSHOT_PHASE as well as look into how we could pass theinformation about volumes to the hookscript best.

obviously setting the environment is only allowed in a forked worker context,
else it would affect the next API endpoint handled by the pveproxy/pvedaemon/..
process, so it might be worth double-checking and cleaning up to avoid
side-effects with replication/migration/.. if we go down that route..

very good remark - thanks. I would not have thought of it even though itis kinda obvious now you pointed it out.

+    PVE::Cluster::cfs_update();
+
      $class->__snapshot_commit($vmid, $snapname);
  }

--

2.30.2


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel



_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel


Many thanks for the review!


_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-guest-common 1/1] partially fix #2530: snapshots: add pre/post/failed-snapshot hooks

Reply via email to