Re: [pve-devel] [guest-common] fix #1694: Replication risks permanently losing sync in high loads due to timeout bug

Thomas Lamprecht Wed, 28 Mar 2018 21:23:31 -0700

On 3/23/18 12:15 PM, Wolfgang Link wrote:
> If the pool is under heavy load ZFS will low prioritized deletion jobs.
> This ends in a timeout and the program logic will delete the current sync 
> snapshot.
> On the next run the former sync snapshots will also removed because they are 
> not in the state file.
> In this state it is no more possible to sync and a full sync has to be 
> performed.
> 
> We do not delete the former snapshot on the end of the replication run,
> because prepare_local_job will delete it anyway and
> when a timeout happens in this state we can ignore it and start the 
> replication.
>


So why was it ever done this way? We always get the remote side snapshots
and cleanup the stale ones there, so what purpose had the
remote_finalize_local_job method? Cleaning up in some edge case the
remote_prepare_local_job method didn't reached?
Just asking so that I can better understand any possible side effects of
your proposed changes...

> The snapshot deletion error will be logged in the replication log.
> ---
>  PVE/Replication.pm | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/PVE/Replication.pm b/PVE/Replication.pm
> index 9bc4e61..1eb853d 100644
> --- a/PVE/Replication.pm
> +++ b/PVE/Replication.pm
> @@ -136,8 +136,18 @@ sub prepare {
>               $last_snapshots->{$volid}->{$snap} = 1;
>           } elsif ($snap =~ m/^\Q$prefix\E/) {
>               $logfunc->("delete stale replication snapshot '$snap' on 
> $volid");
> -             PVE::Storage::volume_snapshot_delete($storecfg, $volid, $snap);
> -             $cleaned_replicated_volumes->{$volid} = 1;
> +
> +             eval {
> +                 PVE::Storage::volume_snapshot_delete($storecfg, $volid, 
> $snap);
> +                 $cleaned_replicated_volumes->{$volid} = 1;
> +             };
> +
> +             # If deleting the snapshot fails, we can not be sure if it was 
> due to an error or a timeout.
> +             # The likelihood that the delete has worked out is high at a 
> timeout.
> +             # If it really fails, it will try to remove on the next run.
> +             warn $@ if $@;
> +
> +             $logfunc->("delete stale replication snapshot error: $@") if $@;
>           }
>       }
>      }
> @@ -282,24 +292,16 @@ sub replicate {
>           }
>  
>           replicate_volume($ssh_info, $storecfg, $volid, $base_snapname, 
> $sync_snapname, $rate, $insecure, $logfunc);
> +         # old snapshots will removed by next run from prepare_local_job.
>       }
>      };
> -    $err = $@;
> -
> -    if ($err) {
> +    if ($@) {
>       $cleanup_local_snapshots->($replicate_snapshots, $sync_snapname); # try 
> to cleanup
>       # we do not cleanup the remote side here - this is done in
>       # next run of prepare_local_job
> -     die $err;
> +     die $@;
>      }

The above change is unnecessary, I could fix-it up and amend your patch.
If there's a v2 of this please address that.

>  
> -    # remove old snapshots because they are no longer needed
> -    $cleanup_local_snapshots->($last_snapshots, $last_sync_snapname);
> -
> -    remote_finalize_local_job($ssh_info, $jobid, $vmid, $sorted_volids, 
> $start_time, $logfunc);
> -

This makes remote_finalize_local_job an orphan, it has no callers anymore...
also the finalize-local-job CLI command has no caller.
So if this approach is deemed good those would need to be cleaned up, or
deprecated?

> -    die $err if $err;
> -
>      return $volumes;
>  }
>  
> 


_______________________________________________
pve-devel mailing list
[email protected]
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [guest-common] fix #1694: Replication risks permanently losing sync in high loads due to timeout bug

Reply via email to