On October 10, 2019 8:55 am, Fabian Ebner wrote: > On 10/1/19 12:28 PM, Fabian Grünbichler wrote: >> On October 1, 2019 12:17 pm, Fabian Ebner wrote: >>> Seems like 'zfs destroy' can take longer than 5 seconds, see [0]. >>> I changed the timeout to 15 seconds and also changed the default >>> timeout to 10 instead of 5 seconds, to be on the safe side >>> for other commands like 'zfs create'. >>> >>> [0]: >>> https://forum.proxmox.com/threads/timeout-beim-l%C3%B6schen-des-entfernen-replikats-bei-entfernung-der-replikation.58467/ >> NAK, we have a 30s timeout for synchronous API requests that call this, >> and they might do more than 2 zfs_requests. we'd rather timeout and have >> time to do error handling, than finish successfully sometimes, and die >> without cleanup other times. >> >> the real solution for this is to convert the remaining synchronous API >> calls that trigger storage operations to async ones, and switch the GUI >> over to use the new variants. we had previous discussions about this >> issue already. to implement it really nice, we'd need to things: >> - tasks with structured return value (to implement async content listing >> and similar operations, and make conversion of sync to async API calls >> easier) >> - light-weight / ephemeral tasks (not included in regular task >> lists/log, cleaned up after final poll / 1 day after finishing / ...) >> >> in case of the mentioned report, please investigate whether this call in >> 'pvesr' context is wrongly treated as sync API call (i.e., is_worker >> should maybe return true for pvesr calls?) > I looked at setting the worker flag for the (whole) pvesr environment, > but it seems a > bit much. E.g. in prepare we need to delete stale snapshots before > starting the replication > and we probably don't want to wait all too long for that to complete, > i.e. not > have the worker flag when calling volume_snapshot_delete. > > The situation is different with the vdisk_free call, since there we > don't need to > sit and wait for the result, so there it would make sense to be treated > as a worker. > So we could set the worker flag locally before that call and use our own > timeout, > but it feels like a bad workaround. Also we can't use run_with_timeout > to set our own > timeout, since run_command is called inside vdisk_free, which resets the > alarm.
but we could use run_fork_with_timeout ;) > Since we have a mechanism to spawn a worker in vdisk_free already > (currently only > used by LVM with saferemove), wouldn't it make sense to use that for zfs > as well? So > having free_image in the zfs plugin return a subroutine instead of > executing zfs destroy > directly. Also other callers of vdisk_free should be fine with such a > change, since it already > can happen that vdisk_free spawns a worker (but I haven't looked in detail). but that is the non-default behaviour, not the default one. I'd rather not switch all ZFS vdisk removals to forking a task if there are other solutions. AFAICT, this vdisk_free happens only for previously, but no-longer replicated volumes? a simple solution might be to re-factor 139ff of pvesr.pm to fork a worker if any vdisks need to be removed, and remove them all asynchronously in a single task? OTOH, if vdisk_free runs into this problem, it's only a question of load for other zfs_requests (like listing volumes, snapshots, creating snapshots, removing snapshots) to also run into the low timeouts.. so the big question still remains - do we want to increase those timeouts in general for pvesr, and if we do, how do we do it? > It would introduce a bit of noise since it would create a task on every > vdisk_free for a zfs volume. > There the light-weight tasks you mentioned would be ideal, but for now > we don't have those. > And maybe the timeout is hit very rarely and so it's not worth the > change, what do you think? > >>> Signed-off-by: Fabian Ebner <[email protected]> >>> --- >>> PVE/Storage/ZFSPoolPlugin.pm | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/PVE/Storage/ZFSPoolPlugin.pm b/PVE/Storage/ZFSPoolPlugin.pm >>> index f66b277..3ce06be 100644 >>> --- a/PVE/Storage/ZFSPoolPlugin.pm >>> +++ b/PVE/Storage/ZFSPoolPlugin.pm >>> @@ -182,7 +182,7 @@ sub zfs_request { >>> my $msg = ''; >>> my $output = sub { $msg .= "$_[0]\n" }; >>> >>> - $timeout = PVE::RPCEnvironment->is_worker() ? 60*60 : 5 if !$timeout; >>> + $timeout = PVE::RPCEnvironment->is_worker() ? 60*60 : 10 if !$timeout; >>> >>> run_command($cmd, errmsg => "zfs error", outfunc => $output, timeout >>> => $timeout); >>> >>> @@ -346,7 +346,7 @@ sub zfs_delete_zvol { >>> >>> for (my $i = 0; $i < 6; $i++) { >>> >>> - eval { $class->zfs_request($scfg, undef, 'destroy', '-r', >>> "$scfg->{pool}/$zvol"); }; >>> + eval { $class->zfs_request($scfg, 15, 'destroy', '-r', >>> "$scfg->{pool}/$zvol"); }; >>> if ($err = $@) { >>> if ($err =~ m/^zfs error:(.*): dataset is busy.*/) { >>> sleep(1); >>> -- >>> 2.20.1 >>> >>> >>> _______________________________________________ >>> pve-devel mailing list >>> [email protected] >>> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >>> >>> >> _______________________________________________ >> pve-devel mailing list >> [email protected] >> https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel >> > > _______________________________________________ pve-devel mailing list [email protected] https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
