Hi all, Gabriel is taking a look into this one, but I would suggest anybody else available to also take a look please, especially KVM users which are affected as below.
Summary so far: - Xen + NFS fixed/working fine - tested both with and without backing up snap on sec store - deleted from appropriate stores fine. - VMware + NFS - tested - no issues (perhaps was not even broken) - KVM + Ceph - per latest testing works fine - but, if not mistaken, there is a different behaviour here than with KVM + NFS (see below) - KVM + NFS - broken - you can't even delete the snapshot, exceptions raised (please see more details on the PR itself) The difference that I believe I noticed between KVM + Ceph vs NFS, is the fact that the snapshot on RBD Primary store is NOT deleted when the snapshot is created and copied over to the Secondary Storage (again, from memory...) - but with NFS/QCOW2, the snapshots IS REMOVED from the QCOW2 file once it has been copied (qemu-img converted) to the Secondary Storage. This is specifically what is happening when **snapshot.backup.to.secondary=TRUE** (when this is FALSE, then the snap is kept on QCOW2 as the only logical choice) and what currentl causes issues when you try to delete a snap from KVM+NFS setup. I proposed that this be unified if snapshot.backup.to.secondary=TRUE, then I believe it's better for Ceph to also NOT keep the snap on RBD/Primary once it has been copied over to the secondary storage - too many snaps on Ceph = destroyed performance on Ceph cluster (at least in older Ceph versions @Gabriel Beims Bräscher <gabrasc...@gmail.com> @Wido den Hollander <w...@42on.com> ) With KVM/NFS, if you want to really restore the volume (new API in... 4.11.x?) - then the qcow2 snap is being in the background again copied back from Secondary to Primary storage and is being used to "revert" a volume.Same can be done for Ceph I guess, it's qemu-img convert. If we don't unify the behaviour, that can still work, but then when creating KVM+NFS snapshots, the row in the snapshots_store_ref table should not exist (or should be marked as "Destroyed) for the PRIMARY store_role, as we actually have deleted the snap from the QCOW2 when the createSnapshot API completes - this raises the exception when trying to delete a snap (more details in PR) Thanks, Andrija On Tue, 4 Feb 2020 at 14:05, Andrija Panic <andrija.pa...@gmail.com> wrote: > +1, but we need to test different HV+storage combinations...that is some > effort. > > On Tue, 4 Feb 2020 at 13:56, Daan Hoogland <daan.hoogl...@gmail.com> > wrote: > >> so having seen the discussions here and on the PR, do we agree to try and >> get @Gabriel Beims Bräscher <gabrasc...@gmail.com> 's PR in and leave it >> at >> that for this release? >> >> On Tue, Feb 4, 2020 at 10:10 AM Gabriel Beims Bräscher < >> gabrasc...@gmail.com> >> wrote: >> >> > Hello folks, >> > >> > Just to give you an update. I deployed a XenServer cluster and >> performed a >> > few tests on PR #3649. After upgrading a 4.13.0.0 Zone with this fix, >> > XenServer snapshot was deleted on primary and secondary storage (NFS). >> > >> > Em seg., 3 de fev. de 2020 às 14:11, Gabriel Beims Bräscher < >> > gabrasc...@gmail.com> escreveu: >> > >> > > I would try as much as possible to have it merged into 4.14. >> Considering >> > > that it is not simple to map all the garbage snapshots on secondary >> > storage. >> > > >> > > The proposed PR [1] should, in theory, fix also for XenServer. >> However, I >> > > did not test it for XenServer so far. >> > > Today I am deploying a XenServer cluster to check it. If someone else >> > > could also hammer that PR and see if it works fine would be great :-) >> > > >> > > [1] https://github.com/apache/cloudstack/pull/3649 >> > > >> > > Em seg., 3 de fev. de 2020 às 14:02, Paul Angus < >> > paul.an...@shapeblue.com> >> > > escreveu: >> > > >> > >> Thanks. My vote would be that it is a blocker, as there is no way to >> > >> clean up and so storage filling up and crashing is a very real >> > possibility. >> > >> >> > >> >> > >> paul.an...@shapeblue.com >> > >> www.shapeblue.com >> > >> Amadeus House, Floral Street, London WC2E 9DPUK >> > >> @shapeblue >> > >> >> > >> >> > >> >> > >> >> > >> -----Original Message----- >> > >> From: Andrija Panic <andrija.pa...@gmail.com> >> > >> Sent: 03 February 2020 16:58 >> > >> To: dev <dev@cloudstack.apache.org> >> > >> Subject: Re: [DISCUSS] blocker issue 3646 for 4.14/4.13.1 >> > >> >> > >> I believe not - i.e. you can go and delete the files manually (but in >> > >> some cases there is also records not properly removed from the >> > >> snapshots_store_ref, for either primary or secondary kind, which >> makes >> > it >> > >> more complicated...) >> > >> >> > >> I can see Simon has asked his colleague to check it (comments on PR) >> - >> > >> fingers crossed. >> > >> >> > >> On Mon, 3 Feb 2020 at 17:37, Paul Angus <paul.an...@shapeblue.com> >> > wrote: >> > >> >> > >> > Is there any kind of workaround or way to 'force' snapshots to be >> > >> > cleaned up (that doesn't create inconsistencies in CloudStack's >> view >> > >> > of the world vs the physical world? >> > >> > >> > >> > paul.an...@shapeblue.com >> > >> > www.shapeblue.com >> > >> > Amadeus House, Floral Street, London WC2E 9DPUK @shapeblue >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > -----Original Message----- >> > >> > From: Andrija Panic <andrija.pa...@gmail.com> >> > >> > Sent: 03 February 2020 16:35 >> > >> > To: dev <dev@cloudstack.apache.org> >> > >> > Subject: Re: [DISCUSS] blocker issue 3646 for 4.14/4.13.1 >> > >> > >> > >> > This issue is here from before (i.e. not new to 4.14), so we can >> argue >> > >> > it's not technically a blocker due to regression happening in some >> > >> previous >> > >> > release, and I can live with it being moved to 4.15. >> > >> > >> > >> > That being said, would be great to see it solved if this rings any >> > bells >> > >> > for anyone who might have played with the related code... >> > >> > >> > >> > On Mon, 3 Feb 2020 at 13:21, Daan Hoogland < >> daan.hoogl...@gmail.com> >> > >> > wrote: >> > >> > >> > >> > > People, >> > >> > > A ticket has been raised as a blocker but i don't think anybody >> here >> > >> > > has the resources to fix it. It is a regression of kinds, and a >> > known >> > >> > > issue but in my not so humble opinion won't block anybody from >> > using a >> > >> > > future release. The Issue [1] describes the problem and a PR [2] >> > gives >> > >> > > a partial solution. It is known to work for a KVM/Ceph >> environment >> > and >> > >> > > thus might be to specific. >> > >> > > >> > >> > > I move that we either >> > >> > > 1. find the PR that caused this and revert it, and/or 2. postpone >> > >> > > fixing it till after 4.14 (unless someone has the resources and >> > >> > > volunteers to address it) and as an ugly workaround (creating a >> cron >> > >> > > job for your env that deletes stale images) exists, unmark it as >> > >> > > blocker. >> > >> > > >> > >> > > [1] https://github.com/apache/cloudstack/issues/3646 >> > >> > > [2] https://github.com/apache/cloudstack/pull/3649 >> > >> > > >> > >> > > any comments, please? >> > >> > > -- >> > >> > > Daan >> > >> > > >> > >> > >> > >> > >> > >> > -- >> > >> > >> > >> > Andrija Panić >> > >> > >> > >> >> > >> >> > >> -- >> > >> >> > >> Andrija Panić >> > >> >> > > >> > >> >> >> -- >> Daan >> > > > -- > > Andrija Panić > -- Andrija Panić