Thanks for the feedback, Andrija. It looks like delete was not totally supported then (am I missing something?). I will take a look into this and open a PR adding propper support for rbd snapshot deletion if necessary.
Regarding the rollback, I have tested it several times and it worked; however, I see a weak point on the Ceph rollback implementation. It looks like Li Jerry was able to execute the rollback without any problem. Li, could you please post here the log output: "Attempting to rollback RBD snapshot [name:%s], [pool:%s], [volumeid:%s], [snapshotid:%s]"? Andrija will not be able to see that log as the exception happen prior to it, the only way of you checking those values is via remote debugging. If you be able to post those values it would help as well on sorting out what is wrong. I am checking the code base, running a few tests, and evaluating the log that you (Andrija) sent. What I can say for now is that it looks that the parameter "snapshotRelPath = snapshot.getPath()" [1] is a critical piece of code that can definitely break the rollback execution flow. My tests had pointed for a pattern but now I see other possibilities. I will probably add a few parameters on the rollback/revert command instead of using the path or review the path life-cycle and different execution flows in order to keep it safer to be used. [1] https://github.com/apache/cloudstack/blob/50fc045f366bd9769eba85c4bc3ecdc0b7035c11/plugins/hypervisors/kvm/src/main/java/com/cloud/hypervisor/kvm/resource/wrapper A few details on the test environments and Ceph/RBD version: CloudStack, KVM, and Ceph nodes are running with Ubuntu 18.04 Ceph version 13.2.5 (cbff874f9007f1869bfd3821b7e33b2a6ffd4988) mimic (stable) RADOS Block Devices has snapshot rollback support since Ceph v10.0.2 [ https://github.com/ceph/ceph/pull/6878] Rados-java [https://github.com/ceph/rados-java] supports snapshot rollback since 0.5.0; rados-java 0.5.0 is the version used by CloudStack 4.13.0.0 I will be updating here soon. Em dom, 8 de set de 2019 às 12:28, Wido den Hollander <w...@widodh.nl> escreveu: > > > On 9/8/19 5:26 AM, Andrija Panic wrote: > > Maaany release ago, deleting Ceph volume snap, was also only deleting it > in > > DB, so the RBD performance become terrible with many tens of (i. e. > Hourly) > > snapshots. I'll try to verify this on 4.13 myself, but Wido and the guys > > will know better... > > I pinged Gabriel and he's looking into it. He'll get back to it. > > Wido > > > > > I > > > > On Sat, Sep 7, 2019, 08:34 li jerry <div...@hotmail.com> wrote: > > > >> I found it had nothing to do with storage.cleanup.delay and > >> storage.cleanup.interval. > >> > >> > >> > >> The reason is that when DeleteSnapshot Cmd is executed, because the RBD > >> snapshot does not have Copy to secondary storage, it only changes the > >> database information, and does not enter the main storage to delete the > >> snapshot. > >> > >> > >> > >> > >> > >> Log=========================== > >> > >> > >> > >> 2019-09-07 23:27:00,118 DEBUG [c.c.a.ApiServlet] > >> (qtp504527234-17:ctx-2e407b61) (logid:445cbea8) ===START=== > 192.168.254.3 > >> -- GET > >> > command=deleteSnapshot&id=0b50eb7e-4f42-4de7-96c2-1fae137c8c9f&response=json&_=1567869534480 > >> > >> 2019-09-07 23:27:00,139 DEBUG [c.c.a.ApiServer] > >> (qtp504527234-17:ctx-2e407b61 ctx-679fd276) (logid:445cbea8) CIDRs from > >> which account 'Acct[2f96c108-9408-11e9-a820-0200582b001a-admin]' is > allowed > >> to perform API calls: 0.0.0.0/0,::/0 > >> > >> 2019-09-07 23:27:00,204 DEBUG [c.c.a.ApiServer] > >> (qtp504527234-17:ctx-2e407b61 ctx-679fd276) (logid:445cbea8) Retrieved > >> cmdEventType from job info: SNAPSHOT.DELETE > >> > >> 2019-09-07 23:27:00,217 INFO [o.a.c.f.j.i.AsyncJobMonitor] > >> (API-Job-Executor-2:ctx-f0843047 job-1378) (logid:c34a368a) Add job-1378 > >> into job monitoring > >> > >> 2019-09-07 23:27:00,219 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > >> (qtp504527234-17:ctx-2e407b61 ctx-679fd276) (logid:445cbea8) submit > async > >> job-1378, details: AsyncJobVO {id:1378, userId: 2, accountId: 2, > >> instanceType: Snapshot, instanceId: 13, cmd: > >> org.apache.cloudstack.api.command.user.snapshot.DeleteSnapshotCmd, > cmdInfo: > >> > {"response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"1237","id":"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f","ctxDetails":"{\"interface > >> > com.cloud.storage.Snapshot\":\"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f\"}","ctxAccountId":"2","uuid":"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f","cmdEventType":"SNAPSHOT.DELETE","_":"1567869534480"}, > >> cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, > >> result: null, initMsid: 2200502468634, completeMsid: null, lastUpdated: > >> null, lastPolled: null, created: null, removed: null} > >> > >> 2019-09-07 23:27:00,220 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] > >> (API-Job-Executor-2:ctx-f0843047 job-1378) (logid:1cee5097) Executing > >> AsyncJobVO {id:1378, userId: 2, accountId: 2, instanceType: Snapshot, > >> instanceId: 13, cmd: > >> org.apache.cloudstack.api.command.user.snapshot.DeleteSnapshotCmd, > cmdInfo: > >> > {"response":"json","ctxUserId":"2","httpmethod":"GET","ctxStartEventId":"1237","id":"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f","ctxDetails":"{\"interface > >> > com.cloud.storage.Snapshot\":\"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f\"}","ctxAccountId":"2","uuid":"0b50eb7e-4f42-4de7-96c2-1fae137c8c9f","cmdEventType":"SNAPSHOT.DELETE","_":"1567869534480"}, > >> cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0, > >> result: null, initMsid: 2200502468634, completeMsid: null, lastUpdated: > >> null, lastPolled: null, created: null, removed: null} > >> > >> 2019-09-07 23:27:00,221 DEBUG [c.c.a.ApiServlet] > >> (qtp504527234-17:ctx-2e407b61 ctx-679fd276) (logid:445cbea8) ===END=== > >> 192.168.254.3 -- GET > >> > command=deleteSnapshot&id=0b50eb7e-4f42-4de7-96c2-1fae137c8c9f&response=json&_=1567869534480 > >> > >> 2019-09-07 23:27:00,305 DEBUG [c.c.a.m.ClusteredAgentAttache] > >> (AgentManager-Handler-12:null) (logid:) Seq 1-8660140608456756853: > Routing > >> from 2199066247173 > >> > >> 2019-09-07 23:27:00,305 DEBUG [o.a.c.s.s.XenserverSnapshotStrategy] > >> (API-Job-Executor-2:ctx-f0843047 job-1378 ctx-f50e25a4) (logid:1cee5097) > >> Can't find snapshot on backup storage, delete it in db > >> > >> > >> > >> -Jerry > >> > >> > >> > >> ________________________________ > >> 发件人: Andrija Panic <andrija.pa...@gmail.com> > >> 发送时间: Saturday, September 7, 2019 1:07:19 AM > >> 收件人: users <us...@cloudstack.apache.org> > >> 抄送: dev@cloudstack.apache.org <dev@cloudstack.apache.org> > >> 主题: Re: 4.13 rbd snapshot delete failed > >> > >> storage.cleanup.delay > >> storage.cleanup.interval > >> > >> put both to 60 (seconds) and wait for up to 2min - should be deleted > just > >> fine... > >> > >> cheers > >> > >> On Fri, 6 Sep 2019 at 18:52, li jerry <div...@hotmail.com> wrote: > >> > >>> Hello All > >>> > >>> When I tested ACS 4.13 KVM + CEPH snapshot, I found that snapshots > could > >>> be created and rolled back (using API alone), but deletion could not be > >>> completed. > >>> > >>> > >>> > >>> After executing the deletion API, the snapshot will disappear from the > >>> list Snapshots, but the snapshot on CEPH RBD will not be deleted (rbd > >> snap > >>> list rbd/ac510428-5d09-4e86-9d34-9dfab3715b7c) > >>> > >>> > >>> > >>> Is there any way we can completely delete the snapshot? > >>> > >>> -Jerry > >>> > >>> > >> > >> -- > >> > >> Andrija Panić > >> > > >