On 10-09-15 14:07, Andrija Panic wrote: > Wido, > > part of code where you want to delete some volume, checks if volume is type > RBD - and then tries to list snapshots, delete snapshtos, and finally > remove image. Here first step- Listing snapshtos- fails, if there are more > than 16 snapshtos present - number 16 is hardcoded in elsewhere part of > code and throws RBD exception...then agent crashes... and then VMs goe down > etc. >
Hmmm, that seems like a bug in rados-java indeed. I don't know if there is a release of rados-java where this is fixed in. Looking at the code of rados-java it should, but I'm not 100% certain. > So our current way as quick fix is to invoke external script which will > also list and remove all snapshtos, but will not fail. > Yes, but we should fix it upstream. I understand that you will use a temp script to clean up everything. > I'm not sure why is there 16 as the hardcoded limit - will try to provide > part of code where this is present...we can increase this number but it > doesn make any sense (from 16 to i.e. 200), since we still have lot of > garbage left on CEPH (snapshtos that were removed in ACS (DB and Secondary > NFS) - but not removed from CEPH. And in my understanding this needs to be > implemented, so we dont catch any exceptions that I originally described... > > Any thoughts on this ? > A cleanup script for now should be OK indeed. Afterwards the Java code should be able to do this. You can try manually by using rados-java and fix that. This is the part where the listing is done: https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java Wido > Thx for input! > > On 10 September 2015 at 13:56, Wido den Hollander <w...@widodh.nl> wrote: > >> >> >> On 10-09-15 12:17, Andrija Panic wrote: >>> We are testing some [dirty?] patch on our dev system and we shall soon >>> share it for review. >>> >>> Basically, we are using external python script that is invoked in some >> part >>> of code execution to delete needed CEPH snapshots and then after that >>> proceeds with volume deleteion etc... >>> >> >> That shouldn't be required. The Java bindings for librbd and librados >> should be able to remove the snapshots. >> >> There is no need to invoke external code, this can all be handled in Java. >> >>> On 10 September 2015 at 11:26, Andrija Panic <andrija.pa...@gmail.com> >>> wrote: >>> >>>> Eh, OK. Thx for the info. >>>> >>>> BTW why is 16 snapshot limits hardcoded - any reason for that ? >>>> >>>> Not cleaning snapshots on CEPH and trying to delete volume after having >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM side...and some >> VMs >>>> being rebooted etc - which means downtime :| >>>> >>>> Thanks, >>>> >>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com> wrote: >>>> >>>>> Andrija, >>>>> >>>>> The Ceph snapshot deletion is not currently implemented. >>>>> >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>>>> >>>>> - Si >>>>> >>>>> ________________________________________ >>>>> From: Andrija Panic <andrija.pa...@gmail.com> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>>>> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH (only from >>>>> Secondaryt NFS and DB) >>>>> >>>>> Hi folks, >>>>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also affected) - >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS marks it as >>>>> deleted in DB, deletes from NFS Secondary Storage but fails to delete >>>>> snapshot on CEPH primary storage (doesn even try to delete it AFAIK) >>>>> >>>>> So we end up having 5 live snapshots in DB (just example) but actually >> in >>>>> CEPH there are more than i.e. 16 snapshots. >>>>> >>>>> More of the issue, when ACS agent tries to obtain list of snapshots >> from >>>>> CEPH for some volume or so - if number of snapshots is over 16, it >> raises >>>>> exception (and perhaps this is the reason Agent crashed for us - need >> to >>>>> check with my colegues who are investigatin this in details). This >> number >>>>> 16 is for whatever reasons hardcoded in ACS code. >>>>> >>>>> Wondering if anyone experienced this, or have any info - we plan to >> try to >>>>> fix this, and I will inlcude my dev colegues here, but we might need >> some >>>>> help at least for guidance or- >>>>> >>>>> Any help is really apreaciated or at list confirmation that this is >> known >>>>> issue etc. >>>>> >>>>> Thanks, >>>>> >>>>> -- >>>>> >>>>> Andrija Panić >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Andrija Panić >>>> >>> >>> >>> >> > > >