On 10-09-15 14:21, Andrija Panic wrote: > Thx Wido, > > I will have my colegue Igor and Dmytro join with details on this. >
Great! > I agree we need fix upstream, that is the main purpose from our side! > I love to see a pull request for rados-java :) 0.1.5 should be released then. > With this temp fix, we just avoid agent crashing (agent somehow restarts > again fine :) ) but VMs also go down on that host, at least some of them. > True, but I think the fix in rados-java won't be that hard. > Do you see any lifecycle/workflow issue, if we implement deleting SNAP from > CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary NFS - > or perhaps only delete SNAP from CEPH as a part of actuall SNAP deletion > (when you manually or via scheduled snapshots, delete snapshot from DB and > NFS) ? Maybe second option is better, I dont know how you guys handle this > for regular NFS as primary storage etc... > No, there is no problem. You can remove the RBD snapshot afterwards, ACS will never touch it. So it's fine to remove any RBD snapshot(s) from volumes without telling ACS. Wido > > Any guidance is most welcomed, and our team will try to code all this. > > Thx Wido again > > On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl> wrote: > >> >> >> On 10-09-15 14:07, Andrija Panic wrote: >>> Wido, >>> >>> part of code where you want to delete some volume, checks if volume is >> type >>> RBD - and then tries to list snapshots, delete snapshtos, and finally >>> remove image. Here first step- Listing snapshtos- fails, if there are >> more >>> than 16 snapshtos present - number 16 is hardcoded in elsewhere part of >>> code and throws RBD exception...then agent crashes... and then VMs goe >> down >>> etc. >>> >> >> Hmmm, that seems like a bug in rados-java indeed. I don't know if there >> is a release of rados-java where this is fixed in. >> >> Looking at the code of rados-java it should, but I'm not 100% certain. >> >>> So our current way as quick fix is to invoke external script which will >>> also list and remove all snapshtos, but will not fail. >>> >> >> Yes, but we should fix it upstream. I understand that you will use a >> temp script to clean up everything. >> >>> I'm not sure why is there 16 as the hardcoded limit - will try to provide >>> part of code where this is present...we can increase this number but it >>> doesn make any sense (from 16 to i.e. 200), since we still have lot of >>> garbage left on CEPH (snapshtos that were removed in ACS (DB and >> Secondary >>> NFS) - but not removed from CEPH. And in my understanding this needs to >> be >>> implemented, so we dont catch any exceptions that I originally >> described... >>> >>> Any thoughts on this ? >>> >> >> A cleanup script for now should be OK indeed. Afterwards the Java code >> should be able to do this. >> >> You can try manually by using rados-java and fix that. >> >> This is the part where the listing is done: >> >> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >> >> Wido >> >>> Thx for input! >>> >>> On 10 September 2015 at 13:56, Wido den Hollander <w...@widodh.nl> >> wrote: >>> >>>> >>>> >>>> On 10-09-15 12:17, Andrija Panic wrote: >>>>> We are testing some [dirty?] patch on our dev system and we shall soon >>>>> share it for review. >>>>> >>>>> Basically, we are using external python script that is invoked in some >>>> part >>>>> of code execution to delete needed CEPH snapshots and then after that >>>>> proceeds with volume deleteion etc... >>>>> >>>> >>>> That shouldn't be required. The Java bindings for librbd and librados >>>> should be able to remove the snapshots. >>>> >>>> There is no need to invoke external code, this can all be handled in >> Java. >>>> >>>>> On 10 September 2015 at 11:26, Andrija Panic <andrija.pa...@gmail.com> >>>>> wrote: >>>>> >>>>>> Eh, OK. Thx for the info. >>>>>> >>>>>> BTW why is 16 snapshot limits hardcoded - any reason for that ? >>>>>> >>>>>> Not cleaning snapshots on CEPH and trying to delete volume after >> having >>>>>> more than 16 snapshtos in CEPH = Agent crashing on KVM side...and some >>>> VMs >>>>>> being rebooted etc - which means downtime :| >>>>>> >>>>>> Thanks, >>>>>> >>>>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com> wrote: >>>>>> >>>>>>> Andrija, >>>>>>> >>>>>>> The Ceph snapshot deletion is not currently implemented. >>>>>>> >>>>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>>>>>> >>>>>>> - Si >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Andrija Panic <andrija.pa...@gmail.com> >>>>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>>>>>> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org >>>>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH (only from >>>>>>> Secondaryt NFS and DB) >>>>>>> >>>>>>> Hi folks, >>>>>>> >>>>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also >> affected) - >>>>>>> when we delete some snapshot (volume snapshot) in ACS, ACS marks it >> as >>>>>>> deleted in DB, deletes from NFS Secondary Storage but fails to delete >>>>>>> snapshot on CEPH primary storage (doesn even try to delete it AFAIK) >>>>>>> >>>>>>> So we end up having 5 live snapshots in DB (just example) but >> actually >>>> in >>>>>>> CEPH there are more than i.e. 16 snapshots. >>>>>>> >>>>>>> More of the issue, when ACS agent tries to obtain list of snapshots >>>> from >>>>>>> CEPH for some volume or so - if number of snapshots is over 16, it >>>> raises >>>>>>> exception (and perhaps this is the reason Agent crashed for us - >> need >>>> to >>>>>>> check with my colegues who are investigatin this in details). This >>>> number >>>>>>> 16 is for whatever reasons hardcoded in ACS code. >>>>>>> >>>>>>> Wondering if anyone experienced this, or have any info - we plan to >>>> try to >>>>>>> fix this, and I will inlcude my dev colegues here, but we might need >>>> some >>>>>>> help at least for guidance or- >>>>>>> >>>>>>> Any help is really apreaciated or at list confirmation that this is >>>> known >>>>>>> issue etc. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Andrija Panić >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Andrija Panić >>>>>> >>>>> >>>>> >>>>> >>>> >>> >>> >>> >> > > >