Thx Wido,

I will have my colegue Igor and Dmytro join with details on this.

I agree we need fix upstream, that is the main purpose from our side!

With this temp fix, we just avoid agent crashing (agent somehow restarts
again fine :) ) but VMs also go down on that host, at least some of them.

Do you see any lifecycle/workflow issue, if we implement deleting SNAP from
CEPH after you SNAP a volume in ACS and sucsssfully move to Secondary NFS -
or perhaps only delete SNAP from CEPH as a part of actuall SNAP deletion
(when you manually or via scheduled snapshots, delete snapshot from DB and
NFS) ? Maybe second option is better, I dont know how you guys handle this
for regular NFS as primary storage etc...


Any guidance is most welcomed, and our team will try to code all this.

Thx Wido again

On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl> wrote:

>
>
> On 10-09-15 14:07, Andrija Panic wrote:
> > Wido,
> >
> > part of code where you want to delete some volume, checks if volume is
> type
> > RBD - and then tries to list snapshots, delete snapshtos, and finally
> > remove image. Here first step- Listing snapshtos-  fails, if there are
> more
> > than 16 snapshtos present - number 16 is hardcoded in elsewhere part of
> > code and throws RBD exception...then agent crashes... and then VMs goe
> down
> > etc.
> >
>
> Hmmm, that seems like a bug in rados-java indeed. I don't know if there
> is a release of rados-java where this is fixed in.
>
> Looking at the code of rados-java it should, but I'm not 100% certain.
>
> > So our current way as quick fix is to invoke external script which will
> > also list and remove all snapshtos, but will not fail.
> >
>
> Yes, but we should fix it upstream. I understand that you will use a
> temp script to clean up everything.
>
> > I'm not sure why is there 16 as the hardcoded limit - will try to provide
> > part of code where this is present...we can increase this number but it
> > doesn make any sense (from 16 to i.e. 200), since we still have lot of
> > garbage left on CEPH (snapshtos that were removed in ACS (DB and
> Secondary
> > NFS) - but not removed from CEPH. And in my understanding this needs to
> be
> > implemented, so we dont catch any exceptions that I originally
> described...
> >
> > Any thoughts on this ?
> >
>
> A cleanup script for now should be OK indeed. Afterwards the Java code
> should be able to do this.
>
> You can try manually by using rados-java and fix that.
>
> This is the part where the listing is done:
>
> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java
>
> Wido
>
> > Thx for input!
> >
> > On 10 September 2015 at 13:56, Wido den Hollander <w...@widodh.nl>
> wrote:
> >
> >>
> >>
> >> On 10-09-15 12:17, Andrija Panic wrote:
> >>> We are testing some [dirty?] patch on our dev system and we shall soon
> >>> share it for review.
> >>>
> >>> Basically, we are using external python script that is invoked in some
> >> part
> >>> of code execution to delete needed CEPH snapshots and then after that
> >>> proceeds with volume deleteion etc...
> >>>
> >>
> >> That shouldn't be required. The Java bindings for librbd and librados
> >> should be able to remove the snapshots.
> >>
> >> There is no need to invoke external code, this can all be handled in
> Java.
> >>
> >>> On 10 September 2015 at 11:26, Andrija Panic <andrija.pa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Eh, OK. Thx for the info.
> >>>>
> >>>> BTW why is 16 snapshot limits hardcoded - any reason for that ?
> >>>>
> >>>> Not cleaning snapshots on CEPH and trying to delete volume after
> having
> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM side...and some
> >> VMs
> >>>> being rebooted etc - which means downtime :|
> >>>>
> >>>> Thanks,
> >>>>
> >>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com> wrote:
> >>>>
> >>>>> Andrija,
> >>>>>
> >>>>> The Ceph snapshot deletion is not currently implemented.
> >>>>>
> >>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302
> >>>>>
> >>>>> - Si
> >>>>>
> >>>>> ________________________________________
> >>>>> From: Andrija Panic <andrija.pa...@gmail.com>
> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM
> >>>>> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH (only from
> >>>>> Secondaryt NFS and DB)
> >>>>>
> >>>>> Hi folks,
> >>>>>
> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also
> affected) -
> >>>>> when we delete some snapshot (volume snapshot) in ACS, ACS marks it
> as
> >>>>> deleted in DB, deletes from NFS Secondary Storage but fails to delete
> >>>>> snapshot on CEPH primary storage (doesn even try to delete it AFAIK)
> >>>>>
> >>>>> So we end up having 5 live snapshots in DB (just example) but
> actually
> >> in
> >>>>> CEPH there are more than i.e. 16 snapshots.
> >>>>>
> >>>>> More of the issue, when ACS agent tries to obtain list of snapshots
> >> from
> >>>>> CEPH for some volume or so - if number of snapshots is over 16, it
> >> raises
> >>>>> exception  (and perhaps this is the reason Agent crashed for us -
> need
> >> to
> >>>>> check with my colegues who are investigatin this in details). This
> >> number
> >>>>> 16 is for whatever reasons hardcoded in ACS code.
> >>>>>
> >>>>> Wondering if anyone experienced this, or have any info - we plan to
> >> try to
> >>>>> fix this, and I will inlcude my dev colegues here, but we might need
> >> some
> >>>>> help at least for guidance or-
> >>>>>
> >>>>> Any help is really apreaciated or at list confirmation that this is
> >> known
> >>>>> issue etc.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Andrija Panić
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Andrija Panić
> >>>>
> >>>
> >>>
> >>>
> >>
> >
> >
> >
>



-- 

Andrija Panić

Reply via email to