On 10-09-15 14:07, Andrija Panic wrote:
> Wido,
> 
> part of code where you want to delete some volume, checks if volume is type
> RBD - and then tries to list snapshots, delete snapshtos, and finally
> remove image. Here first step- Listing snapshtos-  fails, if there are more
> than 16 snapshtos present - number 16 is hardcoded in elsewhere part of
> code and throws RBD exception...then agent crashes... and then VMs goe down
> etc.
> 

Hmmm, that seems like a bug in rados-java indeed. I don't know if there
is a release of rados-java where this is fixed in.

Looking at the code of rados-java it should, but I'm not 100% certain.

> So our current way as quick fix is to invoke external script which will
> also list and remove all snapshtos, but will not fail.
> 

Yes, but we should fix it upstream. I understand that you will use a
temp script to clean up everything.

> I'm not sure why is there 16 as the hardcoded limit - will try to provide
> part of code where this is present...we can increase this number but it
> doesn make any sense (from 16 to i.e. 200), since we still have lot of
> garbage left on CEPH (snapshtos that were removed in ACS (DB and Secondary
> NFS) - but not removed from CEPH. And in my understanding this needs to be
> implemented, so we dont catch any exceptions that I originally described...
> 
> Any thoughts on this ?
> 

A cleanup script for now should be OK indeed. Afterwards the Java code
should be able to do this.

You can try manually by using rados-java and fix that.

This is the part where the listing is done:
https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java

Wido

> Thx for input!
> 
> On 10 September 2015 at 13:56, Wido den Hollander <w...@widodh.nl> wrote:
> 
>>
>>
>> On 10-09-15 12:17, Andrija Panic wrote:
>>> We are testing some [dirty?] patch on our dev system and we shall soon
>>> share it for review.
>>>
>>> Basically, we are using external python script that is invoked in some
>> part
>>> of code execution to delete needed CEPH snapshots and then after that
>>> proceeds with volume deleteion etc...
>>>
>>
>> That shouldn't be required. The Java bindings for librbd and librados
>> should be able to remove the snapshots.
>>
>> There is no need to invoke external code, this can all be handled in Java.
>>
>>> On 10 September 2015 at 11:26, Andrija Panic <andrija.pa...@gmail.com>
>>> wrote:
>>>
>>>> Eh, OK. Thx for the info.
>>>>
>>>> BTW why is 16 snapshot limits hardcoded - any reason for that ?
>>>>
>>>> Not cleaning snapshots on CEPH and trying to delete volume after having
>>>> more than 16 snapshtos in CEPH = Agent crashing on KVM side...and some
>> VMs
>>>> being rebooted etc - which means downtime :|
>>>>
>>>> Thanks,
>>>>
>>>> On 9 September 2015 at 22:05, Simon Weller <swel...@ena.com> wrote:
>>>>
>>>>> Andrija,
>>>>>
>>>>> The Ceph snapshot deletion is not currently implemented.
>>>>>
>>>>> See: https://issues.apache.org/jira/browse/CLOUDSTACK-8302
>>>>>
>>>>> - Si
>>>>>
>>>>> ________________________________________
>>>>> From: Andrija Panic <andrija.pa...@gmail.com>
>>>>> Sent: Wednesday, September 9, 2015 3:03 PM
>>>>> To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
>>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH (only from
>>>>> Secondaryt NFS and DB)
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> we enounter issue in ACS 4.5.1 (perhaps other versions also affected) -
>>>>> when we delete some snapshot (volume snapshot) in ACS, ACS marks it as
>>>>> deleted in DB, deletes from NFS Secondary Storage but fails to delete
>>>>> snapshot on CEPH primary storage (doesn even try to delete it AFAIK)
>>>>>
>>>>> So we end up having 5 live snapshots in DB (just example) but actually
>> in
>>>>> CEPH there are more than i.e. 16 snapshots.
>>>>>
>>>>> More of the issue, when ACS agent tries to obtain list of snapshots
>> from
>>>>> CEPH for some volume or so - if number of snapshots is over 16, it
>> raises
>>>>> exception  (and perhaps this is the reason Agent crashed for us - need
>> to
>>>>> check with my colegues who are investigatin this in details). This
>> number
>>>>> 16 is for whatever reasons hardcoded in ACS code.
>>>>>
>>>>> Wondering if anyone experienced this, or have any info - we plan to
>> try to
>>>>> fix this, and I will inlcude my dev colegues here, but we might need
>> some
>>>>> help at least for guidance or-
>>>>>
>>>>> Any help is really apreaciated or at list confirmation that this is
>> known
>>>>> issue etc.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>>
>>>>> Andrija Panić
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Andrija Panić
>>>>
>>>
>>>
>>>
>>
> 
> 
> 

Reply via email to