On 09/11/2015 05:08 PM, Andrija Panic wrote: > THx a lot Wido !!! - we will patch this - For my understanding - is this > "temorary"solution - since it raises limit to 256 snaps ? Or am I wrong ? I > mean, since we dont stil have proper snapshots removal etc, so after i.e. > 3-6months we will again have 256 snapshots of a single volume on CEPH ? >
No, it will also work with >256 snapshots. I've tested it with 256 and that worked fine. I see no reason why it won't work with 1024 or 2048 for example. > BTW we also have other exception, that causes same consequences - agent > disocnnecting and VMs going down... > As Dmytro explained, unprotecting snapshot causes same consequence... > > From my understanding, any RBD exception, might cause Agent to disconnect > (or actually mgmt server to disconnect agent)... > > Any clue on this, recommendation? > No, I don't have a clue. It could be that the job hangs somewhere inside the Agent due to a uncaught exception though. > Thx a lot for fixing rados-java stuff ! > You're welcome! Wido > Andrija > > On 11 September 2015 at 15:28, Wido den Hollander <w...@widodh.nl> wrote: > >> >> >> On 11-09-15 14:43, Dmytro Shevchenko wrote: >>> Thanks a lot Wido! Any chance to find out why management server decided >>> that it lost connection to agent after that exceptions? It's not so >>> critical as this bug with 16 snapshots, but during last week we catch >>> situation when Agent failed unprotect snapshot, rise exception and this >>> is was a reason of disconnection a bit later after that. (It is not >>> clear why CS decided remove that volume, it was template with one 'gold' >>> snapshot with several active clones) >>> >> >> No, I didn't look at CS at all. I just spend the day improving the RADOS >> bindings. >> >> Wido >> >>> On 09/11/2015 03:20 PM, Wido den Hollander wrote: >>>> >>>> On 11-09-15 10:19, Wido den Hollander wrote: >>>>> >>>>> On 10-09-15 23:15, Andrija Panic wrote: >>>>>> Wido, >>>>>> >>>>>> could you folow maybe what my colegue Dmytro just sent ? >>>>>> >>>>> Yes, seems logical. >>>>> >>>>>> Its not only matter of question fixing rados-java (16 snaps limit) >>>>>> - it >>>>>> seems that for any RBD exception, ACS will freak out... >>>>>> >>>>> No, a RbdException will be caught, but the Rados Bindings shouldn't >>>>> throw NegativeArraySizeException in any case. >>>>> >>>>> That's the main problem. >>>>> >>>> Seems to be fixed with this commit: >>>> >> https://github.com/ceph/rados-java/commit/5584f3961c95d998d2a9eff947a5b7b4d4ba0b64 >>>> >>>> >>>> Just tested it with 256 snapshots: >>>> >>>> ------------------------------------------------------- >>>> T E S T S >>>> ------------------------------------------------------- >>>> Running com.ceph.rbd.TestRbd >>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: >>>> 521.014 sec >>>> >>>> Results : >>>> >>>> Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 >>>> >>>> The bindings should now be capable of listing more then 16 snapshots. >>>> >>>> You can build the bindings manually and replace rados.jar on your >>>> running systems. >>>> >>>> For 4.6 I'll try to get the updated rados-java included. >>>> >>>> Wido >>>> >>>>> Wido >>>>> >>>>>> THx >>>>>> >>>>>> On 10 September 2015 at 17:06, Dmytro Shevchenko < >>>>>> dmytro.shevche...@safeswisscloud.com> wrote: >>>>>> >>>>>>> Hello everyone, some clarification about this. Configuration: >>>>>>> CS: 4.5.1 >>>>>>> Primary storage: Ceph >>>>>>> >>>>>>> Actually we have 2 separate bugs: >>>>>>> >>>>>>> 1. When you remove some volume with more then 16 snapshots (doesn't >>>>>>> matter >>>>>>> destroyed or active - they always present on Ceph), on next storage >>>>>>> garbage >>>>>>> collector cycle it invoke 'deletePhysicalDisk' from >>>>>>> LibvirtStorageAdaptor.java. On line 854 we calling list snapshots >> from >>>>>>> external rados-java library and getting exception. >>>>>>> >>>>>>> >> https://github.com/apache/cloudstack/blob/4.5.1/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L854 >>>>>>> >>>>>>> This exception do not catching in current function, but Agent DO >>>>>>> NOT CRASH >>>>>>> at this moment and continue working fine. Agent form proper answer to >>>>>>> server and send it, text in answer - java stack trace. Log from >>>>>>> Agent side: >>>>>>> >>>>>>> 2015-09-10 02:32:35,312 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>> (agentRequest-Handler-4:null) Trying to fetch storage pool >>>>>>> 33ebaf83-5d09-3038-b63b-742e759a >>>>>>> 992e from libvirt >>>>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>>>> (agentRequest-Handler-4:null) Attempting to remove volume >>>>>>> 4c6a2092-056c-4446-a2ca-d6bba9f7f >>>>>>> 7f8 from pool 33ebaf83-5d09-3038-b63b-742e759a992e >>>>>>> 2015-09-10 02:32:35,431 INFO [kvm.storage.LibvirtStorageAdaptor] >>>>>>> (agentRequest-Handler-4:null) Unprotecting and Removing RBD >>>>>>> snapshots of >>>>>>> image cloudstack-storage >>>>>>> /4c6a2092-056c-4446-a2ca-d6bba9f7f7f8 prior to removing the image >>>>>>> 2015-09-10 02:32:35,436 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>> (agentRequest-Handler-4:null) Succesfully connected to Ceph cluster >> at >>>>>>> 10.10.1.26:6789 >>>>>>> 2015-09-10 02:32:35,454 DEBUG [kvm.storage.LibvirtStorageAdaptor] >>>>>>> (agentRequest-Handler-4:null) Fetching list of snapshots of RBD image >>>>>>> cloudstack-storage/4c6a2092 >>>>>>> -056c-4446-a2ca-d6bba9f7f7f8 >>>>>>> 2015-09-10 02:32:35,457 WARN [cloud.agent.Agent] >>>>>>> (agentRequest-Handler-4:null) Caught: >>>>>>> java.lang.NegativeArraySizeException >>>>>>> at com.ceph.rbd.RbdImage.snapList(Unknown Source) >>>>>>> at >>>>>>> >> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854) >>>>>>> >>>>>>> at >>>>>>> >> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175) >>>>>>> >>>>>>> at >>>>>>> >> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.java:1206) >>>>>>> >>>>>>> 2015-09-10 02:32:35,458 DEBUG [cloud.agent.Agent] >>>>>>> (agentRequest-Handler-4:null) Seq 1-1743737480722513946: { Ans: , >>>>>>> MgmtId: >>>>>>> 90520739779588, via: 1, Ver: v1, >>>>>>> Flags: 10, >>>>>>> >> [{"com.cloud.agent.api.Answer":{"result":false,"details":"java.lang.NegativeArraySizeException\n\tat >>>>>>> >>>>>>> com.ceph.rbd.RbdImage.snapList(Unknown Sourc >>>>>>> e)\n\tat >>>>>>> >> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.deletePhysicalDisk(LibvirtStorageAdaptor.java:854)\n\tat >>>>>>> >>>>>>> com.cloud.hypervisor.kvm.storage.Lib >>>>>>> virtStoragePool.deletePhysicalDisk(LibvirtStoragePool.java:175)\n\tat >>>>>>> >> com.cloud.hypervisor.kvm.storage.KVMStorageProcessor.deleteVolume(KVMStorageProcessor.j >>>>>>> >>>>>>> ava:1206)\n\tat >>>>>>> >> com.cloud.storage.resource.StorageSubsystemCommandHandlerBase.execute(StorageSubsystemCommandHandlerBase.java:124)\n\tat >>>>>>> >>>>>>> com.cloud.storage.re..... >>>>>>> >>>>>>> so this volume and it snapshots never will be removed. >>>>>>> >>>>>>> >>>>>>> 2. Second bug. Experimentally it has been found that after 50 >>>>>>> minutes we >>>>>>> had exception on Agent, for some unknown reason Management server >>>>>>> decided >>>>>>> about it lost connection to this agent, start HA process and start >>>>>>> Agent >>>>>>> process again. >>>>>>> Log on Agent side: >>>>>>> 2015-09-10 02:57:12,664 DEBUG [kvm.resource.LibvirtComputingResource] >>>>>>> (agentRequest-Handler-2:null) Executing: /bin/bash -c free|grep >>>>>>> Mem:|awk >>>>>>> '{print $2}' >>>>>>> 2015-09-10 02:57:12,667 DEBUG [kvm.resource.LibvirtComputingResource] >>>>>>> (agentRequest-Handler-2:null) Execution is successful. >>>>>>> 2015-09-10 02:57:40,502 DEBUG [kvm.resource.LibvirtComputingResource] >>>>>>> (UgentTask-5:null) Executing: >>>>>>> /usr/share/cloudstack-common/scripts/vm/network/security_ >>>>>>> group.py get_rule_logs_for_vms >>>>>>> 2015-09-10 02:57:40,572 DEBUG [kvm.resource.LibvirtComputingResource] >>>>>>> (UgentTask-5:null) Execution is successful. >>>>>>> 2015-09-10 02:57:54,135 INFO [cloud.agent.AgentShell] (main:null) >>>>>>> Agent >>>>>>> started >>>>>>> 2015-09-10 02:57:54,136 INFO [cloud.agent.AgentShell] (main:null) >>>>>>> Implementation Version is 4.5.1 >>>>>>> 2015-09-10 02:57:54,138 INFO [cloud.agent.AgentShell] (main:null) >>>>>>> agent.properties found at /etc/cloudstack/agent/agent.properties >>>>>>> ..... >>>>>>> >>>>>>> Log on Server side: >>>>>>> 2015-09-10 02:57:53,710 INFO [c.c.a.m.AgentManagerImpl] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) Investigating why host 1 has >>>>>>> disconnected >>>>>>> with event AgentDisconnecte >>>>>>> d >>>>>>> 2015-09-10 02:57:53,714 DEBUG [c.c.a.m.AgentManagerImpl] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) checking if agent (1) is alive >>>>>>> 2015-09-10 02:57:53,723 DEBUG [c.c.a.t.Request] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Sending { >>>>>>> Cmd , >>>>>>> MgmtId: 90520739779588, via: 1(ix1 >>>>>>> -c7-2), Ver: v1, Flags: 100011, >>>>>>> [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } >>>>>>> 2015-09-10 02:57:53,724 INFO [c.c.a.m.AgentAttache] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Unable to >>>>>>> send >>>>>>> due to Resource [Host:1] is unr >>>>>>> eachable: Host 1: Channel is closed >>>>>>> 2015-09-10 02:57:53,724 DEBUG [c.c.a.m.AgentAttache] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) Seq 1-1743737480722513988: Cancelling. >>>>>>> 2015-09-10 02:57:53,724 WARN [c.c.a.m.AgentManagerImpl] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) Resource [Host:1] is unreachable: >>>>>>> Host 1: >>>>>>> Channel is closed >>>>>>> 2015-09-10 02:57:53,728 DEBUG [c.c.h.HighAvailabilityManagerImpl] >>>>>>> (AgentTaskPool-1:ctx-2127ada4) SimpleInvestigator unable to >>>>>>> determine the >>>>>>> state of the host >>>>>>> . Moving on. >>>>>>> >>>>>>> >>>>>>> it's look like connection problem, but it appear only when we have >>>>>>> this >>>>>>> RBD exception on agent side and only with this node. I tried to >>>>>>> play with >>>>>>> "storage.cleanup.interval" parameter and set it to 5 minutes, now we >>>>>>> getting exception on agent side every 5min, but disconnects still >>>>>>> happening >>>>>>> every 50min and I can't find out why. >>>>>>> >>>>>>> On 09/10/2015 03:21 PM, Andrija Panic wrote: >>>>>>> >>>>>>>> Thx Wido, >>>>>>>> >>>>>>>> I will have my colegue Igor and Dmytro join with details on this. >>>>>>>> >>>>>>>> I agree we need fix upstream, that is the main purpose from our >> side! >>>>>>>> >>>>>>>> With this temp fix, we just avoid agent crashing (agent somehow >>>>>>>> restarts >>>>>>>> again fine :) ) but VMs also go down on that host, at least some >>>>>>>> of them. >>>>>>>> >>>>>>>> Do you see any lifecycle/workflow issue, if we implement deleting >>>>>>>> SNAP >>>>>>>> from CEPH after you SNAP a volume in ACS and sucsssfully move to >>>>>>>> Secondary >>>>>>>> NFS - or perhaps only delete SNAP from CEPH as a part of actuall >> SNAP >>>>>>>> deletion (when you manually or via scheduled snapshots, delete >>>>>>>> snapshot >>>>>>>> from DB and NFS) ? Maybe second option is better, I dont know how >>>>>>>> you guys >>>>>>>> handle this for regular NFS as primary storage etc... >>>>>>>> >>>>>>>> >>>>>>>> Any guidance is most welcomed, and our team will try to code all >>>>>>>> this. >>>>>>>> >>>>>>>> Thx Wido again >>>>>>>> >>>>>>>> On 10 September 2015 at 14:14, Wido den Hollander <w...@widodh.nl >>>>>>>> <mailto:w...@widodh.nl>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 10-09-15 14:07, Andrija Panic wrote: >>>>>>>> > Wido, >>>>>>>> > >>>>>>>> > part of code where you want to delete some volume, checks if >>>>>>>> volume is type >>>>>>>> > RBD - and then tries to list snapshots, delete snapshtos, and >>>>>>>> finally >>>>>>>> > remove image. Here first step- Listing snapshtos- fails, if >>>>>>>> there are more >>>>>>>> > than 16 snapshtos present - number 16 is hardcoded in >>>>>>>> elsewhere >>>>>>>> part of >>>>>>>> > code and throws RBD exception...then agent crashes... and >> then >>>>>>>> VMs goe down >>>>>>>> > etc. >>>>>>>> > >>>>>>>> >>>>>>>> Hmmm, that seems like a bug in rados-java indeed. I don't >>>>>>>> know if >>>>>>>> there >>>>>>>> is a release of rados-java where this is fixed in. >>>>>>>> >>>>>>>> Looking at the code of rados-java it should, but I'm not 100% >>>>>>>> certain. >>>>>>>> >>>>>>>> > So our current way as quick fix is to invoke external script >>>>>>>> which will >>>>>>>> > also list and remove all snapshtos, but will not fail. >>>>>>>> > >>>>>>>> >>>>>>>> Yes, but we should fix it upstream. I understand that you >>>>>>>> will use a >>>>>>>> temp script to clean up everything. >>>>>>>> >>>>>>>> > I'm not sure why is there 16 as the hardcoded limit - will >> try >>>>>>>> to provide >>>>>>>> > part of code where this is present...we can increase this >>>>>>>> number >>>>>>>> but it >>>>>>>> > doesn make any sense (from 16 to i.e. 200), since we still >>>>>>>> have >>>>>>>> lot of >>>>>>>> > garbage left on CEPH (snapshtos that were removed in ACS >>>>>>>> (DB and >>>>>>>> Secondary >>>>>>>> > NFS) - but not removed from CEPH. And in my understanding >> this >>>>>>>> needs to be >>>>>>>> > implemented, so we dont catch any exceptions that I >> originally >>>>>>>> described... >>>>>>>> > >>>>>>>> > Any thoughts on this ? >>>>>>>> > >>>>>>>> >>>>>>>> A cleanup script for now should be OK indeed. Afterwards the >>>>>>>> Java code >>>>>>>> should be able to do this. >>>>>>>> >>>>>>>> You can try manually by using rados-java and fix that. >>>>>>>> >>>>>>>> This is the part where the listing is done: >>>>>>>> >>>>>>>> >> https://github.com/ceph/rados-java/blob/master/src/main/java/com/ceph/rbd/RbdImage.java >>>>>>>> >>>>>>>> >>>>>>>> Wido >>>>>>>> >>>>>>>> > Thx for input! >>>>>>>> > >>>>>>>> > On 10 September 2015 at 13:56, Wido den Hollander >>>>>>>> <w...@widodh.nl <mailto:w...@widodh.nl>> wrote: >>>>>>>> > >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> On 10-09-15 12:17, Andrija Panic wrote: >>>>>>>> >>> We are testing some [dirty?] patch on our dev system and we >>>>>>>> shall soon >>>>>>>> >>> share it for review. >>>>>>>> >>> >>>>>>>> >>> Basically, we are using external python script that is >>>>>>>> invoked >>>>>>>> in some >>>>>>>> >> part >>>>>>>> >>> of code execution to delete needed CEPH snapshots and then >>>>>>>> after that >>>>>>>> >>> proceeds with volume deleteion etc... >>>>>>>> >>> >>>>>>>> >> >>>>>>>> >> That shouldn't be required. The Java bindings for librbd and >>>>>>>> librados >>>>>>>> >> should be able to remove the snapshots. >>>>>>>> >> >>>>>>>> >> There is no need to invoke external code, this can all be >>>>>>>> handled in Java. >>>>>>>> >> >>>>>>>> >>> On 10 September 2015 at 11:26, Andrija Panic >>>>>>>> <andrija.pa...@gmail.com <mailto:andrija.pa...@gmail.com>> >>>>>>>> >>> wrote: >>>>>>>> >>> >>>>>>>> >>>> Eh, OK. Thx for the info. >>>>>>>> >>>> >>>>>>>> >>>> BTW why is 16 snapshot limits hardcoded - any reason for >>>>>>>> that ? >>>>>>>> >>>> >>>>>>>> >>>> Not cleaning snapshots on CEPH and trying to delete volume >>>>>>>> after having >>>>>>>> >>>> more than 16 snapshtos in CEPH = Agent crashing on KVM >>>>>>>> side...and some >>>>>>>> >> VMs >>>>>>>> >>>> being rebooted etc - which means downtime :| >>>>>>>> >>>> >>>>>>>> >>>> Thanks, >>>>>>>> >>>> >>>>>>>> >>>> On 9 September 2015 at 22:05, Simon Weller < >> swel...@ena.com >>>>>>>> <mailto:swel...@ena.com>> wrote: >>>>>>>> >>>> >>>>>>>> >>>>> Andrija, >>>>>>>> >>>>> >>>>>>>> >>>>> The Ceph snapshot deletion is not currently implemented. >>>>>>>> >>>>> >>>>>>>> >>>>> See: >> https://issues.apache.org/jira/browse/CLOUDSTACK-8302 >>>>>>>> >>>>> >>>>>>>> >>>>> - Si >>>>>>>> >>>>> >>>>>>>> >>>>> ________________________________________ >>>>>>>> >>>>> From: Andrija Panic <andrija.pa...@gmail.com> >>>>>>>> >>>>> Sent: Wednesday, September 9, 2015 3:03 PM >>>>>>>> >>>>> To: dev@cloudstack.apache.org >>>>>>>> <mailto:dev@cloudstack.apache.org>; >> us...@cloudstack.apache.org >>>>>>>> <mailto:us...@cloudstack.apache.org> >>>>>>>> >>>>>>>> >>>>> Subject: ACS 4.5 - volume snapshots NOT removed from CEPH >>>>>>>> (only from >>>>>>>> >>>>> Secondaryt NFS and DB) >>>>>>>> >>>>> >>>>>>>> >>>>> Hi folks, >>>>>>>> >>>>> >>>>>>>> >>>>> we enounter issue in ACS 4.5.1 (perhaps other versions >>>>>>>> also >>>>>>>> affected) - >>>>>>>> >>>>> when we delete some snapshot (volume snapshot) in ACS, >> ACS >>>>>>>> marks it as >>>>>>>> >>>>> deleted in DB, deletes from NFS Secondary Storage but >>>>>>>> fails >>>>>>>> to delete >>>>>>>> >>>>> snapshot on CEPH primary storage (doesn even try to >> delete >>>>>>>> it AFAIK) >>>>>>>> >>>>> >>>>>>>> >>>>> So we end up having 5 live snapshots in DB (just example) >>>>>>>> but actually >>>>>>>> >> in >>>>>>>> >>>>> CEPH there are more than i.e. 16 snapshots. >>>>>>>> >>>>> >>>>>>>> >>>>> More of the issue, when ACS agent tries to obtain list of >>>>>>>> snapshots >>>>>>>> >> from >>>>>>>> >>>>> CEPH for some volume or so - if number of snapshots is >>>>>>>> over >>>>>>>> 16, it >>>>>>>> >> raises >>>>>>>> >>>>> exception (and perhaps this is the reason Agent >>>>>>>> crashed for >>>>>>>> us - need >>>>>>>> >> to >>>>>>>> >>>>> check with my colegues who are investigatin this in >>>>>>>> details). This >>>>>>>> >> number >>>>>>>> >>>>> 16 is for whatever reasons hardcoded in ACS code. >>>>>>>> >>>>> >>>>>>>> >>>>> Wondering if anyone experienced this, or have any info >>>>>>>> - we >>>>>>>> plan to >>>>>>>> >> try to >>>>>>>> >>>>> fix this, and I will inlcude my dev colegues here, but we >>>>>>>> might need >>>>>>>> >> some >>>>>>>> >>>>> help at least for guidance or- >>>>>>>> >>>>> >>>>>>>> >>>>> Any help is really apreaciated or at list confirmation >>>>>>>> that >>>>>>>> this is >>>>>>>> >> known >>>>>>>> >>>>> issue etc. >>>>>>>> >>>>> >>>>>>>> >>>>> Thanks, >>>>>>>> >>>>> >>>>>>>> >>>>> -- >>>>>>>> >>>>> >>>>>>>> >>>>> Andrija Panić >>>>>>>> >>>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> >>>>>>>> >>>> -- >>>>>>>> >>>> >>>>>>>> >>>> Andrija Panić >>>>>>>> >>>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >> >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Andrija Panić >>>>>>>> >>>>>>> -- >>>>>>> --- >>>>>>> Best regards >>>>>>> Dmytro Shevchenko >>>>>>> dshevchenko.m...@gmail.com >>>>>>> skype: demonsh_mk >>>>>>> +380(66)2426648 >>>>>>> >>>>>>> >>>>>> >>> >> > > >