> On Jul 6, 2016, at 5:14 PM, Wido den Hollander <w...@widodh.nl> wrote: > >> >> Op 6 juli 2016 om 16:18 schreef Aaron Hurt <ah...@ena.com >> <mailto:ah...@ena.com>>: >> >> >> >>> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <w...@widodh.nl >>> <mailto:w...@widodh.nl>> wrote: >>> >>>> >>>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com >>>> <mailto:ah...@ena.com> <mailto:ah...@ena.com <mailto:ah...@ena.com>>>: >>>> >>>> >>>> In preparation to roll a new platform built on 4.8 with a Ceph storage >>>> backend we’ve been encountering segfaults that appear to be related to >>>> snapshot operations via java-jados (librbd) on the host agent. We’ve been >>>> able to isolate this to two possible places in the code: >>>> >>>> lines ~866-875 in >>>> plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java >>>> >>>> for (RbdSnapInfo snap : snaps) { >>>> if (image.snapIsProtected(snap.name)) { >>>> s_logger.debug("Unprotecting snapshot " + >>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); >>>> image.snapUnprotect(snap.name); >>>> } else { >>>> s_logger.debug("Snapshot " + pool.getSourceDir() + >>>> "/" + uuid + "@" + snap.name + " is not protected."); >>>> } >>>> s_logger.debug("Removing snapshot " + >>>> pool.getSourceDir() + "/" + uuid + "@" + snap.name); >>>> image.snapRemove(snap.name); >>>> } >>>> >>>> Should we be checking if the unprotect actually failed/succeeded before >>>> attempting to remove the snapshot? >>>> >>>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 >>>> <https://github.com/apache/cloudstack/pull/1230> >>>> <https://github.com/apache/cloudstack/pull/1230 >>>> <https://github.com/apache/cloudstack/pull/1230>> >>>> <https://github.com/apache/cloudstack/pull/1230 >>>> <https://github.com/apache/cloudstack/pull/1230> >>>> <https://github.com/apache/cloudstack/pull/1230 >>>> <https://github.com/apache/cloudstack/pull/1230>>>) duplicates some of >>>> this functionality and there doesn’t seem to be any protection preventing >>>> deletePhysicalDisk and the cleanup routine being run simultaneously. >>>> >>>> >>>> To Reproduce (with ceph/rbd primary storage) >>>> >>>> 1. Set global concurrent.snapshots.threshold.perhost to the default NULL >>>> value >>>> 2. Set global snapshot.poll.interval and storage.cleanup.interval to a >>>> low interval … 10 seconds >>>> 3. Restart management server >>>> 4. Deploy several VMs from templates >>>> 5. Destroy+expunge the VMs after they are running >>>> 6. Observe segfaults in management server >>>> >>>> >>>> Workaround >>>> >>>> We’ve been able to eliminate the segfaults of the host agent in our >>>> testing by simply setting concurrent.snapshots.threshold.perhost to 1 even >>>> with the decreased poll intervals. >>>> >>>> Segfault Logs >>>> >>>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>> >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d >>>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>> >>>> >>>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>> >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 >>>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>> >>>> >>>> We would really appreciate any feedback and/or confirmation from the >>>> community around the above issues. I’d also be happy to provide any >>>> additional information needed to get this addressed. >>> >>> What seems to be happening is that it failed to unprotect the snapshot of >>> the volume. This could have various reasons, for example if there is a >>> child image of the snapshot. I don't think it's the case however. >>> >>> It could still be that it tries to remove the master/golden image from the >>> template while it still has childs attached to that snapshot. >>> >>> I'm not sure if this is due to rados-java or a bug in librados. The Java >>> could should just throw a exception and not completely crash the JVM. This >>> happens lower in the code and not in Java. >>> >>> The assert shows this also happens when Java is talking to libvirt. I guess >>> a librados bug, but now completely sure. >>> >>> Wido >> >> >> We’re seeing this happen around other issues and it does seem to be related >> to java-rados and the JNA wrappings around librbd. This is an exception >> that just occurred this morning when performing a load balancer update. >> >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) >> (logid:7b48049b) Execution is successful. >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) >> (logid:4a2bd0ba) Executing: >> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh >> checkrouter.sh 169.254.3.93 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] >> (agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659: { >> Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, >> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: >> BACKUP\n","wait":0}}] } >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] >> (agentRequest-Handler-1:) (logid:aec84669) Request:Seq >> 1-5871286539207573662: { Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: >> 100001, >> [{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: >> .0.11","router.name":"r-167-QA"},"wait":0}}] } >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] >> (agentRequest-Handler-1:) (logid:aec84669) Processing command: >> com.cloud.agent.api.routing.LoadBalancerConfigCommand >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) >> (logid:aec84669) Transforming >> com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: global >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: log 127.0.0.1:3914 local0 warning >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: maxconn 4096 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: maxpipes 1024 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: chroot /var/lib/haproxy >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: user haproxy >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: group haproxy >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) global section: daemon >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: defaults >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: log global >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: mode tcp >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: option dontlognull >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: retries 3 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: option redispatch >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: option forwardfor >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: option forceclose >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: timeout connect 5000 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: timeout client 50000 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) default section: timeout server 50000 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) Haproxy mode http enabled >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) >> (logid:aec84669) Haproxystats rule: >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public >> 10.107.0.6:8081 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri /admin?stats >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm Haproxy\ >> Statistics >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth >> admin1:AdMiN123 >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) >> (logid:4a2bd0ba) Execution is successful. >> Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] >> (agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661: { >> Ans: , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, >> [{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status: >> BACKUP\n","wait":0}}] } >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) >> (logid:aec84669) Processing FileConfigItem, copying 1315 characters to >> load_balancer.json took 237ms >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG >> [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) >> (logid:aec84669) Executing: >> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh >> update_config.py 169.254.0.11 load_balancer.json >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In >> function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' >> thread 7f530dff3700 time 2016-07-06 08:59:09.143659 >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: >> FAILED assert(sub < m_subsys.size()) >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 >> (45107e21c568dd033c2f0a3107dec8f0b0e58374) >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) >> [0x7f534a94bda5] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) >> [0x7f534a712028] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) >> [0x7f534a743ff3] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) >> [0x7f534ab9ae8e] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) >> [0x7f534aba817d] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) >> [0x7f5455e78dc5] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) >> [0x7f5455793ced] >> Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the >> executable, or `objdump -rdS <executable>` is needed to interpret this. >> Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 >> Aborted (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m >> -Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: >> cloudstack-agent.service: main process exited, code=exited, status=134/n/a >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit >> cloudstack-agent.service entered failed state. >> Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service >> failed. >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service >> holdoff time over, scheduling restart. >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent. >> Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack >> Agent… >> >> Is there something obvious I’m missing here? >> > > I am confused. In t his path you are not even touching Ceph, but still it > crashes on librados. > > Searching I found this issue in the Ceph tracker: > http://tracker.ceph.com/issues/14314 <http://tracker.ceph.com/issues/14314> > > Isn't there a package version mismatch in your Ceph cluster? > > Wido >
I agree it’s very confusing and I’m running out of ideas as to what the cause may be. Here are the package versions on all our related boxes in the lab. http://pastie.org/private/5t5p61ryqbaqm6mw07bw9g I’ve also collected the most recent instances of our two segfaults/aborts below. journalctl -u cloudstack-agent.service --no-pager | grep -B30 -A2 Aborted http://sprunge.us/LcYA journalctl -u cloudstack-agent.service --no-pager | grep -B5 -A20 com.ceph.rbd.RbdException http://sprunge.us/SiCf I also went back to look at the two places in the code where snapshot cleanup is taking place in our tree: The place where the failed to unprotect exceptions seem to be triggered: https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840 <https://github.com/myENA/cloudstack/blob/release/ENA-4.8/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java#L840> The cleanup code for rbd snapshots: https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289 <https://github.com/myENA/cloudstack/blob/679840ae674cc1c655c256e8047187fa3b157ce7/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/KVMStorageProcessor.java#L1289> I’ve rolled a small patch into our testing tree that moves the context and image cleanup into a ‘finally’ block because I thought there may be a problem with those not being closed/freed when the unprotect threw an exception. So the code in LibvirtStorageAdapter.java is slightly different than mainline the 4.8 branch. Here is the pr/diff in our fork that shows the changes. https://github.com/myENA/cloudstack/pull/11/files Does this make sense? Is this even possibly related to the issue I’m seeing? Once again, all the advice and attention is definitely appreciated. — Aaron
signature.asc
Description: Message signed with OpenPGP using GPGMail