> On Jul 2, 2016, at 11:37 AM, Wido den Hollander <w...@widodh.nl> wrote:
> 
>> 
>> Op 30 juni 2016 om 18:29 schreef Aaron Hurt <ah...@ena.com 
>> <mailto:ah...@ena.com>>:
>> 
>> 
>> In preparation to roll a new platform built on 4.8 with a Ceph storage 
>> backend we’ve been encountering segfaults that appear to be related to 
>> snapshot operations via java-jados (librbd) on the host agent.  We’ve been 
>> able to isolate this to two possible places in the code:
>> 
>> lines ~866-875 in 
>> plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/storage/LibvirtStorageAdaptor.java
>> 
>>                for (RbdSnapInfo snap : snaps) {
>>                    if (image.snapIsProtected(snap.name)) {
>>                        s_logger.debug("Unprotecting snapshot " + 
>> pool.getSourceDir() + "/" + uuid + "@" + snap.name);
>>                        image.snapUnprotect(snap.name);
>>                    } else {
>>                        s_logger.debug("Snapshot " + pool.getSourceDir() + 
>> "/" + uuid + "@" + snap.name + " is not protected.");
>>                    }
>>                    s_logger.debug("Removing snapshot " + pool.getSourceDir() 
>> + "/" + uuid + "@" + snap.name);
>>                    image.snapRemove(snap.name);
>>                }
>> 
>> Should we be checking if the unprotect actually failed/succeeded before 
>> attempting to remove the snapshot?
>> 
>> Code from PR #1230 (https://github.com/apache/cloudstack/pull/1230 
>> <https://github.com/apache/cloudstack/pull/1230> 
>> <https://github.com/apache/cloudstack/pull/1230 
>> <https://github.com/apache/cloudstack/pull/1230>>) duplicates some of this 
>> functionality and there doesn’t seem to be any protection preventing 
>> deletePhysicalDisk and the cleanup routine being run simultaneously.
>> 
>> 
>> To Reproduce (with ceph/rbd primary storage)
>> 
>> 1.  Set global concurrent.snapshots.threshold.perhost to the default NULL 
>> value
>> 2.  Set global snapshot.poll.interval and storage.cleanup.interval to a low 
>> interval … 10 seconds
>> 3.  Restart management server
>> 4.  Deploy several VMs from templates
>> 5.  Destroy+expunge the VMs after they are running
>> 6.  Observe segfaults in management server
>> 
>> 
>> Workaround
>> 
>> We’ve been able to eliminate the segfaults of the host agent in our testing 
>> by simply setting concurrent.snapshots.threshold.perhost to 1 even with the 
>> decreased poll intervals.
>> 
>> Segfault Logs
>> 
>> https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d> 
>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d 
>> <https://slack-files.com/T0RJECUV7-F1M39K4F5-f9c6b3986d>>
>> 
>> https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56> 
>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56 
>> <https://slack-files.com/T0RJECUV7-F1KCTRNNN-8d36665b56>>
>> 
>> We would really appreciate any feedback and/or confirmation from the 
>> community around the above issues.  I’d also be happy to provide any 
>> additional information needed to get this addressed.
> 
> What seems to be happening is that it failed to unprotect the snapshot of the 
> volume. This could have various reasons, for example if there is a child 
> image of the snapshot. I don't think it's the case however.
> 
> It could still be that it tries to remove the master/golden image from the 
> template while it still has childs attached to that snapshot.
> 
> I'm not sure if this is due to rados-java or a bug in librados. The Java 
> could should just throw a exception and not completely crash the JVM. This 
> happens lower in the code and not in Java.
> 
> The assert shows this also happens when Java is talking to libvirt. I guess a 
> librados bug, but now completely sure.
> 
> Wido


We’re seeing this happen around other issues and it does seem to be related to 
java-rados and the JNA wrappings around librbd.  This is an exception that just 
occurred this morning when performing a load balancer update.

Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[kvm.resource.LibvirtComputingResource] (agentRequest-Handler-3:) 
(logid:7b48049b) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) 
(logid:4a2bd0ba) Executing: 
/usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh 
checkrouter.sh 169.254.3.93
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] 
(agentRequest-Handler-3:) (logid:7b48049b) Seq 1-5871286539207573659:  { Ans: , 
MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
 BACKUP\n","wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] 
(agentRequest-Handler-1:) (logid:aec84669) Request:Seq 1-5871286539207573662:  
{ Cmd , MgmtId: 52239507206, via: 1, Ver: v1, Flags: 100001, 
[{"com.cloud.agent.api.routing.LoadBalancerConfigCommand":{"loadBalancers":[{"uuid":"733276c0-df18-4110-a4ed-5efc1f523eb2","srcIp":"10.107.0.12","srcPort":80,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[]},{"uuid":"1dcba169-922c-45d2-9f85-33d63ef6f0e7","srcIp":"10.107.0.11","srcPort":300,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.191","destPort":400,"revoked":false,"alreadyAdded":false}]},{"uuid":"580dd6d7-b12a-4e14-93d8-a6f87dd75763","srcIp":"10.107.0.13","srcPort":5000,"protocol":"tcp","algorithm":"roundrobin","revoked":false,"alreadyAdded":false,"inline":false,"destinations":[{"destIp":"10.90.0.35","destPort":80,"revoked":false,"alreadyAdded":false},{"destIp":"10.90.0.36","destPort":80,"revoked":false,"alreadyAdded":false}]},{"uuid":"6b8f4872-1d05-4942-b715-3b0bf92e9d20","srcIp":"10.107.0.19","srcPort":111,"protocol":"tcp","algorithm":"roundrobin","revoked":true,"alreadyAdded":false,"inline":false,"destinations":[]}],"lbStatsVisibility":"global","lbStatsPublicIP":"10.107.0.6","lbStatsPrivateIP":"169.254.0.11","lbStatsGuestIP":"10.90.0.14","lbStatsPort":"8081","lbStatsSrcCidrs":"0/0","lbStatsAuth":"admin1:AdMiN123","lbStatsUri":"/admin?stats","maxconn":"4096","keepAliveEnabled":false,"nic":{"deviceId":3,"networkRateMbps":200,"defaultNic":false,"pxeDisable":true,"nicUuid":"503eca28-76fb-4e0a-aaf1-66bb63fae4b5","uuid":"e0b77f27-83b3-4ce4-b81d-05b0c559b395","ip":"10.90.0.14","netmask":"255.255.255.0","gateway":"10.90.0.1","mac":"02:00:42:58:00:04","broadcastType":"Vxlan","type":"Guest","broadcastUri":"vxlan://6054","isolationUri":"vxlan://6054","isSecurityGroupEnabled":false,"name":"bond0.109"},"vpcId":10,"accessDetails":{"router.guest.ip":"10.90.0.14","zone.network.type":"Advanced","router.ip":"169.254
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: 
.0.11","router.name":"r-167-QA"},"wait":0}}] }
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] 
(agentRequest-Handler-1:) (logid:aec84669) Processing command: 
com.cloud.agent.api.routing.LoadBalancerConfigCommand
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) 
(logid:aec84669) Transforming 
com.cloud.agent.api.routing.LoadBalancerConfigCommand to ConfigItems
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section: global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         log 127.0.0.1:3914   local0 warning
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         maxconn 4096
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         maxpipes 1024
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         chroot /var/lib/haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         user haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         group haproxy
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
global section:         daemon
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section: defaults
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         log     global
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         mode    tcp
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         option  dontlognull
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         retries 3
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         option redispatch
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         option forwardfor
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         option forceclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         timeout connect    5000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         timeout client     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
default section:         timeout server     50000
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: INFO  
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
Haproxy mode http enabled
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[cloud.network.HAProxyConfigurator] (agentRequest-Handler-1:) (logid:aec84669) 
Haproxystats rule:
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: listen stats_on_public 
10.107.0.6:8081
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: mode http
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: option httpclose
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats enable
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats uri     /admin?stats
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats realm   Haproxy\ 
Statistics
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: stats auth    admin1:AdMiN123
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:) 
(logid:4a2bd0ba) Execution is successful.
Jul 06 08:59:08 njcloudhost.dev.ena.net sh[13457]: DEBUG [cloud.agent.Agent] 
(agentRequest-Handler-2:) (logid:4a2bd0ba) Seq 1-5871286539207573661:  { Ans: , 
MgmtId: 52239507206, via: 1, Ver: v1, Flags: 10, 
[{"com.cloud.agent.api.CheckRouterAnswer":{"state":"BACKUP","result":true,"details":"Status:
 BACKUP\n","wait":0}}] }
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[resource.virtualnetwork.VirtualRoutingResource] (agentRequest-Handler-1:) 
(logid:aec84669) Processing FileConfigItem, copying 1315 characters to 
load_balancer.json took 237ms
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: DEBUG 
[kvm.resource.LibvirtComputingResource] (agentRequest-Handler-1:) 
(logid:aec84669) Executing: 
/usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh 
update_config.py 169.254.0.11 load_balancer.json
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: In 
function 'bool ceph::log::SubsystemMap::should_gather(unsigned int, int)' 
thread 7f530dff3700 time 2016-07-06 08:59:09.143659
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ./log/SubsystemMap.h: 62: 
FAILED assert(sub < m_subsys.size())
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: ceph version 10.2.2 
(45107e21c568dd033c2f0a3107dec8f0b0e58374)
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 1: (()+0x289da5) 
[0x7f534a94bda5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 2: (()+0x50028) 
[0x7f534a712028]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 3: (()+0x81ff3) 
[0x7f534a743ff3]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 4: (()+0x4d8e8e) 
[0x7f534ab9ae8e]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 5: (()+0x4e617d) 
[0x7f534aba817d]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 6: (()+0x7dc5) 
[0x7f5455e78dc5]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: 7: (clone()+0x6d) 
[0x7f5455793ced]
Jul 06 08:59:09 njcloudhost.dev.ena.net sh[13457]: NOTE: a copy of the 
executable, or `objdump -rdS <executable>` is needed to interpret this.
Jul 06 08:59:14 njcloudhost.dev.ena.net sh[13457]: /bin/sh: line 1: 13461 
Aborted                 (core dumped) /usr/lib/jvm/jre/bin/java -Xms256m 
-Xmx2048m -cp "$CLASSPATH" $JAVA_CLASS
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service: 
main process exited, code=exited, status=134/n/a
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: Unit 
cloudstack-agent.service entered failed state.
Jul 06 08:59:14 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service 
failed.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: cloudstack-agent.service 
holdoff time over, scheduling restart.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Started CloudStack Agent.
Jul 06 08:59:24 njcloudhost.dev.ena.net systemd[1]: Starting CloudStack Agent…

Is there something obvious I’m missing here?

— Aaron

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to