Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)

Ivan Kudryavtsev Fri, 05 Jan 2018 22:29:33 -0800

Hello, colleagues.

During last days I found 2 bugs which I believe is critical for 4.11
release. I would like to share them here and get help if possible:


1. CPVM bug. I use wildcard certificate issued by Comodo CA. I uploaded it
to CS via UI and destroyed CPVM to force use it. It uses it like a charm,
but after some amount of time it loses it and console proxy connection is
no longer not possible. After it's beging rebooted or recreated everything
works well. I'm not familiar with CPVM at all and can not even imaging what
can be wrong here.

1a. CPVM has debug enabled and logs include tons of messages like:

2018-01-06 06:13:57,069 DEBUG
[cloud.consoleproxy.ConsoleProxyAjaxImageHandler] (Thread-4159:null)
AjaxImageHandler
/ajaximg?token=RcHSrvzegyrjZAlc1Wjifcwv9P8WwK3eH63SuIS8WFFGssxymmjdYkZ4-S4ilY1UHxX612Lt_5Xi1Z5JaoCfDSf_UCi8lTIsPEBlDpUEWQg1IblYu0HxvoDugX9J4XgAdpj74qg_U4pOs74dzdZFB50PB_HxcMhzUqd5plH914PmRDw5k0ONaa183CsGa7DcGVvWaR_eYP_8_CArahGAjHt04Kx227tjyMx4Zaju7iNyxpBWxtBC5YJyj8rjv7IeA_0Pevz91pWn6OE1pkeLwGeFSV8pZw4BWg95SG97A-I&key=2020&ts=1515219237015
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null)
decode token. host: 10.252.2.10
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null)
decode token. port: 5903
2018-01-06 06:13:57,070 DEBUG
[cloud.consoleproxy.ConsoleProxyHttpHandlerHelper] (Thread-4159:null)
decode token. tag: 375c62b5-74d9-4494-8b79-0d7c76cff10f

Every opened session is dumped to logs. I suppose it's dangerous and could
lead to FS overusage and CPVM failure.

/dev/vda10                                              368M   63M  287M
19% /var/log

Might it be that (1) is a consequence of (1a)?

2. High CPU utilization bug. After management server is launched it uses 0
CPU because I run development cloud. After two days I see that 2 cores are
used 50% by management server processes, several days ago I saw all
management server processes utilized almost all CPU available. Surprisingly
It continues function (API, UI), no active API utlization in logs.

The only two suspicios things I found for the last incident are:

root@cs2-head1:/var/log/cloudstack/management# zgrep ERROR
management-server.log.2018-01-04.gz
2018-01-04 12:58:23,391 ERROR [c.c.c.ClusterManagerImpl]
(localhost-startStop-1:null) (logid:) Unable to ping management server at
10.252.2.2:9090 due to ConnectException
2018-01-04 12:58:25,743 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:36:23,874 ERROR [c.c.u.PropertiesUtil]
(localhost-startStop-1:null) (logid:) Unable to find properties file:
commands.properties
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188 ctx-b1887051)
(logid:be4b64e0) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a
deployment for VM[SecondaryStorageVm|s-24-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,043 ERROR [c.c.v.VmWorkJobHandlerProxy]
(Work-Job-Executor-4:ctx-faf69614 job-38155/job-38185 ctx-83290fa8)
(logid:65010252) Invocation exception, caused by:
com.cloud.exception.InsufficientServerCapacityException: Unable to create a
deployment for VM[ConsoleProxy|v-10-VM]Scope=interface
com.cloud.dc.DataCenter; id=1
2018-01-04 14:43:23,044 ERROR [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-5:ctx-e566f561 job-38158/job-38188) (logid:be4b64e0)
Unable to complete AsyncJobVO {id:38188, userId: 1, accountId: 1,
instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkStart,
cmdInfo:
rO0ABXNyABhjb20uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAAGHQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwcHA,
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
result: null, initMsid: 2485138019287, completeMsid: null, lastUpdated:
null, lastPolled: null, created: Thu Jan 04 14:43:22 KRAT 2018}, job
origin:38158
2018-01-04 14:43:23,044 ERROR [c.c.v.VmWorkJobDispatcher]
(Work-Job-Executor-4:ctx-faf69614 job-38155/job-38185) (logid:65010252)
Unable to complete AsyncJobVO {id:38185, userId: 1, accountId: 1,
instanceType: null, instanceId: null, cmd: com.cloud.vm.VmWorkStart,
cmdInfo:
rO0ABXNyABhjb20uY2xvdWQudm0uVm1Xb3JrU3RhcnR9cMGsvxz73gIAC0oABGRjSWRMAAZhdm9pZHN0ADBMY29tL2Nsb3VkL2RlcGxveS9EZXBsb3ltZW50UGxhbm5lciRFeGNsdWRlTGlzdDtMAAljbHVzdGVySWR0ABBMamF2YS9sYW5nL0xvbmc7TAAGaG9zdElkcQB-AAJMAAtqb3VybmFsTmFtZXQAEkxqYXZhL2xhbmcvU3RyaW5nO0wAEXBoeXNpY2FsTmV0d29ya0lkcQB-AAJMAAdwbGFubmVycQB-AANMAAVwb2RJZHEAfgACTAAGcG9vbElkcQB-AAJMAAlyYXdQYXJhbXN0AA9MamF2YS91dGlsL01hcDtMAA1yZXNlcnZhdGlvbklkcQB-AAN4cgATY29tLmNsb3VkLnZtLlZtV29ya5-ZtlbwJWdrAgAESgAJYWNjb3VudElkSgAGdXNlcklkSgAEdm1JZEwAC2hhbmRsZXJOYW1lcQB-AAN4cAAAAAAAAAABAAAAAAAAAAEAAAAAAAAACnQAGVZpcnR1YWxNYWNoaW5lTWFuYWdlckltcGwAAAAAAAAAAHBwcHBwcHBwcHA,
cmdVersion: 0, status: IN_PROGRESS, processStatus: 0, resultCode: 0,
result: null, initMsid: 2485138019287, completeMsid: null, lastUpdated:
null, lastPolled: null, created: Thu Jan 04 14:43:21 KRAT 2018}, job
origin:38155
2018-01-04 14:43:25,127 ERROR
[o.a.c.s.d.d.CloudStackPrimaryDataStoreDriverImpl]
(consoleproxy-1:ctx-6f2f9b7b) (logid:25acd369) No remote endpoint to send
DeleteCommand, check if host or ssvm is down?
2018-01-04 14:43:25,127 ERROR
[o.a.c.s.d.d.CloudStackPrimaryDataStoreDriverImpl]
(secstorage-1:ctx-ae3adf87) (logid:4db1e2a0) No remote endpoint to send
DeleteCommand, check if host or ssvm is down?


The worst thing is that I don't even have an idea how to catch it. Also, as
I have second management down, I see a lot of events like:
2017-12-28 05:08:34,927 DEBUG [c.c.c.ClusterManagerImpl]
(Cluster-Heartbeat-1:ctx-9bbae21c) (logid:9a1b0b21) Management server
heartbeat takes too long to finish. profiler: Done. Duration: 1935ms,
profilerHeartbeatUpdate: Done. Duration: 617ms, profilerPeerScan: Done.
Duration: 1317ms

Could this be a reason of high CPU utilization?

-- 
With best regards, Ivan Kudryavtsev
Bitworks Software, Ltd.
Cell: +7-923-414-1515
WWW: http://bitworks.software/ <http://bw-sw.com/>

Found 2 bugs in ACS 4.10. Possibly exist in 4.11 (master)

Reply via email to