Hi carlos,
Did you ever find a fix for this?
I'm seeing a same issue on 4.1.1 with Vmware ESXi.
On 29 October 2013 04:54, Carlos Reategui <[email protected]> wrote:
Update. I cleared out the async_job table and also reset the system
vms
it
thought where in starting mode from my previous attempts by setting
them
to
Stopped from starting. I also re-set the XS pool master to be the
one XS thinks it is.
Now when I start the CS MS here are the logs leading up to the first
exception about the Unable to reach the pool:
2013-10-28 21:27:11,040 DEBUG [cloud.alert.ClusterAlertAdapter]
(Cluster-Notification-1:null) Management server node 172.30.45.2 is
up, send alert
2013-10-28 21:27:11,045 WARN [cloud.cluster.ClusterManagerImpl]
(Cluster-Notification-1:null) Notifying management server join event
took 9
ms
2013-10-28 21:27:23,236 DEBUG [cloud.server.StatsCollector]
(StatsCollector-2:null) HostStatsCollector is running...
2013-10-28 21:27:23,243 DEBUG [cloud.server.StatsCollector]
(StatsCollector-3:null) VmStatsCollector is running...
2013-10-28 21:27:23,247 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) StorageCollector is running...
2013-10-28 21:27:23,255 DEBUG [cloud.server.StatsCollector]
(StatsCollector-1:null) There is no secondary storage VM for
secondary storage host nfs://172.30.45.2/store/secondary
2013-10-28 21:27:23,273 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Forwarding null to
233845174730255
2013-10-28 21:27:23,274 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Routing from
233845174730253
2013-10-28 21:27:23,275 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-9:null) Seq 1-201916421: Link is closed
2013-10-28 21:27:23,275 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1-201916421: MgmtId 233845174730253:
Req:
Resource [Host:1] is unreachable: Host 1: Link is c
losed
2013-10-28 21:27:23,275 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-9:null) Seq 1--1: MgmtId 233845174730253: Req:
Routing to peer
2013-10-28 21:27:23,277 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-11:null) Seq 1--1: MgmtId 233845174730253: Req:
Cancel request received
2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-11:null) Seq 1-201916421: Cancelling.
2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Waiting some more time
because this is the current command
2013-10-28 21:27:23,277 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Waiting some more time
because this is the current command
2013-10-28 21:27:23,277 INFO [utils.exception.CSExceptionErrorCode]
(StatsCollector-2:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list
for exceptions
2013-10-28 21:27:23,277 WARN [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Timed out on null
2013-10-28 21:27:23,278 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 1-201916421: Cancelling.
2013-10-28 21:27:23,278 WARN [agent.manager.AgentManagerImpl]
(StatsCollector-2:null) Operation timed out: Commands 201916421 to
Host 1 timed out after 3600
2013-10-28 21:27:23,278 WARN [cloud.resource.ResourceManagerImpl]
(StatsCollector-2:null) Unable to obtain host 1 statistics.
2013-10-28 21:27:23,278 WARN [cloud.server.StatsCollector]
(StatsCollector-2:null) Received invalid host stats for host: 1
2013-10-28 21:27:23,281 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Forwarding null to
233845174730255
2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 1-201916422: Routing from
233845174730253
2013-10-28 21:27:23,283 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 1-201916422: Link is closed
2013-10-28 21:27:23,283 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-12:null) Seq 1-201916422: MgmtId 233845174730253:
Req: Resource [Host:1] is unreachable: Host 1: Link is
closed
2013-10-28 21:27:23,284 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-12:null) Seq 1--1: MgmtId 233845174730253: Req:
Routing to peer
2013-10-28 21:27:23,286 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-13:null) Seq 1--1: MgmtId 233845174730253: Req:
Cancel request received
2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-13:null) Seq 1-201916422: Cancelling.
2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Waiting some more time
because this is the current command
2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Waiting some more time
because this is the current command
2013-10-28 21:27:23,286 INFO [utils.exception.CSExceptionErrorCode]
(StatsCollector-1:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list
for exceptions
2013-10-28 21:27:23,286 WARN [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Timed out on null
2013-10-28 21:27:23,286 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 1-201916422: Cancelling.
2013-10-28 21:27:23,288 DEBUG [cloud.storage.StorageManagerImpl]
(StatsCollector-1:null) Unable to send storage pool command to
Pool[200|LVM] via 1
com.cloud.exception.OperationTimedoutException: Commands 201916422
to
Host
1 timed out after 3600
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:429)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
464)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2347)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:4
71)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 21:27:23,289 INFO [cloud.server.StatsCollector]
(StatsCollector-1:null) Unable to reach Pool[200|LVM]
com.cloud.exception.StorageUnavailableException: Resource
[StoragePool:200]
is unreachable: Unable to send command to the pool
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2357)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:4
71)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 21:27:23,300 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-2:null) Seq 2-1168703496: Forwarding null to
233845174730255
2013-10-28 21:27:23,301 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-14:null) Seq 2-1168703496: Routing from
233845174730253
2013-10-28 21:27:23,301 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-14:null) Seq 2-1168703496: Link is closed
2013-10-28 21:27:23,302 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-14:null) Seq 2-1168703496: MgmtId
Req: Resource [Host:2] is unreachable: Host 2: Link is closed
2013-10-28 21:27:23,302 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-14:null) Seq 2--1: MgmtId 233845174730253: Req:
Routing to peer
2013-10-28 21:27:23,303 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-15:null) Seq 2--1: MgmtId 233845174730253: Req:
Cancel request received
2013-10-28 21:27:23,303 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-15:null) Seq 2-1168703496: Cancelling.
2013-10-28 21:27:23,303 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 2-1168703496: Waiting some more time
because this is the current command
2013-10-28 21:27:23,304 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 2-1168703496: Waiting some more time
because this is the current command
2013-10-28 21:27:23,304 INFO [utils.exception.CSExceptionErrorCode]
(StatsCollector-2:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list
for exceptions
2013-10-28 21:27:23,304 WARN [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 2-1168703496: Timed out on null
2013-10-28 21:27:23,304 DEBUG [agent.manager.AgentAttache]
(StatsCollector-2:null) Seq 2-1168703496: Cancelling.
2013-10-28 21:27:23,304 WARN [agent.manager.AgentManagerImpl]
(StatsCollector-2:null) Operation timed out: Commands 1168703496 to
Host
2
timed out after 3600
2013-10-28 21:27:23,304 WARN [cloud.resource.ResourceManagerImpl]
(StatsCollector-2:null) Unable to obtain host 2 statistics.
2013-10-28 21:27:23,304 WARN [cloud.server.StatsCollector]
(StatsCollector-2:null) Received invalid host stats for host: 2
2013-10-28 21:27:23,307 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-1:null) Seq 2-1168703497: Forwarding null to
233845174730255
2013-10-28 21:27:23,308 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-1:null) Seq 2-1168703497: Routing from
233845174730253
2013-10-28 21:27:23,308 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-1:null) Seq 2-1168703497: Link is closed
2013-10-28 21:27:23,308 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-1:null) Seq 2-1168703497: MgmtId 233845174730253:
Req: Resource [Host:2] is unreachable: Host 2: Link is closed
2013-10-28 21:27:23,308 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-1:null) Seq 2--1: MgmtId 233845174730253: Req:
Routing to peer
2013-10-28 21:27:23,310 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-5:null) Seq 2--1: MgmtId 233845174730253: Req:
Cancel
request received
2013-10-28 21:27:23,310 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-5:null) Seq 2-1168703497: Cancelling.
2013-10-28 21:27:23,310 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703497: Waiting some more time
because this is the current command
2013-10-28 21:27:23,310 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703497: Waiting some more time
because this is the current command
2013-10-28 21:27:23,310 INFO [utils.exception.CSExceptionErrorCode]
(StatsCollector-1:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list
for exceptions
2013-10-28 21:27:23,310 WARN [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703497: Timed out on null
2013-10-28 21:27:23,310 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703497: Cancelling.
2013-10-28 21:27:23,310 DEBUG [cloud.storage.StorageManagerImpl]
(StatsCollector-1:null) Unable to send storage pool command to
Pool[201|LVM] via 2
com.cloud.exception.OperationTimedoutException: Commands 1168703497
to
Host
2 timed out after 3600
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:429)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
464)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2347)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:4
71)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 21:27:23,311 INFO [cloud.server.StatsCollector]
(StatsCollector-1:null) Unable to reach Pool[201|LVM]
com.cloud.exception.StorageUnavailableException: Resource
[StoragePool:201]
is unreachable: Unable to send command to the pool
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2357)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:4
71)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 21:27:23,328 DEBUG [agent.manager.ClusteredAgentAttache]
(StatsCollector-1:null) Seq 2-1168703498: Forwarding null to
233845174730255
2013-10-28 21:27:23,329 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-3:null) Seq 2-1168703498: Routing from
233845174730253
2013-10-28 21:27:23,329 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-3:null) Seq 2-1168703498: Link is closed
2013-10-28 21:27:23,329 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-3:null) Seq 2-1168703498: MgmtId 233845174730253:
Req: Resource [Host:2] is unreachable: Host 2: Link is closed
2013-10-28 21:27:23,330 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-3:null) Seq 2--1: MgmtId 233845174730253: Req:
Routing to peer
2013-10-28 21:27:23,331 DEBUG
[agent.manager.ClusteredAgentManagerImpl]
(AgentManager-Handler-4:null) Seq 2--1: MgmtId 233845174730253: Req:
Cancel
request received
2013-10-28 21:27:23,331 DEBUG [agent.manager.AgentAttache]
(AgentManager-Handler-4:null) Seq 2-1168703498: Cancelling.
2013-10-28 21:27:23,331 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703498: Waiting some more time
because this is the current command
2013-10-28 21:27:23,331 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703498: Waiting some more time
because this is the current command
2013-10-28 21:27:23,331 INFO [utils.exception.CSExceptionErrorCode]
(StatsCollector-1:null) Could not find exception:
com.cloud.exception.OperationTimedoutException in error code list
for exceptions
2013-10-28 21:27:23,332 WARN [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703498: Timed out on null
2013-10-28 21:27:23,332 DEBUG [agent.manager.AgentAttache]
(StatsCollector-1:null) Seq 2-1168703498: Cancelling.
2013-10-28 21:27:23,332 DEBUG [cloud.storage.StorageManagerImpl]
(StatsCollector-1:null) Unable to send storage pool command to
Pool[202|NetworkFilesystem] via 2
com.cloud.exception.OperationTimedoutException: Commands 1168703498
to
Host
2 timed out after 3600
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:429)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:
464)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2347)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:4
71)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
IP tables is disable on the XS hosts so the connection prob is not a
firewall issue.
If I do an xe se-list I see all 3 of the above SRs and the hosts
have mounted the NFS SR and can access it.
On Mon, Oct 28, 2013 at 9:05 PM, Carlos Reategui
<[email protected]
wrote:
Using CS 4.1.1 with 2 hosts running XS 6.0.2
Had to shut everything down and now I am having problems bringing
things
up.
As suggested I used CS to stop all my instances as well as the
system
VMs
and the SR. Then I shutdown the XS 6.02 servers after enabling
maintenance
mode from the CS console.
After bringing things up, my XS servers had the infamous
interface-rename
issue which I resolved by editing the udev rules file manually.
Now I have my XS servers up but for some reason my pool master got
changed
so I used xe pool-designate-new-master to switch it back.
I did not notice that this designation change had been picked up
by CS
and
when starting it up it keeps trying to connect to the wrong pool
master.
Should I switch XS to match CS or what do I need to change in CS
to
tell
it what the pool master is?
I tried putting the server that CS thinks is the master in
maintenance mode from CS but that just ends up in an apparent
infinite cycle
spitting
out endless lines like these:
2013-10-28 20:39:02,059 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-1:null) Seq 2-855048230: Forwarding Seq
2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,060 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-11:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1,
Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,062 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-13:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1,
Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,063 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-7:null) Seq 2-855048230: Forwarding Seq
2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,064 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-15:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1,
Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,066 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-14:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1,
Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,067 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-2:null) Seq 2-855048230: Forwarding Seq
2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,068 DEBUG
[agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1,
Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
After stopping and restarting the MS, the first error I see is:
2013-10-28 20:41:53,749 DEBUG [cloud.api.ApiServlet]
(catalina-exec-1:null) ===START=== 10.110.3.70 -- GET
command=queryAsyncJobResult&jobId=d695b8ba-53b5-4e22-8e97-54e5ed236f88
&response=json&sessi
onkey=r4nsNGoidS8enQWHRKbV2AUNeac%3D&_=1383018110624
2013-10-28 20:41:53,756 ERROR [cloud.api.ApiServlet]
(catalina-exec-1:null) unknown exception writing api response
java.lang.NullPointerException
at
com.cloud.user.AccountManagerImpl.getSystemUser(AccountManagerImpl.jav
a:280)
at
com.cloud.user.AccountManagerImpl.getSystemUser(AccountManagerImpl.jav
a:143)
at
com.cloud.api.ApiServlet.processRequest(ApiServlet.java:238)
at com.cloud.api.ApiServlet.doGet(ApiServlet.java:66)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
cationFilterChain.java:290)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
lterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
lve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
lve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
va:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
va:102)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:6
15)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
e.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
:293)
at
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor
.java:889)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.pro
cess(Http11NioProtocol.java:744)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint
.java:2282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 20:41:53,761 DEBUG [cloud.api.ApiServlet]
(catalina-exec-1:null) ===END=== 10.110.3.70 -- GET
command=queryAsyncJobResult&jobId=d695b8ba-53b5-4e22-8e97-54e5ed236f88
&response=json&session
key=r4nsNGoidS8enQWHRKbV2AUNeac%3D&_=1383018110624
Then I see a few of these:
2013-10-28 20:42:01,464 WARN
[agent.manager.ClusteredAgentManagerImpl]
(HA-Worker-4:work-10) Unable to connect to peer management server:
233845174730255, ip: 172.30.45.2 due to Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:530)
at
java.nio.channels.SocketChannel.open(SocketChannel.java:164)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.connectToPeer(Cluste
redAgentManagerImpl.java:477)
at
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttac
he.java:172)
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:388)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:51
1)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:46
4)
at
com.cloud.ha.CheckOnAgentInvestigator.isVmAlive(CheckOnAgentInvestigat
or.java:53)
at
com.cloud.ha.HighAvailabilityManagerImpl.restart(HighAvailabilityManag
erImpl.java:434)
at
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabi
lityManagerImpl.java:829)
2013-10-28 20:42:01,468 WARN
[agent.manager.ClusteredAgentManagerImpl]
(HA-Worker-2:work-11) Unable to connect to peer management server:
233845174730255, ip: 172.30.45.2 due to Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:530)
at
java.nio.channels.SocketChannel.open(SocketChannel.java:164)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.connectToPeer(Cluste
redAgentManagerImpl.java:477)
at
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttac
he.java:172)
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:388)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:51
1)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:46
4)
at
com.cloud.ha.CheckOnAgentInvestigator.isVmAlive(CheckOnAgentInvestigat
or.java:53)
at
com.cloud.ha.HighAvailabilityManagerImpl.restart(HighAvailabilityManag
erImpl.java:434)
at
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabi
lityManagerImpl.java:829)
The next error is:
2013-10-28 20:42:01,845 WARN [utils.nio.Task]
(AgentManager-Handler-6:null) Caught the following exception but
pushing
on
java.lang.NullPointerException
at
com.google.gson.FieldAttributes.getAnnotationFromArray(FieldAttributes
.java:231)
at
com.google.gson.FieldAttributes.getAnnotation(FieldAttributes.java
:150)
at
com.google.gson.VersionExclusionStrategy.shouldSkipField(VersionExclus
ionStrategy.java:38)
at
com.google.gson.DisjunctionExclusionStrategy.shouldSkipField(Disjuncti
onExclusionStrategy.java:38)
at
com.google.gson.ReflectingFieldNavigator.visitFieldsReflectively(Refle
ctingFieldNavigator.java:58)
at
com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:120)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializ
ationContextDefault.java:62)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializ
ationContextDefault.java:53)
at com.google.gson.Gson.toJsonTree(Gson.java:220)
at com.google.gson.Gson.toJsonTree(Gson.java:197)
at
com.cloud.agent.transport.ArrayTypeAdaptor.serialize(ArrayTypeAdaptor.
java:56)
at
com.cloud.agent.transport.ArrayTypeAdaptor.serialize(ArrayTypeAdaptor.
java:37)
at
com.google.gson.JsonSerializationVisitor.findAndInvokeCustomSerializer
(JsonSerializationVisitor.java:184)
at
com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonS
erializationVisitor.java:160)
at
com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:101)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializ
ationContextDefault.java:62)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializ
ationContextDefault.java:53)
at com.google.gson.Gson.toJsonTree(Gson.java:220)
at com.google.gson.Gson.toJson(Gson.java:260)
at
com.cloud.agent.transport.Request.toBytes(Request.java:316)
at
com.cloud.agent.transport.Request.getBytes(Request.java:332)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.cancel(ClusteredAgen
tManagerImpl.java:435)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl$ClusteredAgentHandle
r.doTask(ClusteredAgentManagerImpl.java:641)
at com.cloud.utils.nio.Task.run(Task.java:83)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
and then the next set of errors I see over and over are:
2013-10-28 20:42:16,433 DEBUG [cloud.storage.StorageManagerImpl]
(StatsCollector-2:null) Unable to send storage pool command to
Pool[200|LVM] via 1
com.cloud.exception.OperationTimedoutException: Commands
1112277002 to Host 1 timed out after 3600
at
com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:429)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:51
1)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:46
4)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2347)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java
:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 20:42:16,434 INFO [cloud.server.StatsCollector]
(StatsCollector-2:null) Unable to reach Pool[200|LVM]
com.cloud.exception.StorageUnavailableException: Resource
[StoragePool:200] is unreachable: Unable to send command to the
pool
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:2357)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.jav
a:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.ja
va:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java
:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:
351)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.a
ccess$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.r
un(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
ava:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
java:615)
at java.lang.Thread.run(Thread.java:679)
I have tried to force reconnect to both hosts but that ends up
maxing
out
a CPU core and filling up the log file with endless log lines.
Any thoughts on how to recover my system?