Using CS 4.1.1 with 2 hosts running XS 6.0.2
Had to shut everything down and now I am having problems bringing things up.
As suggested I used CS to stop all my instances as well as the system VMs
and the SR. Then I shutdown the XS 6.02 servers after enabling maintenance
mode from the CS console.
After bringing things up, my XS servers had the infamous interface-rename
issue which I resolved by editing the udev rules file manually.
Now I have my XS servers up but for some reason my pool master got changed
so I used xe pool-designate-new-master to switch it back.
I did not notice that this designation change had been picked up by CS and
when starting it up it keeps trying to connect to the wrong pool master.
Should I switch XS to match CS or what do I need to change in CS to tell
it what the pool master is?
I tried putting the server that CS thinks is the master in maintenance mode
from CS but that just ends up in an apparent infinite cycle spitting out
endless lines like these:
2013-10-28 20:39:02,059 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-1:null) Seq 2-855048230: Forwarding Seq 2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,060 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-11:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,062 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-13:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,063 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-7:null) Seq 2-855048230: Forwarding Seq 2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,064 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-15:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,066 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-14:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,067 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-2:null) Seq 2-855048230: Forwarding Seq 2-855048230:
{ Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flags
: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
2013-10-28 20:39:02,068 DEBUG [agent.manager.ClusteredAgentAttache]
(AgentManager-Handler-12:null) Seq 2-855048230: Forwarding Seq
2-855048230: { Cmd , MgmtId: 233845174730253, via: 2, Ver: v1, Flag
s: 100111, [{"MaintainCommand":{"wait":0}}] } to 233845174730255
After stopping and restarting the MS, the first error I see is:
2013-10-28 20:41:53,749 DEBUG [cloud.api.ApiServlet] (catalina-exec-1:null)
===START=== 10.110.3.70 -- GET
command=queryAsyncJobResult&jobId=d695b8ba-53b5-4e22-8e97-54e5ed236f88&response=json&sessi
onkey=r4nsNGoidS8enQWHRKbV2AUNeac%3D&_=1383018110624
2013-10-28 20:41:53,756 ERROR [cloud.api.ApiServlet] (catalina-exec-1:null)
unknown exception writing api response
java.lang.NullPointerException
at
com.cloud.user.AccountManagerImpl.getSystemUser(AccountManagerImpl.java:280)
at
com.cloud.user.AccountManagerImpl.getSystemUser(AccountManagerImpl.java:143)
at com.cloud.api.ApiServlet.processRequest(ApiServlet.java:238)
at com.cloud.api.ApiServlet.doGet(ApiServlet.java:66)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:617)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:615)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
at
org.apache.coyote.http11.Http11NioProcessor.process(Http11NioProcessor.java:889)
at
org.apache.coyote.http11.Http11NioProtocol$Http11ConnectionHandler.process(Http11NioProtocol.java:744)
at
org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.run(NioEndpoint.java:2282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 20:41:53,761 DEBUG [cloud.api.ApiServlet] (catalina-exec-1:null)
===END=== 10.110.3.70 -- GET
command=queryAsyncJobResult&jobId=d695b8ba-53b5-4e22-8e97-54e5ed236f88&response=json&session
key=r4nsNGoidS8enQWHRKbV2AUNeac%3D&_=1383018110624
Then I see a few of these:
2013-10-28 20:42:01,464 WARN [agent.manager.ClusteredAgentManagerImpl]
(HA-Worker-4:work-10) Unable to connect to peer management server:
233845174730255, ip: 172.30.45.2 due to Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:530)
at java.nio.channels.SocketChannel.open(SocketChannel.java:164)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.connectToPeer(ClusteredAgentManagerImpl.java:477)
at
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:172)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:388)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:464)
at
com.cloud.ha.CheckOnAgentInvestigator.isVmAlive(CheckOnAgentInvestigator.java:53)
at
com.cloud.ha.HighAvailabilityManagerImpl.restart(HighAvailabilityManagerImpl.java:434)
at
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:829)
2013-10-28 20:42:01,468 WARN [agent.manager.ClusteredAgentManagerImpl]
(HA-Worker-2:work-11) Unable to connect to peer management server:
233845174730255, ip: 172.30.45.2 due to Connection refused
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect(Native Method)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:530)
at java.nio.channels.SocketChannel.open(SocketChannel.java:164)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.connectToPeer(ClusteredAgentManagerImpl.java:477)
at
com.cloud.agent.manager.ClusteredAgentAttache.send(ClusteredAgentAttache.java:172)
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:388)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:464)
at
com.cloud.ha.CheckOnAgentInvestigator.isVmAlive(CheckOnAgentInvestigator.java:53)
at
com.cloud.ha.HighAvailabilityManagerImpl.restart(HighAvailabilityManagerImpl.java:434)
at
com.cloud.ha.HighAvailabilityManagerImpl$WorkerThread.run(HighAvailabilityManagerImpl.java:829)
The next error is:
2013-10-28 20:42:01,845 WARN [utils.nio.Task]
(AgentManager-Handler-6:null) Caught the following exception but pushing on
java.lang.NullPointerException
at
com.google.gson.FieldAttributes.getAnnotationFromArray(FieldAttributes.java:231)
at
com.google.gson.FieldAttributes.getAnnotation(FieldAttributes.java:150)
at
com.google.gson.VersionExclusionStrategy.shouldSkipField(VersionExclusionStrategy.java:38)
at
com.google.gson.DisjunctionExclusionStrategy.shouldSkipField(DisjunctionExclusionStrategy.java:38)
at
com.google.gson.ReflectingFieldNavigator.visitFieldsReflectively(ReflectingFieldNavigator.java:58)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:120)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:62)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:53)
at com.google.gson.Gson.toJsonTree(Gson.java:220)
at com.google.gson.Gson.toJsonTree(Gson.java:197)
at
com.cloud.agent.transport.ArrayTypeAdaptor.serialize(ArrayTypeAdaptor.java:56)
at
com.cloud.agent.transport.ArrayTypeAdaptor.serialize(ArrayTypeAdaptor.java:37)
at
com.google.gson.JsonSerializationVisitor.findAndInvokeCustomSerializer(JsonSerializationVisitor.java:184)
at
com.google.gson.JsonSerializationVisitor.visitUsingCustomHandler(JsonSerializationVisitor.java:160)
at com.google.gson.ObjectNavigator.accept(ObjectNavigator.java:101)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:62)
at
com.google.gson.JsonSerializationContextDefault.serialize(JsonSerializationContextDefault.java:53)
at com.google.gson.Gson.toJsonTree(Gson.java:220)
at com.google.gson.Gson.toJson(Gson.java:260)
at com.cloud.agent.transport.Request.toBytes(Request.java:316)
at com.cloud.agent.transport.Request.getBytes(Request.java:332)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl.cancel(ClusteredAgentManagerImpl.java:435)
at
com.cloud.agent.manager.ClusteredAgentManagerImpl$ClusteredAgentHandler.doTask(ClusteredAgentManagerImpl.java:641)
at com.cloud.utils.nio.Task.run(Task.java:83)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
and then the next set of errors I see over and over are:
2013-10-28 20:42:16,433 DEBUG [cloud.storage.StorageManagerImpl]
(StatsCollector-2:null) Unable to send storage pool command to
Pool[200|LVM] via 1
com.cloud.exception.OperationTimedoutException: Commands 1112277002 to Host
1 timed out after 3600
at com.cloud.agent.manager.AgentAttache.send(AgentAttache.java:429)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:511)
at
com.cloud.agent.manager.AgentManagerImpl.send(AgentManagerImpl.java:464)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:2347)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.java:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
2013-10-28 20:42:16,434 INFO [cloud.server.StatsCollector]
(StatsCollector-2:null) Unable to reach Pool[200|LVM]
com.cloud.exception.StorageUnavailableException: Resource [StoragePool:200]
is unreachable: Unable to send command to the pool
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:2357)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:422)
at
com.cloud.storage.StorageManagerImpl.sendToPool(StorageManagerImpl.java:436)
at
com.cloud.server.StatsCollector$StorageCollector.run(StatsCollector.java:316)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
I have tried to force reconnect to both hosts but that ends up maxing out a
CPU core and filling up the log file with endless log lines.
Any thoughts on how to recover my system?