Irrelevant if mgmt sends (TCP) disconnect to the SSVM (which shoul not affect regular OS pings from mgmt to SSVM) - if you have issues pinging from mgmt to hypervisor node - than again I would bet on misbehaving switch equipment.
On Thu, 25 Jul 2019 at 18:52, Rakesh v <www.rakeshv....@gmail.com> wrote: > True, but I thought since mgt sends disconnect to ssvm, I thought it will > reset the interface on which is connecting > > Sent from my iPhone > > > On 25-Jul-2019, at 4:57 PM, Andrija Panic <andrija.pa...@gmail.com> > wrote: > > > > In your previous mail, I understond that you used the OS tool "ping" and > > NOT refering to internal ACS pings? > > "Out of all these, the ping drops were observed from MGT server to ssvm > and > > mgt server to nodes. Basically all nodes lost connection. Then it > recovered > > itself after 1 minute." > > > > > >> On Thu, 25 Jul 2019 at 16:55, Rakesh v <www.rakeshv....@gmail.com> > wrote: > >> > >> The ping between mgt server and ssvm fails because mgt sends disconnect > >> message to all nodes. If you look at the logs I pasted in first email, > the > >> mgt server thinks ssvm is lagging behind on ping and sends a disconnect > >> message without investigation for all nodes. Also it happens at the > >> beginning of every hour. > >> > >> > >> So I'm sure network is not the issue here. > >> > >> Sent from my iPhone > >> > >>> On 25-Jul-2019, at 4:46 PM, Andrija Panic <andrija.pa...@gmail.com> > >> wrote: > >>> > >>> since basic network connectivity (ping failures) was down between mgmts > >> and > >>> nodes (and SSVM on it) - I would point my finger to your networking > >>> equipment - i.e. I expect zero problems with ACS (since pings fail). > >>> > >>> Let us know how it goes. > >>> > >>> Andrija > >>> > >>>> On Thu, 25 Jul 2019 at 16:04, Rakesh v <www.rakeshv....@gmail.com> > >> wrote: > >>>> > >>>> Yes I was monitoring it continuously. Below are the steps which I was > >>>> doing when issue happened > >>>> > >>>> > >>>> 1. Ping from MGT server to ssvm > >>>> 2. Ping from ssvm to secondary storage ip > >>>> 3. Ping from ssvm to public IP like 8.8.8.8 > >>>> 4. Ping from MGT server to node in which ssvm was running > >>>> > >>>> > >>>> Out of all these, the ping drops were observed from MGT server to ssvm > >> and > >>>> mgt server to nodes. Basically all nodes lost connection. Then it > >> recovered > >>>> itself after 1 minute. > >>>> > >>>> > >>>> Sent from my iPhone > >>>> > >>>>> On 25-Jul-2019, at 3:48 PM, Andrija Panic <andrija.pa...@gmail.com> > >>>> wrote: > >>>>> > >>>>> Can you observe the status of SSVM (is it > >>>> UP/Connecting/Disconnected/Down) > >>>>> while you have issues? > >>>>> > >>>>> I would advise checking your Secondary Storage itself - and also > >> running > >>>>> the SSVM diagnose script /usr/local/cloud/systemvm/ssvm-check.sh - > >>>> observe > >>>>> if any errors with NFS or others. > >>>>> > >>>>> Lastly - and don't laugh - check that you don't have issues with > >>>> networking > >>>>> equipment (some of us had VEEEERY strange issues in connectivity some > >>>> years > >>>>> ago with crappy QCT/Quanta Switches in MLAG setup) > >>>>> > >>>>> Andrija > >>>>> > >>>>>> On Thu, 25 Jul 2019 at 15:42, Rakesh v <www.rakeshv....@gmail.com> > >>>> wrote: > >>>>>> > >>>>>> Yes I have set the ip's of the three MGT servers in the "host" field > >>>>>> > >>>>>> Sent from my iPhone > >>>>>> > >>>>>>> On 25-Jul-2019, at 2:14 PM, Pierre-Luc Dion <pdion...@apache.org> > >>>> wrote: > >>>>>>> > >>>>>>> Do you have a load balancer in front of cloudstack? Did you set the > >>>>>> global > >>>>>>> settings "host" to the ip of the mgmt server? > >>>>>>> > >>>>>>> > >>>>>>> Le jeu. 25 juill. 2019 03 h 24, Rakesh Venkatesh < > >>>>>> www.rakeshv....@gmail.com> > >>>>>>> a écrit : > >>>>>>> > >>>>>>>> Hello People > >>>>>>>> > >>>>>>>> > >>>>>>>> I have a strange issue where mgt server times out to send a > command > >> to > >>>>>>>> secondary storage VM every hour and because of this UI won't be > >>>>>> accessible > >>>>>>>> for a short duration of time. Sometimes I have to restart mgt > server > >>>> to > >>>>>> get > >>>>>>>> it back to working state and sometimes I don't need to restart > it. I > >>>>>> also > >>>>>>>> see some exceptions while fetching the storage stats. > >>>>>>>> > >>>>>>>> > >>>>>>>> The log says secondary storage VM is lagging behind mgt server in > >> ping > >>>>>> and > >>>>>>>> it sends a disconnect message to other components. Can you let me > >> know > >>>>>> how > >>>>>>>> to troubleshoot this issue? I destroyed the secondary storage VM > but > >>>> the > >>>>>>>> issue still persists. I checked the date/time on the mgt server > and > >>>> SSVM > >>>>>>>> and they are same. This is happening for quite a few days now. > Below > >>>> are > >>>>>>>> the logs > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 2019-07-25 04:01:22,769 INFO [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Found the following > >>>>>> agents > >>>>>>>> behind on ping: [183] > >>>>>>>> 2019-07-25 04:01:22,775 WARN [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentMonitor-1:ctx-c33dbe74) (logid:5442158c) Disconnect agent > for > >>>>>>>> CPVM/SSVM due to physical connection close. host: 183 > >>>>>>>> 2019-07-25 04:01:22,778 INFO [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Host 183 is > >>>>>> disconnecting > >>>>>>>> with event ShutdownRequested > >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) The next status of > >>>> agent > >>>>>>>> 183is Disconnected, current status is Up > >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Deregistering link > >> for > >>>>>> 183 > >>>>>>>> with state Disconnected > >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Remove Agent : 183 > >>>>>>>> 2019-07-25 04:01:22,781 DEBUG [c.c.a.m.ConnectedAgentAttache] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Processing > >> Disconnect. > >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Seq > >>>>>>>> 183-7541559051008607242: Sending disconnect to class > >>>>>>>> com.cloud.agent.manager.SynchronousListener > >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> com.cloud.hypervisor.xenserver.discoverer.XcpServerDiscoverer > >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.u.n.NioConnection] > >>>>>>>> (pool-2-thread-1:null) (logid:) Closing socket Socket[addr=/ > >>>>>> 172.30.32.16 > >>>>>>>> ,port=38250,localport=8250] > >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentAttache] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq > >>>>>>>> 183-7541559051008607242: Waiting some more time because this is > the > >>>>>> current > >>>>>>>> command > >>>>>>>> 2019-07-25 04:01:22,782 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> com.cloud.hypervisor.hyperv.discoverer.HypervServerDiscoverer > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentAttache] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq > >>>>>>>> 183-7541559051008607242: Waiting some more time because this is > the > >>>>>> current > >>>>>>>> command > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.deploy.DeploymentPlanningManagerImpl > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.network.security.SecurityGroupListener > >>>>>>>> 2019-07-25 04:01:22,783 INFO [c.c.u.e.CSExceptionErrorCode] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Could not find > >>>>>> exception: > >>>>>>>> com.cloud.exception.OperationTimedoutException in error code list > >> for > >>>>>>>> exceptions > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> org.apache.cloudstack.engine.orchestration.NetworkOrchestrator > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.vm.ClusteredVirtualMachineManagerImpl > >>>>>>>> 2019-07-25 04:01:22,783 WARN [c.c.a.m.AgentAttache] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq > >>>>>>>> 183-7541559051008607242: Timed out on null > >>>>>>>> 2019-07-25 04:01:22,783 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.storage.listener.StoragePoolMonitor > >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentAttache] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Seq > >>>>>>>> 183-7541559051008607242: Cancelling. > >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.storage.secondary.SecondaryStorageListener > >>>>>>>> 2019-07-25 04:01:22,784 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.network.SshKeysDistriMonitor > >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [o.a.c.s.RemoteHostEndPoint] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Failed to send > >>>> command, > >>>>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: > >>>>>> Commands > >>>>>>>> 7541559051008607242 to Host 183 timed out after 3600 > >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> com.cloud.network.router.VpcVirtualNetworkApplianceManagerImpl > >>>>>>>> 2019-07-25 04:01:22,785 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.storage.download.DownloadListener > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 2019-07-25 04:01:22,785 ERROR [c.c.s.StatsCollector] > >>>>>>>> (StatsCollector-2:ctx-b55657a9) (logid:dafc4881) Error trying to > >>>>>> retrieve > >>>>>>>> storage stats > >>>>>>>> com.cloud.utils.exception.CloudRuntimeException: Failed to send > >>>> command, > >>>>>>>> due to Agent:183, com.cloud.exception.OperationTimedoutException: > >>>>>> Commands > >>>>>>>> 7541559051008607242 to Host 183 timed out after 3600 > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.storage.RemoteHostEndPoint.sendMessage(RemoteHostEndPoint.java:133) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > com.cloud.server.StatsCollector$StorageCollector.runInContext(StatsCollector.java:1139) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:49) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:56) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:103) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:53) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:46) > >>>>>>>> at > >>>>>>>> > >>>> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > >>>>>>>> at > >>>>>> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > >>>>>>>> at > >>>>>>>> > >>>>>>>> > >>>>>> > >>>> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > >>>>>>>> at java.lang.Thread.run(Thread.java:748) > >>>>>>>> 2019-07-25 04:01:22,786 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.consoleproxy.ConsoleProxyListener > >>>>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.storage.LocalStoragePoolListener > >>>>>>>> 2019-07-25 04:01:22,789 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.storage.upload.UploadListener > >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.capacity.StorageCapacityListener > >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.capacity.ComputeCapacityListener > >>>>>>>> 2019-07-25 04:01:22,790 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: com.cloud.network.SshKeysDistriMonitor > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >> com.cloud.network.router.VirtualNetworkApplianceManagerImpl > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>>>>>> > com.cloud.network.NetworkUsageManagerImpl$DirectNetworkStatsListener > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.n.NetworkUsageManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Disconnected > called > >> on > >>>>>> 183 > >>>>>>>> with status Disconnected > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> com.cloud.agent.manager.AgentManagerImpl$BehindOnPingListener > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.a.m.AgentManagerImpl] > >>>>>>>> (AgentTaskPool-1:ctx-66de2057) (logid:841d2a63) Sending Disconnect > >> to > >>>>>>>> listener: > >>>> com.cloud.agent.manager.AgentManagerImpl$SetHostParamsListener > >>>>>>>> 2019-07-25 04:01:22,791 DEBUG [c.c.h.Status] > >>>>>> (AgentTaskPool-1:ctx-66de2057) > >>>>>>>> (logid:841d2a63) Transition:[Resource state = Enabled, Agent > event = > >>>>>>>> ShutdownRequested, Host id = 183, name = s-2775-VM] > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Thanks and regards > >>>>>>>> Rakesh venkatesh > >>>>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> > >>>>> Andrija Panić > >>>> > >>> > >>> > >>> -- > >>> > >>> Andrija Panić > >> > > > > > > -- > > > > Andrija Panić > -- Andrija Panić