I destroyed the SSVM and then tried hacking the database to make CloudStack realize that the console proxy is in fact stopped.
mysql> update vm_instance set state='Stopped' where name='v-2-VM'; mysql> update host set status='Up' where name='v-2-VM'; Now they're both running and I can see the console. There's got to be a better way to use this system without having to reboot or hack the database daily. On Fri, May 23, 2014 at 10:42 AM, Ian Young <[email protected]> wrote: > Also, is this normal? Every time the server is rebooted, it adds another > record to the mshost table but the "removed" field is always NULL. > > http://pastebin.com/q5zDCu4b > > > On Fri, May 23, 2014 at 10:39 AM, Ian Young <[email protected]>wrote: > >> The SSVM is stopped. If I try to start it, it complains about >> insufficient capacity. CPU? RAM? I have plenty of both available. >> >> 2014-05-23 10:36:51,196 DEBUG [c.c.d.FirstFitPlanner] >> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Listing clusters in order of >> aggregate capacity, that have (atleast one host with) enough CPU and RAM >> capacity under this Pod: 1 >> 2014-05-23 10:36:51,198 DEBUG [c.c.d.FirstFitPlanner] >> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Removing from the clusterId list >> these clusters from avoid set: [1] >> 2014-05-23 10:36:51,198 DEBUG [c.c.d.FirstFitPlanner] >> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) No clusters found after removing >> disabled clusters and clusters in avoid list, returning. >> 2014-05-23 10:36:51,201 DEBUG [c.c.c.CapacityManagerImpl] >> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) VM state transitted from >> :Starting to Stopped with event: OperationFailedvm's original host id: 1 >> new host id: null host id before state transition: null >> 2014-05-23 10:36:51,201 WARN [c.c.s.s.SecondaryStorageManagerImpl] >> (Job-Executor-2:ctx-ababac38 ctx-0906d3c3) Exception while trying to start >> secondary storage vm >> com.cloud.exception.InsufficientServerCapacityException: Unable to create >> a deployment for VM[SecondaryStorageVm|s-1-VM]Scope=interface >> com.cloud.dc.DataCenter; id=1 >> >> >> On Fri, May 23, 2014 at 10:35 AM, Ian Young <[email protected]>wrote: >> >>> I rebooted it and now it's in an even more broken state. It's >>> repeatedly trying to stop the console proxy but can't because its state is >>> "Starting." Here is an excerpt from the management log: >>> >>> http://pastebin.com/FiaDzKXb >>> >>> The agent log keeps repeating these messages: >>> >>> http://pastebin.com/yDidSbrz >>> >>> What's wrong with it? >>> >>> >>> On Thu, May 22, 2014 at 12:55 PM, Ian Young <[email protected]>wrote: >>> >>>> I wonder if something is wrong with the NFS mount. I see this error >>>> periodically in /var/log/messages even though I have set the Domain in >>>> /etc/idmapd.conf to the host's FQDN: >>>> >>>> May 20 19:30:22 virthost1 rpc.idmapd[1790]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 20 19:36:20 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 20 19:44:35 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 21 10:21:25 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 21 12:46:40 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 21 13:52:42 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 21 13:55:20 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107' >>>> does not map into domain 'redacted.com' >>>> May 21 20:31:51 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107' >>>> does not map into domain 'redacted.com' >>>> May 22 10:14:18 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 22 10:18:40 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107' >>>> does not map into domain 'redacted.com' >>>> May 22 10:19:23 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '0' does >>>> not map into domain 'redacted.com' >>>> May 22 10:25:16 virthost1 rpc.idmapd[1731]: nss_getpwnam: name '107' >>>> does not map into domain 'redacted.com' >>>> >>>> name '107' just started appearing in the log yesterday, which looks >>>> unusual. Up until then, the error was always name '0'. >>>> >>>> >>>> On Thu, May 22, 2014 at 11:15 AM, Andrija Panic < >>>> [email protected]> wrote: >>>> >>>>> I have observed this kind of problems ("process blocked for more than >>>>> xx >>>>> sec...") when I had access with storage - check your disks, smartctl >>>>> etc... >>>>> best >>>>> >>>>> Sent from Google Nexus 4 >>>>> On May 22, 2014 7:49 PM, "Ian Young" <[email protected]> wrote: >>>>> >>>>> > And this is in /var/log/messages right before that event: >>>>> > >>>>> > May 22 10:16:07 virthost1 kernel: INFO: task qemu-kvm:2971 blocked >>>>> for more >>>>> > than 120 seconds. >>>>> > May 22 10:16:07 virthost1 kernel: Not tainted >>>>> > 2.6.32-431.11.2.el6.x86_64 #1 >>>>> > May 22 10:16:07 virthost1 kernel: "echo 0 > >>>>> > /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>>> > May 22 10:16:07 virthost1 kernel: qemu-kvm D 0000000000000002 >>>>> 0 >>>>> > 2971 1 0x00000080 >>>>> > May 22 10:16:07 virthost1 kernel: ffff8810724e9be8 0000000000000082 >>>>> > 0000000000000000 ffff88106b6529d8 >>>>> > May 22 10:16:07 virthost1 kernel: ffff880871b3e8d8 ffff880871b3e8f0 >>>>> > ffffffff8100bb8e ffff8810724e9be8 >>>>> > May 22 10:16:07 virthost1 kernel: ffff881073525058 ffff8810724e9fd8 >>>>> > 000000000000fbc8 ffff881073525058 >>>>> > May 22 10:16:07 virthost1 kernel: Call Trace: >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8100bb8e>] ? >>>>> > apic_timer_interrupt+0xe/0x20 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff810555ef>] ? >>>>> > mutex_spin_on_owner+0x9f/0xc0 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8152969e>] >>>>> > __mutex_lock_slowpath+0x13e/0x180 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8152953b>] >>>>> mutex_lock+0x2b/0x50 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021c2cf>] >>>>> > memory_access_ok+0x7f/0xc0 [vhost_net] >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021d89c>] >>>>> > vhost_dev_ioctl+0x2ec/0xa50 [vhost_net] >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021c411>] ? >>>>> > vhost_work_flush+0xe1/0x120 [vhost_net] >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8122db91>] ? >>>>> > avc_has_perm+0x71/0x90 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa021f11a>] >>>>> > vhost_net_ioctl+0x7a/0x5d0 [vhost_net] >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8122f914>] ? >>>>> > inode_has_perm+0x54/0xa0 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffffa01a28b7>] ? >>>>> > kvm_vcpu_ioctl+0x1e7/0x580 [kvm] >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8108b14e>] ? >>>>> > send_signal+0x3e/0x90 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119dc12>] >>>>> vfs_ioctl+0x22/0xa0 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119ddb4>] >>>>> > do_vfs_ioctl+0x84/0x580 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8119e331>] >>>>> sys_ioctl+0x81/0xa0 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff810e1e4e>] ? >>>>> > __audit_syscall_exit+0x25e/0x290 >>>>> > May 22 10:16:07 virthost1 kernel: [<ffffffff8100b072>] >>>>> > system_call_fastpath+0x16/0x1b >>>>> > >>>>> > >>>>> > On Thu, May 22, 2014 at 10:39 AM, Ian Young <[email protected]> >>>>> > wrote: >>>>> > >>>>> > > The console proxy became unavailable again yesterday afternoon. I >>>>> could >>>>> > > SSH into it via its link local address and nothing seemed to be >>>>> wrong >>>>> > > inside the VM itself. However, the qemu-kvm process for that VM >>>>> was at >>>>> > > almost 100% CPU. Inside the VM, the CPU usage was minimal and the >>>>> java >>>>> > > process was running and listening on port 443. So there seems to >>>>> be >>>>> > > something wrong with it down at the KVM/QEMU level. It's weird >>>>> how this >>>>> > > keeps happening to the console proxy only and not any of the other >>>>> VMs. >>>>> > I >>>>> > > tried to reboot it from the management UI and after about 15 >>>>> minutes, it >>>>> > > finally did. Now the console proxy is working but I don't know >>>>> how long >>>>> > it >>>>> > > will last before it breaks again. I found this in libvirtd.log, >>>>> which >>>>> > > corresponds with the time the console proxy rebooted: >>>>> > > >>>>> > > 2014-05-22 17:17:04.362+0000: 25195: info : libvirt version: >>>>> 0.10.2, >>>>> > > package: 29.el6_5.7 (CentOS BuildSystem <http://bugs.centos.org>, >>>>> > > 2014-04-07-07:42:04, c6b9.bsys.dev.centos.org) >>>>> > > 2014-05-22 17:17:04.362+0000: 25195: error : qemuMonitorIO:614 : >>>>> internal >>>>> > > error End of file from monitor >>>>> > > >>>>> > > >>>>> > > On Wed, May 21, 2014 at 2:07 PM, Ian Young <[email protected] >>>>> > >>>>> > wrote: >>>>> > > >>>>> > >> I built and installed a libvirt 1.04 package from the Fedora src >>>>> rpm. >>>>> > It >>>>> > >> installed fine inside a test VM but installing it on the real >>>>> hypervisor >>>>> > >> was a bad idea and I doubt I'll be pursuing it further. All VMs >>>>> > promptly >>>>> > >> stopped and this appeared in libvirtd.log: >>>>> > >> >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: info : libvirt version: >>>>> 1.0.4, >>>>> > >> package: 1.el6 (Unknown, 2014-05-21-11:36:09, redacted.com) >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_network.so >>>>> > not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_storage.so >>>>> > not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_nodedev.so >>>>> > not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_secret.so not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_nwfilter.so >>>>> > not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_interface.so >>>>> > not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module >>>>> /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:19.260+0000: 23567: warning : >>>>> virDriverLoadModule:72 : >>>>> > >> Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so >>>>> not >>>>> > >> accessible >>>>> > >> 2014-05-21 20:36:49.471+0000: 23570: error : do_open:1220 : no >>>>> > connection >>>>> > >> driver available for qemu:///system >>>>> > >> 2014-05-21 20:36:49.472+0000: 23567: error : >>>>> virNetSocketReadWire:1370 : >>>>> > >> End of file while reading data: Input/output error >>>>> > >> 2014-05-21 20:36:49.473+0000: 23571: error : do_open:1220 : no >>>>> > connection >>>>> > >> driver available for lxc:/// >>>>> > >> 2014-05-21 20:36:49.474+0000: 23567: error : >>>>> virNetSocketReadWire:1370 : >>>>> > >> End of file while reading data: Input/output error >>>>> > >> 2014-05-21 20:36:49.475+0000: 23568: error : do_open:1220 : no >>>>> > connection >>>>> > >> driver available for qemu:///system >>>>> > >> 2014-05-21 20:36:49.476+0000: 23567: error : >>>>> virNetSocketReadWire:1370 : >>>>> > >> End of file while reading data: Input/output error >>>>> > >> 2014-05-21 20:36:49.678+0000: 23575: error : do_open:1220 : no >>>>> > connection >>>>> > >> driver available for qemu:///system >>>>> > >> 2014-05-21 20:36:49.678+0000: 23567: error : >>>>> virNetSocketReadWire:1370 : >>>>> > >> End of file while reading data: Input/output error >>>>> > >> 2014-05-21 20:36:49.681+0000: 23572: error : do_open:1220 : no >>>>> > connection >>>>> > >> driver available for qemu:///system >>>>> > >> 2014-05-21 20:36:49.682+0000: 23567: error : >>>>> virNetSocketReadWire:1370 : >>>>> > >> End of file while reading data: Input/output error >>>>> > >> >>>>> > >> >>>>> > >> On Wed, May 21, 2014 at 10:45 AM, Ian Young < >>>>> [email protected] >>>>> > >wrote: >>>>> > >> >>>>> > >>> I was able to get it working by following these steps: >>>>> > >>> >>>>> > >>> 1. stop all instances >>>>> > >>> 2. service cloudstack-management stop >>>>> > >>> 3. service cloudstack-agent stop >>>>> > >>> 4. virsh shutdown {domain} (for each of the system VMs) >>>>> > >>> 5. service libvirtd stop >>>>> > >>> 6. umount primary and secondary >>>>> > >>> 7. reboot >>>>> > >>> >>>>> > >>> The console proxy is working again. I expect it will probably >>>>> break >>>>> > >>> again in a day or two. I have a feeling it's a result of this >>>>> libvirtd >>>>> > >>> bug, since I've seen the "cannot acquire state change lock" >>>>> several >>>>> > times. >>>>> > >>> >>>>> > >>> https://bugs.launchpad.net/nova/+bug/1254872 >>>>> > >>> >>>>> > >>> I might try building my own libvirtd 1.0.3 for EL6. >>>>> > >>> >>>>> > >>> >>>>> > >>> On Tue, May 20, 2014 at 6:21 PM, Ian Young < >>>>> [email protected] >>>>> > >wrote: >>>>> > >>> >>>>> > >>>> So I got the console proxy working via HTTPS (by managing my >>>>> own " >>>>> > >>>> realhostip.com" DNS) last week and everything was working fine. >>>>> > >>>> Today, all of a sudden, the console proxy stopped working >>>>> again. The >>>>> > >>>> browser says, "Connecting to 192-168-100-159.realhostip.com..." >>>>> and >>>>> > >>>> eventually times out. I tried to restart it and it went into a >>>>> > "Stopping" >>>>> > >>>> state that never completed and the Agent State was >>>>> "Disconnected." I >>>>> > could >>>>> > >>>> not shut down the VM using virsh or with "kill -9" because >>>>> libvirtd >>>>> > kept >>>>> > >>>> saying, "cannot acquire state change lock," so I gracefully >>>>> shut down >>>>> > the >>>>> > >>>> remaining instances and rebooted the entire management >>>>> > server/hypervisor. >>>>> > >>>> Start over. >>>>> > >>>> >>>>> > >>>> When it came back up, the SSVM and console proxy started but the >>>>> > >>>> virtual router was stopped. I was able to manually start it >>>>> from the >>>>> > UI. >>>>> > >>>> The console proxy still times out when I try to access it from >>>>> a >>>>> > browser. >>>>> > >>>> I don't see any errors in the management or agent logs, just >>>>> this: >>>>> > >>>> >>>>> > >>>> 2014-05-20 18:04:27,632 DEBUG [c.c.a.t.Request] >>>>> > (catalina-exec-10:null) >>>>> > >>>> Seq 1-2130378876: Sending { Cmd , MgmtId: 55157049428734, via: >>>>> 1( >>>>> > >>>> virthost1.redacted.com), Ver: v1, Flags: 100011, >>>>> > >>>> >>>>> > >>>>> [{"com.cloud.agent.api.GetVncPortCommand":{"id":4,"name":"r-4-VM","wait":0}}] >>>>> > >>>> } >>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.a.t.Request] >>>>> > >>>> (AgentManager-Handler-3:null) Seq 1-2130378876: Processing: { >>>>> Ans: , >>>>> > >>>> MgmtId: 55157049428734, via: 1, Ver: v1, Flags: 10, >>>>> > >>>> >>>>> > >>>>> [{"com.cloud.agent.api.GetVncPortAnswer":{"address":"192.168.100.6","port":5902,"result":true,"wait":0}}] >>>>> > >>>> } >>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.a.t.Request] >>>>> > (catalina-exec-10:null) >>>>> > >>>> Seq 1-2130378876: Received: { Ans: , MgmtId: 55157049428734, >>>>> via: 1, >>>>> > Ver: >>>>> > >>>> v1, Flags: 10, { GetVncPortAnswer } } >>>>> > >>>> 2014-05-20 18:04:27,684 DEBUG [c.c.s.ConsoleProxyServlet] >>>>> > >>>> (catalina-exec-10:null) Port info 192.168.100.6 >>>>> > >>>> 2014-05-20 18:04:27,684 INFO [c.c.s.ConsoleProxyServlet] >>>>> > >>>> (catalina-exec-10:null) Parse host info returned from executing >>>>> > >>>> GetVNCPortCommand. host info: 192.168.100.6 >>>>> > >>>> 2014-05-20 18:04:27,686 DEBUG [c.c.s.ConsoleProxyServlet] >>>>> > >>>> (catalina-exec-10:null) Compose console url: >>>>> > >>>> >>>>> > >>>>> https://192-168-100-159.realhostip.com/ajax?token=CsPhU4m_R2ZoLIdXOtjo3y3humnQN20wt5fSPjbZOHtRh7nli7tiq0ZiWUuwCVIn7VwF6503ziEqCRejlRsVcsyQcUfemTRXlhAOpJUyRugyCuTjmbUIX3EY1cHnFMKwF8FXXZr_PgwyXGPEoOHhkdRgsyRiczbk_Unuh4KmRngATr0FPCLtqhwIMpnbLSYwpnFDz65k9lEJmK6IlXYKVpWXg2rpVEsQvaNlulrZdhMQ7qUbacn82EG43OY8nmwm1SYB8TrUFH5Btb1RHpJm9A >>>>> > >>>> 2014-05-20 18:04:27,686 DEBUG [c.c.s.ConsoleProxyServlet] >>>>> > >>>> (catalina-exec-10:null) the console url is :: >>>>> > >>>> <html><title>r-4-VM</title><frameset><frame src=" >>>>> > >>>> >>>>> > >>>>> https://192-168-100-159.realhostip.com/ajax?token=CsPhU4m_R2ZoLIdXOtjo3y3humnQN20wt5fSPjbZOHtRh7nli7tiq0ZiWUuwCVIn7VwF6503ziEqCRejlRsVcsyQcUfemTRXlhAOpJUyRugyCuTjmbUIX3EY1cHnFMKwF8FXXZr_PgwyXGPEoOHhkdRgsyRiczbk_Unuh4KmRngATr0FPCLtqhwIMpnbLSYwpnFDz65k9lEJmK6IlXYKVpWXg2rpVEsQvaNlulrZdhMQ7qUbacn82EG43OY8nmwm1SYB8TrUFH5Btb1RHpJm9A >>>>> > >>>> "></frame></frameset></html> >>>>> > >>>> 2014-05-20 18:04:29,216 DEBUG [c.c.a.m.AgentManagerImpl] >>>>> > >>>> (AgentManager-Handler-4:null) SeqA 2-545: Processing Seq 2-545: >>>>> { >>>>> > Cmd , >>>>> > >>>> MgmtId: -1, via: 2, Ver: v1, Flags: 11, >>>>> > >>>> >>>>> > >>>>> [{"com.cloud.agent.api.ConsoleProxyLoadReportCommand":{"_proxyVmId":2,"_loadInfo":"{\n >>>>> > >>>> \"connections\": []\n}","wait":0}}] } >>>>> > >>>> >>>>> > >>>> If I try to restart the system VMs with cloudstack-sysvmadm, it >>>>> says: >>>>> > >>>> >>>>> > >>>> Stopping and starting 1 secondary storage vm(s)... >>>>> > >>>> curl: (7) couldn't connect to host >>>>> > >>>> ERROR: Failed to stop secondary storage vm with id 1 >>>>> > >>>> >>>>> > >>>> Done stopping and starting secondary storage vm(s) >>>>> > >>>> >>>>> > >>>> Stopping and starting 1 console proxy vm(s)... >>>>> > >>>> curl: (7) couldn't connect to host >>>>> > >>>> ERROR: Failed to stop console proxy vm with id 2 >>>>> > >>>> >>>>> > >>>> Done stopping and starting console proxy vm(s) . >>>>> > >>>> >>>>> > >>>> Stopping and starting 1 running routing vm(s)... >>>>> > >>>> curl: (7) couldn't connect to host >>>>> > >>>> 2 >>>>> > >>>> Done restarting router(s). >>>>> > >>>> >>>>> > >>>> I notice there are now four entries for the same management >>>>> server in >>>>> > >>>> the mshost table, and they all are in an "Up" state and the >>>>> "removed" >>>>> > field >>>>> > >>>> is NULL. What's wrong with this system? >>>>> > >>>> >>>>> > >>> >>>>> > >>> >>>>> > >> >>>>> > > >>>>> > >>>>> >>>> >>>> >>> >> >
