If you need help (if this is production cloud) to monitor this DB inconsistencies ) together with some other DB inconsistencies) via Jenkins, ping me on PM, I might be able to provide you the script, but need to check.
On 2 February 2018 at 04:46, David Mabry <dma...@ena.com.invalid> wrote: > Andrija, > > You were right! The isolation_uri and the broadcast_uri where both blank > for the problem VMs. Once I corrected the issue, I was able to migrate > them inside of CS without issue. Thanks for helping me get to the root > cause of this issue. > > Thanks, > David Mabry > > On 2/1/18, 3:27 PM, "David Mabry" <dma...@ena.com.INVALID> wrote: > > Andrija, > > Thanks for the tip. I'll check that out and let you know what I find. > > Thanks, > David Mabry > On 2/1/18, 2:04 PM, "Andrija Panic" <andrija.pa...@gmail.com> wrote: > > The customer with serial number here :) > > So, another issue which I noticed, when you have KVM host > disconnections > (agent disconnect), then in some cases in the cloud.NICs table, > there will > be missing broadcast URI, isolatio_URI and state or similar filed > that is > NULL instead of having correct values for specific NIC of the > affected VM. > > In this case the VM will not live migrate via ACS (but you can of > course > manually migrate it)...the fix is to fix the NICs table with > proper values > (copy values from other NICs in the same network). > > Check if this might be the case... > > Cheers > > On 31 January 2018 at 15:49, Tutkowski, Mike < > mike.tutkow...@netapp.com> > wrote: > > > Glad to hear you fixed the issue! :) > > > > > On Jan 31, 2018, at 7:16 AM, David Mabry > <dma...@ena.com.INVALID> wrote: > > > > > > Mike and Wei, > > > > > > Good news! I was able to manually live migrate these VMs > following the > > steps outlined below: > > > > > > 1.) virsh dumpxml 38 --migratable > 38.xml > > > 2.) Change the vnc information in 38.xml to match destination > host IP > > and available VNC port > > > 3.) virsh migrate --verbose --live 38 --xml 38.xml qemu+tcp:// > > destination.host.net/system > > > > > > To my surprise, Cloudstack was able to discover and properly > handle the > > fact that this VM was live migrated to a new host without > issue. Very cool. > > > > > > Wei, I suspect you are correct when you said this was an issue > with the > > cloudstack agent code. After digging a little deeper, the agent > is never > > attempting to talk to libvirt at all after prepping the dxml to > send to the > > destination host. I'm going to attempt to reproduce this in my > lab and > > attach a remote debugger and see if I can get to the bottom of > it. > > > > > > Thanks again for the help guys! I really appreciate it. > > > > > > Thanks, > > > David Mabry > > > > > > On 1/30/18, 9:55 AM, "David Mabry" <dma...@ena.com.INVALID> > wrote: > > > > > > Ah, understood. I'll take a closer look at the logs and > make sure > > that I didn't accidentally miss those lines when I pulled > together the logs > > for this email chain. > > > > > > Thanks, > > > David Mabry > > > On 1/30/18, 8:34 AM, "Wei ZHOU" <ustcweiz...@gmail.com> > wrote: > > > > > > Hi David, > > > > > > I encountered the UnsupportAnswer once before, when I > made some > > changes in > > > the kvm plugin. > > > > > > Normally there should be some network configurations in > the > > agent.log but I > > > do not see it. > > > > > > -Wei > > > > > > > > > 2018-01-30 15:00 GMT+01:00 David Mabry > <dma...@ena.com.invalid>: > > > > > >> Hi Wei, > > >> > > >> I detached the iso and received the same error. Just out of > curiosity, > > >> what leads you to believe it is something in the vxlan code? > I guess at > > >> this point, attaching a remote debugger to the agent in > question might > > be > > >> the best way to get to the bottom of what is going on. > > >> > > >> Thanks in advance for the help. I really, really appreciate > it. > > >> > > >> Thanks, > > >> David Mabry > > >> > > >> On 1/30/18, 3:30 AM, "Wei ZHOU" <ustcweiz...@gmail.com> > wrote: > > >> > > >> The answer should be caused by an exception in the > cloudstack agent. > > >> I tried to migrate a vm in our testing env, it is working. > > >> > > >> there are some different between our env and yours. > > >> (1) vlan VS vxlan > > >> (2) no ISO VS attached ISO > > >> (3) both of us use ceph and centos7. > > >> > > >> I suspect it is caused by codes on vxlan. > > >> However, could you detach the ISO and try again ? > > >> > > >> -Wei > > >> > > >> > > >> > > >> 2018-01-29 19:48 GMT+01:00 David Mabry > <dma...@ena.com.invalid>: > > >> > > >>> Good day Cloudstack Devs, > > >>> > > >>> I've run across a real head scratcher. I have two VMs, > (initially 3 > > >> VMs, > > >>> but more on that later) on a single host, that I cannot live > migrate > > >> to any > > >>> other host in the same cluster. We discovered this after > attempting > > >> to > > >>> roll out patches going from CentOS 7.2 to CentOS 7.4. > Initially, we > > >>> thought it had something to do with the new version of > libvirtd or > > >> qemu-kvm > > >>> on the other hosts in the cluster preventing these VMs from > > >> migrating, but > > >>> we are able to live migrate other VMs to and from this host > without > > >> issue. > > >>> We can even create new VMs on this specific host and live > migrate > > >> them > > >>> after creation with no issue. We've put the migration > source agent, > > >>> migration destination agent and the management server in > debug and > > >> don't > > >>> seem to get anything useful other than "Unsupported command". > > >> Luckily, we > > >>> did have one VM that was shutdown and restarted, this is the > 3rd VM > > >>> mentioned above. Since that VM has been restarted, it has > no issues > > >> live > > >>> migrating to any other host in the cluster. > > >>> > > >>> I'm at a loss as to what to try next and I'm hoping that > someone out > > >> there > > >>> might have had a similar issue and could shed some light on > what to > > >> do. > > >>> Obviously, I can contact the customer and have them shutdown > their > > >> VMs, but > > >>> that will potentially just delay this problem to be solved > another > > >> day. > > >>> Even if shutting down the VMs is ultimately the solution, > I'd still > > >> like to > > >>> understand what happened to cause this issue in the first > place with > > >> the > > >>> hopes of preventing it in the future. > > >>> > > >>> Here's some information about my setup: > > >>> Cloudstack 4.8 Advanced Networking > > >>> CentOS 7.2 and 7.4 Hosts > > >>> Ceph RBD Primary Storage > > >>> NFS Secondary Storage > > >>> Instance in Question for Debug: i-532-1392-NSVLTN > > >>> > > >>> I have attached relevant debug logs to this email if anyone > wishes > > >> to take > > >>> a look. I think the most interesting error message that I > have > > >> received is > > >>> the following: > > >>> > > >>> 468390:2018-01-27 08:59:35,172 DEBUG [c.c.a.t.Request] > > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802 > > >> ctx-8e7f45ad) > > >>> (logid:f0888362) Seq 22-942378222027276319: Received: { > Ans: , > > >> MgmtId: > > >>> 14038012703634, via: 22(csh02c01z01.nsvltn.ena.net), Ver: > v1, > > >> Flags: 110, > > >>> { UnsupportedAnswer } } > > >>> 468391:2018-01-27 08:59:35,172 WARN > [c.c.a.m.AgentManagerImpl] > > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802 > > >> ctx-8e7f45ad) > > >>> (logid:f0888362) Unsupported Command: Unsupported command > issued: > > >>> com.cloud.agent.api.PrepareForMigrationCommand. Are you > sure you > > >> got the > > >>> right type of server? > > >>> 468392:2018-01-27 08:59:35,179 ERROR > [c.c.v.VmWorkJobHandlerProxy] > > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802 > > >> ctx-8e7f45ad) > > >>> (logid:f0888362) Invocation exception, caused by: > > >> com.cloud.exception.AgentUnavailableException: > > >>> Resource [Host:22] is unreachable: Host 22: Unable to > prepare for > > >> migration > > >>> due to Unsupported command issued: com.cloud.agent.api. > > >> PrepareForMigrationCommand. > > >>> Are you sure you got the right type of server? > > >>> 468393:2018-01-27 08:59:35,179 INFO > [c.c.v.VmWorkJobHandlerProxy] > > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802 > > >> ctx-8e7f45ad) > > >>> (logid:f0888362) Rethrow exception com.cloud.exception. > > >> AgentUnavailableException: > > >>> Resource [Host:22] is unreachable: Host 22: Unable to > prepare for > > >> migration > > >>> due to Unsupported command issued: com.cloud.agent.api. > > >> PrepareForMigrationCommand. > > >>> Are you sure you got the right type of server? > > >>> > > >>> I've tracked this "Unsupported command" down in the CS 4.8 > code to > > >>> cloudstack/api/src/com/cloud/agent/api/Answer.java which is > the > > >> generic > > >>> answer class. I believe where the error is really being > spawned > > >> from is > > >>> cloudstack/engine/orchestration/src/com/cloud/ > > >>> vm/VirtualMachineManagerImpl.java. Specifically: > > >>> Answer pfma = null; > > >>> try { > > >>> pfma = _agentMgr.send(dstHostId, pfmc); > > >>> if (pfma == null || !pfma.getResult()) { > > >>> final String details = pfma != null ? > > >> pfma.getDetails() : > > >>> "null answer returned"; > > >>> final String msg = "Unable to prepare for > migration > > >> due to > > >>> " + details; > > >>> pfma = null; > > >>> throw new AgentUnavailableException(msg, > dstHostId); > > >>> } > > >>> > > >>> The pfma returned must be in error or is never returned and > therefore > > >>> still null. That answer appears that it should be coming > from the > > >>> destination agent, but for the life of me I can't figure out > what > > >> the root > > >>> cause of this error is beyond, "Unsupported command > issued". What > > >> command > > >>> is unsupported? My guess is that it could be something > wrong with > > >> the dxml > > >>> that is generated and passed to the destination host, but I > have as > > >> yet > > >>> been unable to catch that dxml in debug. > > >>> > > >>> Any help or guidance is greatly appreciated. > > >>> > > >>> Thanks, > > >>> David Mabry > > >>> > > >>> > > >> > > >> > > >> > > > > > > > > > > > > > > > > > > -- > > Andrija Panić > > > > > -- Andrija Panić