Re: CS 4.8 KVM VMs will not live migrate

Andrija Panic Fri, 02 Feb 2018 06:48:06 -0800

If you need help (if this is production cloud) to monitor this DB
inconsistencies ) together with some other DB inconsistencies) via Jenkins,
ping me on PM, I might be able to provide you the script, but need to check.


On 2 February 2018 at 04:46, David Mabry <dma...@ena.com.invalid> wrote:

> Andrija,
>
> You were right!  The isolation_uri and the broadcast_uri where both blank
> for the problem VMs.  Once I corrected the issue, I was able to migrate
> them inside of CS without issue.  Thanks for helping me get to the root
> cause of this issue.
>
> Thanks,
> David Mabry
>
> On 2/1/18, 3:27 PM, "David Mabry" <dma...@ena.com.INVALID> wrote:
>
>     Andrija,
>
>     Thanks for the tip.  I'll check that out and let you know what I find.
>
>     Thanks,
>     David Mabry
>     On 2/1/18, 2:04 PM, "Andrija Panic" <andrija.pa...@gmail.com> wrote:
>
>         The customer with serial number here :)
>
>         So, another issue which I noticed, when you have KVM host
> disconnections
>         (agent disconnect), then in some cases in the cloud.NICs table,
> there will
>         be missing broadcast URI, isolatio_URI and state or similar filed
> that is
>         NULL instead of having correct values for specific NIC of the
> affected VM.
>
>         In this case the VM will not live migrate via ACS (but you can of
> course
>         manually migrate it)...the fix is to fix the NICs table with
> proper values
>         (copy values from other NICs in the same network).
>
>         Check if this might be the case...
>
>         Cheers
>
>         On 31 January 2018 at 15:49, Tutkowski, Mike <
> mike.tutkow...@netapp.com>
>         wrote:
>
>         > Glad to hear you fixed the issue! :)
>         >
>         > > On Jan 31, 2018, at 7:16 AM, David Mabry
> <dma...@ena.com.INVALID> wrote:
>         > >
>         > > Mike and Wei,
>         > >
>         > > Good news!  I was able to manually live migrate these VMs
> following the
>         > steps outlined below:
>         > >
>         > > 1.) virsh dumpxml 38 --migratable > 38.xml
>         > > 2.) Change the vnc information in 38.xml to match destination
> host IP
>         > and available VNC port
>         > > 3.) virsh migrate --verbose --live 38 --xml 38.xml qemu+tcp://
>         > destination.host.net/system
>         > >
>         > > To my surprise, Cloudstack was able to discover and properly
> handle the
>         > fact that this VM was live migrated to a new host without
> issue.  Very cool.
>         > >
>         > > Wei, I suspect you are correct when you said this was an issue
> with the
>         > cloudstack agent code.  After digging a little deeper, the agent
> is never
>         > attempting to talk to libvirt at all after prepping the dxml to
> send to the
>         > destination host.  I'm going to attempt to reproduce this in my
> lab and
>         > attach a remote debugger and see if I can get to the bottom of
> it.
>         > >
>         > > Thanks again for the help guys!  I really appreciate it.
>         > >
>         > > Thanks,
>         > > David Mabry
>         > >
>         > > On 1/30/18, 9:55 AM, "David Mabry" <dma...@ena.com.INVALID>
> wrote:
>         > >
>         > >    Ah, understood.  I'll take a closer look at the logs and
> make sure
>         > that I didn't accidentally miss those lines when I pulled
> together the logs
>         > for this email chain.
>         > >
>         > >    Thanks,
>         > >    David Mabry
>         > >    On 1/30/18, 8:34 AM, "Wei ZHOU" <ustcweiz...@gmail.com>
> wrote:
>         > >
>         > >        Hi David,
>         > >
>         > >        I encountered the UnsupportAnswer once before, when I
> made some
>         > changes in
>         > >        the kvm plugin.
>         > >
>         > >        Normally there should be some network configurations in
> the
>         > agent.log but I
>         > >        do not see it.
>         > >
>         > >        -Wei
>         > >
>         > >
>         > >        2018-01-30 15:00 GMT+01:00 David Mabry
> <dma...@ena.com.invalid>:
>         > >
>         > >> Hi Wei,
>         > >>
>         > >> I detached the iso and received the same error.  Just out of
> curiosity,
>         > >> what leads you to believe it is something in the vxlan code?
> I guess at
>         > >> this point, attaching a remote debugger to the agent in
> question might
>         > be
>         > >> the best way to get to the bottom of what is going on.
>         > >>
>         > >> Thanks in advance for the help.  I really, really appreciate
> it.
>         > >>
>         > >> Thanks,
>         > >> David Mabry
>         > >>
>         > >> On 1/30/18, 3:30 AM, "Wei ZHOU" <ustcweiz...@gmail.com>
> wrote:
>         > >>
>         > >>    The answer should be caused by an exception in the
> cloudstack agent.
>         > >>    I tried to migrate a vm in our testing env, it is working.
>         > >>
>         > >>    there are some different between our env and yours.
>         > >>    (1) vlan VS vxlan
>         > >>    (2) no ISO VS attached ISO
>         > >>    (3) both of us use ceph and centos7.
>         > >>
>         > >>    I suspect it is caused by codes on vxlan.
>         > >>    However, could you detach the ISO and try again ?
>         > >>
>         > >>    -Wei
>         > >>
>         > >>
>         > >>
>         > >>    2018-01-29 19:48 GMT+01:00 David Mabry
> <dma...@ena.com.invalid>:
>         > >>
>         > >>> Good day Cloudstack Devs,
>         > >>>
>         > >>> I've run across a real head scratcher.  I have two VMs,
> (initially 3
>         > >> VMs,
>         > >>> but more on that later) on a single host, that I cannot live
> migrate
>         > >> to any
>         > >>> other host in the same cluster.  We discovered this after
> attempting
>         > >> to
>         > >>> roll out patches going from CentOS 7.2 to CentOS 7.4.
> Initially, we
>         > >>> thought it had something to do with the new version of
> libvirtd or
>         > >> qemu-kvm
>         > >>> on the other hosts in the cluster preventing these VMs from
>         > >> migrating, but
>         > >>> we are able to live migrate other VMs to and from this host
> without
>         > >> issue.
>         > >>> We can even create new VMs on this specific host and live
> migrate
>         > >> them
>         > >>> after creation with no issue.  We've put the migration
> source agent,
>         > >>> migration destination agent and the management server in
> debug and
>         > >> don't
>         > >>> seem to get anything useful other than "Unsupported command".
>         > >> Luckily, we
>         > >>> did have one VM that was shutdown and restarted, this is the
> 3rd VM
>         > >>> mentioned above.  Since that VM has been restarted, it has
> no issues
>         > >> live
>         > >>> migrating to any other host in the cluster.
>         > >>>
>         > >>> I'm at a loss as to what to try next and I'm hoping that
> someone out
>         > >> there
>         > >>> might have had a similar issue and could shed some light on
> what to
>         > >> do.
>         > >>> Obviously, I can contact the customer and have them shutdown
> their
>         > >> VMs, but
>         > >>> that will potentially just delay this problem to be solved
> another
>         > >> day.
>         > >>> Even if shutting down the VMs is ultimately the solution,
> I'd still
>         > >> like to
>         > >>> understand what happened to cause this issue in the first
> place with
>         > >> the
>         > >>> hopes of preventing it in the future.
>         > >>>
>         > >>> Here's some information about my setup:
>         > >>> Cloudstack 4.8 Advanced Networking
>         > >>> CentOS 7.2 and 7.4 Hosts
>         > >>> Ceph RBD Primary Storage
>         > >>> NFS Secondary Storage
>         > >>> Instance in Question for Debug: i-532-1392-NSVLTN
>         > >>>
>         > >>> I have attached relevant debug logs to this email if anyone
> wishes
>         > >> to take
>         > >>> a look.  I think the most interesting error message that I
> have
>         > >> received is
>         > >>> the following:
>         > >>>
>         > >>> 468390:2018-01-27 08:59:35,172 DEBUG [c.c.a.t.Request]
>         > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
>         > >> ctx-8e7f45ad)
>         > >>> (logid:f0888362) Seq 22-942378222027276319: Received:  {
> Ans: ,
>         > >> MgmtId:
>         > >>> 14038012703634, via: 22(csh02c01z01.nsvltn.ena.net), Ver:
> v1,
>         > >> Flags: 110,
>         > >>> { UnsupportedAnswer } }
>         > >>> 468391:2018-01-27 08:59:35,172 WARN
> [c.c.a.m.AgentManagerImpl]
>         > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
>         > >> ctx-8e7f45ad)
>         > >>> (logid:f0888362) Unsupported Command: Unsupported command
> issued:
>         > >>> com.cloud.agent.api.PrepareForMigrationCommand.  Are you
> sure you
>         > >> got the
>         > >>> right type of server?
>         > >>> 468392:2018-01-27 08:59:35,179 ERROR
> [c.c.v.VmWorkJobHandlerProxy]
>         > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
>         > >> ctx-8e7f45ad)
>         > >>> (logid:f0888362) Invocation exception, caused by:
>         > >> com.cloud.exception.AgentUnavailableException:
>         > >>> Resource [Host:22] is unreachable: Host 22: Unable to
> prepare for
>         > >> migration
>         > >>> due to Unsupported command issued: com.cloud.agent.api.
>         > >> PrepareForMigrationCommand.
>         > >>> Are you sure you got the right type of server?
>         > >>> 468393:2018-01-27 08:59:35,179 INFO
> [c.c.v.VmWorkJobHandlerProxy]
>         > >>> (Work-Job-Executor-6:ctx-188ea30f job-181792/job-181802
>         > >> ctx-8e7f45ad)
>         > >>> (logid:f0888362) Rethrow exception com.cloud.exception.
>         > >> AgentUnavailableException:
>         > >>> Resource [Host:22] is unreachable: Host 22: Unable to
> prepare for
>         > >> migration
>         > >>> due to Unsupported command issued: com.cloud.agent.api.
>         > >> PrepareForMigrationCommand.
>         > >>> Are you sure you got the right type of server?
>         > >>>
>         > >>> I've tracked this "Unsupported command" down in the CS 4.8
> code to
>         > >>> cloudstack/api/src/com/cloud/agent/api/Answer.java which is
> the
>         > >> generic
>         > >>> answer class.  I believe where the error is really being
> spawned
>         > >> from is
>         > >>> cloudstack/engine/orchestration/src/com/cloud/
>         > >>> vm/VirtualMachineManagerImpl.java.  Specifically:
>         > >>>        Answer pfma = null;
>         > >>>        try {
>         > >>>            pfma = _agentMgr.send(dstHostId, pfmc);
>         > >>>            if (pfma == null || !pfma.getResult()) {
>         > >>>                final String details = pfma != null ?
>         > >> pfma.getDetails() :
>         > >>> "null answer returned";
>         > >>>                final String msg = "Unable to prepare for
> migration
>         > >> due to
>         > >>> " + details;
>         > >>>                pfma = null;
>         > >>>                throw new AgentUnavailableException(msg,
> dstHostId);
>         > >>>            }
>         > >>>
>         > >>> The pfma returned must be in error or is never returned and
> therefore
>         > >>> still null.  That answer appears that it should be coming
> from the
>         > >>> destination agent, but for the life of me I can't figure out
> what
>         > >> the root
>         > >>> cause of this error is beyond, "Unsupported command
> issued".  What
>         > >> command
>         > >>> is unsupported?  My guess is that it could be something
> wrong with
>         > >> the dxml
>         > >>> that is generated and passed to the destination host, but I
> have as
>         > >> yet
>         > >>> been unable to catch that dxml in debug.
>         > >>>
>         > >>> Any help or guidance is greatly appreciated.
>         > >>>
>         > >>> Thanks,
>         > >>> David Mabry
>         > >>>
>         > >>>
>         > >>
>         > >>
>         > >>
>         > >
>         > >
>         > >
>         > >
>         >
>
>
>
>         --
>
>         Andrija Panić
>
>
>
>
>


-- 

Andrija Panić

Re: CS 4.8 KVM VMs will not live migrate

Reply via email to