Hi,

Update: after a reboot of all hosts during the weekend (resulting in reboot of all VMs), the problematic router VM is OK now. Not sure what had caused it.

Thanks.

On 2016-12-22 14:03, Syahrul Sazli Shaharir wrote:
On 2016-12-21 23:26, Linas Žilinskas wrote:
At this point I'm not sure what the issue for you could be. Did you
try recreating the failing vrouter?

Yes, multiple times by destroying it and/or restarting the network -
failed every time.

Also, just in case, check if there's free disk space on it. We had
some vrouters stuck due to this, and i saw another thread here
discussing it.

Plenty of space in the stuck VM:-

root@r-691-VM:~# df -h
Filesystem                                              Size  Used
Avail Use% Mounted on
rootfs 461M 157M 281M 36% /
udev                                                     10M     0
10M   0% /dev
tmpfs                                                    50M  236K
50M   1% /run
/dev/disk/by-uuid/6a0427bc-6052-48de-a4b8-c82d8217ed1d 461M 157M 281M 36% /
tmpfs                                                   5.0M     0
5.0M   0% /run/lock
tmpfs                                                   207M     0
207M   0% /run/shm
/dev/vda1                                                73M   23M
47M  33% /boot
/dev/vda6                                                92M  5.6M
81M   7% /home
/dev/vda8                                               184M  6.2M
169M   4% /opt
/dev/vda11                                               92M  5.6M
81M   7% /tmp
/dev/vda7                                               751M  493M
219M  70% /usr
/dev/vda9                                               563M  157M
377M  30% /var
/dev/vda10                                              184M  7.2M
168M   5% /var/log

Thanks.


Basically the /var/log/ partition fills up, since it's relatively
small. And if you had issues for a period of time with that specific
router and restarted it multiple times, the log partition might be
full.

On 21/12/16 06:35, Syahrul Sazli Shaharir wrote:

On 2016-12-20 17:53, Wei ZHOU wrote:

Hi Synhrul,

Could you upload the /var/log/cloud.log ?

Sure:-

Working router VM: http://pastebin.com/hwwk86ve

Non-working router VM: http://pastebin.com/G4nv09ab

Thanks.

-Wei

2016-12-20 3:18 GMT+01:00 Syahrul Sazli Shaharir <sa...@nocser.net>:


On 2016-12-19 18:10, Syahrul Sazli Shaharir wrote:

On 2016-12-19 17:03, Linas Žilinskas wrote:

From the logs it doesn't seem that the script timeouts. "Execution
is
successful", so it manages to pass the data over the socket.

I guess the systemvm just doesn't configure itself for some reason.

You are right, I was able to enter the router VM console at some
point
during the timeout loops, and able to capture syslog output during
the
loop:-

http://pastebin.com/n37aHeSa

I restarted another network, and that network's router VM was able to
be
recreated, even on the same host as the failed network (and both
networks
are exactly same configuration, only VLAN & subnet are different).
Comparing between the two syslog outputs during boot shows the
problematic
network router VM self-configuration got stuck in vm_dhcp_entry.json .


1. Working network router VM : http://pastebin.com/Y6zpDa6M
2. Non-working network router VM : http://pastebin.com/jzfGMGQB

Thanks.

Also, in my personal tests, I noticed some different behaviour with

different kernels. Don't remember the specifics right now, but on
some
combinations (qemu / kernel) the socket acted differently. For
example
the data was sent over the socket, but wasn't visible inside the
VM.
Other times the socket would be stuck from the host side.

So i would suggest testing different kernels (3.x, 4.4.x, 4.8.x)
or
try to login to the system vm and see what's happening from
inside.

Will do this next and feedback the results here.

Thanks for your help! :)

On 12/16/16 03:46, Syahrul Sazli Shaharir wrote:

On 2016-12-16 11:27, Syahrul Sazli Shaharir wrote:
On Wed, 26 Oct 2016, Linas ?ilinskas wrote:

So after some investigation I've found out that qemu 2.3.0 is indeed

broken, at least the way CS uses the qemu chardev/socket.

Not sure in which specific version it happened, but it was fixed in
2.4.0-rc3, specifically noting that CloudStack 4.2 was not working.

qemu git commit: 4bf1cb03fbc43b0055af60d4ff093d6894aa4338

Also attaching the patch from that commit.

For our own purposes i've included the patch to the qemu-kvm-ev
package (2.3.0) and all is well.

Hi,

I am facing the exact same issue on latest Cloudstack 4.9.0.1, on
latest CentOS 7.3.1611, with latest qemu-kvm-ev-2.6.0-27.1.el7
package.

The issue initially surfaced following a heartbeat-induced reset of
all hosts, when it was on CS 4.8 @ CentOS 7.0 and stock
qemu-kvm-1.5.3. Since then, the patchviasocket.pl/py timeouts
persisted for 1 out of 4 router VM/networks, even after upgrading to


latest code. (I have checked the qemu-kvm-ev-2.6.0-27.1.el7 source,
and the patched code are pretty much still intact, as per the
2.4.0-rc3 commit).

Any help would be greatly appreciated.

Thanks.

(Attached are some debug logs from the host's agent.log)

Here are the debug logs as mentioned: http://pastebin.com/yHdsMNzZ

Thanks.

--sazli

On 2016-10-20 09:59, Linas ?ilinskas wrote:

Hi.

We have made an upgrade to 4.9.

Custom build packages with our own patches, which in my mind (i'm
the only
one patching those) should not affect the issue i'll describe.

I'm not sure whether we didn't notice it before, or it's actually
related
to something in 4.9

Basically our system vm's were unable to be patched via the qemu
socket.
The script simply error'ed out with a timeout while trying to push
the
data to the socket.

Executing it manually (with cmd line from the logs) resulted the
same. I
even tried the old perl variant, which also had same result.

So finally we found out that this issue happens only on our HVs
which run
qemu 2.3.0, from the centos 7 special interest virtualization repo.
Other
ones that run qemu 1.5, from official repos, can patch the system
vms
fine.

So i'm wondering if anyone tested 4.9 with kvm with qemu >= 2.x?
Maybe it
something else special in our setup. e.g. we're running the HVs
from a
preconfigured netboot image (pxe), but all of them, including those
with
qemu 1.5, so i have no idea.

Linas ?ilinskas
Head of Development
website <http://www.host1plus.com/> [1] [1] facebook
<https://www.facebook.com/Host1Plus> [2] [2] twitter
<https://twitter.com/Host1Plus> [3] [3] linkedin
<https://www.linkedin.com/company/digital-energy-technologies-ltd.>
[4]
[4]

Host1Plus is a division of Digital Energy Technologies Ltd.

26 York Street, London W1U 6PZ, United Kingdom
 --
--sazli

Linas Žilinskas
Head of Development

Links:
------
[1] http://www.host1plus.com/
[2] https://www.facebook.com/Host1Plus
[3] https://twitter.com/Host1Plus
[4] https://www.linkedin.com/company/digital-energy-technologies-ltd.

--
--sazli

Reply via email to