John — I’ve submitted our patch to work around the issue to review board and tied it to the original ticket we found. I submitted it against the 4.3 but I know you’ve been testing the patch on 4.2. If someone could take a look at it for sanity, please check. It looks like it would be an issue in all version of cloudstack on CentOS/KVM as a hypervisor.
Review Request 25585: [CLOUDSTACK-2823] SystemVMs start fail on CentOS 6.4 Thanks, David Bierce Senior System Administrator | Appcore Office +1.800.735.7104 Direct +1.515.612.7801 www.appcore.com On Sep 12, 2014, at 1:23 PM, John Skinner <john.skin...@appcore.com> wrote: > Actually, I believe the kernel is the problem. The hosts are running CentOS > 6, the systemvm is stock template, Debian 7. This does not seem to be an > issue on Ubuntu KVM hypervisors. > > The fact that you are rebuilding systemvms on reboot is exactly why you are > not seeing this issue. New system VMs are usually successful, it’s when you > reboot them or start a stopped one where this issue shows up. > > The serial port is loading, but I think the behavior is different after > initial boot because if you access the system VM after you reboot it you do > not have anything on /proc/cmdline and in /var/cache/cloud/cmdline the file > is old and does not contain the new control network IP address. However, I am > able to net cat the serial port between the hypervisor and the systemvm after > it comes up - but CloudStack will eventually force stop the VM since it > doesn’t get the new control network IP address it assumes it never started. > > Which is why when we wrap that while loop to check for an empty string on > $cmd it works every time after that. > > Change that global setting from true to false, and try to reboot a few > routers. I guarantee you will see this issue. > > John Skinner > Appcore > > On Sep 12, 2014, at 10:48 AM, Marcus <shadow...@gmail.com> wrote: > >> You may also want to investigate on whether you are seeing a race condition >> with /dev/vport0p1 coming on line and cloud-early-config running. It will >> be indicated by a log line in the systemvm /var/log/cloud.log: >> >> log_it "/dev/vport0p1 not loaded, perhaps guest kernel is too old." >> >> Actually, if it has anything to do with the virtio-serial socket that would >> probably be logged. Can you open a bug in Jira and provide the logs? >> >> On Fri, Sep 12, 2014 at 9:36 AM, Marcus <shadow...@gmail.com> wrote: >> >>> Can you provide more info? Is the host running CentOS 6.x, or is your >>> systemvm? What is rebooted, the host or the router, and how is it rebooted? >>> We have what sounds like the same config (CentOS 6.x hosts, stock >>> community provided systemvm), and are running thousands of virtual routers, >>> rebooted regularly with no issue (both hosts and virtual routers). One >>> setting we may have that you may not is that our system vms are rebuilt >>> from scratch on every reboot (recreate.systemvm.enabled=true in global >>> settings), not that I expect this to be the problem, but might be something >>> to look at. >>> >>> On Fri, Sep 12, 2014 at 8:49 AM, John Skinner <john.skin...@appcore.com> >>> wrote: >>> >>>> I have found that on CloudStack 4.2 + (when we changed to using the >>>> virtio-socket to send data to the systemvm) when running CentOS 6.X >>>> cloud-early-config fails. On new systemvm creation there is a high chance >>>> for success, but still a chance for failure. After the systemvm has been >>>> created a simple reboot will cause start to fail every time. This has been >>>> confirmed on 2 separate CloudStack 4.2 environments; 1 running CentOS 6.3 >>>> KVM, and another running CentOS 6.2 KVM. This can be fixed with a simple >>>> modification to the get_boot_params function in the cloud-early-config >>>> script. If you wrap the while read line inside of another while that checks >>>> if $cmd returns an empty string it fixes the issue. >>>> >>>> This is a pretty nasty issue for any one running CloudStack 4.2 + on >>>> CentOS 6.X >>>> >>>> John Skinner >>>> Appcore >>> >>> >>> >