Looks good, and I was going to suggest waiting like that if it turned out to be a race condition. The only thing I would suggest is maybe a sleep so it doesn't kill CPU if for some reason the system vm never gets cmdline. On Sep 12, 2014 1:28 PM, "David Bierce" <david.bie...@appcore.com> wrote:
> John — > > I’ve submitted our patch to work around the issue to review board and tied > it to the original ticket we found. I submitted it against the 4.3 but I > know you’ve been testing the patch on 4.2. If someone could take a look at > it for sanity, please check. It looks like it would be an issue in all > version of cloudstack on CentOS/KVM as a hypervisor. > > Review Request 25585: [CLOUDSTACK-2823] SystemVMs start fail on CentOS 6.4 > > > > > Thanks, > David Bierce > Senior System Administrator | Appcore > > Office +1.800.735.7104 > Direct +1.515.612.7801 > www.appcore.com > > On Sep 12, 2014, at 1:23 PM, John Skinner <john.skin...@appcore.com> > wrote: > > > Actually, I believe the kernel is the problem. The hosts are running > CentOS 6, the systemvm is stock template, Debian 7. This does not seem to > be an issue on Ubuntu KVM hypervisors. > > > > The fact that you are rebuilding systemvms on reboot is exactly why you > are not seeing this issue. New system VMs are usually successful, it’s when > you reboot them or start a stopped one where this issue shows up. > > > > The serial port is loading, but I think the behavior is different after > initial boot because if you access the system VM after you reboot it you do > not have anything on /proc/cmdline and in /var/cache/cloud/cmdline the file > is old and does not contain the new control network IP address. However, I > am able to net cat the serial port between the hypervisor and the systemvm > after it comes up - but CloudStack will eventually force stop the VM since > it doesn’t get the new control network IP address it assumes it never > started. > > > > Which is why when we wrap that while loop to check for an empty string > on $cmd it works every time after that. > > > > Change that global setting from true to false, and try to reboot a few > routers. I guarantee you will see this issue. > > > > John Skinner > > Appcore > > > > On Sep 12, 2014, at 10:48 AM, Marcus <shadow...@gmail.com> wrote: > > > >> You may also want to investigate on whether you are seeing a race > condition > >> with /dev/vport0p1 coming on line and cloud-early-config running. It > will > >> be indicated by a log line in the systemvm /var/log/cloud.log: > >> > >> log_it "/dev/vport0p1 not loaded, perhaps guest kernel is too old." > >> > >> Actually, if it has anything to do with the virtio-serial socket that > would > >> probably be logged. Can you open a bug in Jira and provide the logs? > >> > >> On Fri, Sep 12, 2014 at 9:36 AM, Marcus <shadow...@gmail.com> wrote: > >> > >>> Can you provide more info? Is the host running CentOS 6.x, or is your > >>> systemvm? What is rebooted, the host or the router, and how is it > rebooted? > >>> We have what sounds like the same config (CentOS 6.x hosts, stock > >>> community provided systemvm), and are running thousands of virtual > routers, > >>> rebooted regularly with no issue (both hosts and virtual routers). One > >>> setting we may have that you may not is that our system vms are rebuilt > >>> from scratch on every reboot (recreate.systemvm.enabled=true in global > >>> settings), not that I expect this to be the problem, but might be > something > >>> to look at. > >>> > >>> On Fri, Sep 12, 2014 at 8:49 AM, John Skinner < > john.skin...@appcore.com> > >>> wrote: > >>> > >>>> I have found that on CloudStack 4.2 + (when we changed to using the > >>>> virtio-socket to send data to the systemvm) when running CentOS 6.X > >>>> cloud-early-config fails. On new systemvm creation there is a high > chance > >>>> for success, but still a chance for failure. After the systemvm has > been > >>>> created a simple reboot will cause start to fail every time. This has > been > >>>> confirmed on 2 separate CloudStack 4.2 environments; 1 running CentOS > 6.3 > >>>> KVM, and another running CentOS 6.2 KVM. This can be fixed with a > simple > >>>> modification to the get_boot_params function in the cloud-early-config > >>>> script. If you wrap the while read line inside of another while that > checks > >>>> if $cmd returns an empty string it fixes the issue. > >>>> > >>>> This is a pretty nasty issue for any one running CloudStack 4.2 + on > >>>> CentOS 6.X > >>>> > >>>> John Skinner > >>>> Appcore > >>> > >>> > >>> > > > >