On Tue, Jan 17, 2017, at 03:41 PM, Jay Faulkner wrote: > Hi all, > > Back in late October, Vasyl wrote support for devstack to auto detect, > and when possible, use kvm to power Ironic gate jobs > (0036d83b330d98e64d656b156001dd2209ab1903). This has lowered some job > time when it works, but has caused failures — how many? It’s hard to > quantify as the log messages that show the error don’t appear to be > indexed by elastic search. It’s something seen often enough that the > issue has become a permanent staple on our gate whiteboard, and doesn’t > appear to be decreasing in quantity. > > I pushed up a patch, https://review.openstack.org/#/c/421581, which keeps > the auto detection behavior, but defaults devstack to use qemu emulation > instead of kvm. > > I have two questions: > 1) Is there any way I’m not aware of we can quantify the number of > failures this is causing? The key log message, "KVM: entry failed, > hardware error 0x0”, shows up in logs/libvirt/qemu/node-*.txt.gz. > 2) Are these failures avoidable or visible in any way? > > IMO, if we can’t fix these failures, in my opinion, we have to do a > change to avoid using nested KVM altogether. Lower reliability for our > jobs is not worth a small decrease in job run time.
Part of the problem with nested KVM failures is that in many cases they destroy the test nodes in unrecoverable ways. In which case you don't get any logs, and zuul will restart the job for you. I think that graphite will capture this as a job that resulted in a Null/None status though (rather than SUCCESS/FAILURE). As for collecting info when you do get logs, we don't index the libvirt instance logs currently and I am not sure we want to. We already struggle to keep up with the existing set of logs when we are busy. Instead we might have job cleanup do a quick grep for known nested virt problem indicators and then log that to the console log which will be indexed. I think trove has also seen kernel panic type errors in syslog that we hypothesized were a result of using nested virt. The infra team explicitly attempts to force qemu instead of kvm on jobs using devstack-gate for these reasons. We know it doesn't work reliably and not all clouds support it. Unfortunately my understanding of the situation is that base hypervisor cpu and kernel, second level hypervisor kernel, and nested guest kernel all come into play here. And there can be nasty interactions between them causing a variety of problems. Put another way: 2017-01-14T00:42:00 <mnaser> if we're talking nested kvm 2017-01-14T00:42:04 <mnaser> it's kindof a nightmare from http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2017-01-14.log Clark __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev